A simplistic explanation of automatic-differentiation-based backpropagation is that, the AD software calculates the “gradient” of the output of each layer with respect to its input (which in turn is the output of the preceding layer), and multiply all of them together to yield the end-to-end gradient - that is, the derivative of the scalar objective function with regards to optimizable parameters. As it is very often the case that both the input and output of a hidden layer are vectors, “Jacobian matrix” would be a more proper term to be used in place of “gradient”. There are numerous articles introducing the mathematical definition of Jacobian matrix and how it works in backpropagation, and it isn’t unfamiliar to many people that what AD software actually calculates in backpropagation is the vector-Jacobian products (which gives the gradient of the object function with regards to an intermediate vector), instead of Jacobian matrices themselves. In cases where we are only interested in the gradient of the scalar loss function with regards to parameters, it is not necessary to explicitly build the intermediate VJPs as we can always get the VJP at any nodes by asking the automatic differentiator to calculate the gradient of the loss function with regards to those particular parameters. However, it can become essential if we want to calculate the product between a vector-vector Jacobian and an arbitrary vector which is not the derivative of any scalar node in the graph (that is not something like dL / dv) - say, in Gauss-Newton optimization, where we need to calculate the product between a Jacobian and a step vector. Moreover, writing the backpropagation process as a chain of VJPs provides insight for one to understand how things really work. We will assume readers have basic knowledge on the mathematical definitions of gradient, Jacobian matrix, Hessian matrix, and preferably, on the basic principle of backpropagation.

This notebook is intended to investigate the methods of getting VJPs in different AD packages - namely Autograd, TensorFlow, and PyTorch, and clarify the behaviors of the relevant functions implemented in these packages. The Gauss-Newton optimization codes by Saugat Kandel (https://github.com/saugatkandel/sopt) were heavily used as a reference when I tried to figure out how the make_vjp function in Autograd works. If you are looking for a working implementation of second-order optimization algorithms, please check out the cited repo.