#### Jay Taylor's notes

back to listing index

### A friendly Introduction to Backpropagation in Python | Sushant Choudhary

[web search]
Original source (sushant-choudhary.github.io)
Clipped on: 2017-11-26

fz=x+y\displaystyle {\frac{\partial f}{\partial z} = x+y}zf=x+y fx=fq.qx=z.1=z\displaystyle {\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q}.\frac{\partial q}{\partial x} = z.1 = z}xf=qf.xq=z.1=z fy=fq.qy=z.1=z\displaystyle {\frac{\partial f}{\partial y} = \frac{\partial f}{\partial q}.\frac{\partial q}{\partial y} = z.1 = z}yf=qf.yq=z.1=z

Here, q is just a forwardAddGate with inputs x and y, and f is a forwardMultiplyGate with inputs z and q. The last two equations above are key: when calculating the gradient of the entire circuit with respect to x (or y) we merely calculate the gradient of the gate q with respect to x (or y) and magnify it by a factor equal to the gradient of the circuit with respect to the output of gate q.

For inputs to this circuit x=-2, y=5, z=-4 it is straightforward to compute that fx=fq.qx=z.1=41=4\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q}.\frac{\partial q}{\partial x} = z.1 = -4*1 = -4xf=qf.xq=z.1=41=4

Let’s see what’s going on here. As such qx\frac{\partial q}{\partial x}xq equals 1, i.e, increasing x increases the output of gate q. However, in the larger circuit (f) the output is increased by a reduction in the output of q, since fq=z=4\frac{\partial f}{\partial q} = z = -4qf=z=4 is a negative number. Hence, the goal, which is to maximize its output of the larger circuit f, is served by reducing q, for which x needs to be reduced.

Hopefully, it is clear now that in this circuit, to calculate the gradient with respect to any input, we need to just calculate the gradient for the simpler gate which directly takes that input, with respect to that input; and then multiply the result obtained with the gradient of the circuit with respect to that gate (chain rule).

But in a more complex circuit, that gate might lead into multiple other gates before the output stage, so it is best to do the chain computation backwards, starting from the output stage. (Backpropagation)