Jay Taylor's notes
back to listing indexA friendly Introduction to Backpropagation in Python | Sushant Choudhary
[web search]Here, q is just a forwardAddGate with inputs x and y, and f is a forwardMultiplyGate with inputs z and q. The last two equations above are key: when calculating the gradient of the entire circuit with respect to x (or y) we merely calculate the gradient of the gate q with respect to x (or y) and magnify it by a factor equal to the gradient of the circuit with respect to the output of gate q.
For inputs to this circuit x=-2, y=5, z=-4 it is straightforward to compute that ∂f∂x=∂f∂q.∂q∂x=z.1=−4∗1=−4\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q}.\frac{\partial q}{\partial x} = z.1 = -4*1 = -4∂x∂f=∂q∂f.∂x∂q=z.1=−4∗1=−4
Let’s see what’s going on here. As such ∂q∂x\frac{\partial q}{\partial x}∂x∂q equals 1, i.e, increasing x increases the output of gate q. However, in the larger circuit (f) the output is increased by a reduction in the output of q, since ∂f∂q=z=−4\frac{\partial f}{\partial q} = z = -4∂q∂f=z=−4 is a negative number. Hence, the goal, which is to maximize its output of the larger circuit f, is served by reducing q, for which x needs to be reduced.
Hopefully, it is clear now that in this circuit, to calculate the gradient with respect to any input, we need to just calculate the gradient for the simpler gate which directly takes that input, with respect to that input; and then multiply the result obtained with the gradient of the circuit with respect to that gate (chain rule).
But in a more complex circuit, that gate might lead into multiple other gates before the output stage, so it is best to do the chain computation backwards, starting from the output stage. (Backpropagation)