Why factor in the gradient?

Make changes that help the most
Almost linear at first - swift change, avoid local min
What if miss is high and gradient is small?
- Could be a noisy outlier
- If not an outlier, other examples will also support change
- If a hidden unit, expect other units to change
- If an output unit, expect hidden layer to change
Initiate changes over subset of training examples
Model must fit, or change to fit
"Co-evolving" layers
Output unit ISA Meta-concept