Why factor in the gradient?
- Make changes that help the most
- Almost linear at first - swift change, avoid local min
- What if miss is high and gradient is small?
- Could be a noisy outlier
- If not an outlier, other examples will also support change
- If a hidden unit, expect other units to change
- If an output unit, expect hidden layer to change
- Initiate changes over subset of training examples
- Model must fit, or change to fit
- "Co-evolving" layers
- Output unit ISA Meta-concept