#optimization
Every summary, chronological. Filter by category, tag, or source from the rail.
Tag · #optimization
How Adam's Variance Normalization Fixes SGD's Frequency Bias
Standard SGD fails to optimize rare tokens because they receive infrequent gradient updates. Adam solves this by using variance normalization to automatically amplify the effective learning rate for rare parameters.
MarkTechPost
Showing 1 of 1