- Adam A Method for Stochastic Optimization
- Adam-mini Use Fewer Learning Rates To Gain More
- Estimation of Non-Normalized Statistical Models by Score Matching
- Full Parameter Fine-tuning for Large Language Models with Limited Resources - âLOMOâ
- How to Train Your Energy-Based Models
- Large Batch Optimization for Deep Learning Training BERT in 76 minutes
- Large Batch Training of Convolutional Networks
- LoRA Low-Rank Adaptation of Large Language Models
- Noise-contrastive estimation A new estimation principle for unnormalized statistical models
- Parameter-efficient fine-tuning of large-scale pre-trained language models - âDelta-tuningâ
- SAGA A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives
- SGDR Stochastic Gradient Descent with Warm Restarts
- Stochastic Average Gradient A Simple Empirical Investigation
- The Marginal Value of Adaptive Gradient Methods in Machine Learning
Resources đ
- Convex Optimization by Ryan Tibshirani
- Convex Optimization by Stephen Boyd and Lieven Vandenberghe
- Foundations for Optimization and Optimization by Mark Walker
- Short Lectures on Optimization by Michel Bierlaire
- bentrevett/a-tour-of-pytorch-optimizers - A tour of different optimization algorithms in PyTorch
Concepts
Losses
- Cross-entropy
- KL divergence
- Hinge loss - Wikipedia
- (max) margin loss
- See also Loss functions for classification - Wikipedia
Optimization of Logistic and Multinomial Regression
- Chapter 5 on Logistic Regression from Speech and Language Processing. Daniel Jurafsky & James H. Martin contains explanations and derivations of the optimisation of logistic and multinomial regression. Relevant sections are:
- §5.4 Learning in Logistic Regression
- §5.5 The cross-entropy loss function
- §5.6.1 The Gradient for Logistic Regression (and the worked example in 5.6.3)
- §5.8 Learning in Multinomial Logistic Regression
- §5.10 Advanced: Deriving the Gradient Equation
- A comparison of numerical optimizers for logistic regression from Thomas P. Minka (October 22, 2003; revised Mar 26, 2007)
- This post (responding the question of how scikit-learn optimisers work) briefly explains:
- Newtonâs method
- Limited-memory BroydenâFletcherâGoldfarbâShanno Algorithm (L-BFGS)
- LIBLINEAR, the Library for Large Linear Classification (from ICML 2008)
- SAG
- SAGA A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives
- BroydenâFletcherâGoldfarbâShanno algorithm - Wikipedia
- How does the L-BFGS work? (Cross Validated post)
Lagrangian Optimisation
- Lagrange Multipliers | Geometric Meaning & Full Example from Dr. Trefor Bazett - excellent visual intuition motivating the Lagrange multiplier method via contours (level curves) and gradients (the gradient of the constraint curve at the max/min is a scalar multiple of the gradient of the main, loss or in economics âutilityâ function)
- Lagrangian Optimization: Example from EconomicsHelpDesk (June 9th 2013)
- Lagrange multipliers, using tangency to solve constrained optimization from Khan Academy