Title: Adam-mini: Use Fewer Learning Rates To Gain More
Authors: Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun
Published: 24th June 2024 (Monday) @ 16:56:41
Link: http://arxiv.org/abs/2406.16793v6

Abstract

We propose Adam-mini, an optimizer that achieves on par or better performance than AdamW with 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., ). By investigating the Hessian structure of neural nets, we find Adam’s might not function at its full potential as effectively as we expected. We find that 99.9% of these learning rates in could be harmlessly removed if we (1) carefully partition the parameters into blocks following our new principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We then provide one simple way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama 2-7B on A800-80GB GPUs, which saves 33% wall-clock time for pre-training.