How to Improve AI Models While Training Only 0.1% of Parameters

How to Improve AI Models While Training Only 0.1% of Parameters
Estimated Reading Time: 9 minutes
- AdaMix is a groundbreaking Parameter-Efficient Fine-Tuning (PEFT) method that significantly improves AI model performance.
- It achieves state-of-the-art results by training only 0.1% to 0.2% of a model’s parameters, even outperforming full model fine-tuning.
- AdaMix utilizes a mixture of adaptation modules, stochastic routing, and a unique module merging mechanism to maintain efficiency without sacrificing performance.
- This innovation drastically reduces computational costs, memory, and storage requirements for deploying large language models.
- AdaMix democratizes access to advanced AI capabilities, making specialized AI more accessible and economically viable for widespread deployment.
- The Scaling Challenge: Why Traditional Fine-Tuning is Unsustainable
- AdaMix: Unlocking Peak Performance with Minimal Parameters
- The Engineering Behind AdaMix’s Efficiency and Effectiveness
- Real-World Impact: Democratizing Advanced AI
- 3 Actionable Steps for AI Practitioners
- Conclusion
- Frequently Asked Questions (FAQ)
The landscape of Artificial Intelligence is evolving at an unprecedented pace, driven by the capabilities of increasingly large pre-trained language models (PLMs) like GPT-3 and MT-NLG. These colossal models, boasting hundreds of billions of parameters, have redefined what’s possible in natural language understanding (NLU) and generation (NLG). However, this power comes with a significant cost: fine-tuning these behemoths for specific downstream tasks requires updating an equally massive number of parameters, demanding immense computational resources, memory, and storage.
Imagine needing to store a full copy of a 175-billion-parameter model for every new task – the infrastructure requirements quickly become unsustainable. This challenge has spurred the development of Parameter-Efficient Fine-Tuning (PEFT) techniques, which aim to reduce the trainable parameters while maintaining performance. While PEFT methods have offered a partial solution, they often fall short of the performance achieved by full model fine-tuning. Until now.
A groundbreaking approach called AdaMix is changing this paradigm. By tuning a mere 0.1% to 0.2% of a PLM’s parameters, AdaMix not only outperforms existing state-of-the-art PEFT methods but also surpasses the performance of full model fine-tuning for various NLU and NLG tasks. This innovation promises to democratize access to advanced AI capabilities by drastically cutting down on the resources required for deployment and specialization.
Authors:
- (1) Yaqing Wang, Purdue University (wang5075@purdue.edu);
- (2) Sahaj Agarwal, Microsoft (sahagar@microsoft.com);
- (3) Subhabrata Mukherjee, Microsoft Research (submukhe@microsoft.com);
- (4) Xiaodong Liu, Microsoft Research (xiaodl@microsoft.com);
- (5) Jing Gao, Purdue University (jinggao@purdue.edu);
- (6) Ahmed Hassan Awadallah, Microsoft Research (hassanam@microsoft.com);
- (7) Jianfeng Gao, Microsoft Research (jfgao@microsoft.com).
Table of Links (from Paper):
- Abstract and 1. Introduction
- Background
- 2.1 Mixture-of-Experts
- 2.2 Adapters
- Mixture-of-Adaptations
- 3.1 Routing Policy
- 3.2 Consistency regularization
- 3.3 Adaptation module merging and 3.4 Adaptation module sharing
- 3.5 Connection to Bayesian Neural Networks and Model Ensembling
- Experiments
- 4.1 Experimental Setup
- 4.2 Key Results
- 4.3 Ablation Study
- Related Work
- Conclusions
- Limitations
- Acknowledgment and References
- Appendix
- A. Few-shot NLU Datasets B. Ablation Study C. Detailed Results on NLU Tasks D. Hyper-parameter
Abstract:
Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updating hundreds of millions to billions of parameters, and storing a large copy of the PLM weights for every task resulting in increased cost for storing, sharing and serving the models. To address this, parameter-efficient fine-tuning (PEFT) techniques were introduced where small trainable components are injected in the PLM and updated during fine-tuning. We propose AdaMix as a general PEFT method that tunes a mixture of adaptation modules – given the underlying PEFT method of choice – introduced in each Transformer layer while keeping most of the PLM weights frozen. For instance, AdaMix can leverage a mixture of adapters like Houlsby (Houlsby et al., 2019) or a mixture of low rank decomposition matrices like LoRA (Hu et al., 2021) to improve downstream task performance over the corresponding PEFT methods for fully supervised and few-shot NLU and NLG tasks. Further, we design AdaMix such that it matches the same computational cost and the number of tunable parameters as the underlying PEFT method. By only tuning 0.1 − 0.2% of PLM parameters, we show that AdaMix outperforms SOTA parameter-efficient fine-tuning and full model fine-tuning for both NLU and NLG tasks. Code and models are made available at https://aka.ms/AdaMix.
1. Introduction:
Standard fine-tuning of large pre-trained language models (PLMs) (Devlin et al., 2019; Liu et al., 2019; Brown et al., 2020; Raffel et al., 2019) to downstream tasks requires updating all model parameters. Given the ever-increasing size of PLMs (e.g., 175 billion parameters for GPT-3 (Brown et al., 2020) and 530 billion parameters for MTNLG (Smith et al., 2022)), even the fine-tuning step becomes expensive as it requires storing a full copy of model weights for every task. To address these challenges, recent works have developed parameter-efficient fine-tuning (PEFT) techniques. These approaches typically underperform standard full model fine-tuning, but significantly reduce the number of trainable parameters. There are many varieties of PEFT methods, including prefix-tuning (Li and Liang, 2021) and prompt-tuning (Lester et al., 2021) to condition frozen language models via natural language task descriptions, low dimensional projections using adapters (Houlsby et al., 2019; Pfeiffer et al., 2020, 2021) and more recently using low-rank approximation (Hu et al., 2021). Figure 1 shows the performance of some popular PEFT methods with varying number of tunable parameters. We observe a significant performance gap with respect to full model tuning where all PLM parameters are updated.
In this paper, we present AdaMix, a mixture of adaptation modules approach, and show that it outperforms SOTA PEFT methods and also full model fine-tuning while tuning only 0.1 − 0.2% of PLM parameters.
In contrast to traditional PEFT methods that use a single adaptation module in every Transformer layer, AdaMix uses several adaptation modules that learn multiple views of the given task. In order to design this mixture of adaptations, we take inspiration from sparsely-activated mixture-of-experts (MoE) models. In traditional dense models (e.g., BERT (Devlin et al., 2019), GPT-3 (Brown et al., 2020)), all model weights are activated for every input example. MoE models induce sparsity by activating only a subset of the model weights for each incoming input.
Consider adapters (Houlsby et al., 2019), one of the most popular PEFT techniques, to illustrate our method. A feedforward layer (FFN) is introduced to down-project the hidden representation to a low dimension d (also called the bottleneck dimension) followed by another up-project FFN to match the dimensionality of the next layer. Instead of using a single adapter, we introduce multiple project-up and project-down FFNs in each Transformer layer. We route input examples to one of the project-up and one of the project-down FFN’s resulting in the same amount of computational cost (FLOPs) as that of using a single adapter. For methods like LoRA (Hu et al., 2021), that decomposes the gradient of pre-trained weights into low-rank matrices (A and B), we introduce multiple low-rank decompositions and route the input examples to them similar to adapters.
We discuss different routing mechanism and show that stochastic routing yields good performance while eliminating the need for introducing any additional parameters for module selection. To alleviate training instability that may arise from the randomness in selecting different adaptation modules in different training steps, we leverage consistency regularization and the sharing of adaptation modules during stochastic routing.
The introduction of multiple adaptation modules results in an increased number of adaptation parameters. This does not increase computational cost but increases storage cost. To address this, we develop a merging mechanism to combine weights from different adaptation modules to a single module in each Transformer layer. This allows us to keep the number of adaptation parameters the same as that of a single adaptation module. Our merging mechanism is inspired by model weight averaging model soups (Wortsman et al., 2022) and multi BERTs (Sellam et al., 2022). Weight averaging of models with different random initialization has been shown to improve model performance in recent works (Matena and Raffel, 2021; Neyshabur etur al., 2020; Frankle et al., 2020) that show the optimized models to lie in the same basin of error landscape. While the above works are geared towards fine-tuning independent models, we extend this idea to parameter-efficient fine-tuning with randomly initialized adaptation modules and a frozen language model.
Overall, our work makes the following contributions:
- (a) We develop a new method AdaMix as a mixture of adaptations for parameter-efficient fine-tuning (PEFT) of large language models. Given any PEFT method of choice like adapters and low-rank decompositions, AdaMix improves downstream task performance over the underlying PEFT method.
- (b) AdaMix is trained with stochastic routing and adaptation module merging to retain the same computational cost (e.g., FLOPs, #tunable adaptation parameters) and benefits of the underlying PEFT method. To better understand how AdaMix works, we demonstrate its strong connections to Bayesian Neural Networks and model ensembling.
- (c) By tuning only 0.1 − 0.2% of a pre-trained language model’s parameters, AdaMix is the first PEFT method to outperform full model fine-tuning methods for all NLU tasks on GLUE, and outperforms other competing methods for NLG and few-shot NLU tasks.
Practical benefits of PEFT methods. The most significant benefit of PEFT methods comes from the reduction in memory and storage usage. For a Transformer, the VRAM consumption can be significantly reduced as we do not need to keep track of optimizer states for the frozen parameters. PEFT methods also allow multiple tasks to share the same copy of the full (frozen) PLM. Hence, the storage cost for introducing a new task can be reduced by up to 444x (from 355MB to 0.8MB with RoBERTa-large encoder in our setting).
We present background on Mixture-of-Experts (MoE) and adapters in Section 2 of Appendix.
2. Background
2.1 Mixture-of-Experts
2.2 Adapters
The predominant methodology for task adaptation is to tune all of the trainable parameters of the PLMs for every task. This raises significant resource challenges both during training and deployment. A recent study (Aghajanyan et al., 2021) shows that PLMs have a low instrinsic dimension that can match the performance of the full parameter space.
To adapt PLMs for downstream tasks with a small number of parameters, adapters (Houlsby et al., 2019) have recently been introduced as an alternative approach for lightweight tuning.
This paper is available on arxiv under CC BY 4.0 DEED license.
The Scaling Challenge: Why Traditional Fine-Tuning is Unsustainable
The incredible success of large language models stems from their ability to learn vast amounts of information during pre-training. However, to apply these general models to specific tasks – like sentiment analysis, question answering, or code generation – they need to be fine-tuned. Historically, this meant updating all the model’s parameters, a process known as full model fine-tuning. While effective, this approach quickly becomes resource-prohibitive:
- Enormous Storage Costs: Each fine-tuned model requires a complete copy of the PLM’s weights, leading to terabytes of storage for multiple tasks.
- High Computational Demands: Updating billions of parameters is computationally intensive, requiring powerful GPUs and significant energy.
- Memory Constraints: Running these fine-tuned models in production demands vast amounts of VRAM, limiting the number of models that can be served simultaneously.
To mitigate these issues, Parameter-Efficient Fine-Tuning (PEFT) methods emerged. Techniques like prefix-tuning, prompt-tuning, LoRA (Low-Rank Adaptation), and adapters were developed to inject small, trainable components into the frozen PLM, significantly reducing the number of parameters that need updating. While these methods offered much-needed relief in terms of resource usage, they often came with a trade-off: a noticeable performance gap compared to full model fine-tuning. This left AI developers in a dilemma: efficiency or top-tier performance?
AdaMix: Unlocking Peak Performance with Minimal Parameters
AdaMix offers a compelling answer to this dilemma. Its core innovation lies in moving beyond a single adaptation module per Transformer layer, a common practice in conventional PEFT methods. Instead, AdaMix introduces a mixture of adaptation modules.
This concept draws inspiration from Mixture-of-Experts (MoE) models, which induce sparsity by activating only a subset of their weights for each input. AdaMix applies this principle to adaptation modules. For example, if using adapters (small feedforward layers inserted into the PLM), instead of just one project-down and one project-up FFN in each Transformer layer, AdaMix incorporates several. Similarly, for LoRA, which uses low-rank decomposition matrices, AdaMix deploys multiple such decompositions.
The beauty of AdaMix is that it can leverage any underlying PEFT method (adapters, LoRA, etc.) and enhance its performance. Crucially, it achieves this without increasing the computational cost (FLOPs) during inference, matching that of using a single adaptation module.
The Engineering Behind AdaMix’s Efficiency and Effectiveness
Achieving superior performance while tuning only a fraction of parameters involves several clever mechanisms:
1. Stochastic Routing
Instead of activating all adaptation modules for every input, AdaMix employs a routing mechanism. Input examples are directed to a specific subset of the available project-up and project-down FFNs (or low-rank matrices). Stochastic routing, in particular, has proven effective. This method randomly selects modules, eliminating the need for additional parameters to manage module selection and keeping the overall parameter count low.
2. Consistency Regularization and Module Sharing
Randomness, while efficient, can introduce instability during training. To counteract this, AdaMix utilizes consistency regularization, ensuring more stable learning across different module selections. Additionally, sharing certain adaptation modules further enhances stability and efficiency.
3. Adaptation Module Merging
While AdaMix uses multiple adaptation modules during training to learn diverse “views” of a task, this could theoretically increase storage requirements for the adaptation parameters. To address this, AdaMix employs a brilliant merging mechanism. Inspired by techniques like “model soups” and weight averaging, it combines the weights from different adaptation modules into a single, consolidated module in each Transformer layer post-training. This ensures that the final deployed model has the same number of adaptation parameters as if only a single module was used, effectively retaining the storage benefits of the underlying PEFT method.
This ingenious combination allows AdaMix to capture the benefits of ensemble-like learning (multiple “experts” or “views”) during training, then distill that knowledge into a compact, deployable form. This approach also draws conceptual links to Bayesian Neural Networks and model ensembling, where combining multiple perspectives often leads to more robust and accurate predictions.
Real-World Impact: Democratizing Advanced AI
The practical benefits of AdaMix are transformative. Consider a large enterprise needing to deploy a single foundational LLM for hundreds of diverse internal applications – from specialized legal document analysis to customer support bots for different product lines, or internal knowledge management systems. Each application requires fine-tuning. With traditional methods, this would mean storing hundreds of copies of a colossal PLM, leading to immense server costs, deployment complexity, and slow iteration cycles.
With AdaMix, the enterprise can keep a single, frozen copy of the base PLM. For each new task, only a minuscule set of AdaMix adaptation parameters (as low as 0.8MB compared to 355MB for a RoBERTa-large encoder in the study) needs to be stored and loaded. This drastically reduces VRAM consumption, allows more tasks to share the same hardware, and slashes the storage cost for introducing new AI capabilities by hundreds of times. Such efficiency makes it feasible to deploy highly specialized AI across an organization, fostering innovation and reducing time-to-market for AI-powered solutions.
3 Actionable Steps for AI Practitioners
- Deep Dive into PEFT: If you’re working with large language models, familiarize yourself with existing parameter-efficient fine-tuning techniques like LoRA and adapters. Understanding their mechanics will provide a solid foundation for appreciating AdaMix’s advancements.
- Evaluate AdaMix for Resource-Constrained Environments: For projects where both top-tier performance and strict resource efficiency (memory, storage, compute) are critical, investigate AdaMix. The provided code and models offer an excellent starting point for experimentation.
- Explore Mixture-Based Architectures: Beyond AdaMix, consider how “mixture” concepts (e.g., Mixture-of-Experts) can be applied to other areas of your AI model design to achieve scalability, sparsity, and specialized learning without exploding model size.
Conclusion
AdaMix represents a significant leap forward in the field of large language model fine-tuning. By innovatively combining mixture-of-experts principles with parameter-efficient techniques, and introducing clever mechanisms like stochastic routing and adaptation module merging, it delivers state-of-the-art performance while training only a minuscule fraction (0.1% – 0.2%) of the model’s parameters. This not only solves critical challenges related to storage, computational cost, and memory consumption but also opens new avenues for deploying highly specialized and powerful AI models more widely and economically.
Ready to revolutionize your approach to AI model deployment? Delve deeper into the official research paper and explore the practical implementation of AdaMix.
Frequently Asked Questions (FAQ)
What is AdaMix?
AdaMix is a novel Parameter-Efficient Fine-Tuning (PEFT) method designed to improve the performance of large pre-trained language models (PLMs) while only updating a minuscule fraction (0.1% – 0.2%) of their parameters. It achieves this by using a mixture of adaptation modules and smart routing mechanisms.
How does AdaMix reduce computational cost and memory?
AdaMix significantly reduces computational cost and memory by only fine-tuning a tiny percentage of the PLM’s parameters. During inference, its module merging mechanism ensures that the deployed model has the same number of adaptation parameters as a single adaptation module, minimizing VRAM and storage requirements. It enables multiple tasks to share the same frozen PLM, further cutting costs.
Does AdaMix outperform full model fine-tuning?
Yes, AdaMix is reported to not only outperform existing state-of-the-art PEFT methods but also surpass the performance of full model fine-tuning for various Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks, despite training only 0.1% – 0.2% of parameters.
What are the key mechanisms behind AdaMix’s efficiency?
AdaMix leverages stochastic routing to direct input examples to specific adaptation modules, consistency regularization and module sharing for training stability, and an adaptation module merging mechanism to consolidate multiple modules into a single one post-training, thus maintaining low parameter count for deployment.
Where can I access the AdaMix code and models?
The code and models for AdaMix are made available by the authors at https://aka.ms/AdaMix.