LoRA (machine learning)

Last updated

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique for large language models and other deep neural networks. Introduced in 2021 by researchers at Microsoft, LoRA enables adaptation of pre-trained models to specific tasks while requiring significantly fewer computational resources and trainable parameters than traditional full model fine-tuning. [1]

Contents

Background

The development of increasingly large language models in the late 2010s and early 2020s created substantial computational challenges. GPT-1, released in 2018 with 117 million parameters, cost less than $50,000 to train. [2] GPT-2, released in 2019 with 1.5 billion parameters, required $40,000 to train. [2]

By 2020, GPT-3 scaled to 175 billion parameters, with training costs estimated between $500,000 and $4.6 million. [3] Training consumed approximately 1,287 megawatt-hours of electricity and produced 502 metric tons of carbon emissions. [4] GPT-4, released in 2023, required over $100 million to train and consumed approximately 50 gigawatt-hours of energy using 25,000 Nvidia A100 GPUs running for 90 to 100 days. [5] GPT-5, released in August 2025, required individual training runs costing over $500 million each, with total training costs estimated between $1.25 billion and $2.5 billion. [6] [7] This created a barrier where adapting such models to specific tasks through traditional fine-tuning became prohibitively expensive for most researchers and organizations.

Purpose

LoRA works by decomposing weight update matrices into lower-rank representations. Rather than updating all parameters in a neural network during fine-tuning, LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. [1] This approach is grounded in linear algebra and exploits the hypothesis that weight updates during fine-tuning have low "intrinsic rank," meaning the changes can be effectively represented with fewer parameters than the full weight matrix. [1]

When applied to GPT-3, LoRA reduced trainable parameters by approximately 10,000 times (from 175 billion to roughly 18 million) and GPU memory requirements during training by 3 times (from 1.2 terabytes to 350 gigabytes). [1] [8] The technique applies broadly to any dense layers in deep learning models, though it has been most extensively studied in the context of large language models. [1] After training, LoRA adapter weights can be merged with the base model weights, resulting in no additional inference latency during deployment. [1]

Uses

A primary use of LoRA is creating customized versions of large models at dramatically reduced cost. The adapter weights trained through LoRA can be folded back into the original base model, producing a new full-scale specialized model for a far lower cost than retraining the entire model. [1] This allows organizations to create domain-specific versions of models like GPT-3 (175 billion parameters) while only bearing the computational cost of training a small adapter (18 million parameters), rather than the prohibitive expense of full model retraining. Once merged, the resulting model can achieve performance comparable to traditional fine-tuning while requiring a fraction of the resources to create.

Alternatively, organizations can maintain a single base model alongside multiple small LoRA adapters, each specialized for different tasks or domains. For example, a 175 billion parameter base model could be paired with separate 18 million parameter adapters for customer service, legal analysis, and medical applications. This approach dramatically reduces storage requirements compared to maintaining multiple full-scale fine-tuned models, as each adapter requires less than one percent of the storage space of a complete model. [1]

LoRA also enables dynamic adapter swapping, where different adapters can be loaded and applied to the same base model without reloading the entire model into memory. This allows systems to switch between specialized tasks efficiently. Multiple adapters can also be combined by merging their weight updates, either with each other or with the base model, to create models with blended capabilities. [8]

References

  1. 1 2 3 4 5 6 7 8 Hu, Edward J.; Shen, Yelong; Wallis, Phillip; Allen-Zhu, Zeyuan; Li, Yuanzhi; Wang, Shean; Wang, Lu; Chen, Weizhu (2022). LoRA: Low-Rank Adaptation of Large Language Models. International Conference on Learning Representations.
  2. 1 2 "AI Cheat Sheet: Large Language Foundation Model Training Costs". PYMNTS. 2025-02-10. Retrieved 2026-01-22.
  3. "What is the cost of training large language models?". CUDO Compute. 2025-05-12. Retrieved 2026-01-22.
  4. "Optimization could cut the carbon footprint of AI training by up to 75%". University of Michigan. 2023-04-19. Retrieved 2026-01-22.
  5. "The Cost of AI: Breakdown of Investments in Training, Infrastructure and More". Forward Future. 2025-05-05. Retrieved 2026-01-22.
  6. "OpenAI GPT-5 is costing $500 Million per training run and still failing". Fanatical Futurist. 2025-05-30. Retrieved 2026-01-22.
  7. "All You Need to Know About GPT-5 & OpenAI's 2025 Roadmap". Fello AI. 2025-02-13. Retrieved 2026-01-22.
  8. 1 2 "What is LoRA (Low-Rank Adaption)?". IBM. 2024-11-17. Retrieved 2026-01-22.