This article relies largely or entirely on a single source .(January 2026) |
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique for large language models and other deep neural networks. Introduced in 2021 by researchers at Microsoft, LoRA enables adaptation of pre-trained models to specific tasks while requiring significantly fewer computational resources and trainable parameters than traditional full model fine-tuning. [1]
The development of increasingly large language models in the late 2010s and early 2020s created substantial computational challenges. GPT-1, released in 2018 with 117 million parameters, cost less than $50,000 to train. [2] GPT-2, released in 2019 with 1.5 billion parameters, required $40,000 to train. [2]
By 2020, GPT-3 scaled to 175 billion parameters, with training costs estimated between $500,000 and $4.6 million. [3] Training consumed approximately 1,287 megawatt-hours of electricity and produced 502 metric tons of carbon emissions. [4] GPT-4, released in 2023, required over $100 million to train and consumed approximately 50 gigawatt-hours of energy using 25,000 Nvidia A100 GPUs running for 90 to 100 days. [5] GPT-5, released in August 2025, required individual training runs costing over $500 million each, with total training costs estimated between $1.25 billion and $2.5 billion. [6] [7] This created a barrier where adapting such models to specific tasks through traditional fine-tuning became prohibitively expensive for most researchers and organizations.
LoRA works by decomposing weight update matrices into lower-rank representations. Rather than updating all parameters in a neural network during fine-tuning, LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. [1] This approach is grounded in linear algebra and exploits the hypothesis that weight updates during fine-tuning have low "intrinsic rank," meaning the changes can be effectively represented with fewer parameters than the full weight matrix. [1]
When applied to GPT-3, LoRA reduced trainable parameters by approximately 10,000 times (from 175 billion to roughly 18 million) and GPU memory requirements during training by 3 times (from 1.2 terabytes to 350 gigabytes). [1] [8] The technique applies broadly to any dense layers in deep learning models, though it has been most extensively studied in the context of large language models. [1] After training, LoRA adapter weights can be merged with the base model weights, resulting in no additional inference latency during deployment. [1]
A primary use of LoRA is creating customized versions of large models at dramatically reduced cost. The adapter weights trained through LoRA can be folded back into the original base model, producing a new full-scale specialized model for a far lower cost than retraining the entire model. [1] This allows organizations to create domain-specific versions of models like GPT-3 (175 billion parameters) while only bearing the computational cost of training a small adapter (18 million parameters), rather than the prohibitive expense of full model retraining. Once merged, the resulting model can achieve performance comparable to traditional fine-tuning while requiring a fraction of the resources to create.
Alternatively, organizations can maintain a single base model alongside multiple small LoRA adapters, each specialized for different tasks or domains. For example, a 175 billion parameter base model could be paired with separate 18 million parameter adapters for customer service, legal analysis, and medical applications. This approach dramatically reduces storage requirements compared to maintaining multiple full-scale fine-tuned models, as each adapter requires less than one percent of the storage space of a complete model. [1]
LoRA also enables dynamic adapter swapping, where different adapters can be loaded and applied to the same base model without reloading the entire model into memory. This allows systems to switch between specialized tasks efficiently. Multiple adapters can also be combined by merging their weight updates, either with each other or with the base model, to create models with blended capabilities. [8]