Small language models or compact language models are artificial intelligence language models designed for human natural language processing including language and text generation. They are smaller in scale and scope than large language models.
A large language model typically contains hundreds of billions of training parameters, with some models exceeding a trillion parameters. This substantial parameter count enables the model to encode vast amounts of information, thereby improving the generalizability and accuracy of its outputs. However, training such models demands enormous computational resources, rendering it infeasible for an individual to do so using a single computer and graphics processing unit.
Small language models, on the other hand, use far fewer parameters, typically ranging from a few thousand to a few hundred million. This make them more feasible to train and host in resource-constrained environments such as a single computer or even a mobile device. [1] [2] [3] [4] [5]
Most contemporary (2020s) small language models use the same architecture as a large language model, but with a smaller parameter count and sometimes lower arithmetic precision. Parameter count is reduced by a combination of knowledge distillation and pruning. Precision can be reduced by quantization. Work on large language models mostly translate to small language models: pruning and quantization are also widely used to speed up large language models.
While most small language models use autoregressive architectures, alternative approaches such as diffusion language models have emerged. Diffusion models generate text through iterative denoising rather than sequential token prediction, offering advantages in parallel generation and factuality. [6] [7]
Some notable models are: [2]
Phi-4 14B is marginally "small" at best, but Microsoft does market it as a small model. [8]
Research has shown that pre-training remains effective even at a small scale. Tiny models demonstrate significant performance improvements when pre-trained, with gains that increase with larger pre-training datasets. Classification accuracy improves when pre-training and test datasets share similar tokens. Shallow architectures can replicate deep model performance through collaborative learning. [9]
Architecture choices affect performance at small scales. Research has found that depth-width ratios matter more than absolute parameter counts. For 70 million parameter models, using 32 layers with 384 hidden dimensions outperforms the standard 12-layer GPT-2 architecture, while modern architectural improvements like RMSNorm, RoPE, and GQA provide minimal benefits at this scale. [6]
Small language models are increasingly adopted in enterprise settings for their lower inference costs, reduced infrastructure requirements, and suitability for on-premises or private cloud deployments in regulated industries such as finance and healthcare. [10] [11] [12]
Organizations often use SLMs for high-volume, low-complexity tasks like classification, routing, and extraction, while reserving large language models for advanced reasoning or multilingual synthesis. [13] [14]