Reasoning language model

Last updated

Reasoning language models are artificial intelligence systems that combine natural language processing with structured reasoning capabilities. These models are usually constructed by prompting, supervised finetuning (SFT), and reinforcement learning (RL) initialized with pretrained language models.

Contents

Prompting

A language model is a generative model of a training dataset of texts. Prompting means constructing a text prompt, such that, conditional on the text prompt, the language model generates a solution to the task. Prompting can be applied to a pretrained model ("base model"), a base model that has undergone SFT, or RL, or both. [1]

Chain of thought

Chain of Thought prompting (CoT) prompts the model to answer a question by first generating a "chain of thought", i.e. steps of reasoning that mimic a train of thought. [2] It was published in 2022 by the Brain team of Google on the PaLM-540B model. [3] In CoT prompting, the prompt is of the form "<Input> Let's think step by step", and the model would respond with a chain of reasoning steps, ended with an answer:Similarly, Tree of Thought prompting generalizes CoT by prompting the model to generate one or more "possible next steps", and then running the model on each of the possible next steps by breadth-first, beam, or some other method of tree search. [4] Similarly, Graph of Thought generalizes CoT so that the reasoning steps form a directed acyclic graph. [5]

Self-consistency decoding performs several chain-of-thought rollouts, then selects the most commonly reached conclusion out of all the rollouts. [6] If the rollouts disagree by a lot, a human can be queried for the correct chain of thought. [7]

Retrieval-augmented generation

A language model may answer a query by first querying a database of documents using the query. The document retrieval can be via a vector database, summary index, tree index, or keyword table index. [8] Following document retrieval, the LLM generates an output that incorporates information from both the query and the retrieved documents. [9]

Tool use

Language models can perform long reasoning steps by calling external methods, such as numerical recipes, program interpreters, API calls, and so on. This can be prompt-engineered by describing the external methods in-context (an example of in-context learning) or finetuned into the model. [10]

Supervised finetuning

A base model can be finetuned on a dataset of reasoning tasks with example solutions and reasoning traces. The finetuned model would then be able to generate reasoning traces for a given problem. [11] [12]

As it is expensive to get humans to write reasoning traces for a SFT dataset, researchers have proposed ways to automatically construct SFT datasets. In rejection sampling finetuning (RFT), new reasoning traces are collected via a loop: [13]

  1. Sample a task prompt
  2. Generate many reasoning traces for the prompt.
  3. Use a verifier to remove reasoning traces with the wrong final answer.
  4. For each remaining trace, extract the set of equations appearing in it. Deduplicate the traces so that each one has a different set of equations. Add those to the dataset.

Reinforcement learning

A pretrained language model can be further trained by RL. In the RL formalism, a generative language model is a policy. A prompt specifying a task to solve is an environmental state, and the response of the language model to the prompt is an action. The probability that the language model responds with is .

Training a reasoning language model by RL then consists of constructing a reward model to guide the RL process. Intuitively, a reward model describes how desirable/appropriate/good the response is for the prompt. For reasoning language model, the prompt describes a reasoning task, and the reward would be high if the response solves the task, and low if the response fails to solve the task.

For reasoning language models, the model's response may be broken down into multiple steps, in which case it is written as .

Outcome Reward Model

Outcome reward model, or outcome-supervised RM (ORM), [11] is a reward model that computes the reward of a step determined by the final answer: . They are also called "verifiers".

For tasks with an answer that is easy to verify, such as word problems in math, the outcome reward can simply be binary: 1 if the final answer is correct, and 0 otherwise. [11] If the answer is not easy to verify programmatically, humans can manually label the answers as correct or not, then the labels can be used to finetune a base model that predicts the human label. [12] For other kinds of tasks, such as creative writing, where task performance is not binary true/false, one can train a reward model by finetuning a base model on human ranked preference data, such as used in reinforcement learning from human feedback. [14] A base model can also be finetuned to predict, given a partial thinking trace , whether the final answer would be correct or not. This can then be used as a binary reward signal. [11]

The ORM is usually trained via logistic regression, i.e. minimizing cross-entropy loss. [15]

Given a PRM, an ORM can be constructed by multiplying the total process reward during the reasoning trace, [14] or by taking the minimum, [15] or some other method to aggregate the process rewards.

Process Reward Model

Process reward model, or process-supervised RM (PRM), [11] is a reward model that computes the reward of a step determined by the steps so far: .

Given a partial thinking trace , a human can be queried as to whether the steps so far are correct, regardless of whether the ultimate answer would be correct. This can then be used as a binary reward signal. As human labels are expensive, a base model can be finetuned to predict the human labels. [11] The PRM is usually trained via logistic regression, i.e. minimizing cross-entropy loss. [15]

As an example, in a 2023 OpenAI paper, 800K process labels were collected for 75K solution traces. A labeler would be presented with a solution trace, and keep labelling "positive" if the step progresses towards the solution, "neutral" if it is not wrong, but does not progress towards solution, and "negative" if it is a mistake. As soon as a "negative" label is entered, the labeler stops labeling that thinking trace, and begins labeling another one. The idea was that, while labelling subsequent reasoning steps can provide even richer supervision signals, simply labeling up to the first error was sufficient for training a competent PRM. [14] [16]

As human labels are expensive, researchers have proposed methods to create PRM without human labels on the processes. Inspired by Monte Carlo tree search (MCTS), the Math-Shepherd method samples multiple continuations until the end, starting at each reasoning step , and set the reward at that step to be either in the case of "soft estimation", or in the case of "hard estimation". This creates process reward using only an ORM, which is usually easier or cheaper to construct. After creating these process reward labels, a PRM can be trained on them. [15] Some have tried a fully MCTS approach. [17]

One can also use an ORM to implicitly construct a PRM, similar to direct preference optimization. [18]

[19] [20]

Guided sampling

A trained ORM can be used to select the best response. The policy would rollout multiple responses, and a trained ORM would select the best response. This allows a simple form of test time compute scaling ("best-of-N"). [12] [21]

A trained PRM can also be used to guide reasoning by greedy tree search. That is, the policy model generates several possible next reasoning steps, and the PRM selects the best one, and the process repeats. This is similar to how a trained ORM can be used to select the best response. [22] Beam search perform better than greedy search.

Lookahead search is another tree search method, where the policy model generates several possible next reasoning steps, then make a (partial) rollout for each. If a solution endpoint is reached during the forward simulation, the process halts early. Otherwise, the PRM is used to calculate the total reward for each rollout. The step with the highest rollout is selected. [23]

Self-consistency can be combined with an ORM. The model would be used to generate multiple answers, and the answers would be clustered, so that each cluster has the same answer. The ORM is used to compute the reward for each answer, and the rewards within each cluster is summed. The answer corresponding to the cluster with the highest summed reward is output. [15]

Applications

Prompt engineering was discovered in GPT-3 as "few-shot learning", [24] which began a period of research into "eliciting" capacities of pretrained language models. It was then found that a model could be prompted to perform CoT reasoning, which improves its performance on reasoning tasks.

Benchmark

The reasoning ability of language models are usually tested on problems of which there are unambiguous solutions that can be cheaply checked, and requires reasoning when solved by a human. These are usually in mathematics and competitive programming. The answer is usually an array of integers, a multiple choice letter, or a program that passes unit tests within a limited runtime.

The benchmark scores are of the following kinds:

The pass@n score can be estimated more accurately by making attempts, and use the unbiased estimator , where is the number of correct attempts. [29]

See also

Related Research Articles

A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent Markov process. An HMM requires that there be an observable process whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about state of by observing . By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be "influenced" exclusively by the outcome of at and that the outcomes of and at must be conditionally independent of at given at time . Estimation of the parameters in an HMM can be performed using maximum likelihood estimation. For linear chain HMMs, the Baum–Welch algorithm can be used to estimate parameters.

Hypercomputation or super-Turing computation is a set of hypothetical models of computation that can provide outputs that are not Turing-computable. For example, a machine that could solve the halting problem would be a hypercomputer; so too would one that could correctly evaluate every statement in Peano arithmetic.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

In geometry, the kissing number of a mathematical space is defined as the greatest number of non-overlapping unit spheres that can be arranged in that space such that they each touch a common unit sphere. For a given sphere packing in a given space, a kissing number can also be defined for each individual sphere as the number of spheres it touches. For a lattice packing the kissing number is the same for every sphere, but for an arbitrary sphere packing the kissing number may vary from one sphere to another.

In mathematics and computer science, Zeno machines are a hypothetical computational model related to Turing machines that are capable of carrying out computations involving a countably infinite number of algorithmic steps. These machines are ruled out in most models of computation.

Recurrent neural networks (RNNs) are a class of artificial neural network commonly used for sequential data processing. Unlike feedforward neural networks, which process data in a single pass, RNNs process data across multiple time steps, making them well-adapted for modelling and processing text, speech, and time series.

Answer set programming (ASP) is a form of declarative programming oriented towards difficult search problems. It is based on the stable model semantics of logic programming. In ASP, search problems are reduced to computing stable models, and answer set solvers—programs for generating stable models—are used to perform search. The computational process employed in the design of many answer set solvers is an enhancement of the DPLL algorithm and, in principle, it always terminates.

<span class="mw-page-title-main">Multi-armed bandit</span> Resource problem in machine learning

In probability theory and machine learning, the multi-armed bandit problem is a problem in which a decision maker iteratively selects one of multiple fixed choices when the properties of each choice are only partially known at the time of allocation, and may become better understood as time passes. A fundamental aspect of bandit problems is that choosing an arm does not affect the properties of the arm or other arms.

A constrained conditional model (CCM) is a machine learning and inference framework that augments the learning of conditional models with declarative constraints. The constraint can be used as a way to incorporate expressive prior knowledge into the model and bias the assignments made by the learned model to satisfy these constraints. The framework can be used to support decisions in an expressive output space while maintaining modularity and tractability of training and inference.

Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Deep learning architecture for modelling sequential data

A transformer is a deep learning architecture that was developed by researchers at Google and is based on the multi-head attention mechanism, which was proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. It uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model. As of 2020, BERT is a ubiquitous baseline in natural language processing (NLP) experiments.

Prompt engineering is the process of structuring or crafting an instruction in order to produce the best possible output from a generative artificial intelligence (AI) model.

Probabilistic numerics is an active field of study at the intersection of applied mathematics, statistics, and machine learning centering on the concept of uncertainty in computation. In probabilistic numerics, tasks in numerical analysis such as finding numerical solutions for integration, linear algebra, optimization and simulation and differential equations are seen as problems of statistical, probabilistic, or Bayesian inference.

This glossary of quantum computing is a list of definitions of terms and concepts used in quantum computing, its sub-disciplines, and related fields.

DisCoCat is a mathematical framework for natural language processing which uses category theory to unify distributional semantics with the principle of compositionality. The grammatical derivations in a categorial grammar are interpreted as linear maps acting on the tensor product of word vectors to produce the meaning of a sentence or a piece of text. String diagrams are used to visualise information flow and reason about natural language semantics.

<span class="mw-page-title-main">Reinforcement learning from human feedback</span> Machine learning technique

In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent with human preferences. It involves training a reward model to represent preferences, which can then be used to train other models through reinforcement learning.

A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.

<span class="mw-page-title-main">Neural scaling law</span> Statistical law in machine learning

In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost.

DeepSeek is a Chinese artificial intelligence company that develops open-source large language models (LLMs). Based in Hangzhou, Zhejiang, it is owned and funded by Chinese hedge fund High-Flyer, whose co-founder, Liang Wenfeng, established the company in 2023 and serves as its CEO.

References

  1. Qiao, Shuofei; Ou, Yixin; Zhang, Ningyu; Chen, Xiang; Yao, Yunzhi; Deng, Shumin; Tan, Chuanqi; Huang, Fei; Chen, Huajun (2023-09-18), Reasoning with Language Model Prompting: A Survey, arXiv: 2212.09597
  2. Wei, Jason; Wang, Xuezhi; Schuurmans, Dale; Bosma, Maarten; Ichter, Brian; Xia, Fei; Chi, Ed H.; Le, Quoc V.; Zhou, Denny (31 October 2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems (NeurIPS 2022). Vol. 35. arXiv: 2201.11903 .
  3. Sharan Narang and Aakanksha Chowdhery (2022-04-04). "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance".
  4. Yao, Shunyu; Yu, Dian; Zhao, Jeffrey; Shafran, Izhak; Griffiths, Thomas L.; Cao, Yuan; Narasimhan, Karthik (2023-05-17). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models". arXiv: 2305.10601 [cs.CL].
  5. Besta, Maciej; Blach, Nils; Kubicek, Ales; Gerstenberger, Robert; Podstawski, Michal; Gianinazzi, Lukas; Gajda, Joanna; Lehmann, Tomasz; Niewiadomski, Hubert; Nyczyk, Piotr; Hoefler, Torsten (2024-03-24). "Graph of Thoughts: Solving Elaborate Problems with Large Language Models". Proceedings of the AAAI Conference on Artificial Intelligence. 38 (16): 17682–17690. doi:10.1609/aaai.v38i16.29720. ISSN   2374-3468.
  6. Wang, Xuezhi; Wei, Jason; Schuurmans, Dale; Le, Quoc; Chi, Ed; Narang, Sharan; Chowdhery, Aakanksha; Zhou, Denny (2022-03-01). "Self-Consistency Improves Chain of Thought Reasoning in Language Models". arXiv: 2203.11171 [cs.CL].
  7. Diao, Shizhe; Wang, Pengcheng; Lin, Yong; Zhang, Tong (2023-02-01). "Active Prompting with Chain-of-Thought for Large Language Models". arXiv: 2302.12246 [cs.CL].
  8. "How Each Index Works - LlamaIndex 🦙 v0.10.17". docs.llamaindex.ai. Retrieved 2024-04-08.
  9. Lewis, Patrick; Perez, Ethan; Piktus, Aleksandra; Petroni, Fabio; Karpukhin, Vladimir; Goyal, Naman; Küttler, Heinrich; Lewis, Mike; Yih, Wen-tau; Rocktäschel, Tim; Riedel, Sebastian; Kiela, Douwe (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". Advances in Neural Information Processing Systems. 33. Curran Associates, Inc.: 9459–9474. arXiv: 2005.11401 .
  10. Schick, Timo; Dwivedi-Yu, Jane; Dessi, Roberto; Raileanu, Roberta; Lomeli, Maria; Hambro, Eric; Zettlemoyer, Luke; Cancedda, Nicola; Scialom, Thomas (2023-12-15). "Toolformer: Language Models Can Teach Themselves to Use Tools". Advances in Neural Information Processing Systems. 36: 68539–68551. arXiv: 2302.04761 .
  11. 1 2 3 4 5 6 Uesato, Jonathan; Kushman, Nate; Kumar, Ramana; Song, Francis; Siegel, Noah; Wang, Lisa; Creswell, Antonia; Irving, Geoffrey; Higgins, Irina (2022-11-25), Solving math word problems with process- and outcome-based feedback, arXiv: 2211.14275
  12. 1 2 3 4 Cobbe, Karl; Kosaraju, Vineet; Bavarian, Mohammad; Chen, Mark; Jun, Heewoo; Kaiser, Lukasz; Plappert, Matthias; Tworek, Jerry; Hilton, Jacob (2021-11-18), Training Verifiers to Solve Math Word Problems, arXiv: 2110.14168
  13. Yuan, Zheng; Yuan, Hongyi; Li, Chengpeng; Dong, Guanting; Lu, Keming; Tan, Chuanqi; Zhou, Chang; Zhou, Jingren (2023-09-13), Scaling Relationship on Learning Mathematical Reasoning with Large Language Models, arXiv: 2308.01825
  14. 1 2 3 Lightman, Hunter; Kosaraju, Vineet; Burda, Yura; Edwards, Harri; Baker, Bowen; Lee, Teddy; Leike, Jan; Schulman, John; Sutskever, Ilya (2023-05-31), Let's Verify Step by Step, arXiv: 2305.20050
  15. 1 2 3 4 5 Wang, Peiyi; Li, Lei; Shao, Zhihong; Xu, Runxin; Dai, Damai; Li, Yifei; Chen, Deli; Wu, Yu; Sui, Zhifang (August 2024). Ku, Lun-Wei; Martins, Andre; Srikumar, Vivek (eds.). "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations". Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics: 9426–9439. arXiv: 2312.08935 . doi:10.18653/v1/2024.acl-long.510.
  16. openai/prm800k, OpenAI, 2025-01-27, retrieved 2025-01-27
  17. Chen, Guoxin; Liao, Minpeng; Li, Chengxi; Fan, Kai (2024-09-27), AlphaMath Almost Zero: Process Supervision without Process, arXiv: 2405.03553
  18. Yuan, Lifan; Li, Wendi; Chen, Huayu; Cui, Ganqu; Ding, Ning; Zhang, Kaiyan; Zhou, Bowen; Liu, Zhiyuan; Peng, Hao (2024-12-02), Free Process Rewards without Process Labels, arXiv: 2412.01981
  19. DeepSeek-AI; Guo, Daya; Yang, Dejian; Zhang, Haowei; Song, Junxiao; Zhang, Ruoyu; Xu, Runxin; Zhu, Qihao; Ma, Shirong (2025-01-22), DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, arXiv: 2501.12948
  20. Shao, Zhihong; Wang, Peiyi; Zhu, Qihao; Xu, Runxin; Song, Junxiao; Bi, Xiao; Zhang, Haowei; Zhang, Mingchuan; Li, Y. K. (2024-04-27), DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, arXiv: 2402.03300
  21. Zhang, Di; Wu, Jianbo; Lei, Jingdi; Che, Tong; Li, Jiatong; Xie, Tong; Huang, Xiaoshui; Zhang, Shufei; Pavone, Marco (2024-11-21), LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning, arXiv: 2410.02884
  22. Ma, Qianli; Zhou, Haotian; Liu, Tingkai; Yuan, Jianbo; Liu, Pengfei; You, Yang; Yang, Hongxia (2023-10-16), Let's reward step by step: Step-Level reward model as the Navigators for Reasoning, arXiv: 2310.10080
  23. Snell, Charlie; Lee, Jaehoon; Xu, Kelvin; Kumar, Aviral (2024-08-06), Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, arXiv: 2408.03314
  24. Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon (2020-12-06). "Language models are few-shot learners". Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS '20. Red Hook, NY, USA: Curran Associates Inc.: 1877–1901. ISBN   978-1-7138-2954-6.
  25. Hendrycks, Dan; Burns, Collin; Basart, Steven; Zou, Andy; Mazeika, Mantas; Song, Dawn; Steinhardt, Jacob (2021-01-12), Measuring Massive Multitask Language Understanding, arXiv: 2009.03300
  26. Hendrycks, Dan; Burns, Collin; Kadavath, Saurav; Arora, Akul; Basart, Steven; Tang, Eric; Song, Dawn; Steinhardt, Jacob (2021-11-08), Measuring Mathematical Problem Solving With the MATH Dataset, arXiv: 2103.03874
  27. math-eval (2025-01-26), math-eval/MathEval , retrieved 2025-01-27
  28. Rein, David; Hou, Betty Li; Stickland, Asa Cooper; Petty, Jackson; Pang, Richard Yuanzhe; Dirani, Julien; Michael, Julian; Bowman, Samuel R. (2023-11-20), GPQA: A Graduate-Level Google-Proof Q&A Benchmark, arXiv: 2311.12022
  29. 1 2 Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas (2021-07-14), Evaluating Large Language Models Trained on Code, arXiv: 2107.03374
  30. Jimenez, Carlos E.; Yang, John; Wettig, Alexander; Yao, Shunyu; Pei, Kexin; Press, Ofir; Narasimhan, Karthik (2024-11-11), SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, arXiv: 2310.06770
  31. "ARC Prize". ARC Prize. Retrieved 2025-01-27.
  32. "LiveBench". livebench.ai. Retrieved 2025-01-27.
  33. Glazer, Elliot; Erdil, Ege; Besiroglu, Tamay; Chicharro, Diego; Chen, Evan; Gunning, Alex; Olsson, Caroline Falkman; Denain, Jean-Stanislas; Ho, Anson (2024-12-20), FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, arXiv: 2411.04872
  34. DeepSeek-AI; Guo, Daya; Yang, Dejian; Zhang, Haowei; Song, Junxiao; Zhang, Ruoyu; Xu, Runxin; Zhu, Qihao; Ma, Shirong (2025-01-22), DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, arXiv: 2501.12948