Stochastic parrot

Last updated

In machine learning, the term stochastic parrot is a metaphor to describe the theory that large language models, though able to generate plausible language, do not understand the meaning of the language they process. [1] [2] The term was coined by Emily M. Bender [2] [3] in the 2021 artificial intelligence research paper "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜" by Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell. [4]

Contents

Origin and definition

The term was first used in the paper "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜" by Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell (using the pseudonym "Shmargaret Shmitchell"). [4] They argued that large language models (LLMs) present dangers such as environmental and financial costs, inscrutability leading to unknown dangerous biases, and potential for deception, and that they can't understand the concepts underlying what they learn. [5]

Etymology

The word "stochastic" from the ancient Greek "stokhastikos" ('based on guesswork') is a term from probability theory meaning "randomly determined". [6] The word "parrot" refers to parrots' ability to mimic human speech, without understanding its meaning. [6]

Purpose

In their paper, Bender et al. argue that LLMs are probabilistically linking words and sentences together without considering meaning. Therefore, they are labeled to be mere "stochastic parrots". [4] According to the machine learning professionals Lindholm, Wahlström, Lindsten, and Schön, the analogy highlights two vital limitations: [1] [7]

Lindholm et al. noted that, with poor quality datasets and other limitations, a learning machine might produce results that are "dangerously wrong". [1]

Google involvement

Gebru was asked by Google to retract the paper or remove the names of Google employees from it. According to Jeff Dean, the paper "didn't meet our bar for publication". In response, Gebru listed conditions to be met, stating that otherwise they could "work on a last date". Dean wrote that one of these conditions was for Google to disclose the reviewers of the paper and their specific feedback, which Google declined. Shortly after, she received an email saying that Google was "accepting her resignation". Her firing sparked a protest by Google employees, who believed the intent was to censor Gebru's criticism. [8]

Subsequent usage

In July of 2021, the Alan Turing Institute hosted a keynote and panel discussion on the paper. [9] As of September 2024, the paper has been cited in 4,789 publications. [10] The term has been used in publications in the fields of law, [11] grammar, [12] narrative, [13] and humanities. [14] The authors continue to maintain their concerns about the dangers of chatbots based on large language models, such as GPT-4. [15]

Stochastic parrot is now a neologism used by AI skeptics to refer to machines' lack of understanding of the meaning of their outputs and is sometimes interpreted as a "slur against AI". [6] Its use expanded further when Sam Altman, CEO of Open AI, used the term ironically when he tweeted, "i am a stochastic parrot and so r u." [6] The term was then designated to be the 2023 AI-related Word of the Year for the American Dialect Society, even over the words "ChatGPT" and "LLM". [6] [16]

The phrase is often referenced by some researchers to describe LLMs as pattern matchers that can generate plausible human-like text through their vast amount of training data, merely parroting in a stochastic fashion. However, other researchers argue that LLMs are, in fact, at least partially able to understand language. [17]

Debate

Some LLMs, such as ChatGPT, have become capable of interacting with users in convincingly human-like conversations. [17] The development of these new systems has deepened the discussion of the extent to which LLMs understand or are simply "parroting".

Subjective experience

In the mind of a human being, words and language correspond to things one has experienced. [18] For LLMs, words may correspond only to other words and patterns of usage fed into their training data. [19] [20] [4] Proponents of the idea of stochastic parrots thus conclude that LLMs are incapable of actually understanding language. [19] [4]

Hallucinations and mistakes

The tendency of LLMs to pass off fake information as fact is held as support. [18] Called hallucinations, LLMs will occasionally synthesize information that matches some pattern, but not reality. [19] [20] [18] That LLMs can’t distinguish fact and fiction leads to the claim that they can’t connect words to a comprehension of the world, as language should do. [19] [18] Further, LLMs often fail to decipher complex or ambiguous grammar cases that rely on understanding the meaning of language. [19] [20] As an example, borrowing from Saba et al., is the prompt: [19]

The wet newspaper that fell down off the table is my favorite newspaper. But now that my favorite newspaper fired the editor I might not like reading it anymore. Can I replace ‘my favorite newspaper’ by ‘the wet newspaper that fell down off the table’ in the second sentence?

LLMs respond to this in the affirmative, not understanding that the meaning of "newspaper" is different in these two contexts; it is first an object and second an institution. [19] Based on these failures, some AI professionals conclude they are no more than stochastic parrots. [19] [18] [4]

Benchmarks and experiments

One argument against the hypothesis that LLMs are stochastic parrot is their results on benchmarks for reasoning, common sense and language understanding. In 2023, some LLMs have shown good results on many language understanding tests, such as the Super General Language Understanding Evaluation (SuperGLUE). [20] [21] Such tests, and the smoothness of many LLM responses, help as many as 51% of AI professionals believe they can truly understand language with enough data, according to a 2022 survey. [20]

When experimenting on ChatGPT-3, one scientist argued that the model was not a stochastic parrot, but had serious reasoning limitations. [17] He found that the model was coherent and informative when attempting to predict future events based on the information in the prompt. [17] ChatGPT-3 was frequently able to parse subtextual information from text prompts as well. However, the model frequently failed when tasked with logic and reasoning, especially when these prompts involved spatial awareness. [17] The model’s varying quality of responses indicates that LLMs may have a form of "understanding" in certain categories of tasks while acting as a stochastic parrot in others. [17]

Interpretability

Another technique for investigating if LLMs can understand is termed "mechanistic interpretability". The idea is to reverse-engineer a large language model to analyze how it internally processes the information.

One example is Othello-GPT, where a small transformer was trained to predict legal Othello moves. It has been found that this model has an internal representation of the Othello board, and that modifying this representation changes the predicted legal Othello moves in the correct way. This supports the idea that LLMs have a "world model", and are not just doing superficial statistics. [22] [23]

In another example, a small transformer was trained on computer programs written in the programming language Karel. Similar to the Othello-GPT example, this model developed an internal representation of Karel program semantics. Modifying this representation results in appropriate changes to the output. Additionally, the model generates correct programs that are, on average, shorter than those in the training set. [24]

Researchers also studied "grokking", a phenomenon where an AI model initially memorizes the training data outputs, and then, after further training, suddenly finds a solution that generalizes to unseen data. [25]

Shortcuts to reasoning

However, when tests created to test people for language comprehension are used to test LLMs, they sometimes result in false positives caused by spurious correlations within text data. [26] Models have shown examples of shortcut learning, which is when a system makes unrelated correlations within data instead of using human-like understanding. [27] One such experiment conducted in 2019 tested Google’s BERT LLM using the argument reasoning comprehension task. BERT was prompted to choose between 2 statements, and find the one most consistent with an argument. Below is an example of one of these prompts: [20] [28]

Argument: Felons should be allowed to vote. A person who stole a car at 17 should not be barred from being a full citizen for life.
Statement A: Grand theft auto is a felony.
Statement B: Grand theft auto is not a felony.

Researchers found that specific words such as "not" hint the model towards the correct answer, allowing near-perfect scores when included but resulting in random selection when hint words were removed. [20] [28] This problem, and the known difficulties defining intelligence, causes some to argue all benchmarks that find understanding in LLMs are flawed, that they all allow shortcuts to fake understanding.

See also

Related Research Articles

A superintelligence is a hypothetical agent that possesses intelligence surpassing that of the brightest and most gifted human minds. "Superintelligence" may also refer to a property of problem-solving systems whether or not these high-level intellectual competencies are embodied in agents that act in the world. A superintelligence may or may not be created by an intelligence explosion and associated with a technological singularity.

Google Brain was a deep learning artificial intelligence research team that served as the sole AI branch of Google before being incorporated under the newer umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, it combined open-ended machine learning research with information systems and large-scale computing resources. It created tools such as TensorFlow, which allow neural networks to be used by the public, and multiple internal AI research projects, and aimed to create research opportunities in machine learning and natural language processing. It was merged into former Google sister company DeepMind to form Google DeepMind in April 2023.

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

Emily Menon Bender is an American linguist who is a professor at the University of Washington. She specializes in computational linguistics and natural language processing. She is also the director of the University of Washington's Computational Linguistics Laboratory. She has published several papers on the risks of large language models and on ethics in natural language processing.

In artificial intelligence, researchers teach AI systems to develop their own ways of communicating by having them work together on tasks and use symbols as parts of a new language. These languages might grow out of human languages or be built completely from scratch. When AI is used for translating between languages, it can even create a new shared language to make the process easier. Natural Language Processing (NLP) helps these systems understand and generate human-like language, making it possible for AI to interact and communicate more naturally with people.

Artificial intelligence is used in Wikipedia and other Wikimedia projects for the purpose of developing those projects. Human and bot interaction in Wikimedia projects is routine and iterative.

<span class="mw-page-title-main">Timnit Gebru</span> Computer scientist

Timnit Gebru is an Eritrean Ethiopian-born computer scientist who works in the fields of artificial intelligence (AI), algorithmic bias and data mining. She is a co-founder of Black in AI, an advocacy group that has pushed for more Black roles in AI development and research. She is the founder of the Distributed Artificial Intelligence Research Institute (DAIR).

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020.

<span class="mw-page-title-main">Margaret Mitchell (scientist)</span> American computer scientist

Margaret Mitchell is a computer scientist who works on algorithmic bias and fairness in machine learning. She is most well known for her work on automatically removing undesired biases concerning demographic groups from machine learning models, as well as more transparent reporting of their intended use.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model. A prompt is natural language text describing the task that an AI should perform. A prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem in the style of Edgar Allan Poe about leaves falling", or a longer statement including context, instructions, and conversation history.

A foundation model, also known as large X model (LxM), is a machine learning or deep learning model that is trained on vast datasets so it can be applied across a wide range of use cases. Generative AI applications like Large Language Models are often examples of foundation models.

Prompt injection is a family of related computer security exploits carried out by getting a machine learning model which was trained to follow human-given instructions to follow instructions provided by a malicious user. This stands in contrast to the intended operation of instruction-following systems, wherein the ML model is intended only to follow trusted instructions (prompts) provided by the ML model's operator.

<span class="mw-page-title-main">ChatGPT</span> Chatbot developed by OpenAI

ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and launched in 2022. It is currently based on the GPT-4o large language model (LLM). ChatGPT can generate human-like conversational responses and enables users to refine and steer a conversation towards a desired length, format, style, level of detail, and language. It is credited with accelerating the AI boom, which has led to ongoing rapid investment in and public attention to the field of artificial intelligence (AI). Some observers have raised concern about the potential of ChatGPT and similar programs to displace human intelligence, enable plagiarism, or fuel misinformation.

<span class="mw-page-title-main">Hallucination (artificial intelligence)</span> Erroneous material generated by AI

In the field of artificial intelligence (AI), a hallucination or artificial hallucination is a response generated by AI that contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where hallucination typically involves false percepts. However, there is a key difference: AI hallucination is associated with erroneous responses rather than perceptual experiences.

<span class="mw-page-title-main">Generative pre-trained transformer</span> Type of large language model

A generative pre-trained transformer (GPT) is a type of large language model (LLM) and a prominent framework for generative artificial intelligence. It is an artificial neural network that is used in natural language processing by machines. It is based on the transformer deep learning architecture, pre-trained on large data sets of unlabeled text, and able to generate novel human-like content. As of 2023, most LLMs had these characteristics and are sometimes referred to broadly as GPTs.

A large language model (LLM) is a type of computational model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.

In deep learning, fine-tuning is an approach to transfer learning in which the parameters of a pre-trained neural network model are trained on new data. Fine-tuning can be done on the entire neural network, or on only a subset of its layers, in which case the layers that are not being fine-tuned are "frozen". A model may also be augmented with "adapters" that consist of far fewer parameters than the original model, and fine-tuned in a parameter-efficient way by tuning the weights of the adapters and leaving the rest of the model's weights frozen.

<span class="mw-page-title-main">Generative artificial intelligence</span> AI system capable of generating content in response to prompts

Generative artificial intelligence is a subset of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models learn the underlying patterns and structures of their training data and use them to produce new data based on the input, which often comes in the form of natural language prompts.

Open-source artificial intelligence is an AI system that is freely available to use, study, modify, and share. These attributes extend to each of the system's components, including datasets, code, and model parameters, promoting a collaborative and transparent approach to AI development. Free and open-source software (FOSS) licenses, such as the Apache License, MIT License, and GNU General Public License, outline the terms under which open-source artificial intelligence can be accessed, modified, and redistributed.

Artificial intelligence is defined as “systems which display intelligent behaviour by analysing their environment and taking actions – with some degree of autonomy – to achieve specific goals”. These systems can be software-based or embedded in hardware. The so called intelligence is displayed by following either rule based or machine learning algorithms. Artificial intelligence in education (aied) is a generic term, and an interdisciplinary collection of fields which are bundled together, inter alia anthropomorphism, generative artificial intelligence, data-driven decision-making, ai ethics, classroom surveillance, data-privacy and Ai Literacy.An educator could learn these tools and become a prompt engineer.

References

  1. 1 2 3 Lindholm et al. 2022, pp. 322–3.
  2. 1 2 Uddin, Muhammad Saad (April 20, 2023). "Stochastic Parrots: A Novel Look at Large Language Models and Their Limitations". Towards AI. Retrieved 2023-05-12.
  3. Weil, Elizabeth (March 1, 2023). "You Are Not a Parrot". New York . Retrieved 2023-05-12.
  4. 1 2 3 4 5 6 Bender, Emily M.; Gebru, Timnit; McMillan-Major, Angelina; Shmitchell, Shmargaret (2021-03-01). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜". Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. FAccT '21. New York, NY, USA: Association for Computing Machinery. pp. 610–623. doi: 10.1145/3442188.3445922 . ISBN   978-1-4503-8309-7. S2CID   232040593.
  5. Hao, Karen (4 December 2020). "We read the paper that forced Timnit Gebru out of Google. Here's what it says". MIT Technology Review . Archived from the original on 6 October 2021. Retrieved 19 January 2022.
  6. 1 2 3 4 5 Zimmer, Ben. "'Stochastic Parrot': A Name for AI That Sounds a Bit Less Intelligent". Wall Street Journal . Retrieved 2024-04-01.
  7. Uddin, Muhammad Saad (April 20, 2023). "Stochastic Parrots: A Novel Look at Large Language Models and Their Limitations". Towards AI. Retrieved 2023-05-12.
  8. Lyons, Kim (5 December 2020). "Timnit Gebru's actual paper may explain why Google ejected her". The Verge.
  9. Weller (2021).
  10. "Bender: On the Dangers of Stochastic Parrots". Google Scholar . Retrieved 2023-05-12.
  11. Arnaudo, Luca (April 20, 2023). "Artificial Intelligence, Capabilities, Liabilities: Interactions in the Shadows of Regulation, Antitrust – And Family Law". SSRN. doi:10.2139/ssrn.4424363. S2CID   258636427.
  12. Bleackley, Pete; BLOOM (2023). "In the Cage with the Stochastic Parrot". Speculative Grammarian. CXCII (3). Retrieved 2023-05-13.
  13. Gáti, Daniella (2023). "Theorizing Mathematical Narrative through Machine Learning". Journal of Narrative Theory . 53 (1). Project MUSE: 139–165. doi:10.1353/jnt.2023.0003. S2CID   257207529.
  14. Rees, Tobias (2022). "Non-Human Words: On GPT-3 as a Philosophical Laboratory". Daedalus . 151 (2): 168–82. doi: 10.1162/daed_a_01908 . JSTOR   48662034. S2CID   248377889.
  15. Goldman, Sharon (March 20, 2023). "With GPT-4, dangers of 'Stochastic Parrots' remain, say researchers. No wonder OpenAI CEO is a 'bit scared'". VentureBeat. Retrieved 2023-05-09.
  16. Corbin, Sam (2024-01-15). "Among Linguists, the Word of the Year Is More of a Vibe". The New York Times. ISSN   0362-4331 . Retrieved 2024-04-01.
  17. 1 2 3 4 5 6 Arkoudas, Konstantine (2023-08-21). "ChatGPT is no Stochastic Parrot. But it also Claims that 1 is Greater than 1". Philosophy & Technology. 36 (3): 54. doi:10.1007/s13347-023-00619-6. ISSN   2210-5441.
  18. 1 2 3 4 5 Fayyad, Usama M. (2023-05-26). "From Stochastic Parrots to Intelligent Assistants—The Secrets of Data and Human Interventions". IEEE Intelligent Systems. 38 (3): 63–67. doi:10.1109/MIS.2023.3268723. ISSN   1541-1672.
  19. 1 2 3 4 5 6 7 8 Saba, Walid S. (2023). "Stochastic LLMS do not Understand Language: Towards Symbolic, Explainable and Ontologically Based LLMS". In Almeida, João Paulo A.; Borbinha, José; Guizzardi, Giancarlo; Link, Sebastian; Zdravkovic, Jelena (eds.). Conceptual Modeling. Lecture Notes in Computer Science. Vol. 14320. Cham: Springer Nature Switzerland. pp. 3–19. arXiv: 2309.05918 . doi:10.1007/978-3-031-47262-6_1. ISBN   978-3-031-47262-6.
  20. 1 2 3 4 5 6 7 Mitchell, Melanie; Krakauer, David C. (2023-03-28). "The debate over understanding in AI's large language models". Proceedings of the National Academy of Sciences. 120 (13): e2215907120. arXiv: 2210.13966 . Bibcode:2023PNAS..12015907M. doi:10.1073/pnas.2215907120. ISSN   0027-8424. PMC   10068812 . PMID   36943882.
  21. Wang, Alex; Pruksachatkun, Yada; Nangia, Nikita; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R. (2019-05-02). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems". arXiv: 1905.00537 [cs.CL].
  22. Li, Kenneth; Hopkins, Aspen K.; Bau, David; Viégas, Fernanda; Pfister, Hanspeter; Wattenberg, Martin (2023-02-27), Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task, arXiv: 2210.13382
  23. Li, Kenneth (2023-01-21). "Large Language Model: world models or surface statistics?". The Gradient. Retrieved 2024-04-04.
  24. Jin, Charles; Rinard, Martin (2023-05-24), Evidence of Meaning in Language Models Trained on Programs, arXiv: 2305.11169
  25. Schreiner, Maximilian (2023-08-11). "Grokking in machine learning: When Stochastic Parrots build models". the decoder. Retrieved 2024-05-25.
  26. Choudhury, Sagnik Ray; Rogers, Anna; Augenstein, Isabelle (2022-09-15), Machine Reading, Fast and Slow: When Do Models "Understand" Language?, arXiv: 2209.07430
  27. Geirhos, Robert; Jacobsen, Jörn-Henrik; Michaelis, Claudio; Zemel, Richard; Brendel, Wieland; Bethge, Matthias; Wichmann, Felix A. (2020-11-10). "Shortcut learning in deep neural networks". Nature Machine Intelligence. 2 (11): 665–673. arXiv: 2004.07780 . doi:10.1038/s42256-020-00257-z. ISSN   2522-5839.
  28. 1 2 Niven, Timothy; Kao, Hung-Yu (2019-09-16), Probing Neural Network Comprehension of Natural Language Arguments, arXiv: 1907.07355

Works cited

Further reading