Stochastic parrot

Last updated

In machine learning, the term stochastic parrot is a metaphor to describe the theory that large language models, though able to generate plausible language, do not understand the meaning of the language they process. [1] [2] The term was coined by Emily M. Bender [2] [3] in the 2021 artificial intelligence research paper "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜" by Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell. [4]

Contents

Origin and definition

The term was first used in the paper "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜" by Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell (using the pseudonym "Shmargaret Shmitchell"). [4] They argued very large language models (LLMs) present dangers such as environmental and financial costs, inscrutability leading to unknown dangerous biases, and their potential for deception, and that they can't understand the concepts underlying what they learn. [5] Gebru and Mitchell lost their jobs at Google for publishing their criticisms, along with subsequent contributing events. [6] [7] Their firing sparked a protest by Google employees. [6] [7]

The word “stochastic” derives from the ancient Greek word “stokhastikos” meaning “based on guesswork,” or “randomly determined.” [8] The word "parrot" refers to the idea that LLMs merely repeat words without understanding their meaning. [8]

In their paper, Bender et al. argue that LLMs are probabilistically linking words and sentences together without considering meaning. Therefore, they are labeled to be mere "stochastic parrots." [4]

According to the machine learning professionals Lindholm, Wahlstrom, Lindsten, and Schon, the analogy highlights two vital limitations: [1] [9]

Lindholm et al. noted that, with poor quality datasets and other limitations, a learning machine might produce results that are "dangerously wrong". [1]

Subsequent usage

In July of 2021, the Alan Turing Institute hosted a keynote and panel discussion on the paper. [10] As of May 2023, the paper has been cited in 1,529 publications. [11] The term has been used in publications in the fields of law, [12] grammar, [13] narrative, [14] and humanities. [15] The authors continue to maintain their concerns about the dangers of chatbots based on large language models, such as GPT-4. [16]

Stochastic parrot is now a neologism used by AI skeptics to refer to machines' lack of understanding of the meaning of their outputs and is sometimes interpreted as a "slur against AI." [8] Its use expanded further when Sam Altman, CEO of Open AI, used the term ironically when he tweeted, "i am a stochastic parrot and so r u." [8] The term was then designated to be the 2023 AI-related Word of the Year for the American Dialect Society, even over the words "ChatGPT" and "LLM." [8] [17]

The phrase is often referenced by some researchers to describe LLMs as pattern matchers that can generate plausible human-like text through their vast amount of training data, merely parroting in a stochastic fashion. However, other researchers argue that LLMs are, in fact, able to understand language. [18]

Debate

Some LLMs, such as ChatGPT, have become capable of interacting with users in convincingly human-like conversations. [18] The development of these new systems has deepened the discussion of the extent to which LLMs are simply “parroting.”

In the mind of a human being, words and language correspond to things one has experienced. [19] For LLMs, words correspond only to other words and patterns of usage fed into their training data. [20] [21] [4] Proponents of the idea of stochastic parrots thus conclude that LLMs are incapable of actually understanding language. [20] [4]

The tendency of LLMs to pass off fake information as fact is held as support. [19] Called hallucinations, LLMs will occasionally synthesize information that matches some pattern, but not reality. [20] [21] [19] That LLMs can’t distinguish fact and fiction leads to the claim that they can’t connect words to a comprehension of the world, as language should do. [20] [19]  Further, LLMs often fail to decipher complex or ambiguous grammar cases that rely on understanding the meaning of language. [20] [21] As an example, borrowing from Saba et al., is the prompt: [20]

The wet newspaper that fell down off the table is my favorite newspaper. But now that my favorite newspaper fired the editor I might not like reading it anymore. Can I replace ‘my favorite newspaper’ by ‘the wet newspaper that fell down off the table’ in the second sentence?

LLMs respond to this in the affirmative, not understanding that the meaning of "newspaper" is different in these two contexts; it is first an object and second an institution. [20] Based on these failures, some AI professionals conclude they are no more than stochastic parrots. [20] [19] [4]

However, there is support for the claim that LLMs are more than that. LLMs do pass many tests for understanding well, such as the Super General Language Understanding Evaluation (SuperGLUE). [21] [22]  Tests such as these and the smoothness of many LLM responses help as many as 51% of AI professionals believe they can truly understand language with enough data, according to 2022 survey. [21]

Another technique which has been applied to show this is termed "mechanistic interpretability". The idea is to reverse-engineer a large language model by discovering symbolic algorithms that approximate the inference performed by LLM. One example is Othello-GPT, where a small transformer is trained to predict legal Othello moves. It has been found that it can make a linear representation of Othello board, and modifying the representation changes the predicted legal Othello moves in the correct way. [23] [24]

In another example, a small Transformer is trained on Karel programs. Similar to the Othello-GPT example, there is a linear representation of Karel program semantics, and modifying the representation changes output in the correct way. The model also generates correct programs that are on average shorter than those in the training set. [25]

However, when tests created to test people for language comprehension are used to test LLMs, they sometimes result in false positives caused by spurious correlations within text data. [26] Models have shown examples of shortcut learning, which is when a system makes unrelated correlations within data instead of using human-like understanding. [27] One such experiment tested Google’s BERT LLM using the argument reasoning comprehension task. They asked it to choose between 2 statements, which is more consistent with an argument. Below is an example of one of these prompts: [21] [28]

Argument: Felons should be allowed to vote. A person who stole a car at 17 should not be barred from being a full citizen for life.
Statement A: Grand theft auto is a felony.
Statement B: Grand theft auto is not a felony.

Researchers found that specific words such as “not” hint the model towards the correct answer, allowing near-perfect scores when included but resulting in random selection when hint words were removed. [21] [28] This problem, and the known difficulties defining intelligence, causes some to argue all benchmarks that find understanding in LLMs are flawed, that they all allow shortcuts to fake understanding.

Without a reliable benchmark, researchers have found difficulties differentiating models between stochastic parrots and entities capable of understanding. When experimenting on ChatGPT-3, one scientist argued that the model was in between true human-like understanding and being a stochastic parrot. [18] He found that the model was coherent and informative when attempting to predict future events based on the information in the prompt. [18] ChatGPT-3 was frequently able to parse subtextual information from text prompts as well. However, the model frequently failed when tasked with logic and reasoning, especially when these prompts involved spatial awareness. [18] The model’s varying quality of responses indicates that LLM models may have a form of “understanding” in certain categories of tasks while acting as a stochastic parrot in others. [18]

See also

Related Research Articles

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

Google Brain was a deep learning artificial intelligence research team under the umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, Google Brain combined open-ended machine learning research with information systems and large-scale computing resources. The team has created tools such as TensorFlow, which allow for neural networks to be used by the public, with multiple internal AI research projects. The team aims to create research opportunities in machine learning and natural language processing. The team was merged into former Google sister company DeepMind to form Google DeepMind in April 2023.

Multimodal learning, in the context of machine learning, is a type of deep learning using a combination of various modalities of data, such as text, audio, or images, in order to create a more robust model of the real-world phenomena in question. In contrast, singular modal learning would analyze text or imaging data independently. Multimodal machine learning combines these fundamentally different statistical analyses using specialized modeling strategies and algorithms, resulting in a model that comes closer to representing the real world.

Emily Menon Bender is an American linguist who is a professor at the University of Washington. She specializes in computational linguistics and natural language processing. She is also the director of the University of Washington's Computational Linguistics Laboratory. She has published several papers on the risks of large language models and on ethics in natural language processing.

Artificial intelligence is used in Wikipedia and other Wikimedia projects for the purpose of developing those projects. Human and bot interaction in Wikimedia projects is routine and iterative.

<span class="mw-page-title-main">Timnit Gebru</span> Computer scientist (born 1983)

Timnit Gebru is an Eritrean Ethiopian-born computer scientist who works in the fields of artificial intelligence (AI), algorithmic bias and data mining. She is an advocate for diversity in technology and co-founder of Black in AI, a community of Black researchers working in AI. She is the founder of the Distributed Artificial Intelligence Research Institute (DAIR).

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based architectures with a technique known as "attention". This attention mechanism allows the model to selectively focus on segments of input text it predicts to be most relevant. It uses a 2048-tokens-long context, float16 (16-bit) precision, and a hitherto-unprecedented 175 billion parameters, requiring 350GB of storage space as each parameter takes 2 bytes of space, and has demonstrated strong "zero-shot" and "few-shot" learning abilities on many tasks.

<span class="mw-page-title-main">DALL-E</span> Image-generating deep-learning model

DALL·E, DALL·E 2, and DALL·E 3 are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions, called "prompts."

<span class="mw-page-title-main">Margaret Mitchell (scientist)</span> U.S. computer scientist

Margaret Mitchell is a computer scientist who works on algorithmic bias and fairness in machine learning. She is most well known for her work on automatically removing undesired biases concerning demographic groups from machine learning models, as well as more transparent reporting of their intended use.

<span class="mw-page-title-main">Deborah Raji</span> Nigerian-Canadian computer scientist and activist

Inioluwa Deborah Raji is a Nigerian-Canadian computer scientist and activist who works on algorithmic bias, AI accountability, and algorithmic auditing. Raji has previously worked with Joy Buolamwini, Timnit Gebru, and the Algorithmic Justice League on researching gender and racial bias in facial recognition technology. She has also worked with Google’s Ethical AI team and been a research fellow at the Partnership on AI and AI Now Institute at New York University working on how to operationalize ethical considerations in machine learning engineering practice. A current Mozilla fellow, she has been recognized by MIT Technology Review and Forbes as one of the world's top young innovators.

<span class="mw-page-title-main">GPT-1</span> 2018 text-generating language model

Generative Pre-trained Transformer 1 (GPT-1) was the first of OpenAI's large language models following Google's invention of the transformer architecture in 2017. In June 2018, OpenAI released a paper entitled "Improving Language Understanding by Generative Pre-Training", in which they introduced that initial model along with the general concept of a generative pre-trained transformer.

Prompt engineering is the process of structuring text that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

A foundation model is a machine learning or Deep learning model that is trained on broad data such that it can be applied across a wide range of use cases. Foundation models have transformed artificial intelligence (AI), powering prominent generative AI applications like ChatGPT. The Stanford Institute for Human-Centered Artificial Intelligence's Center for Research on Foundation Models created and popularized the term.

<span class="mw-page-title-main">ChatGPT</span> Chatbot developed by OpenAI

ChatGPT is a chatbot developed by OpenAI and launched on November 30, 2022. Based on a large language model, it enables users to refine and steer a conversation towards a desired length, format, style, level of detail, and language. Successive prompts and replies, known as prompt engineering, are considered at each conversation stage as a context.

<span class="mw-page-title-main">Hallucination (artificial intelligence)</span> Confident unjustified claim by AI

In the field of artificial intelligence (AI), a hallucination or artificial hallucination is a response generated by AI which contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where hallucination typically involves false percepts. However, there’s a key difference: AI hallucination is associated with unjustified responses or beliefs rather than perceptual experiences.

Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created by OpenAI, and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the paid chatbot product ChatGPT Plus, via OpenAI's API, and via the free chatbot Microsoft Copilot. As a transformer-based model, GPT-4 uses a paradigm where pre-training using both public data and "data licensed from third-party providers" is used to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.

<span class="mw-page-title-main">Generative pre-trained transformer</span> Type of large language model

Generative pre-trained transformers (GPT) are a type of large language model (LLM) and a prominent framework for generative artificial intelligence. They are artificial neural networks that are used in natural language processing tasks. GPTs are based on the transformer architecture, pre-trained on large data sets of unlabelled text, and able to generate novel human-like content. As of 2023, most LLMs have these characteristics and are sometimes referred to broadly as GPTs.

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.

Ashish Vaswani is a computer scientist working in deep learning, who is known for his significant contributions to the field of artificial intelligence (AI) and natural language processing (NLP). He is one of the co-authors of the seminal paper "Attention Is All You Need" which introduced the Transformer model, a novel architecture that uses a self-attention mechanism and has since become foundational to many state-of-the-art models in NLP. Transformer architecture is the core of language models that power applications such as ChatGPT. He was a co-founder of Adept AI Labs and a former staff research scientist at Google Brain.

References

  1. 1 2 3 Lindholm et al. 2022, pp. 322–3.
  2. 1 2 Uddin, Muhammad Saad (April 20, 2023). "Stochastic Parrots: A Novel Look at Large Language Models and Their Limitations". Towards AI. Retrieved 2023-05-12.
  3. Weil, Elizabeth (March 1, 2023). "You Are Not a Parrot". New York . Retrieved 2023-05-12.
  4. 1 2 3 4 5 6 Bender, Emily M.; Gebru, Timnit; McMillan-Major, Angelina; Shmitchell, Shmargaret (2021-03-01). "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜". Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. FAccT '21. New York, NY, USA: Association for Computing Machinery. pp. 610–623. doi: 10.1145/3442188.3445922 . ISBN   978-1-4503-8309-7. S2CID   232040593.
  5. Haoarchive, Karen (4 December 2020). "We read the paper that forced Timnit Gebru out of Google. Here's what it says". MIT Technology Review . Archived from the original on 6 October 2021. Retrieved 19 January 2022.
  6. 1 2 Lyons, Kim (5 December 2020). "Timnit Gebru's actual paper may explain why Google ejected her". The Verge.
  7. 1 2 Taylor, Paul (2021-02-12). "Stochastic Parrots". London Review of Books . Retrieved 2023-05-09.
  8. 1 2 3 4 5 Zimmer, Ben. "'Stochastic Parrot': A Name for AI That Sounds a Bit Less Intelligent". WSJ. Retrieved 2024-04-01.
  9. Uddin, Muhammad Saad (April 20, 2023). "Stochastic Parrots: A Novel Look at Large Language Models and Their Limitations". Towards AI. Retrieved 2023-05-12.
  10. Weller (2021).
  11. "Bender: On the Dangers of Stochastic Parrots". Google Scholar . Retrieved 2023-05-12.
  12. Arnaudo, Luca (April 20, 2023). "Artificial Intelligence, Capabilities, Liabilities: Interactions in the Shadows of Regulation, Antitrust – And Family Law". SSRN. doi:10.2139/ssrn.4424363. S2CID   258636427.
  13. Bleackley, Pete; BLOOM (2023). "In the Cage with the Stochastic Parrot". Speculative Grammarian. CXCII (3). Retrieved 2023-05-13.
  14. Gáti, Daniella (2023). "Theorizing Mathematical Narrative through Machine Learning". Journal of Narrative Theory . 53 (1). Project MUSE: 139–165. doi:10.1353/jnt.2023.0003. S2CID   257207529.
  15. Rees, Tobias (2022). "Non-Human Words: On GPT-3 as a Philosophical Laboratory". Daedalus . 151 (2): 168–82. doi: 10.1162/daed_a_01908 . JSTOR   48662034. S2CID   248377889.
  16. Goldman, Sharon (March 20, 2023). "With GPT-4, dangers of 'Stochastic Parrots' remain, say researchers. No wonder OpenAI CEO is a 'bit scared'". VentureBeat. Retrieved 2023-05-09.
  17. Corbin, Sam (2024-01-15). "Among Linguists, the Word of the Year Is More of a Vibe". The New York Times. ISSN   0362-4331 . Retrieved 2024-04-01.
  18. 1 2 3 4 5 6 Arkoudas, Konstantine (2023-08-21). "ChatGPT is no Stochastic Parrot. But it also Claims that 1 is Greater than 1". Philosophy & Technology. 36 (3): 54. doi:10.1007/s13347-023-00619-6. ISSN   2210-5441.
  19. 1 2 3 4 5 Fayyad, Usama M. (2023-05-26). "From Stochastic Parrots to Intelligent Assistants—The Secrets of Data and Human Interventions". IEEE Intelligent Systems. 38 (3): 63–67. doi:10.1109/MIS.2023.3268723. ISSN   1541-1672.
  20. 1 2 3 4 5 6 7 8 Saba, Walid S. (2023). "Stochastic LLMS do not Understand Language: Towards Symbolic, Explainable and Ontologically Based LLMS". In Almeida, João Paulo A.; Borbinha, José; Guizzardi, Giancarlo; Link, Sebastian; Zdravkovic, Jelena (eds.). Conceptual Modeling. Lecture Notes in Computer Science. Vol. 14320. Cham: Springer Nature Switzerland. pp. 3–19. arXiv: 2309.05918 . doi:10.1007/978-3-031-47262-6_1. ISBN   978-3-031-47262-6.
  21. 1 2 3 4 5 6 7 Mitchell, Melanie; Krakauer, David C. (2023-03-28). "The debate over understanding in AI's large language models". Proceedings of the National Academy of Sciences. 120 (13): e2215907120. arXiv: 2210.13966 . Bibcode:2023PNAS..12015907M. doi:10.1073/pnas.2215907120. ISSN   0027-8424. PMC   10068812 . PMID   36943882.
  22. Wang, Alex; Pruksachatkun, Yada; Nangia, Nikita; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R. (2019-05-02). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems". arXiv: 1905.00537 .
  23. Li, Kenneth; Hopkins, Aspen K.; Bau, David; Viégas, Fernanda; Pfister, Hanspeter; Wattenberg, Martin (2023-02-27), Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task, arXiv: 2210.13382
  24. Li, Kenneth (2023-01-21). "Large Language Model: world models or surface statistics?". The Gradient. Retrieved 2024-04-04.
  25. Jin, Charles; Rinard, Martin (2023-05-24), Evidence of Meaning in Language Models Trained on Programs, arXiv: 2305.11169
  26. Choudhury, Sagnik Ray; Rogers, Anna; Augenstein, Isabelle (2022-09-15), Machine Reading, Fast and Slow: When Do Models "Understand" Language?, arXiv: 2209.07430
  27. Geirhos, Robert; Jacobsen, Jörn-Henrik; Michaelis, Claudio; Zemel, Richard; Brendel, Wieland; Bethge, Matthias; Wichmann, Felix A. (2020-11-10). "Shortcut learning in deep neural networks". Nature Machine Intelligence. 2 (11): 665–673. arXiv: 2004.07780 . doi:10.1038/s42256-020-00257-z. ISSN   2522-5839.
  28. 1 2 Niven, Timothy; Kao, Hung-Yu (2019-09-16), Probing Neural Network Comprehension of Natural Language Arguments, arXiv: 1907.07355

Works cited

Further reading