Artificial intelligence in Wikimedia projects

Last updated

Artificial intelligence is used in Wikipedia and other Wikimedia projects for the purpose of developing those projects. [1] [2] Human and bot interaction in Wikimedia projects is routine and iterative. [3]

Contents

Using artificial intelligence for Wikimedia projects

Various projects seek to improve Wikipedia and Wikimedia projects by using artificial intelligence tools.

ORES

The Objective Revision Evaluation Service (ORES) project is an artificial intelligence service for grading the quality of Wikipedia edits. [4] [5] The Wikimedia Foundation presented the ORES project in November 2015. [6]

Wiki bots

The most well-known bot that fights vandalism is ClueBot NG. The bot was created by Wikipedia users Christopher Breneman and Naomi Amethyst in 2010 (succeeding the original ClueBot created in 2007; NG stands for Next Generation) [7] and uses machine learning and Bayesian statistics to determine if an edit is vandalism. [8] [9]

Detox

Detox was a project by Google, in collaboration with the Wikimedia Foundation, to research methods that could be used to address users posting unkind comments in Wikimedia community discussions. [10] Among other parts of the Detox project, the Wikimedia Foundation and Jigsaw collaborated to use artificial intelligence for basic research and to develop technical solutions[ example needed ] to address the problem. In October 2016 those organizations published "Ex Machina: Personal Attacks Seen at Scale" describing their findings. [11] [12] Various popular media outlets reported on the publication of this paper and described the social context of the research. [13] [14] [15]

Bias reduction

In August 2018, a company called Primer reported attempting to use artificial intelligence to create Wikipedia articles about women as a way to address gender bias on Wikipedia. [16] [17]

Machine translation software such as DeepL is used by contributors. More than 40% of Wikipedia's active editors are in English Wikipedia. DeepL machine translation of English Wikipedia example.png
Machine translation software such as DeepL is used by contributors. More than 40% of Wikipedia's active editors are in English Wikipedia.

Generative models

Wikipedia articles can be read using AI voice technology.

Text

In 2022, the public release of ChatGPT inspired more experimentation with AI and writing Wikipedia articles. A debate was sparked about whether and to what extent such large language models are suitable for such purposes in light of their tendency to generate plausible-sounding misinformation, including fake references; to generate prose that is not encyclopedic in tone; and to reproduce biases. [23] [24] As of May 2023, a draft Wikipedia policy on ChatGPT and similar large language models (LLMs) recommended that users who are unfamiliar with LLMs should avoid using them due to the aforementioned risks, as well as the potential for libel or copyright infringement. [24]

Other media

A WikiProject exists for finding and removing AI-generated text and images, called WikiProject AI Cleanup. [25]

Using Wikimedia projects for artificial intelligence

Datasets of Wikipedia are widely used for training AI models. Models of high-quality language data - (a) Composition of high-quality datasets - The Pile (left), PaLM (top-right), MassiveText (bottom-right).png
Datasets of Wikipedia are widely used for training AI models.

Content in Wikimedia projects is useful as a dataset in advancing artificial intelligence research and applications. For instance, in the development of the Google's Perspective API that identifies toxic comments in online forums, a dataset containing hundreds of thousands of Wikipedia talk page comments with human-labelled toxicity levels was used. [27] Subsets of the Wikipedia corpus are considered the largest well-curated data sets available for AI training. [19] [20]

A 2012 paper reported that more than 1,000 academic articles, including those using artificial intelligence, examine Wikipedia, reuse information from Wikipedia, use technical extensions linked to Wikipedia, or research communication about Wikipedia. [28] A 2017 paper described Wikipedia as the mother lode for human-generated text available for machine learning. [29]

A 2016 research project called "One Hundred Year Study on Artificial Intelligence" named Wikipedia as a key early project for understanding the interplay between artificial intelligence applications and human engagement. [30]

There is a concern about the lack of attribution to Wikipedia articles in large-language models like ChatGPT. [19] While Wikipedia's licensing policy lets anyone use its texts, including in modified forms, it does have the condition that credit is given, implying that using its contents in answers by AI models without clarifying the sourcing may violate its terms of use. [19]

See also

Related Research Articles

<span class="mw-page-title-main">Chatbot</span> Program that simulates conversation

A chatbot is a software application or web interface designed to have textual or spoken conversations. Modern chatbots are typically online and use generative artificial intelligence systems that are capable of maintaining a conversation with a user in natural language and simulating the way a human would behave as a conversational partner. Such chatbots often use deep learning and natural language processing, but simpler chatbots have existed for decades.

Artificial general intelligence (AGI) is a type of artificial intelligence (AI) that matches or surpasses human cognitive capabilities across a wide range of cognitive tasks. This contrasts with narrow AI, which is limited to specific tasks. Artificial superintelligence (ASI), on the other hand, refers to AGI that greatly exceeds human cognitive capabilities. AGI is considered one of the definitions of strong AI.

A superintelligence is a hypothetical agent that possesses intelligence surpassing that of the brightest and most gifted human minds. "Superintelligence" may also refer to a property of problem-solving systems whether or not these high-level intellectual competencies are embodied in agents that act in the world. A superintelligence may or may not be created by an intelligence explosion and associated with a technological singularity.

<span class="mw-page-title-main">Progress in artificial intelligence</span> How AI-related technologies evolve

Progress in artificial intelligence (AI) refers to the advances, milestones, and breakthroughs that have been achieved in the field of artificial intelligence over time. AI is a multidisciplinary branch of computer science that aims to create machines and systems capable of performing tasks that typically require human intelligence. AI applications have been used in a wide range of fields including medical diagnosis, finance, robotics, law, video games, agriculture, and scientific discovery. However, many AI applications are not perceived as AI: "A lot of cutting-edge AI has filtered into general applications, often without being called AI because once something becomes useful enough and common enough it's not labeled AI anymore." "Many thousands of AI applications are deeply embedded in the infrastructure of every industry." In the late 1990s and early 2000s, AI technology became widely used as elements of larger systems, but the field was rarely credited for these successes at the time.

OpenAI is an American artificial intelligence (AI) research organization founded in December 2015 and headquartered in San Francisco, California. Its stated mission is to develop "safe and beneficial" artificial general intelligence (AGI), which it defines as "highly autonomous systems that outperform humans at most economically valuable work". As a leading organization in the ongoing AI boom, OpenAI is known for the GPT family of large language models, the DALL-E series of text-to-image models, and a text-to-video model named Sora. Its release of ChatGPT in November 2022 has been credited with catalyzing widespread interest in generative AI.

In the field of artificial intelligence (AI), AI alignment aims to steer AI systems toward a person's or group's intended goals, preferences, or ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives.

<span class="mw-page-title-main">Timeline of computing 2020–present</span> Historical timeline

This article presents a detailed timeline of events in the history of computing from 2020 to the present. For narratives explaining the overall developments, see the history of computing.

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020.

<span class="mw-page-title-main">GPT-2</span> 2019 text-generating language model

Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was partially released in February 2019, followed by full release of the 1.5-billion-parameter model on November 5, 2019.

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative artificial intelligence (AI) model.

Prompt injection is a family of related computer security exploits carried out by getting a machine learning model which was trained to follow human-given instructions provided by a malicious user. This stands in contrast to the intended operation of instruction-following systems, wherein the ML model is intended only to follow trusted instructions (prompts) provided by the ML model's operator.

<span class="mw-page-title-main">ChatGPT</span> Chatbot developed by OpenAI

ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and launched in 2022. It is currently based on the GPT-4o large language model (LLM). ChatGPT can generate human-like conversational responses and enables users to refine and steer a conversation towards a desired length, format, style, level of detail, and language. It is credited with accelerating the AI boom, which has led to ongoing rapid investment in and public attention to the field of artificial intelligence (AI). Some observers have raised concern about the potential of ChatGPT and similar programs to displace human intelligence, enable plagiarism, or fuel misinformation.

<span class="mw-page-title-main">Hallucination (artificial intelligence)</span> Erroneous material generated by AI

In the field of artificial intelligence (AI), a hallucination or artificial hallucination is a response generated by AI that contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where hallucination typically involves false percepts. However, there is a key difference: AI hallucination is associated with erroneous responses rather than perceptual experiences.

Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the paid chatbot product ChatGPT Plus, via OpenAI's API, and via the free chatbot Microsoft Copilot. As a transformer-based model, GPT-4 uses a paradigm where pre-training using both public data and "data licensed from third-party providers" is used to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.

<span class="mw-page-title-main">Generative pre-trained transformer</span> Type of large language model

A generative pre-trained transformer (GPT) is a type of large language model (LLM) and a prominent framework for generative artificial intelligence. It is an artificial neural network that is used in natural language processing by machines. It is based on the transformer deep learning architecture, pre-trained on large data sets of unlabeled text, and able to generate novel human-like content. As of 2023, most LLMs had these characteristics and are sometimes referred to broadly as GPTs.

A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.

In machine learning, the term stochastic parrot is a metaphor to describe the theory that large language models, though able to generate plausible language, do not understand the meaning of the language they process. The term was coined by Emily M. Bender in the 2021 artificial intelligence research paper "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜" by Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell.

Artificial intelligence detection software aims to determine whether some content was generated using artificial intelligence (AI).

Artificial intelligence could be defined as “systems which display intelligent behaviour by analysing their environment and taking actions – with some degree of autonomy – to achieve specific goals”. These systems might be software-based or embedded in hardware. The so called intelligence would either be displayed by following a rule based or machine learning algorithm. Artificial intelligence in education (AiEd) is another vague term, and an interdisciplinary collection of fields which are bundled together, inter alia anthropomorphism, generative artificial intelligence, data-driven decision-making, ai ethics, classroom surveillance, data-privacy and Ai Literacy. An educator might learn to use these Ai systems as tools and become a prompt engineer, generate probabilistic code,text or rich media and optimize their digital content production. Or a governmental body might see Ai as an ideological project to normalize centralized power and decision making, while public schools and higher education contend with increasing privatization.

References

  1. Marr, Bernard (17 August 2018). "The Amazing Ways How Wikipedia Uses Artificial Intelligence". Forbes.
  2. Gertner, Jon (18 July 2023). "Wikipedia's Moment of Truth - Can the online encyclopedia help teach A.I. chatbots to get their facts right — without destroying itself in the process? + comment". The New York Times . Archived from the original on 18 July 2023. Retrieved 19 July 2023.{{cite news}}: CS1 maint: bot: original URL status unknown (link)
  3. Piscopo, Alessandro (1 October 2018). "Wikidata: A New Paradigm of Human-Bot Collaboration?". arXiv: 1810.00931 [cs.HC].
  4. Simonite, Tom (1 December 2015). "Software That Can Spot Rookie Mistakes Could Make Wikipedia More Welcoming". MIT Technology Review.
  5. Metz, Cade (1 December 2015). "Wikipedia Deploys AI to Expand Its Ranks of Human Editors". Wired. Archived from the original on 2 Apr 2024.
  6. Halfaker, Aaron; Taraborelli, Dario (30 November 2015). "Artificial intelligence service "ORES" gives Wikipedians X-ray specs to see through bad edits". Wikimedia Foundation.
  7. Hicks, Jesse (18 February 2014). "This machine kills trolls". The Verge . Archived from the original on 27 August 2014. Retrieved 18 February 2014.
  8. Nasaw, Daniel (25 July 2012). "Meet the 'bots' that edit Wikipedia". BBC News. Archived from the original on 16 September 2018. Retrieved 21 July 2018.
  9. Raja, Sumit. "Little about the bot that runs Wikipedia, ClueBot NG". digitfreak.com. Archived from the original on 22 November 2013. Retrieved 11 April 2017.
  10. Research:Detox - Meta.
  11. Wulczyn, Ellery; Thain, Nithum; Dixon, Lucas (2017). "Ex Machina: Personal Attacks Seen at Scale". Proceedings of the 26th International Conference on World Wide Web. pp. 1391–1399. arXiv: 1610.08914 . doi:10.1145/3038912.3052591. ISBN   9781450349130. S2CID   6060248.
  12. Jigsaw (7 February 2017). "Algorithms And Insults: Scaling Up Our Understanding Of Harassment On Wikipedia". Medium.
  13. Wakabayashi, Daisuke (23 February 2017). "Google Cousin Develops Technology to Flag Toxic Online Comments". The New York Times.
  14. Smellie, Sarah (17 February 2017). "Inside Wikipedia's Attempt to Use Artificial Intelligence to Combat Harassment". Motherboard. Vice Media.
  15. Gershgorn, Dave (27 February 2017). "Alphabet's hate-fighting AI doesn't understand hate yet". Quartz.
  16. Simonite, Tom (3 August 2018). "Using Artificial Intelligence to Fix Wikipedia's Gender Problem". Wired.
  17. Verger, Rob (7 August 2018). "Artificial intelligence can now help write Wikipedia pages for overlooked scientists". Popular Science.
  18. Costa-jussà, Marta R.; Cross, James; Çelebi, Onur; Elbayad, Maha; Heafield, Kenneth; Heffernan, Kevin; Kalbassi, Elahe; Lam, Janice; Licht, Daniel; Maillard, Jean; Sun, Anna; Wang, Skyler; Wenzek, Guillaume; Youngblood, Al; Akula, Bapi; Barrault, Loic; Gonzalez, Gabriel Mejia; Hansanti, Prangthip; Hoffman, John; Jarrett, Semarley; Sadagopan, Kaushik Ram; Rowe, Dirk; Spruit, Shannon; Tran, Chau; Andrews, Pierre; Ayan, Necip Fazil; Bhosale, Shruti; Edunov, Sergey; Fan, Angela; Gao, Cynthia; Goswami, Vedanuj; Guzmán, Francisco; Koehn, Philipp; Mourachko, Alexandre; Ropers, Christophe; Saleem, Safiyyah; Schwenk, Holger; Wang, Jeff (June 2024). "Scaling neural machine translation to 200 languages". Nature. 630 (8018): 841–846. Bibcode:2024Natur.630..841N. doi:10.1038/s41586-024-07335-x. ISSN   1476-4687. PMC   11208141 .
  19. 1 2 3 4 "Wikipedia's Moment of Truth". New York Times. Retrieved 29 November 2024.
  20. 1 2 Johnson, Isaac; Lescak, Emily (2022). "Considerations for Multilingual Wikipedia Research". arXiv: 2204.02483 [cs.CY].
  21. Mamadouh, Virginie (2020). "Wikipedia: Mirror, Microcosm, and Motor of Global Linguistic Diversity". Handbook of the Changing World Language Map. Springer International Publishing. pp. 3773–3799. doi:10.1007/978-3-030-02438-3_200. ISBN   978-3-030-02438-3. Some versions have expanded dramatically using machine translation through the work of bots or web robots generating articles by translating them automatically from the other Wikipedias, often the English Wikipedia. […] In any event, the English Wikipedia is different from the others because it clearly serves a global audience, while other versions serve more localized audience, even if the Portuguese, Spanish, and French Wikipedias also serves a public spread across different continents
  22. Khincha, Siddharth; Jain, Chelsi; Gupta, Vivek; Kataria, Tushar; Zhang, Shuo (2023). "InfoSync: Information Synchronization across Multilingual Semi-structured Tables". arXiv: 2307.03313 [cs.CL].
  23. Harrison, Stephen (2023-01-12). "Should ChatGPT Be Used to Write Wikipedia Articles?". Slate Magazine. Retrieved 2023-01-13.
  24. 1 2 Woodcock, Claire (2 May 2023). "AI Is Tearing Wikipedia Apart". Vice.
  25. Maiberg, Emanuel (October 9, 2024). "The Editors Protecting Wikipedia from AI Hoaxes". 404 Media . Retrieved October 9, 2024.
  26. Villalobos, Pablo; Ho, Anson; Sevilla, Jaime; Besiroglu, Tamay; Heim, Lennart; Hobbhahn, Marius (2022). "Will we run out of data? Limits of LLM scaling based on human-generated data". arXiv: 2211.04325 [cs.LG].
  27. "Google's comment-ranking system will be a hit with the alt-right". Engadget. 2017-09-01.
  28. Nielsen, Finn Årup (2012). "Wikipedia Research and Tools: Review and Comments". SSRN Working Paper Series. doi:10.2139/ssrn.2129874. ISSN   1556-5068.
  29. Mehdi, Mohamad; Okoli, Chitu; Mesgari, Mostafa; Nielsen, Finn Årup; Lanamäki, Arto (March 2017). "Excavating the mother lode of human-generated text: A systematic review of research that uses the wikipedia corpus". Information Processing & Management. 53 (2): 505–529. doi:10.1016/j.ipm.2016.07.003. S2CID   217265814.
  30. "AI Research Trends - One Hundred Year Study on Artificial Intelligence (AI100)". ai100.stanford.edu.