Winograd schema challenge

Last updated

The Winograd schema challenge (WSC) is a test of machine intelligence proposed in 2012 by Hector Levesque, a computer scientist at the University of Toronto. Designed to be an improvement on the Turing test, it is a multiple-choice test that employs questions of a very specific structure: they are instances of what are called Winograd schemas, named after Terry Winograd, professor of computer science at Stanford University. [1]

Contents

On the surface, Winograd schema questions simply require the resolution of anaphora: the machine must identify the antecedent of an ambiguous pronoun in a statement. This makes it a task of natural language processing, but Levesque argues that for Winograd schemas, the task requires the use of knowledge and commonsense reasoning. [2]

The challenge is considered defeated in 2019 since a number of transformer-based language models achieved accuracies of over 90%. [3]

History

The Winograd Schema Challenge was proposed in the spirit of the Turing test. Proposed by Alan Turing in 1950, the Turing test plays a central role in the philosophy of artificial intelligence. Turing proposed that, instead of debating whether a machine can think, the science of AI should be concerned with demonstrating intelligent behavior, which can be tested. But the exact nature of the test Turing proposed has come under scrutiny, especially since an AI chatbot named Eugene Goostman claimed to pass it in 2014. One of the major concerns with the Turing test is that a machine could easily pass the test with brute force and/or trickery, rather than true intelligence. [4]

The Winograd schema challenge was proposed in 2012 in part to ameliorate the problems that came to light with the nature of the programs that performed well on the test. [5]

Turing's original proposal was what he called the imitation game, which involves free-flowing, unrestricted conversations in English between human judges and computer programs over a text-only channel (such as teletype). In general, the machine passes the test if interrogators are not able to tell the difference between it and a human in a five-minute conversation. [4]

Nuance Communications announced in July 2014 that it would sponsor an annual WSC competition, with a prize of $25,000 for the best system that could match human performance. [6] However, the prize is no longer offered.

Weaknesses of the Turing test

The performance of Eugene Goostman exhibited some of the Turing test's problems. Levesque identifies several major issues, [2] summarized as follows: [7]

Winograd schemas

The key factor in the WSC is the special format of its questions, which are derived from Winograd schemas. Questions of this form may be tailored to require knowledge and commonsense reasoning in a variety of domains. They must also be carefully written not to betray their answers by selectional restrictions or statistical information about the words in the sentence.

Origin

The first cited example of a Winograd schema (and the reason for their name) is due to Terry Winograd: [8]

The city councilmen refused the demonstrators a permit because they [feared/advocated] violence.

The choices of "feared" and "advocated" turn the schema into its two instances:

The city councilmen refused the demonstrators a permit because they feared violence.

The city councilmen refused the demonstrators a permit because they advocated violence.

The schema challenge question is, "Does the pronoun 'they' refer to the city councilmen or the demonstrators?" Switching between the two instances of the schema changes the answer. The answer is immediate for a human reader, but proves difficult to emulate in machines. Levesque [2] argues that knowledge plays a central role in these problems: the answer to this schema has to do with our understanding of the typical relationships between and behavior of councilmen and demonstrators.

Since the original proposal of the Winograd schema challenge, Ernest Davis, a professor at New York University, has compiled a list of over 140 Winograd schemas from various sources as examples of the kinds of questions that should appear on the Winograd schema challenge. [9]

Formal description

A Winograd schema challenge question consists of three parts:

  1. A sentence or brief discourse that contains the following:
    • Two noun phrases of the same semantic class (male, female, inanimate, or group of objects or people),
    • An ambiguous pronoun that may refer to either of the above noun phrases, and
    • A special word and alternate word, such that if the special word is replaced with the alternate word, the natural resolution of the pronoun changes.
  2. A question asking the identity of the ambiguous pronoun, and
  3. Two answer choices corresponding to the noun phrases in question.

A machine will be given the problem in a standardized form which includes the answer choices, thus making it a binary decision problem.

Advantages

The Winograd schema challenge has the following purported advantages:

Pitfalls

One difficulty with the Winograd schema challenge is the development of the questions. They need to be carefully tailored to ensure that they require commonsense reasoning to solve. For example, Levesque [5] gives the following example of a so-called Winograd schema that is "too easy":

The women stopped taking pills because they were [pregnant/carcinogenic]. Which individuals were [pregnant/carcinogenic]?

The answer to this question can be determined on the basis of selectional restrictions: in any situation, pills do not get pregnant, women do; women cannot be carcinogenic, but pills can. Thus this answer could be derived without the use of reasoning, or any understanding of the sentences' meaning—all that is necessary is data on the selectional restrictions of pregnant and carcinogenic.

Activity

In 2016 and 2018, Nuance Communications sponsored a competition, offering a grand prize of $25,000 for the top scorer above 90% (for comparison, humans correctly answer to 92–96% of WSC questions [10] ). However, nobody came close to winning the prize in 2016 and the 2018 competition was cancelled for lack of prospects; [11] the prize is no longer offered. [12]

The Twelfth International Symposium on the Logical Formalizations of Commonsense Reasoning was held on March 23–25, 2015 at the AAAI Spring Symposium Series at Stanford University, with a special focus on the Winograd schema challenge. The organizing committee included Leora Morgenstern (Leidos), Theodore Patkos (The Foundation for Research & Technology Hellas), and Robert Sloan (University of Illinois at Chicago). [13]

The 2016 Winograd Schema Challenge was run on July 11, 2016 at IJCAI-16. There were four contestants. The first round of the contest was to solve PDPs—pronoun disambiguation problems, adapted from literary sources, not constructed as pairs of sentences. [14] The highest score achieved was 58% correct, by Quan Liu et al, of the University of Science and Technology, China. [15] Hence, by the rules of that challenge, no prizes were awarded, and the challenge did not proceed to the second round. The organizing committee in 2016 was Leora Morgenstern, Ernest Davis, and Charles Ortiz. [16]

In 2017, a neural association model designed for commonsense knowledge acquisition achieved 70% accuracy on 70 manually selected problems from the original 273 Winograd schema dataset. [17] In June 2018, a score of 63.7% accuracy was achieved on the full dataset using an ensemble of recurrent neural network language models, [18] marking the first use of deep neural networks that learn from independent corpora to acquire common sense knowledge. In 2019 a score of 90.1%, was achieved on the original Winograd schema dataset by fine-tuning of the BERT language model with appropriate WSC-like training data to avoid having to learn commonsense reasoning. [10] The general language model GPT-3 achieved a score of 88.3% without specific fine-tuning in 2020. [19]

A more challenging, adversarial "Winogrande" dataset of 44,000 problems was designed in 2019. This dataset consists of fill-in-the-blank style sentences, as opposed to the pronoun format of previous datasets. [10]

A version of the Winograd schema challenge is one part of the GLUE (General Language Understanding Evaluation) benchmark collection of challenges in automated natural-language understanding. [20]

Related Research Articles

Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and uses learning and intelligence to take actions that maximize their chances of achieving defined goals. Such machines may be called AIs.

In the field of artificial intelligence (AI), tasks that are hypothesised to require artificial general intelligence to solve are informally known as AI-complete or AI-hard. Calling a problem AI-complete reflects the belief that it cannot be solved by a simple specific algorithm.

The Chinese room argument holds that a digital computer executing a program cannot have a "mind", "understanding", or "consciousness", regardless of how intelligently or human-like the program may make the computer behave. Philosopher John Searle presented the argument in his paper "Minds, Brains, and Programs", published in Behavioral and Brain Sciences in 1980. Gottfried Leibniz (1714), Anatoly Dneprov (1961), Lawrence Davis (1974) and Ned Block (1978) presented similar arguments. Searle's version has been widely discussed in the years since. The centerpiece of Searle's argument is a thought experiment known as the Chinese room.

<span class="mw-page-title-main">Cyc</span> Artificial intelligence project

Cyc is a long-term artificial intelligence project that aims to assemble a comprehensive ontology and knowledge base that spans the basic concepts and rules about how the world works. Hoping to capture common sense knowledge, Cyc focuses on implicit knowledge that other AI platforms may take for granted. This is contrasted with facts one might find somewhere on the internet or retrieve via a search engine or Wikipedia. Cyc enables semantic reasoners to perform human-like reasoning and be less "brittle" when confronted with novel situations.

Knowledge representation and reasoning is the field of artificial intelligence (AI) dedicated to representing information about the world in a form that a computer system can use to solve complex tasks such as diagnosing a medical condition or having a dialog in a natural language. Knowledge representation incorporates findings from psychology about how humans solve problems and represent knowledge, in order to design formalisms that will make complex systems easier to design and build. Knowledge representation and reasoning also incorporates findings from logic to automate various kinds of reasoning.

Planner is a programming language designed by Carl Hewitt at MIT, and first published in 1969. First, subsets such as Micro-Planner and Pico-Planner were implemented, and then essentially the whole language was implemented as Popler by Julian Davies at the University of Edinburgh in the POP-2 programming language. Derivations such as QA4, Conniver, QLISP and Ether were important tools in artificial intelligence research in the 1970s, which influenced commercial developments such as Knowledge Engineering Environment (KEE) and Automated Reasoning Tool (ART).

Natural-language understanding (NLU) or natural-language interpretation (NLI) is a subset of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an AI-hard problem.

<span class="mw-page-title-main">John McCarthy (computer scientist)</span> American scientist (1927–2011)

John McCarthy was an American computer scientist and cognitive scientist. He was one of the founders of the discipline of artificial intelligence. He co-authored the document that coined the term "artificial intelligence" (AI), developed the programming language family Lisp, significantly influenced the design of the language ALGOL, popularized time-sharing, and invented garbage collection.

In the history of artificial intelligence, neat and scruffy are two contrasting approaches to artificial intelligence (AI) research. The distinction was made in the 1970s and was a subject of discussion until the mid-1980s.

In artificial intelligence (AI), commonsense reasoning is a human-like ability to make presumptions about the type and essence of ordinary situations humans encounter every day. These assumptions include judgments about the nature of physical objects, taxonomic properties, and peoples' intentions. A device that exhibits commonsense reasoning might be capable of drawing conclusions that are similar to humans' folk psychology and naive physics.

<span class="mw-page-title-main">Logic in computer science</span> Academic discipline

Logic in computer science covers the overlap between the field of logic and that of computer science. The topic can essentially be divided into three main areas:

In artificial intelligence research, commonsense knowledge consists of facts about the everyday world, such as "Lemons are sour", or "Cows say moo", that all humans are expected to know. It is currently an unsolved problem in Artificial General Intelligence. The first AI program to address common sense knowledge was Advice Taker in 1959 by John McCarthy.

<span class="mw-page-title-main">History of artificial intelligence</span>

The history of artificial intelligence (AI) began in antiquity, with myths, stories and rumors of artificial beings endowed with intelligence or consciousness by master craftsmen. The seeds of modern AI were planted by philosophers who attempted to describe the process of human thinking as the mechanical manipulation of symbols. This work culminated in the invention of the programmable digital computer in the 1940s, a machine based on the abstract essence of mathematical reasoning. This device and the ideas behind it inspired a handful of scientists to begin seriously discussing the possibility of building an electronic brain.

The philosophy of artificial intelligence is a branch of the philosophy of mind and the philosophy of computer science that explores artificial intelligence and its implications for knowledge and understanding of intelligence, ethics, consciousness, epistemology, and free will. Furthermore, the technology is concerned with the creation of artificial animals or artificial people so the discipline is of considerable interest to philosophers. These factors contributed to the emergence of the philosophy of artificial intelligence.

Hector Joseph Levesque is a Canadian academic and researcher in artificial intelligence. His research concerns incorporating commonsense reasoning in intelligent systems and he initiated the Winograd Schemas Challenge.

<span class="mw-page-title-main">Progress in artificial intelligence</span> How AI-related technologies evolve

Progress in artificial intelligence (AI) refers to the advances, milestones, and breakthroughs that have been achieved in the field of artificial intelligence over time. AI is a multidisciplinary branch of computer science that aims to create machines and systems capable of performing tasks that typically require human intelligence. Artificial intelligence applications have been used in a wide range of fields including medical diagnosis, economic-financial applications, robot control, law, scientific discovery, video games, and toys. However, many AI applications are not perceived as AI: "A lot of cutting edge AI has filtered into general applications, often without being called AI because once something becomes useful enough and common enough it's not labeled AI anymore." "Many thousands of AI applications are deeply embedded in the infrastructure of every industry." In the late 1990s and early 21st century, AI technology became widely used as elements of larger systems, but the field was rarely credited for these successes at the time.

<span class="mw-page-title-main">Turing test</span> Test of a machines ability to imitate human intelligence

The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel, such as a computer keyboard and screen, so the result would not depend on the machine's ability to render words as speech. If the evaluator could not reliably tell the machine from the human, the machine would be said to have passed the test. The test results would not depend on the machine's ability to give correct answers to questions, only on how closely its answers resembled those a human would give. Since the Turing test is a test of indistinguishability in performance capacity, the verbal version generalizes naturally to all of human performance capacity, verbal as well as nonverbal (robotic).

The AI effect occurs when onlookers discount the behavior of an artificial intelligence program by arguing that it is not "real" intelligence.

The WebCrow is a research project carried out at the Information Engineering Department of the University of Siena with the purpose of automatically solving crosswords.

AI@50, formally known as the "Dartmouth Artificial Intelligence Conference: The Next Fifty Years", was a conference organized by James Moor, commemorating the 50th anniversary of the Dartmouth workshop which effectively inaugurated the history of artificial intelligence. Five of the original ten attendees were present: Marvin Minsky, Ray Solomonoff, Oliver Selfridge, Trenchard More, and John McCarthy.

References

  1. Ackerman, Evan (29 July 2014). "Can Winograd Schemas Replace Turing Test for Defining Human-level AI". IEEE Spectrum. Retrieved 29 October 2014.
  2. 1 2 3 Levesque, H. J. (2014). "On our best behaviour". Artificial Intelligence . 212: 27–35. doi: 10.1016/j.artint.2014.03.007 .
  3. Kocijan, Vid; Davis, Ernest; Lukasiewicz, Thomas; Marcus, Gary; Morgenstern, Leora (11 July 2023). "The defeat of the Winograd Schema Challenge". Artificial Intelligence. 325: 103971. arXiv: 2201.02387 . doi:10.1016/j.artint.2023.103971. ISSN   0004-3702. S2CID   245827747.
  4. 1 2 Turing, Alan (October 1950). "Computing Machinery and Intelligence" (PDF). Mind . LIX (236): 433–460. doi:10.1093/mind/LIX.236.433 . Retrieved 28 October 2014.
  5. 1 2 3 Levesque, Hector; Davis, Ernest; Morgenstern, Leora (2012). The Winograd Schema Challenge. Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning.
  6. "Nuance announces the Winograd Schemas Challenge to Advance Artificial Intelligence Innovation". Business Wire. 28 July 2014. Retrieved 9 November 2014.
  7. Michael, Julian (18 May 2015). The Theory of Correlation Formulas and Their Application to Discourse Coherence (Thesis). UT Digital Repository. p. 6. hdl:2152/29979.
  8. Winograd, Terry (January 1972). "Understanding Natural Language" (PDF). Cognitive Psychology. 3 (1): 1–191. doi:10.1016/0010-0285(72)90002-3 . Retrieved 4 November 2014.
  9. Davis, Ernest. "A Collection of Winograd Schemas". cs.nyu.edu. NYU. Retrieved 30 October 2014.
  10. 1 2 3 Sakaguchi, Keisuke; Le Bras, Ronan; Bhagavatula, Chandra; Choi, Yejin (2019). "WinoGrande: An Adversarial Winograd Schema Challenge at Scale". arXiv: 1907.10641 [cs.CL].
  11. Boguslavsky, I.M.; Frolova, T.I.; Iomdin, L.L.; Lazursky, A.V.; Rygaev, I.P.; Timoshenko, S.P. (2019). "Knowledge-based approach to Winograd Schema Challenge" (PDF). Proceedings of the International Conference of Computational Linguistics and Intellectual Technologies. Moscow. The prize could not be awarded to anybody. Most of the participants showed a result close to the random choice or even worse. The second competition scheduled for 2018 was canceled due to the lack of prospective participants.
  12. "Winograd Schema Challenge". CommonsenseReasoning.org. Retrieved 24 January 2020.
  13. "AAAI 2015 Spring Symposia". Association for the Advancement of Artificial Intelligence. Retrieved 1 January 2015.
  14. Davis, Ernest; Morgenstern, Leora; Ortiz, Charles (Fall 2017). "The First Winograd Schema Challenge at IJCAI-16". AI Magazine.
  15. Liu, Quan; Jiang, Hui; Ling, Zhen-Hua; Zhu, Xiaodan; Wei, Si; Hu, Yu (2016). "Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge". arXiv: 1611.04146 [cs.AI].
  16. Morgenstern, Leora; Davis, Ernest; Ortiz, Charles L. (March 2016). "Planning, Executing, and Evaluating the Winograd Schema Challenge". AI Magazine. 37 (1): 50–54. doi: 10.1609/aimag.v37i1.2639 . ISSN   0738-4602.
  17. Liu, Quan; Jiang, Hui; Evdokimov, Andrew; Ling, Zhen-Hua; Zhu, Xiaodan; Wei, Si; Hu, Yu (2017). "Cause-Effect Knowledge Acquisition and Neural Association Model for Solving a Set of Winograd Schema Problems". Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. pp. 2344–2350. doi: 10.24963/ijcai.2017/326 . ISBN   9780999241103.
  18. Trinh, Trieu H.; Le, Quoc V. (26 September 2019). "A Simple Method for Commonsense Reasoning". arXiv: 1806.02847 [cs.AI].
  19. Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; et al. (2020). "Language Models are Few-Shot Learners". arXiv: 2005.14165 [cs.CL].
  20. "GLUE Benchmark". GlueBenchmark.com. Retrieved 30 July 2019.