Coreference

Last updated December 24, 2023

In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in Bill said Alice would arrive soon, and she did, the words Alice and she refer to the same person.^[1]

Co-reference is often non-trivial to determine. For example, in Bill said he would come, the word he may or may not refer to Bill. Determining which expressions are coreferences is an important part of analyzing or understanding the meaning, and often requires information from the context, real-world knowledge, such as tendencies of some names to be associated with particular species ("Rover"), kinds of artifacts ("Titanic"), grammatical genders, or other properties.

Linguists commonly use indices to notate coreference, as in Bill_i said he_i would come. Such expressions are said to be coindexed, indicating that they should be interpreted as coreferential.

When expressions are coreferential, the first to occur is often a full or descriptive form (for example, an entire personal name, perhaps with a title and role), while later occurrences use shorter forms (for example, just a given name, surname, or pronoun). The earlier occurrence is known as the antecedent and the other is called a proform, anaphor, or reference. However, pronouns can sometimes refer forward, as in "When she arrived home, Alice went to sleep." In such cases, the coreference is called cataphoric rather than anaphoric.

Coreference is important for binding phenomena in the field of syntax. The theory of binding explores the syntactic relationship that exists between coreferential expressions in sentences and texts.

Types

When exploring coreference, numerous distinctions can be made, e.g. anaphora, cataphora, split antecedents, coreferring noun phrases, etc.^[2] Several of these more specific phenomena are illustrated here:

Anaphora: a. The music_i was so loud that it_i couldn't be enjoyed. –The anaphor it follows the expression to which it refers (its antecedent).; b. Our neighbors_i dislike the music. If they_i are angry, the cops will show up soon. – The anaphor they follows the expression to which it refers (its antecedent).
Cataphora: a. If they_i are angry about the music, the neighbors_i will call the cops. – The cataphor they precedes the expression to which it refers (its postcedent).; b. Despite her_i difficulty, Wilma_i came to understand the point. – The cataphor her precedes the expression to which it refers (its postcedent)
Split antecedents: a. Carol_i told Bob_i to attend the party. They_i arrived together. – The anaphor they has a split antecedent, referring to both Carol and Bob.; b. When Carol_i helps Bob_i and Bob_i helps Carol_i, they_i can accomplish any task. – The anaphor they has a split antecedent, referring to both Carol and Bob.
Coreferring noun phrases: a. The project leader_i is refusing to help. The jerk_i thinks only of himself_i. – Coreferring noun phrases, whereby the second noun phrase is a predication over the first.; b. Some of our colleagues₁ are going to be supportive. These kinds of people₁ will earn our gratitude. – Coreferring noun phrases, whereby the second noun phrase is a predication over the first.

Relation to bound variables

Semanticists and logicians sometimes draw a distinction between coreference and what is known as a bound variable.^[3] Bound variables occur when the antecedent to the proform is an indefinite quantified expression, e.g.^[4]^{[ clarification needed ]}

Every student_i has received his_i grade. – The pronoun his is an example of a bound variable
No student_i was upset with his_i grade. – The pronoun his is an example of a bound variable

Quantified expressions such as every student and no student are not considered referential. These expressions are grammatically singular but do not pick out single referents in the discourse or real world. Thus, the antecedents to his in these examples are not properly referential, and neither is his. Instead, it is considered a variable that is bound by its antecedent. Its reference varies based upon which of the students in the discourse world is thought of. The existence of bound variables is perhaps more apparent with the following example:

Only Jack_i likes his_i grade. – The pronoun his can be a bound variable.

This sentence is ambiguous. It can mean that Jack likes his grade but everyone else dislikes Jack's grade; or that no one likes their own grade except Jack. In the first meaning, his is coreferential; in the second, it is a bound variable because its reference varies over the set of all students.

Coindex notation is commonly used for both cases. That is, when two or more expressions are coindexed, it does not signal whether one is dealing with coreference or a bound variable (or as in the last example, whether it depends on interpretation).

Coreference resolution

In computational linguistics, coreference resolution is a well-studied problem in discourse. To derive the correct interpretation of a text, or even to estimate the relative importance of various mentioned subjects, pronouns and other referring expressions must be connected to the right individuals. Algorithms intended to resolve coreferences commonly look first for the nearest preceding individual that is compatible with the referring expression. For example, she might attach to a preceding expression such as the woman or Anne, but not as probably to Bill. Pronouns such as himself have much stricter constraints. As with many linguistic tasks, there is a tradeoff between precision and recall. Cluster-quality metrics commonly used to evaluate coreference resolution algorithms include the Rand index, the adjusted Rand index, and different mutual information-based methods.

A particular problem for coreference resolution in English is the pronoun it, which has many uses. It can refer much like he and she, except that it generally refers to inanimate objects (the rules are actually more complex: animals may be any of it, he, or she; ships are traditionally she; hurricanes are usually it despite having gendered names). It can also refer to abstractions rather than beings, e.g. He was paid minimum wage, but didn't seem to mind it. Finally, it also has pleonastic uses, which do not refer to anything specific:

It's raining.
It's really a shame.
It takes a lot of work to succeed.
Sometimes it's the loudest who have the most influence.

Pleonastic uses are not considered referential, and so are not part of coreference.^[5]

Approaches to coreference resolution can broadly be separated into mention-pair, mention-ranking or entity-based algorithms. Mention-pair algorithms involve binary decisions if a pair of two given mentions belong to the same entity. Entity-wide constraints like gender are not considered, which leads to error propagation. For example, the pronouns he or she can both have a high probability of coreference with the teacher, but cannot be coreferent with each other. Mention-ranking algorithms expand on this idea but instead stipulate that one mention can only be coreferent with one (previous) mention. As a result, each previous mention must be given a score and the highest scoring mention (or no mention) is linked. Finally, in entity-based methods mentions are linked based on information of the whole coreference chain instead of individual mentions. The representation of a variable-width chain is more complex and computationally expensive than mention-based methods, which lead to these algorithms being mostly based on neural network architectures.

Notes

↑ For definitions of coreference, see for instance Crystal (1997:94) and Radford (2004:332).
↑ These distinctions (anaphora, cataphora, split antecedents, coreferring noun phrases, etc.) are discussed in Jurafsky and Martin (2000:669ff).
↑ For discussions of bound variables, see for instance Portner (2005:102ff.).
↑ See Jurafsky and Martin (2000:701) for an example of a bound variable like the ones given here.
↑ Li et al. (2009) have demonstrated high accuracy in sorting out pleonastic it, and this success promises to improve the accuracy of coreference resolution overall.

Related Research Articles

In linguistics and grammar, a pronoun is a word or a group of words that one may substitute for a noun or noun phrase.

In mathematics, and in other disciplines involving formal languages, including mathematical logic and computer science, a variable may be said to be either free or bound. The terms are opposites. A free variable is a notation (symbol) that specifies places in an expression where substitution may take place and is not a parameter of this or any container expression. Some older books use the terms real variable and apparent variable for free variable and bound variable, respectively. The idea is related to a placeholder, or a wildcard character that stands for an unspecified symbol.

In grammar, an antecedent is one or more words that establish the meaning of a pronoun or other pro-form. For example, in the sentence "John arrived late because traffic held him up," the word "John" is the antecedent of the pronoun "him." Pro-forms usually follow their antecedents, but sometimes precede them. In the latter case the more accurate term would be postcedent, although the distinction is rarely made in common usage. The linguistic term that is closely related to antecedent and proform is anaphora. Theories of syntax explore the distinction between antecedents and postcedents in terms of binding.

In linguistics, anaphora is the use of an expression whose interpretation depends upon another expression in context. In a narrower sense, anaphora is the use of an expression that depends specifically upon an antecedent expression and thus is contrasted with cataphora, which is the use of an expression that depends upon a postcedent expression. The anaphoric (referring) term is called an anaphor. For example, in the sentence Sally arrived, but nobody saw her, the pronoun her is an anaphor, referring back to the antecedent Sally. In the sentence Before her arrival, nobody saw Sally, the pronoun her refers forward to the postcedent Sally, so her is now a cataphor. Usually, an anaphoric expression is a pro-form or some other kind of deictic expression. Both anaphora and cataphora are species of endophora, referring to something mentioned elsewhere in a dialog or text.

In linguistics, inalienable possession is a type of possession in which a noun is obligatorily possessed by its possessor. Nouns or nominal affixes in an inalienable possession relationship cannot exist independently or be "alienated" from their possessor. Inalienable nouns include body parts, kinship terms, and part-whole relations. Many languages reflect the distinction but vary in how they mark inalienable possession. Cross-linguistically, inalienability correlates with many morphological, syntactic, and semantic properties.

In linguistics, binding is the phenomenon in which anaphoric elements such as pronouns are grammatically associated with their antecedents. For instance in the English sentence "Mary saw herself", the anaphor "herself" is bound by its antecedent "Mary". Binding can be licensed or blocked in certain contexts or syntactic configurations, e.g. the pronoun "her" cannot be bound by "Mary" in the English sentence "Mary saw her". While all languages have binding, restrictions on it vary even among closely related languages. Binding has been a major area of research in syntax and semantics since the 1970s and, as the name implies, is a core component of government and binding theory.

In linguistics, a pro-form is a type of function word or expression that stands in for another word, phrase, clause or sentence where the meaning is recoverable from the context. They are used either to avoid repetitive expressions or in quantification.

In pragmatics, exophora is reference to something extratextual, i.e. not in the immediate text, and contrasts with endophora. Exophora can be deictic, in which special words or grammatical markings are used to make reference to something in the context of the utterance or speaker. For example, pronouns are often exophoric, with words such as "this", "that", "here", "there", as in that chair over there is John's said while indicating the direction of the chair referred to. Given "Did the gardener water those plants?", it is quite possible that "those" refers back to the preceding text, to some earlier mention of those particular plants in the discussion. But it is also possible that it refers to the environment in which the dialogue is taking place—to the "context of situation", as it is called—where the plants in question are present and can be pointed to if necessary. The interpretation would be "those plants there, in front of us". This kind of reference is called exophora, since it takes us outside the text altogether. Exophoric reference is not cohesive, since it does not bind the two elements together into a text.

In generative grammar and related frameworks, a node in a parse tree c-commands its sister node and all of its sister's descendants. In these frameworks, c-command plays a central role in defining and constraining operations such as syntactic movement, binding, and scope. Tanya Reinhart introduced c-command in 1976 as a key component of her theory of anaphora. The term is short for "constituent command".

Personal pronouns are pronouns that are associated primarily with a particular grammatical person – first person, second person, or third person. Personal pronouns may also take different forms depending on number, grammatical or natural gender, case, and formality. The term "personal" is used here purely to signify the grammatical sense; personal pronouns are not limited to people and can also refer to animals and objects.

In linguistics, cataphora is the use of an expression or word that co-refers with a later, more specific, expression in the discourse. The preceding expression, whose meaning is determined or specified by the later expression, may be called a cataphor. Cataphora is a type of anaphora, although the terms anaphora and anaphor are sometimes used in a stricter sense, denoting only cases where the order of the expressions is the reverse of that found in cataphora.

Generic antecedents are representatives of classes, referred to in ordinary language by another word, in a situation in which gender is typically unknown or irrelevant. These mostly arise in generalizations and are particularly common in abstract, theoretical or strategic discourse. Examples include "readers of Wikipedia appreciate their encyclopedia", "the customerwho spends in this market".

A reciprocal pronoun is a pronoun that indicates a reciprocal relationship. A reciprocal pronoun can be used for one of the participants of a reciprocal construction, i.e. a clause in which two participants are in a mutual relationship. The reciprocal pronouns of English are one another and each other, and they form the category of anaphors along with reflexive pronouns.

In generative linguistics, PRO is a pronominal determiner phrase (DP) without phonological content. As such, it is part of the set of empty categories. The null pronoun PRO is postulated in the subject position of non-finite clauses. One property of PRO is that, when it occurs in a non-finite complement clause, it can be bound by the main clause subject or the main clause object. The presence of PRO in non-finite clauses lacking overt subjects allows a principled solution for problems relating to binding theory.

Donkey sentences are sentences that contain a pronoun with clear meaning but whose syntactical role in the sentence poses challenges to grammarians. Such sentences defy straightforward attempts to generate their formal language equivalents. The difficulty is with understanding how English speakers parse such sentences.

A bound variable pronoun is a pronoun that has a quantified determiner phrase (DP) – such as every, some, or who – as its antecedent.

In linguistics, sloppy identity is an interpretive property that is found with verb phrase ellipsis where the identity of the pronoun in an elided VP is not identical to the antecedent VP.

Logophoricity is a phenomenon of binding relation that may employ a morphologically different set of anaphoric forms, in the context where the referent is an entity whose speech, thoughts, or feelings are being reported. This entity may or may not be distant from the discourse, but the referent must reside in a clause external to the one in which the logophor resides. The specially-formed anaphors that are morphologically distinct from the typical pronouns of a language are known as logophoric pronouns, originally coined by the linguist Claude Hagège. The linguistic importance of logophoricity is its capability to do away with ambiguity as to who is being referred to. A crucial element of logophoricity is the logophoric context, defined as the environment where use of logophoric pronouns is possible. Several syntactic and semantic accounts have been suggested. While some languages may not be purely logophoric, logophoric context may still be found in those languages; in those cases, it is common to find that in the place where logophoric pronouns would typically occur, non-clause-bounded reflexive pronouns appear instead.

In linguistics, crossover effects are restrictions on possible binding or coreference that hold between certain phrases and pronouns. Coreference that is normal and natural when a pronoun follows its antecedent becomes impossible, or at best just marginally possible, when "crossover" is deemed to have occurred, e.g. ?Who₁ do his₁ friends admire __₁? The term itself refers to the traditional transformational analysis of sentences containing leftward movement, whereby it appears as though the fronted constituent crosses over the expression with which it is coindexed on its way to the front of the clause. Crossover effects are divided into strong crossover (SCO) and weak crossover (WCO). The phenomenon occurs in English and related languages, and it may be present in all natural languages that allow fronting.

In linguistics, a relativizer is a type of conjunction that introduces a relative clause. For example, in English, the conjunction that may be considered a relativizer in a sentence such as "I have one that you can use." Relativizers do not appear, at least overtly, in all languages; even in languages that do have overt or pronounced relativizers, they do not necessarily appear all of the time. For these reasons it has been suggested that in some cases, a "zero relativizer" may be present, meaning that a relativizer is implied in the grammar but is not actually realized in speech or writing. For example, the word that can be omitted in the above English example, producing "I have one you can use", using a zero relativizer.

References

Crystal, D. 1997. A dictionary of linguistics and phonetics. 4th edition. Cambridge, MA: Blackwell Publishing.
Jurafsky, D. and H. Martin 2000. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. New Delhi, India: Pearson Education.
Portner, P. 2005. What is semantics?: Fundamentals of formal semantics. Malden, MA: Blackwell Publishing.
Radford, A. 2004. English syntax: An introduction. Cambridge, UK: Cambridge University Press.
Li, Y., P. Musilek, M. Reformat, and L. Wyard-Scott 2009. Identification of pleonastic it using the web Archived 2022-10-26 at the Wayback Machine . Journal of Artificial Intelligence Research 34, 339–389.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] For definitions of coreference, see for instance Crystal (1997:94) and Radford (2004:332).

[2] These distinctions (anaphora, cataphora, split antecedents, coreferring noun phrases, etc.) are discussed in Jurafsky and Martin (2000:669ff).

[3] For discussions of bound variables, see for instance Portner (2005:102ff.).

[4] See Jurafsky and Martin (2000:701) for an example of a bound variable like the ones given here.

[5] Li et al. (2009) have demonstrated high accuracy in sorting out pleonastic it, and this success promises to improve the accuracy of coreference resolution overall.

[1]

[2]

[3]

[4]

[5]