TRACE (psycholinguistics)

Last updated

TRACE is a connectionist model of speech perception, proposed by James McClelland and Jeffrey Elman in 1986. [1] It is based on a structure called "the TRACE", a dynamic processing structure made up of a network of units, which performs as the system's working memory as well as the perceptual processing mechanism. [2] TRACE was made into a working computer program for running perceptual simulations. These simulations are predictions about how a human mind/brain processes speech sounds and words as they are heard in real time.

Contents

Inspiration

TRACE was created during the formative period of connectionism, and was included as a chapter in Parallel Distributed Processing: Explorations in the Microstructures of Cognition. [3] The researchers found that certain problems regarding speech perception could be conceptualized in terms of a connectionist interactive activation model. The problems were that

  1. speech is extended in time
  2. the sounds of speech (phonemes) overlap with each other
  3. the articulation of a speech sound is affected by the sounds that come before and after it, and
  4. there is natural variability in speech (e.g. foreign accent) as well as noise in the environment (e.g. busy restaurant).

Each of these causes the speech signal to be complex and often ambiguous, making it difficult for the human mind/brain to decide what words it is really hearing. In very simple terms, an interactive activation model solves this problem by placing different kinds of processing units (phonemes, words) in isolated layers, allowing activated units to pass information between layers, and having units within layers compete with one another, until the “winner” is considered “recognized” by the model.

Key findings

"TRACE was the first model that instantiated the activation of multiple word candidates that match any part of the speech input." [4] A simulation of speech perception involves presenting the TRACE computer program with mock speech input, running the program, and generating a result. A successful simulation indicates that the result is found to be meaningfully similar to how people process speech.

Time-course of word recognition

It is generally accepted in psycholinguistics that (1) when the beginning of a word is heard, a set of words that share the same initial sound become activated in memory, [5] (2) the words that are activated compete with each other while more and more of the word is heard, [6] (3) at some point, due to both the auditory input and the lexical competition, one word is recognized. [1]

For example, a listener hears the beginning of bald, and the words bald, ball, bad, bill become active in memory. Then, soon after, only bald and ball remain in competition (bad, bill have been eliminated because the vowel sound doesn't match the input). Soon after, bald is recognized. TRACE simulates this process by representing the temporal dimension of speech, allowing words in the lexicon to vary in activation strength, and by having words compete during processing. Figure 1 shows a line graph of word activation in a simple TRACE simulation.

Figure 1 - A simple TRACE simulation. Word activation and competition unfolds in time. In this simulation, the word "bald" becomes the most active, therefore it is considered to be the one that is recognized. TRACE simulation of the word 'bald'.png
Figure 1 - A simple TRACE simulation. Word activation and competition unfolds in time. In this simulation, the word "bald" becomes the most active, therefore it is considered to be the one that is recognized.

Lexical effect on phoneme perception

If an ambiguous speech sound is spoken that is exactly in between /t/ and /d/, the hearer may have difficulty deciding what it is. But, if that same ambiguous sound is heard at the end of a word like woo/?/ (where ? is the ambiguous sound), then the hearer will more likely perceive the sound as a /d/. This probably occurs because "wood" is a word but "woot" is not. An ambiguous phoneme presented in a lexical context will be perceived as consistent with the surrounding lexical context. This perceptual effect is known as the Ganong effect. [7] TRACE reliably simulates this, and can explain it in relatively simple terms. Essentially, the lexical unit which has become activated by the input (i.e. wood) feeds back activation to the phoneme layer, boosting the activation of its constituent phonemes (i.e. /d/), thus resolving the ambiguity.

Lexical basis of segmentation

Speakers usually don't leave pauses in between words when speaking,[ citation needed ] yet listeners seem to have no difficulty hearing speech as a sequence of words. This is known as the segmentation problem, and is one of the oldest problems in the psychology of language. TRACE proposed the following solution, backed up by simulations. When words become activated and recognized, this reveals the location of word boundaries. Stronger word activation leads to greater confidence about word boundaries, which informs the hearer of where to expect the next word to begin. [1]

Process

The TRACE model is a connectionist network with an input layer and three processing layers: pseudo-spectra (feature), phoneme and word. Figure 2 shows a schematic diagram of TRACE. There are three types of connectivity: (1) feedforward excitatory connections from input to features, features to phonemes, and phonemes to words; (2) lateral (i.e., within layer) inhibitory connections at the feature, phoneme and word layers; and (3) top-down feedback excitatory connections from words to phonemes. The input to TRACE works as follows. The user provides a phoneme sequence that is converted into a multi-dimensional feature vector. This is an approximation of acoustic spectra extended in time. The input vector is revealed a little at a time to simulate the temporal nature of speech. As each new chunk of input is presented, this sends activity along the network connections, changing the activation values in the processing layers. Features activate phoneme units, and phonemes activate word units. Parameters govern the strength of the excitatory and inhibitory connections, as well as many other processing details. There is no specific mechanism that determines when a word or a phoneme has been recognized. If simulations are being compared to reaction time data from a perceptual experiment (e.g. lexical decision), then typically an activation threshold is used. This allows for the model behavior to be interpreted as recognition, and a recognition time to be recorded as the number of processing cycles that have elapsed. For deeper understanding of TRACE processing dynamics, readers are referred to the original publication [1] and to a TRACE software tool that runs simulations with a graphical user interface.

Figure 2 - Schematic diagram of TRACE architecture. TRACE architecture.jpg
Figure 2 - Schematic diagram of TRACE architecture.

Criticism

Modularity of mind debate

TRACE’s relevance to the modularity debate has recently been brought to the fore by Norris, Cutler and McQueen’s (2001) report on the Merge (?) model of speech perception. [8] While it shares a number of features with TRACE, a key difference is the following. While TRACE permits word units to feedback activation to the phoneme level, Merge restricts its processing to feed-forward connections. In the terms of this debate, TRACE is considered to violate the principle of information encapsulation, central to modularity, when it permits a later stage of processing (words) to send information to an earlier stage (phonemes). Merge advocates for modularity by arguing that the same class of perceptual phenomena that is accounted for in TRACE can be explained in a connectionist architecture that does not include feedback connections. Norris et al. point out that when two theories can explain the same phenomenon, parsimony dictates that the simpler theory is preferable.

Applications

Speech and language therapy

Models of language processing can be used to conceptualize the nature of impairment in persons with speech and language disorder. For example, it has been suggested that language deficits in expressive aphasia may be caused by excessive competition between lexical units, thus preventing any word from becoming sufficiently activated. [9] Arguments for this hypothesis consider that mental dysfunction can be explained by slight perturbation of the network model's processing. This emerging line of research incorporates a wide range of theories and models, and TRACE represents just one piece of a growing puzzle.

Distinction from speech recognition software

Psycholinguistic models of speech perception, e.g. TRACE, must be distinguished from computer speech recognition tools. The former are psychological theories about how the human mind/brain processes information. The latter are engineered solutions for converting an acoustic signal into text. Historically, the two fields have had little contact, but this is beginning to change. [10]

Influence

TRACE’s influence in the psychology literature can be assessed by the number of articles that cite it. There are 345 citations of McClelland and Elman (1986) in the PsycINFO database. Figure 3 shows the distribution of those citations over the years since publication. The figure suggests that interest in TRACE grew significantly in 2001, and has remained strong, with about 30 citations per year.

Figure 3 - Annual breakdown of TRACE citations in PsycINFO research database. TRACE citations.png
Figure 3 - Annual breakdown of TRACE citations in PsycINFO research database.

See also

Related Research Articles

<span class="mw-page-title-main">Cognitive science</span> Interdisciplinary scientific study of cognitive processes

Cognitive science is the interdisciplinary, scientific study of the mind and its processes. It examines the nature, the tasks, and the functions of cognition. Mental faculties of concern to cognitive scientists include language, perception, memory, attention, reasoning, and emotion; to understand these faculties, cognitive scientists borrow from fields such as linguistics, psychology, artificial intelligence, philosophy, neuroscience, and anthropology. The typical analysis of cognitive science spans many levels of organization, from learning and decision-making to logic and planning; from neural circuitry to modular brain organization. One of the fundamental concepts of cognitive science is that "thinking can best be understood in terms of representational structures in the mind and computational procedures that operate on those structures."

Phonetics is a branch of linguistics that studies how humans produce and perceive sounds or, in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. The field of phonetics is traditionally divided into three sub-disciplines on questions involved such as how humans plan and execute movements to produce speech, how various movements affect the properties of the resulting sound or how humans convert sound waves to linguistic information. Traditionally, the minimal linguistic unit of phonetics is the phone—a speech sound in a language which differs from the phonological unit of phoneme; the phoneme is an abstract categorization of phones and it is also defined as the smallest unit that discerns meaning between sounds in any given language.

<span class="mw-page-title-main">Perception</span> Interpretation of sensory information

Perception is the organization, identification, and interpretation of sensory information in order to represent and understand the presented information or environment. All perception involves signals that go through the nervous system, which in turn result from physical or chemical stimulation of the sensory system. Vision involves light striking the retina of the eye; smell is mediated by odor molecules; and hearing involves pressure waves.

In psychology, parallel processing is the ability of the brain to simultaneously process incoming stimuli of differing quality. Parallel processing is associated with the visual system in that the brain divides what it sees into four components: color, motion, shape, and depth. These are individually analyzed and then compared to stored memories, which helps the brain identify what you are viewing. The brain then combines all of these into the field of view that is then seen and comprehended. This is a continual and seamless operation. For example, if one is standing between two different groups of people who are simultaneously carrying on two different conversations, one may be able to pick up only some information of both conversations at the same time. Parallel processing has been linked, by some experimental psychologists, to the stroop effect. In the stroop effect, an inability to attend to all stimuli is seen through people's selective attention.

Psycholinguistics or psychology of language is the study of the interrelation between linguistic factors and psychological aspects. The discipline is mainly concerned with the mechanisms by which language is processed and represented in the mind and brain; that is, the psychological and neurobiological factors that enable humans to acquire, use, comprehend, and produce language.

<span class="mw-page-title-main">Neural circuit</span> Network or circuit of neurons

A neural circuit is a population of neurons interconnected by synapses to carry out a specific function when activated. Multiple neural circuits interconnect with one another to form large scale brain networks.

<span class="mw-page-title-main">NETtalk (artificial neural network)</span> Artificial neural network

NETtalk is an artificial neural network. It is the result of research carried out in the mid-1980s by Terrence Sejnowski and Charles Rosenberg. The intent behind NETtalk was to construct simplified models that might shed light on the complexity of learning human level cognitive tasks, and their implementation as a connectionist model that could also learn to perform a comparable task. The authors trained it in two ways, once by Boltzmann machine and once by backpropagation.

In cognitive psychology, the word superiority effect (WSE) refers to the phenomenon that people have better recognition of letters presented within words as compared to isolated letters and to letters presented within nonword strings. Studies have also found a WSE when letter identification within words is compared to letter identification within pseudowords and pseudohomophones.

Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.

Language production is the production of spoken or written language. In psycholinguistics, it describes all of the stages between having a concept to express and translating that concept into linguistic forms. These stages have been described in two types of processing models: the lexical access models and the serial models. Through these models, psycholinguists can look into how speeches are produced in different ways, such as when the speaker is bilingual. Psycholinguists learn more about these models and different kinds of speech by using language production research methods that include collecting speech errors and elicited production tasks.

Sentence processing takes place whenever a reader or listener processes a language utterance, either in isolation or in the context of a conversation or a text. Many studies of the human language comprehension process have focused on reading of single utterances (sentences) without context. Extensive research has shown that language comprehension is affected by context preceding a given utterance as well as many other factors.

The logogen model of 1969 is a model of speech recognition that uses units called "logogens" to explain how humans comprehend spoken or written words. Logogens are a vast number of specialized recognition units, each able to recognize one specific word. This model provides for the effects of context on word recognition.

Deep dyslexia is a form of dyslexia that disrupts reading processes. Deep dyslexia may occur as a result of a head injury, stroke, disease, or operation. This injury results in the occurrence of semantic errors during reading and the impairment of nonword reading.

The cohort model in psycholinguistics and neurolinguistics is a model of lexical retrieval first proposed by William Marslen-Wilson in the late 1970s. It attempts to describe how visual or auditory input is mapped onto a word in a hearer's lexicon. According to the model, when a person hears speech segments real-time, each speech segment "activates" every word in the lexicon that begins with that segment, and as more segments are added, more words are ruled out, until only one word is left that still matches the input.

<span class="mw-page-title-main">Time delay neural network</span>

Time delay neural network (TDNN) is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.

Word recognition, according to Literacy Information and Communication System (LINCS) is "the ability of a reader to recognize written words correctly and virtually effortlessly". It is sometimes referred to as "isolated word recognition" because it involves a reader's ability to recognize words individually from a list without needing similar words for contextual help. LINCS continues to say that "rapid and effortless word recognition is the main component of fluent reading" and explains that these skills can be improved by "practic[ing] with flashcards, lists, and word grids".

Statistical language acquisition, a branch of developmental psycholinguistics, studies the process by which humans develop the ability to perceive, produce, comprehend, and communicate with natural language in all of its aspects through the use of general learning mechanisms operating on statistical patterns in the linguistic input. Statistical learning acquisition claims that infants' language-learning is based on pattern perception rather than an innate biological grammar. Several statistical elements such as frequency of words, frequent frames, phonotactic patterns and other regularities provide information on language structure and meaning for facilitation of language acquisition.

The dual-route theory of reading aloud was first described in the early 1970s. This theory suggests that two separate mental mechanisms, or cognitive routes, are involved in reading aloud, with output of both mechanisms contributing to the pronunciation of a written stimulus.

Embodied bilingual language, also known as L2 embodiment, is the idea that people mentally simulate their actions, perceptions, and emotions when speaking and understanding a second language (L2) as with their first language (L1). It is closely related to embodied cognition and embodied language processing, both of which only refer to native language thinking and speaking. An example of embodied bilingual language would be situation in which a L1 English speaker learning Spanish as a second language hears the word rápido ("fast") in Spanish while taking notes and then proceeds to take notes more quickly.

In neuroscience, predictive coding is a theory of brain function which postulates that the brain is constantly generating and updating a "mental model" of the environment. According to the theory, such a mental model is used to predict input signals from the senses that are then compared with the actual input signals from those senses. Predictive coding is member of a wider set of theories that follow the Bayesian brain hypothesis.

References

  1. 1 2 3 4 McClelland, J.L., & Elman, J.L. (1986)
  2. McClelland, James; Elman, Jeffrey (January 1986). "The TRACE Model of Speech Perception". Cognitive Psychology. 18 (1): 1–86. doi:10.1016/0010-0285(86)90015-0. PMID   3753912. S2CID   7428866.
  3. McClelland, J.L., D.E. Rumelhart and the PDP Research Group (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 2: Psychological and Biological Models, Cambridge, Massachusetts: MIT Press
  4. Weber, Andrea; Scharenborg, Odette (2012-05-01). "Models of spoken-word recognition". Wiley Interdisciplinary Reviews: Cognitive Science. 3 (3): 387–401. doi:10.1002/wcs.1178. hdl: 11858/00-001M-0000-0012-29E4-5 . ISSN   1939-5086. PMID   26301470.
  5. Marslen-Wilson, W.; Tyler, L. K. (1980). "The temporal structure of spoken language understanding". Cognition. 8 (1): 1–71. CiteSeerX   10.1.1.299.7676 . doi:10.1016/0010-0277(80)90015-3. PMID   7363578. S2CID   11708426.
  6. Luce, P. A.; Pisoni, D. B. (1998). "Recognizing spoken words: The neighborhood activation model". Ear and Hearing. 19 (1): 1–36. doi:10.1097/00003446-199802000-00001. PMC   3467695 . PMID   9504270.
  7. Ganong, W. F. (1980). Phonetic categorization in auditory perception. Journal of Experimental Psychology: Human Perception and Performance, 6, 110–125.
  8. Norris, D.; McQueen, J. M.; Cutler, A. (2000). "Merging information in speech recognition: Feedback is never necessary". Behavioral and Brain Sciences. 23 (3): 299–370. doi:10.1017/s0140525x00003241. hdl: 11858/00-001M-0000-0013-3790-1 . PMID   11301575. S2CID   32291239.
  9. Self-organizing dynamics of lexical access in normals and aphasics. McNellis, Mark G.; Blumstein, Sheila E.; Journal of Cognitive Neuroscience, Vol 13(2), Feb 2001. pp. 151-170.
  10. Scharenborg, O.; Norris, D.; ten Bosch, L.; McQueen, J.M. (2005). "How should a speech recognizer work?". Cognitive Science. 29 (6): 867–918. doi:10.1207/s15516709cog0000_37. hdl: 11858/00-001M-0000-0013-1E5D-C . PMID   21702797.