Silent speech interface

Last updated May 21, 2024

Silent speech interface is a device that allows speech communication without using the sound made when people vocalize their speech sounds. As such it is a type of electronic lip reading. It works by the computer identifying the phonemes that an individual pronounces from nonauditory sources of information about their speech movements. These are then used to recreate the speech using speech synthesis.^[1]

Input methods

Silent speech interface systems have been created using ultrasound and optical camera input of tongue and lip movements.^[2] Electromagnetic devices are another technique for tracking tongue and lip movements.^[3] The detection of speech movements by electromyography of speech articulator muscles and the larynx is another technique.^[4]^[5] Another source of information is the vocal tract resonance signals that get transmitted through bone conduction called non-audible murmurs.^[6] They have also been created as a brain–computer interface using brain activity in the motor cortex obtained from intracortical microelectrodes.^[7]

Uses

Such devices are created as aids to those unable to create the sound phonation needed for audible speech such as after laryngectomies.^[8] Another use is for communication when speech is masked by background noise or distorted by self-contained breathing apparatus. A further practical use is where a need exists for silent communication, such as when privacy is required in a public place, or hands-free data silent transmission is needed during a military or security operation.^[2]^[9]

In 2002, the Japanese company NTT DoCoMo announced it had created a silent mobile phone using electromyography and imaging of lip movement. The company stated that "the spur to developing such a phone was ridding public places of noise," adding that, "the technology is also expected to help people who have permanently lost their voice."^[10] The feasibility of using silent speech interfaces for practical communication has since then been shown.^[11]

In 2019, Arnav Kapur, a researcher from the Massachusetts Institute of Technology, conducted a study known as AlterEgo. Its implementation of the silent speech interface enables direct communication between the human brain and external devices through stimulation of the speech muscles. By leveraging neural signals associated with speech and language, the AlterEgo system deciphers the user's intended words and translates them into text or commands without the need for audible speech.^[12]

In fiction

The decoding of silent speech using a computer played an important role in Arthur C. Clarke's story and Stanley Kubrick's associated film A Space Odyssey . In this, HAL 9000, a computer controlling spaceship Discovery One, bound for Jupiter, discovers a plot to deactivate it by the mission astronauts Dave Bowman and Frank Poole through lip reading their conversations.^[13]

In Orson Scott Card’s series (including Ender’s Game ), the artificial intelligence can be spoken to while the protagonist wears a movement sensor in his jaw, enabling him to converse with the AI without making noise. He also wears an ear implant.

Related Research Articles

Lip reading, also known as speechreading, is a technique of understanding a limited range of speech by visually interpreting the movements of the lips, face and tongue without sound. Estimates of the range of lip reading vary, with some figures as low as 30% because lip reading relies on context, language knowledge, and any residual hearing. Although lip reading is used most extensively by deaf and hard-of-hearing people, most people with normal hearing process some speech information from sight of the moving mouth.

Cyberware is a relatively new and unknown field. In science fiction circles, however, it is commonly known to mean the hardware or machine parts implanted in the human body and acting as an interface between the central nervous system and the computers or machinery connected to it.

A brain–computer interface (BCI), sometimes called a brain–machine interface (BMI), is a direct communication link between the brain's electrical activity and an external device, most commonly a computer or robotic limb. BCIs are often directed at researching, mapping, assisting, augmenting, or repairing human cognitive or sensory-motor functions. They are often conceptualized as a human–machine interface that skips the intermediary of moving body parts (hands...), although they also raise the possibility of erasing the distinction between brain and machine. BCI implementations range from non-invasive and partially invasive to invasive, based on how physically close electrodes are to brain tissue.

Electromyography (EMG) is a technique for evaluating and recording the electrical activity produced by skeletal muscles. EMG is performed using an instrument called an electromyograph to produce a record called an electromyogram. An electromyograph detects the electric potential generated by muscle cells when these cells are electrically or neurologically activated. The signals can be analyzed to detect abnormalities, activation level, or recruitment order, or to analyze the biomechanics of human or animal movement. Needle EMG is an electrodiagnostic medicine technique commonly used by neurologists. Surface EMG is a non-medical procedure used to assess muscle activation by several professionals, including physiotherapists, kinesiologists and biomedical engineers. In computer science, EMG is also used as middleware in gesture recognition towards allowing the input of physical action to a computer as a form of human-computer interaction.

<span class="mw-page-title-main">Subvocalization</span> Internal process while reading

Subvocalization, or silent speech, is the internal speech typically made when reading; it provides the sound of the word as it is read. This is a natural process when reading, and it helps the mind to access meanings to comprehend and remember what is read, potentially reducing cognitive load.

Gesture recognition is an area of research and development in computer science and language technology concerned with the recognition and interpretation of human gestures. A subdiscipline of computer vision, it employs mathematical algorithms to interpret gestures.

Neuroprosthetics is a discipline related to neuroscience and biomedical engineering concerned with developing neural prostheses. They are sometimes contrasted with a brain–computer interface, which connects the brain to a computer rather than a device meant to replace missing biological functionality.

In speech communication, intelligibility is a measure of how comprehensible speech is in given conditions. Intelligibility is affected by the level and quality of the speech signal, the type and level of background noise, reverberation, and, for speech over communication devices, the properties of the communication system. A common standard measurement for the quality of the intelligibility of speech is the Speech Transmission Index (STI). The concept of speech intelligibility is relevant to several fields, including phonetics, human factors, acoustical engineering, and audiometry.

Subvocal recognition (SVR) is the process of taking subvocalization and converting the detected results to a digital output, aural or text-based.

A Bioamplifier is an electrophysiological device, a variation of the instrumentation amplifier, used to gather and increase the signal integrity of physiologic electrical activity for output to various sources. It may be an independent unit, or integrated into the electrodes.

<span class="mw-page-title-main">Electroencephalography</span> Electrophysiological monitoring method to record electrical activity of the brain

Electroencephalography (EEG) is a method to record an electrogram of the spontaneous electrical activity of the brain. The biosignals detected by EEG have been shown to represent the postsynaptic potentials of pyramidal neurons in the neocortex and allocortex. It is typically non-invasive, with the EEG electrodes placed along the scalp using the International 10–20 system, or variations of it. Electrocorticography, involving surgical placement of electrodes, is sometimes called "intracranial EEG". Clinical interpretation of EEG recordings is most often performed by visual inspection of the tracing or quantitative EEG analysis.

Facial electromyography (fEMG) refers to an electromyography (EMG) technique that measures muscle activity by detecting and amplifying the tiny electrical impulses that are generated by muscle fibers when they contract.

Human–computer interaction (HCI) is research in the design and the use of computer technology, which focuses on the interfaces between people (users) and computers. HCI researchers observe the ways humans interact with computers and design technologies that allow humans to interact with computers in novel ways. A device that allows interaction between human being and a computer is known as a "Human-computer Interface (HCI)".

The neurotrophic electrode is an intracortical device designed to read the electrical signals that the brain uses to process information. It consists of a small, hollow glass cone attached to several electrically conductive gold wires. The term neurotrophic means "relating to the nutrition and maintenance of nerve tissue" and the device gets its name from the fact that it is coated with Matrigel and nerve growth factor to encourage the expansion of neurites through its tip. It was invented by neurologist Dr. Philip Kennedy and was successfully implanted for the first time in a human patient in 1996 by neurosurgeon Roy Bakay.

Imagined speech is thinking in the form of sound – "hearing" one's own voice silently to oneself, without the intentional movement of any extremities such as the lips, tongue, or hands. Logically, imagined speech has been possible since the emergence of language, however, the phenomenon is most associated with its investigation through signal processing and detection within electroencephalograph (EEG) data as well as data obtained using alternative non-invasive, brain–computer interface (BCI) devices.

Auditory feedback (AF) is an aid used by humans to control speech production and singing by helping the individual verify whether the current production of speech or singing is in accordance with his acoustic-auditory intention. This process is possible through what is known as the auditory feedback loop, a three-part cycle that allows individuals to first speak, then listen to what they have said, and lastly, correct it when necessary. From the viewpoint of movement sciences and neurosciences, the acoustic-auditory speech signal can be interpreted as the result of movements of speech articulators. Auditory feedback can hence be inferred as a feedback mechanism controlling skilled actions in the same way that visual feedback controls limb movements.

Frank H. Guenther is an American computational and cognitive neuroscientist whose research focuses on the neural computations underlying speech, including characterization of the neural bases of communication disorders and development of brain–computer interfaces for communication restoration. He is currently a professor of speech, language, and hearing sciences and biomedical engineering at Boston University.

Clinical Electrophysiological Testing is based on techniques derived from electrophysiology used for the clinical diagnosis of patients. There are many processes that occur in the body which produce electrical signals that can be detected. Depending on the location and the source of these signals, distinct methods and techniques have been developed to properly target them.

Data augmentation is a statistical technique which allows maximum likelihood estimation from incomplete data. Data augmentation has important applications in Bayesian analysis, and the technique is widely used in machine learning to reduce overfitting when training machine learning models, achieved by training models on several slightly-modified copies of existing data.

AlterEgo is a wearable silent speech output-input device developed by MIT Media Lab. The device is attached around the head, neck, and jawline and translates your brain speech center impulse input into words on a computer, without vocalization.

References

↑ Denby B, Schultz T, Honda K, Hueber T, Gilbert J.M., Brumberg J.S. (2010). Silent speech interfaces. Speech Communication 52: 270–287. doi : 10.1016/j.specom.2009.08.002
1 2 Hueber T, Benaroya E-L, Chollet G, Denby B, Dreyfus G, Stone M. (2010). Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Communication, 52 288–300. doi : 10.1016/j.specom.2009.11.004
↑ Wang, J., Samal, A., & Green, J. R. (2014). Preliminary test of a real-time, interactive silent speech interface based on electromagnetic articulograph, the 5th ACL/ISCA Workshop on Speech and Language Processing for Assistive Technologies, Baltimore, MD, 38-45.
↑ Jorgensen C, Dusan S. (2010). Speech interfaces based upon surface electromyography. Speech Communication, 52: 354–366. doi : 10.1016/j.specom.2009.11.003
↑ Schultz T, Wand M. (2010). Modeling Coarticulation in EMG-based Continuous Speech Recognition. Speech Communication, 52: 341-353. doi : 10.1016/j.specom.2009.12.002
↑ Hirahara T, Otani M, Shimizu S, Toda T, Nakamura K, Nakajima Y, Shikano K. (2010). Silent-speech enhancement using body-conducted vocal-tract resonance signals. Speech Communication, 52:301–313. doi : 10.1016/j.specom.2009.12.001
↑ Brumberg J.S., Nieto-Castanon A, Kennedy P.R., Guenther F.H. (2010). Brain–computer interfaces for speech communication. Speech Communication 52:367–379. 2010 doi : 10.1016/j.specom.2010.01.001
↑ Deng Y., Patel R., Heaton J. T., Colby G., Gilmore L. D., Cabrera J., Roy S. H., De Luca C.J., Meltzner G. S.(2009). Disordered speech recognition using acoustic and sEMG signals. In INTERSPEECH-2009, 644-647.
↑ Deng Y., Colby G., Heaton J. T., and Meltzner HG. S. (2012). Signal Processing Advances for the MUTE sEMG-Based Silent Speech Recognition System. Military Communication Conference, MILCOM 2012.
↑ Fitzpatrick M. (2002). Lip-reading cellphone silences loudmouths. New Scientist.
↑ Wand M, Schultz T. (2011). Session-independent EMG-based Speech Recognition. Proceedings of the 4th International Conference on Bio-inspired Systems and Signal Processing.
↑ "Project Overview ‹ AlterEgo". MIT Media Lab. Retrieved 2024-05-20.
↑ Clarke, Arthur C. (1972). The Lost Worlds of 2001. London: Sidgwick and Jackson. ISBN 0-283-97903-8.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Denby B, Schultz T, Honda K, Hueber T, Gilbert J.M., Brumberg J.S. (2010). Silent speech interfaces. Speech Communication 52: 270–287. doi : 10.1016/j.specom.2009.08.002

[Hueber-2] 1 2 Hueber T, Benaroya E-L, Chollet G, Denby B, Dreyfus G, Stone M. (2010). Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Communication, 52 288–300. doi : 10.1016/j.specom.2009.11.004

[3] Wang, J., Samal, A., & Green, J. R. (2014). Preliminary test of a real-time, interactive silent speech interface based on electromagnetic articulograph, the 5th ACL/ISCA Workshop on Speech and Language Processing for Assistive Technologies, Baltimore, MD, 38-45.

[4] Jorgensen C, Dusan S. (2010). Speech interfaces based upon surface electromyography. Speech Communication, 52: 354–366. doi : 10.1016/j.specom.2009.11.003

[5] Schultz T, Wand M. (2010). Modeling Coarticulation in EMG-based Continuous Speech Recognition. Speech Communication, 52: 341-353. doi : 10.1016/j.specom.2009.12.002

[6] Hirahara T, Otani M, Shimizu S, Toda T, Nakamura K, Nakajima Y, Shikano K. (2010). Silent-speech enhancement using body-conducted vocal-tract resonance signals. Speech Communication, 52:301–313. doi : 10.1016/j.specom.2009.12.001

[7] Brumberg J.S., Nieto-Castanon A, Kennedy P.R., Guenther F.H. (2010). Brain–computer interfaces for speech communication. Speech Communication 52:367–379. 2010 doi : 10.1016/j.specom.2010.01.001

[Deng-8] Deng Y., Patel R., Heaton J. T., Colby G., Gilmore L. D., Cabrera J., Roy S. H., De Luca C.J., Meltzner G. S.(2009). Disordered speech recognition using acoustic and sEMG signals. In INTERSPEECH-2009, 644-647.

[Deng2-9] Deng Y., Colby G., Heaton J. T., and Meltzner HG. S. (2012). Signal Processing Advances for the MUTE sEMG-Based Silent Speech Recognition System. Military Communication Conference, MILCOM 2012.

[10] Fitzpatrick M. (2002). Lip-reading cellphone silences loudmouths. New Scientist.

[11] Wand M, Schultz T. (2011). Session-independent EMG-based Speech Recognition. Proceedings of the 4th International Conference on Bio-inspired Systems and Signal Processing.

[12] "Project Overview ‹ AlterEgo". MIT Media Lab. Retrieved 2024-05-20.

[13] Clarke, Arthur C. (1972). The Lost Worlds of 2001. London: Sidgwick and Jackson. ISBN 0-283-97903-8.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]