NETtalk is an artificial neural network. It is the result of research carried out in the mid-1980s by Terrence Sejnowski and Charles Rosenberg. The intent behind NETtalk was to construct simplified models that might shed light on the complexity of learning human level cognitive tasks, and their implementation as a connectionist model that could also learn to perform a comparable task. The authors trained it in two ways, once by Boltzmann machine and once by backpropagation. [1]
NETtalk is a program that learns to pronounce written English text by being shown text as input and matching phonetic transcriptions for comparison. [2] [3]
The network was trained on a large amount of English words and their corresponding pronunciations, and is able to generate pronunciations for unseen words with a high level of accuracy. The success of the NETtalk network inspired further research in the field of pronunciation generation and speech synthesis and demonstrated the potential of neural networks for solving complex NLP problems. The output of the network was a stream of phonemes, which fed into DECtalk to produce audible speech, It achieved popular success, appearing on the Today show. [4] The development process was described in a 1993 interview. It took three months to create the training dataset, but only a few days to train the network. [5]
The network had three layers and 18,629 adjustable weights, large by the standards of 1986. There were worries that it would overfit the dataset, but it was trained successfully. The dataset was a 20,000-word subset of the Brown Corpus, with manually annotated phoneme and stress for each letter. [4]
The input of the network has 203 units, divided into 7 groups of 29 units each. Each group is a one-hot encoding of one character. There are 29 possible characters: 26 letters, comma, period, and word boundary (whitespace).
The hidden layer has 80 units.
The output has 26 units. 21 units encode for articulatory features (point of articulation, voicing, vowel height, etc.) of phonemes, and 5 units encode for stress and syllable boundaries.
NETtalk was created to explore the mechanisms of learning to correctly pronounce English text. The authors note that learning to read involves a complex mechanism involving many parts of the human brain. NETtalk does not specifically model the image processing stages and letter recognition of the visual cortex. Rather, it assumes that the letters have been pre-classified and recognized, and these letter sequences comprising words are then shown to the neural network during training and during performance testing. It is NETtalk's task to learn proper associations between the correct pronunciation with a given sequence of letters based on the context in which the letters appear. In other words, NETtalk learns to use the letters around the currently pronounced phoneme that provide cues as to its intended phonemic mapping.
In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.
In linguistics, mispronunciation is the act of pronouncing a word incorrectly. Languages are pronounced in different ways by different people, depending on factors like the area they grew up in, their level of education, and their social class. Even within groups of the same area and class, people can pronounce words differently.
Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.
Connectionism is the name of an approach to the study of human mental processes and cognition that utilizes mathematical models known as connectionist networks or artificial neural networks. Connectionism has had many "waves" since its beginnings.
A Boltzmann machine, named after Ludwig Boltzmann is a spin-glass model with an external field, i.e., a Sherrington–Kirkpatrick model, that is a stochastic Ising model. It is a statistical physics technique applied in the context of cognitive science. It is also classified as a Markov random field.
Terrence Joseph Sejnowski is the Francis Crick Professor at the Salk Institute for Biological Studies where he directs the Computational Neurobiology Laboratory and is the director of the Crick-Jacobs center for theoretical and computational biology. He has performed pioneering research in neural networks and computational neuroscience.
In machine learning, backpropagation is a gradient estimation method commonly used for training neural networks to compute the network parameter updates.
Phonological awareness is an individual's awareness of the phonological structure, or sound structure, of words. Phonological awareness is an important and reliable predictor of later reading ability and has, therefore, been the focus of much research.
An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.
Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at dealing with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". The name is made in analogy with long-term memory and short-term memory and their relationship, studied by cognitive psychologists since early 20th century.
The CMU Pronouncing Dictionary is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research.
Time delay neural network (TDNN) is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.
The orthographic depth of an alphabetic orthography indicates the degree to which a written language deviates from simple one-to-one letter–phoneme correspondence. It depends on how easy it is to predict the pronunciation of a word based on its spelling: shallow orthographies are easy to pronounce based on the written word, and deep orthographies are difficult to pronounce based on how they are written.
Deep learning is a subset of machine learning methods that utilize neural networks for representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.
In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.
Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.
Connectionist temporal classification (CTC) is a type of neural network output and associated scoring function, for training recurrent neural networks (RNNs) such as LSTM networks to tackle sequence problems where the timing is variable. It can be used for tasks like on-line handwriting recognition or recognizing phonemes in speech audio. CTC refers to the outputs and scoring, and is independent of the underlying neural network structure. It was introduced in 2006.
Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by neural circuitry. While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron. Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling that period an "AI winter".