Steve Young (software engineer)

Last updated

Steve Young
StephenYoung.jpg
Born
Stephen John Young

1951 (age 7273)
Alma materUniversity of Cambridge
Known for
Scientific career
Fields
Institutions
Thesis Speech synthesis from concept with applications to speech output from systems  (1978)
Doctoral advisor Frank Fallside
Website mi.eng.cam.ac.uk/~sjy

Stephen John Young CBE FRS FREng (born 1951) is a British researcher, [1] Professor of Information Engineering at the University of Cambridge and an entrepreneur. He is one of the pioneers of automated speech recognition [2] and statistical spoken dialogue systems. [3] [4] He served as the Senior Pro-Vice-Chancellor of the University of Cambridge from 2009 to 2015, responsible for planning and resources. From 2015 to 2019, he held a joint appointment between his professorship at Cambridge and Apple, where he was a senior member of the Siri development team. [5]

Contents

Early life and education

Young was born in Liverpool on 23 January 1951. He studied at the University of Cambridge, completing a BA in Electrical Sciences in 1973 and a PhD in speech recognition in 1978, under the supervision of Professor Frank Fallside at the Engineering Department. He held lectureships at both Manchester and Cambridge before being elected to the Chair of Information Engineering at Cambridge University in 1994. [6]

Research and academic career

He is best known as the leading author of the HTK toolkit, [2] a software package for using hidden Markov models to model time series, mainly used for speech recognition. Its first version was originally developed by Young at the Machine Intelligence Laboratory of the Cambridge University Engineering Department (CUED) in 1989. Due to the growing popularity of the toolkit worldwide, Microsoft decided to make the core HTK toolkit available again and licensed the software back to CUED after its acquisition of Entropic, the startup Steve co-founded in 1993 to distribute and maintain the HTK toolkit. The HTK book, [7] which is the tutorial of the HTK toolkit, has received more than 7,000 citations. [8]

In the late nineties, Young's research interests shifted to the design of statistical spoken dialogue systems. His most notable contribution to the field is the partially observable Markov decision process (POMDP) based dialogue management framework, [3] [9] [10] which includes the Hidden Information State (HIS) dialogue model, [11] the first practical dialogue management model based on the POMDP framework. His research focuses on developing spoken dialogue systems that are robust against noise introduced by noisy speech recognisers, as well as adapt and scale on-line in interaction with real users. One notable instance of this approach is the application of Gaussian process based reinforcement learning for rapid policy optimisation. [12] [13] In recent years, Young's research group has successfully applied deep learning techniques to various submodules of statistical dialogue systems, [14] [15] [16] [17] winning multiple best paper awards at prestigious speech and NLP conferences.

Entrepreneurship

Apart from his academic and scientific contributions, Young is also a successful entrepreneur and he took a leading role in three company acquisitions:

Awards and honours

Young is a Fellow of the Royal Academy of Engineering, [19] the Institution of Engineering and Technology (IET), the Institute of Electrical and Electronics Engineers (IEEE), the RSA and the International Speech Communication Association (ISCA). [5]

He received the IEEE Signal Processing Society Technical Achievement Award in 2004, and the ISCA Medal for Scientific Achievement in 2010. He also received the European Signal Processing Society Individual Technical Achievement Award in 2013, and the IEEE James L Flanagan Speech and Audio Processing Award in 2015. [5]

In 2020 he was elected a Fellow of the Royal Society (FRS) [20]

Young was appointed Commander of the Order of the British Empire (CBE) in the 2022 Birthday Honours for services to software engineering. [21]

Related Research Articles

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events. This is done especially in the context of Markov information sources and hidden Markov models (HMM).

Lawrence R. Rabiner is an electrical engineer working in the fields of digital signal processing and speech processing; in particular in digital signal processing for automatic speech recognition. He has worked on systems for AT&T Corporation for speech recognition.

<span class="mw-page-title-main">Dynamic Bayesian network</span> Probabilistic graphical model

A dynamic Bayesian network (DBN) is a Bayesian network (BN) which relates variables to each other over adjacent time steps.

Multimodal interaction provides the user with multiple modes of interacting with a system. A multimodal interface provides several distinct tools for input and output of data.

A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a sensor model and the underlying MDP. Unlike the policy function in MDP which maps the underlying states to the actions, POMDP's policy is a mapping from the history of observations to the actions.

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.

Frederick Jelinek was a Czech-American researcher in information theory, automatic speech recognition, and natural language processing. He is well known for his oft-quoted statement, "Every time I fire a linguist, the performance of the speech recognizer goes up".

CMU Sphinx, also called Sphinx for short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University. These include a series of speech recognizers and an acoustic model trainer (SphinxTrain).

Julius is a speech recognition engine, specifically a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. It can perform almost real-time computing (RTC) decoding on most current personal computers (PCs) in 60k word dictation task using word trigram (3-gram) and context-dependent Hidden Markov model (HMM). Major search methods are fully incorporated.

HTK is a proprietary software toolkit for handling HMMs. It is mainly intended for speech recognition, but has been used in many other pattern recognition applications that employ HMMs, including speech synthesis, character recognition and DNA sequencing.

In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it. Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. For this reason, in the fields of predictive modelling and probabilistic forecasting, it is desirable for a given model to exhibit the Markov property.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

Julia Hirschberg is an American computer scientist noted for her research on computational linguistics and natural language processing.

Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Generally, the technology works best if it uses multiple modalities in context. To date, the most work has been conducted on automating the recognition of facial expressions from video, spoken expressions from audio, written expressions from text, and physiology as measured by wearables.

Alberto Ciaramella is an Italian computer engineer and scientist. He is notable for extensive pioneering contributions in the field of speech technologies and applied natural language processing, most of them at CSELT and Loquendo, with the amount of 40 papers and four patents.

Ronjon Nag is a British-American inventor and entrepreneur specializing in the field of mobile technology. He co-founded the technology company Lexicus, acquired by Motorola in 1993 and Cellmania, acquired by Research in Motion in 2010. He later served as Vice-President of both Motorola and BlackBerry.

The IARPA Babel program developed speech recognition technology for noisy telephone conversations. The main goal of the program was to improve the performance of keyword search on languages with very little transcribed data, i.e. low-resource languages. Data from 26 languages was collected with certain languages being held-out as "surprise" languages to test the ability of the teams to rapidly build a system for a new language.

Fatmah Baothman is Saudi Arabian computer scientist who is the first woman in the Middle East with a Ph.D. in artificial intelligence. She was recently appointed the board president for the Artificial Intelligence Society. Baothman has worked over 25 years as, and is currently, an assistant professor at King Abdulaziz University Faculty of Computing & Information Technology Baothman established the women's Department which is the foundation of the Computer Science College at King Abdulaziz University, and became the first teaching assistant faculty member.

References

  1. "Steve Young – Google Scholar Citations". Google Scholar. Retrieved 2 May 2017.
  2. 1 2 "HTK Speech Recognition Toolkit". University of Cambridge.
  3. 1 2 Williams, Jason; Young, Steve (2007). "Partially observable Markov decision processes for spoken dialogue systems" (PDF). Computer Speech and Language. 21 (2): 393–422. doi:10.1016/j.csl.2006.06.008. S2CID   13903063.
  4. Young, Steve; et al. "The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management" (PDF). Computer Speech and Language.
  5. 1 2 3 "Professor Steve Young, Professor of Information Engineering". University of Cambridge.
  6. "Stephen Young, Emmanuel Fellow".
  7. Young, Steve. "The HTK book" (PDF). Cambridge University Engineering Department.
  8. "Google Scholar" . Retrieved 23 December 2020.
  9. Blaise Thompson and Steve Young (2010). "Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems" (PDF). Computer Speech and Language.
  10. Young, Steve (2013). "POMDP-based Statistical Spoken Dialogue Systems: a Review" (PDF). Proc IEEE.
  11. Steve Young; et al. (2010). "The Hidden Information State Model: a practical framework for POMDP-based spoken dialogue management" (PDF). Computer Speech and Language.
  12. Milica Gasic and Steve Young (2014). "Gaussian processes for POMDP-based dialogue manager optimization" (Document). IEEE Trans. Audio, Speech and Language Processing.
  13. Pei-Hao Su; et al. (2016). "On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems" (PDF). Proc ACL. arXiv: 1605.07669 .
  14. Lina Rojas-Barahona; et al. (2016). "Exploiting Sentence and Context Representations in Deep Neural Models for Spoken Language Understanding". Proc Coling. pp. 258–267.
  15. Nikola Mrkšić; et al. (2017). "The Neural Belief Tracker: Data-Driven Dialogue State Tracking" (PDF). Proc ACL.
  16. Tsung-Hsien Wen; et al. (2015). "Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems" (PDF). Proc EMNLP. arXiv: 1508.01745 .
  17. Tsung-Hsien Wen el al (2017). "A Network-based End-to-End Trainable Task-oriented Dialogue System". arXiv: 1604.04562 [cs.CL].
  18. 1 2 3 "Steve Young: Executive Profile & Biography". Bloomberg L.P.
  19. "Stephen Young". Royal Academy of Engineering. Retrieved 23 December 2020.
  20. "Stephen Young". Royal Society. Retrieved 20 September 2020.
  21. "No. 63714". The London Gazette (Supplement). 1 June 2022. p. B11.