Julius (software)

Last updated
Julius
Original author(s) Lee Akinobu
Developer(s) Kawahara Lab., Kyoto University
Julius project team, Nagoya Institute of Technology
Initial release1991;33 years ago (1991)
Stable release
4.6 / 2 September 2020
Repository github.com/julius-speech
Written in C
Operating system Unix (Linux, BSD, etc.), Windows (via Cygwin)
Platform IA-32, x86-64
Available inJapanese, English
Type Speech recognition
License Free, BSD style [1] [2]
Website julius.osdn.jp/en_index.php

Julius is a speech recognition engine, specifically a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. It can perform almost real-time computing (RTC) decoding on most current personal computers (PCs) in 60k word dictation task using word trigram (3-gram) and context-dependent Hidden Markov model (HMM). Major search methods are fully incorporated.

Contents

It is also modularized carefully to be independent from model structures, and various HMM types are supported such as shared-state triphones and tied-mixture models, with any number of mixtures, states, or phones. Standard formats are adopted to cope with other free modeling toolkit. The main platform is Linux and other Unix workstations, and it works on Windows. Julius is free and open-source software, released under a revised BSD style software license.

Julius has been developed as part of a free software toolkit for Japanese LVCSR research since 1997, and the work has been continued at Continuous Speech Recognition Consortium (CSRC), Japan from 2000 to 2003.

From rev.3.4, a grammar-based recognition parser named Julian is integrated into Julius. Julian is a modified version of Julius that uses hand-designed type of finite-state machine (FSM) termed a deterministic finite automaton (DFA) grammar as a language model. It can be used to build a kind of voice command system of small vocabulary, or various spoken dialog system tasks.

About models

To run, the Julius recognizer needs a language model and an acoustic model for each language.

Julius adopts acoustic models in Hidden Markov Model Toolkit (HTK) ASCII format, pronunciation dictionary in HTK-like format, and word 3-gram language models in ARPA standard format: forward 2-gram and reverse 3-gram as trained from speech corpus with reversed word order.

Although Julius is only distributed with Japanese models, the VoxForge project is working to create English acoustic models for use with the Julius Speech Recognition Engine.

In April 2018, thanks to the effort of Mozilla foundation, a 350-hour audio corpus of spoken English was made available. The new English ENVR-v5.4 open-source speech model was released along with Polish PLPL-v7.1 models and are available from SourceForge. [3]

See also

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent Markov process. An HMM requires that there be an observable process whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about state of by observing . By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be "influenced" exclusively by the outcome of at and that the outcomes of and at must be conditionally independent of at given at time . Estimation of the parameters in an HMM can be performed using maximum likelihood estimation. For linear chain HMMs, the Baum–Welch algorithm can be used to estimate parameters.

The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events. This is done especially in the context of Markov information sources and hidden Markov models (HMM).

Lawrence R. Rabiner is an electrical engineer working in the fields of digital signal processing and speech processing; in particular in digital signal processing for automatic speech recognition. He has worked on systems for AT&T Corporation for speech recognition.

Statistical machine translation (SMT) was a machine translation approach, that superseded the previous, rule-based approach because it required explicit description of each and every linguistic rule, which was costly, and which often did not generalize to other languages. Since 2003, the statistical approach itself has been gradually superseded by the deep learning-based neural machine translation.

CMU Sphinx, also called Sphinx for short, is the general term to describe a group of speech recognition systems developed at Carnegie Mellon University. These include a series of speech recognizers and an acoustic model trainer (SphinxTrain).

VoxForge is a free speech corpus and acoustic model repository for open source speech recognition engines.

An acoustic model is used in automatic speech recognition to represent the relationship between an audio signal and the phonemes or other linguistic units that make up speech. The model is learned from a set of audio recordings and their corresponding transcripts. It is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word.

As of the early 2000s, several speech recognition (SR) software packages exist for Linux. Some of them are free and open-source software and others are proprietary software. Speech recognition usually refers to software that attempts to distinguish thousands of words in a human language. Voice control may refer to software used for communicating operational commands to a computer.

The CMU Pronouncing Dictionary is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research.

Moses is a statistical machine translation engine that can be used to train statistical models of text translation from a source language to a target language, developed by the University of Edinburgh. Moses then allows new source-language text to be decoded using these models to produce automatic translations in the target language. Training requires a parallel corpus of passages in the two languages, typically manually translated sentence pairs. Moses is free and open-source software, released under the GNU Library Public License (LGPL), and available as source code and binary files for Windows and Linux. Its development is supported mainly by the EuroMatrix project, with funding by the European Commission.

HTK is a proprietary software toolkit for handling HMMs. It is mainly intended for speech recognition, but has been used in many other pattern recognition applications that employ HMMs, including speech synthesis, character recognition and DNA sequencing.

Time-inhomogeneous hidden Bernoulli model (TI-HBM) is an alternative to hidden Markov model (HMM) for automatic speech recognition. Contrary to HMM, the state transition process in TI-HBM is not a Markov-dependent process, rather it is a generalized Bernoulli process. This difference leads to elimination of dynamic programming at state-level in TI-HBM decoding process. Thus, the computational complexity of TI-HBM for probability evaluation and state estimation is . The TI-HBM is able to model acoustic-unit duration by using a built-in parameter named survival probability. The TI-HBM is simpler and faster than HMM in a phoneme recognition task, but its performance is comparable to HMM.

Speech translation is the process by which conversational spoken phrases are instantly translated and spoken aloud in a second language. This differs from phrase translation, which is where the system only translates a fixed and finite set of phrases that have been manually entered into the system. Speech translation technology enables speakers of different languages to communicate. It thus is of tremendous value for humankind in terms of science, cross-cultural exchange and global business.

RWTH ASR is a proprietary speech recognition toolkit.

Janus Recognition Toolkit (JRTk), sometimes referred to as Janus, is a general purpose speech recognition toolkit developed and maintained by the Interactive Systems Laboratories at Carnegie Mellon University and Karlsruhe Institute of Technology. It is useful for both research and application development and is part of the JANUS speech-to-speech translation system.

The following outline is provided as an overview of and topical guide to natural-language processing:

In signal processing, Feature space Maximum Likelihood Linear Regression (fMLLR) is a global feature transform that are typically applied in a speaker adaptive way, where fMLLR transforms acoustic features to speaker adapted features by a multiplication operation with a transformation matrix. In some literature, fMLLR is also known as the Constrained Maximum Likelihood Linear Regression (cMLLR).

The following outline is provided as an overview of, and topical guide to, machine learning:

<span class="mw-page-title-main">Steve Young (software engineer)</span> British researcher (born 1951)

Stephen John Young is a British researcher, Professor of Information Engineering at the University of Cambridge and an entrepreneur. He is one of the pioneers of automated speech recognition and statistical spoken dialogue systems. He served as the Senior Pro-Vice-Chancellor of the University of Cambridge from 2009 to 2015, responsible for planning and resources. From 2015 to 2019, he held a joint appointment between his professorship at Cambridge and Apple, where he was a senior member of the Siri development team.

References

  1. Callaway, Tom (spot) (2012-08-13). "Licensing/Julius". Fedora Wiki. Red Hat. Retrieved 2019-03-24.
  2. "Large Vocabulary Continuous Speech Recognition Engine Julius". Julius development team. Nagoya Institute of Technology. 2014. Archived from the original on 2019-08-03. Retrieved 2019-03-24.
  3. "JuliusModels - Browse Files at SourceForge.net".