Speech recognition software for Linux

Last updated March 15, 2023

As of the early 2000s, several speech recognition (SR) software packages exist for Linux. Some of them are free and open-source software and others are proprietary software. Speech recognition usually refers to software that attempts to distinguish thousands of words in a human language. Voice control may refer to software used for communicating operational commands to a computer.

Linux native speech recognition

History

In the late 1990s, a Linux version of ViaVoice, created by IBM, was made available to users for no charge. In 2002, the free software development kit (SDK) was removed by the developer.

Development status

In the early 2000s, there was a push to get a high-quality Linux native speech recognition engine developed. As a result, several projects dedicated to creating Linux speech recognition programs were begun, such as Mycroft, which is similar to Microsoft Cortana, but open-source.

Speech sample crowdsourcing

It is essential to compile a speech corpus to produce acoustic models for speech recognition projects. VoxForge is a free speech corpus and acoustic model repository that was built to collect transcribed speech to be used in speech recognition projects. VoxForge accepts crowdsourced speech samples and corrections of recognized speech sequences. It is licensed under a GNU General Public License (GPL).

Speech recognition concept

The first step is to begin recording an audio stream on a computer. The user has two main processing options:

Discrete speech recognition (DSR) – processes information on a local machine entirely. This refers to self-contained systems in which all aspects of SR are performed entirely within the user's computer. This is becoming critical for protecting intellectual property (IP) and avoiding unwanted surveillance (2018).
Remote or server-based SR – transmits an audio speech file to a remote server to convert the file into a text string file. Due to recent cloud storage schemes and data mining, this method more easily allows surveillance, theft of information, and inserting malware.

Remote recognition was formerly used by smartphones because they lacked sufficient performance, working memory, or storage to process speech recognition within the phone. These limits have largely been overcome although server-based SR on mobile devices remains universal.

Speech recognition in browser

Discrete speech recognition can be performed within a web browser and works well with supported browsers. Remote SR does not require installing software on a desktop computer or mobile device as it is mainly a server-based system with the inherent security issues noted above.

Remote: The dictation service records an audio track of the user via a web browser.
DSR: Some solutions work on a client only, without sending data to servers.

Free speech recognition engines

The following is a list of projects dedicated to implementing speech recognition in Linux, and major native solutions. These are not end-user applications. These are programming libraries that may be used to develop end-user applications.

CMU Sphinx is a general term to describe a group of speech recognition systems developed at Carnegie Mellon University.
HTK is the most famous and widely used speech recognition software before Kaldi.
Julius is a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers.
Kaldi is a toolkit for speech recognition provided under the Apache licence.
Mozilla DeepSpeech is developing an open-source Speech-To-Text engine based on Baidu's deep speech research paper.^[1]

VoxForge is a free speech corpus and acoustic model repository for open-source speech recognition engines.

Proprietary speech recognition engines

Janus Recognition Toolkit (JRTk) is a closed source speech recognition toolkit mainly targeted at Linux developed by the Interactive Systems Laboratories developed at Carnegie Mellon University and Karlsruhe Institute of Technology for which commercial and research licenses are available.^[2]

Voice control and keyboard shortcuts

Speech recognition usually refers to software that attempts to distinguish thousands of words in a human language. Voice control may refer to software used for sending operational commands to a computer or appliance. Voice control typically requires a much smaller vocabulary and thus is much easier to implement.

Simple software combined with keyboard shortcuts, have the earliest potential for practically accurate voice control in Linux.

Running Windows speech recognition software with Linux

Via compatibility layer

It is possible to use programs such as Dragon NaturallySpeaking in Linux, by using Wine, though some problems may arise, depending on which version is used.^[3]

Via virtualized Windows

It is also possible to use Windows speech recognition software under Linux. Using no-cost virtualization software, it is possible to run Windows and NaturallySpeaking under Linux. VMware Server or VirtualBox support copy and paste to/from a virtual machine, making dictated text easily transferable to/from the virtual machine.

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the main benefit of searchability. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

In computing, cross-platform software is computer software that is designed to work in several computing platforms. Some cross-platform software requires a separate build for each platform, but some can be directly run on any platform without special preparation, being written in an interpreted language or compiled to portable bytecode for which the interpreters or run-time packages are common or standard components of all supported platforms.

<span class="mw-page-title-main">ChatZilla</span> IRC client

ChatZilla is an IRC client that is part of SeaMonkey. It was previously an extension for Mozilla-based browsers such as Firefox, introduced in 2000. It is cross-platform open source software which has been noted for its consistent appearance across platforms, CSS appearance customization and scripting.

Firebird is an open-source SQL relational database management system that supports Linux, Microsoft Windows, macOS and other Unix platforms. The database forked from Borland's open source edition of InterBase in 2000 but the code has been largely rewritten since Firebird 1.5.

Damn Small Linux (DSL) is a discontinued computer operating system for the x86 family of personal computers. It is free and open-source software under the terms of the GNU GPL and other free and open source licenses. It was designed to run graphical user interface applications on older PC hardware, for example, machines with 486 and early Pentium microprocessors and very little random-access memory (RAM). DSL is a Live CD with a size of 50 megabytes (MB). What originally began as an experiment to see how much software could fit in 50 MB eventually became a full Linux distribution. It can be installed on storage media with small capacities, like bootable business cards, USB flash drives, various memory cards, and Zip drives.

A screen reader is a form of assistive technology (AT) that renders text and image content as speech or braille output. Screen readers are essential to people who are blind, and are useful to people who are visually impaired, illiterate, or have a learning disability. Screen readers are software applications that attempt to convey what people with normal eyesight see on a display to their users via non-visual means, like text-to-speech, sound icons, or a braille device. They do this by applying a wide variety of techniques that include, for example, interacting with dedicated accessibility APIs, using various operating system features, and employing hooking techniques.

TightVNC is a free and open-source remote desktop software server and client application for Linux and Windows. A server for macOS is available under a commercial source code license only, without SDK or binary version provided. Constantin Kaplinsky developed TightVNC, using and extending the RFB protocol of Virtual Network Computing (VNC) to allow end-users to control another computer's screen remotely.

NX technology, commonly known as NX or NoMachine, is a proprietary cross-platform software application for remote access, desktop sharing, virtual desktop and file transfer between computers. It is developed by the Luxembourg-based company NoMachine.

Julius is a speech recognition engine, specifically a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. It can perform almost real-time computing (RTC) decoding on most current personal computers (PCs) in 60k word dictation task using word trigram (3-gram) and context-dependent Hidden Markov model (HMM). Major search methods are fully incorporated.

VoxForge is a free speech corpus and acoustic model repository for open source speech recognition engines.

In computing, SPICE is a remote-display system built for virtual environments which allows users to view a computing "desktop" environment – not only on its computer-server machine, but also from anywhere on the Internet – using a wide variety of machine architectures.

The following outline is provided as an overview of and topical guide to the Perl programming language:

Chrome Remote Desktop is a remote desktop software tool, developed by Google, that allows a user to remotely control another computer's desktop through a proprietary protocol also developed by Google, internally called Chromoting. The protocol transmits the keyboard and mouse events from the client to the server, relaying the graphical screen updates back in the other direction over a computer network. This feature therefore consists of a server component for the host computer, and a client component on the computer accessing the remote server. Note that Chrome Remote Desktop uses a unique protocol, as opposed to using the common Remote Desktop Protocol.

Mozilla is a free software community founded in 1998 by members of Netscape. The Mozilla community uses, develops, spreads and supports Mozilla products, thereby promoting exclusively free software and open standards, with only minor exceptions. The community is supported institutionally by the non-profit Mozilla Foundation and its tax-paying subsidiary, the Mozilla Corporation.

Kaldi is an open-source speech recognition toolkit written in C++ for speech recognition and signal processing, freely available under the Apache License v2.0.

OpenVINO toolkit is a free toolkit facilitating the optimization of a deep learning model from a framework and deployment using an inference engine onto Intel hardware. The toolkit has two versions: OpenVINO toolkit, which is supported by open source community and Intel Distribution of OpenVINO toolkit, which is supported by Intel. OpenVINO was developed by Intel. The toolkit is cross-platform and free for use under Apache License version 2.0. The toolkit enables a write-once, deploy-anywhere approach to deep learning deployments on Intel platforms, including CPU, integrated GPU, Intel Movidius VPU, and FPGAs.

References

↑ "A TensorFlow implementation of Baidu's DeepSpeech architecture". Mozilla. 2017-12-05. Retrieved 2017-12-05.
↑ (IAR), Roedder, Margit (26 January 2018). "KIT – Janus Recognition Toolkit". isl.ira.uka.de.
↑ "WineHQ – Dragon Naturally Speaking". appdb.winehq.org.

External links

Accessibility, SpeechRecognition – Ubuntu Help

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "A TensorFlow implementation of Baidu's DeepSpeech architecture". Mozilla. 2017-12-05. Retrieved 2017-12-05.

[2] (IAR), Roedder, Margit (26 January 2018). "KIT – Janus Recognition Toolkit". isl.ira.uka.de.

[3] "WineHQ – Dragon Naturally Speaking". appdb.winehq.org.

[1]

[2]

[3]