Voice font

Last updated

A voice font is a computer-generated voice that can be controlled by specifying parameters such as speed and pitch and made to pronounce text input. The concept is akin to that of a text font or a MIDI instrument in the sense that the same input may easily be represented in several different ways based on the design of each font. In spite of current shortcomings in the underlying technology for voice fonts, screen readers and other devices used to enhance accessibility of text to persons with disabilities, can benefit from having more than one default voice font. This happens in the same way that users of a traditional computer word processor benefit from having more than one text font.

Contents

Shortcomings

The synthesized voice created by using a voice font tends to have a slightly unnatural tone. Human voices are very prone to change with the speaker's mood and several other factors that aren't programmed into computerized voices. Voice font software on the Macintosh system tries to get around this by providing tags to change some components of the voice, such as pitch. The Natural Voices software in the sources section allows defining acronym pronunciation and speech rate, as well as other things. Even though speech synthesis has existed since around 1930, according to that source, and the Speech synthesis article, it is difficult to fool experienced listeners into believing that the voice is indeed human.

This may be similar to the difficulty in achieving true Artificial Intelligence that can actually pass a Turing Test by presenting spectators with something indistinguishable from what it is trying to simulate.

Common uses

Like its text counterpart, each voice font can supply a different experience and provide a selection for different purposes. The simplest one is to select a voice font from a group in order to get the clearest one, or to choose the one with a speed that is appropriate for different settings.

For people who are hard of hearing in the upper range of the hearing spectrum, for example, selecting a voice that uses a lower pitch will deliver deeper sounds.

Another use for voice fonts is in electronic music. A commonly available set of synthetic voices from Macintosh computers can be used to enhance the mood of certain music pieces that need a voice but where the composer feels that providing a human voice is not in their interests. Here, male voices can be combined in a choir to provide the tenor and bass for a particular piece, and female voices can be added to fill in other parts of the ensemble—resulting in a choir that consists of speech synthesis rather than human singers, or to utilize a female voice when none are available.

Certain Macintosh clients of instant messaging services such as AOL Instant Messenger had the option of reading incoming messages using the system's voice fonts. When message receiver has stepped away from the computer, or temporarily put away the part of the screen showing the incoming text, the computer reads the message out loud. This allows the user to continue with their other tasks without needing to view the incoming text.

See also

Sources

Related Research Articles

<span class="mw-page-title-main">Vocoder</span> Voice encryption, transformation, and synthesis device

A vocoder is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

<span class="mw-page-title-main">User interface</span> Means by which a user interacts with and controls a machine

In the industrial design field of human–computer interaction, a user interface (UI) is the space where interactions between humans and machines occur. The goal of this interaction is to allow effective operation and control of the machine from the human end, while the machine simultaneously feeds back information that aids the operators' decision-making process. Examples of this broad concept of user interfaces include the interactive aspects of computer operating systems, hand tools, heavy machinery operator controls and process controls. The design considerations applicable when creating user interfaces are related to, or involve such disciplines as, ergonomics and psychology.

Interactive voice response (IVR) is a technology that allows telephone users to interact with a computer-operated telephone system through the use of voice and DTMF tones input with a keypad. In telecommunications, IVR allows customers to interact with a company's host system via a telephone keypad or by speech recognition, after which services can be inquired about through the IVR dialogue. IVR systems can respond with pre-recorded or dynamically generated audio to further direct users on how to proceed. IVR systems deployed in the network are sized to handle large call volumes and also used for outbound calling as IVR systems are more intelligent than many predictive dialer systems.

<span class="mw-page-title-main">Telecommunications device for the deaf</span> Electronic text communication device

A telecommunications device for the deaf (TDD) is a teleprinter, an electronic device for text communication over a telephone line, that is designed for use by persons with hearing or speech difficulties. Other names for the device include teletypewriter (TTY), textphone, and minicom.

<span class="mw-page-title-main">Digital audio workstation</span> Computer system used for editing and creating music and audio

A digital audio workstation (DAW) is an electronic device or application software used for recording, editing and producing audio files. DAWs come in a wide variety of configurations from a single software program on a laptop, to an integrated stand-alone unit, all the way to a highly complex configuration of numerous components controlled by a central computer. Regardless of configuration, modern DAWs have a central interface that allows the user to alter and mix multiple recordings and tracks into a final produced piece.

In computer and telecommunications networks, presence information is a status indicator that conveys ability and willingness of a potential communication partner—for example a user—to communicate. A user's client provides presence information via a network connection to a presence service, which is stored in what constitutes his personal availability record and can be made available for distribution to other users to convey their availability for communication. Presence information has wide application in many communication services and is one of the innovations driving the popularity of instant messaging or recent implementations of voice over IP clients.

PlainTalk is the collective name for several speech synthesis (MacinTalk) and speech recognition technologies developed by Apple Inc. In 1990, Apple invested a lot of work and money in speech recognition technology, hiring many researchers in the field. The result was "PlainTalk", released with the AV models in the Macintosh Quadra series from 1993. It was made a standard system component in System 7.1.2, and has since been shipped on all PowerPC and some 68k Macintoshes.

Multimodal interaction provides the user with multiple modes of interacting with a system. A multimodal interface provides several distinct tools for input and output of data.

A voice-user interface (VUI) enables spoken human interaction with computers, using speech recognition to understand spoken commands and answer questions, and typically text to speech to play a reply. A voice command device is a device controlled with a voice user interface.

<span class="mw-page-title-main">DECtalk</span> Speech synthesizer and text-to-speech technology

DECtalk was a speech synthesizer and text-to-speech technology developed by Digital Equipment Corporation in 1983, based largely on the work of Dennis Klatt at MIT, whose source-filter algorithm was variously known as KlattTalk or MITalk.

VNI Software Company is a developer of various education, entertainment, office, and utility software packages. They are known for developing an encoding and a popular input method for Vietnamese on for computers. VNI is often available on computer systems to type Vietnamese, alongside TELEX input method as well. The most common pairing is the use of VNI on keyboard and computers, whilst TELEX is more common on phones or touchscreens.

<span class="mw-page-title-main">Speech-generating device</span> Augmenting speech device

Speech-generating devices (SGDs), also known as voice output communication aids, are electronic augmentative and alternative communication (AAC) systems used to supplement or replace speech or writing for individuals with severe speech impairments, enabling them to verbally communicate. SGDs are important for people who have limited means of interacting verbally, as they allow individuals to become active participants in communication interactions. They are particularly helpful for patients with amyotrophic lateral sclerosis (ALS) but recently have been used for children with predicted speech deficiencies.

eSpeak Compact, open-source, software speech synthesizer

eSpeak is a free and open-source, cross-platform, compact, software speech synthesizer. It uses a formant synthesis method, providing many languages in a relatively small file size. eSpeakNG is a continuation of the original developer's project with more feedback from native speakers.

<span class="mw-page-title-main">AmigaOS</span> Operating system for Amiga computers

AmigaOS is a family of proprietary native operating systems of the Amiga and AmigaOne personal computers. It was developed first by Commodore International and introduced with the launch of the first Amiga, the Amiga 1000, in 1985. Early versions of AmigaOS required the Motorola 68000 series of 16-bit and 32-bit microprocessors. Later versions were developed by Haage & Partner and then Hyperion Entertainment. A PowerPC microprocessor is required for the most recent release, AmigaOS 4.

Mobile translation is any electronic device or software application that provides audio translation. The concept includes any handheld electronic device that is specifically designed for audio translation. It also includes any machine translation service or software application for hand-held devices, including mobile telephones, Pocket PCs, and PDAs. Mobile translation provides hand-held device users with the advantage of instantaneous and non-mediated translation from one human language to another, usually against a service fee that is, nevertheless, significantly smaller than a human translator charges.

<span class="mw-page-title-main">Linux console</span> Console of the Linux kernel

The Linux console is a system console internal to the Linux kernel. A system console is the device which receives all kernel messages and warnings and which allows logins in single user mode. The Linux console provides a way for the kernel and other processes to send text output to the user, and to receive text input from the user. The user typically enters text with a computer keyboard and reads the output text on a computer monitor. The Linux kernel supports virtual consoles – consoles that are logically separate, but which access the same physical keyboard and display. The Linux console are implemented by the VT subsystem of the Linux kernel, and do not rely on any user space software. This is in contrast to a terminal emulator, which is a user space process that emulates a terminal, and is typically used in a graphical display environment.

The Vietnamese language is written with a Latin script with diacritics which requires several accommodations when typing on phone or computers. Software-based systems are a form of writing Vietnamese on phones or computers with software that can be installed on the device or from third-party software such as UniKey. Telex is the oldest input method devised to encode the Vietnamese language with its tones. Other input methods may also include VNI and VIQR. VNI input method is not to be confused with VNI code page.

Assistive Technology for the Deaf and Hard of Hearing is technology built to assist those who are deaf or suffer from hearing loss. Examples of such technology include hearing aids, video relay services, tactile devices, alerting devices and technology for supporting communication.

<span class="mw-page-title-main">WaveNet</span> Deep neural network for generating raw audio

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.