Harvard sentences

Last updated April 30, 2024

The Harvard sentences, or Harvard lines,^[1] is a collection of 720 sample phrases, divided into lists of 10, used for standardized testing of Voice over IP, cellular, and other telephone systems. They are phonetically balanced sentences that use specific phonemes at the same frequency they appear in English.

IEEE Recommended Practice for Speech Quality Measurements^[2] sets out seventy-two lists of ten phrases each, described as the "1965 Revised List of Phonetically Balanced Sentences (Harvard Sentences)." They are widely used in research on telecommunications, speech, and acoustics, where standardized and repeatable sequences of speech are needed. The Open Speech Repository^[3] provides some freely usable, prerecorded WAV files of Harvard Sentences in American and British English, in male and female voices.

Harvard lines are also used to observe how an actor's mouth can move when they are talking. This can be used when creating more realistic CGI models.^[1]

Sample Harvard sentences

The first three lists are as follows:^[4]

List 1

The birch canoe slid on the smooth planks.
Glue the sheet to the dark blue background.
It's easy to tell the depth of a well.
These days a chicken leg is a rare dish.
Rice is often served in round bowls.
The juice of lemons makes fine punch.
The box was thrown beside the parked truck.
The hogs were fed chopped corn and garbage.
Four hours of steady work faced us.
A large size in stockings is hard to sell.

List 2

The boy was there when the sun rose.
A rod is used to catch pink salmon.
The source of the huge river is the clear spring.
Kick the ball straight and follow through.
Help the woman get back to her feet.
A pot of tea helps to pass the evening.
Smoky fires lack flame and heat.
The soft cushion broke the man's fall.
The salt breeze came across from the sea.
The girl at the booth sold fifty bonds.

List 3

The small pup gnawed a hole in the sock.
The fish twisted and turned on the bent hook.
Press the pants and sew a button on the vest.
The swan dive was far short of perfect.
The beauty of the view stunned the young boy.
Two blue fish swam in the tank.
Her purse was full of useless trash.
The colt reared and threw the tall rider.
It snowed, rained, and hailed the same morning.
Read verse out loud for pleasure.

Related Research Articles

N, or n, is the fourteenth letter of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages, and others worldwide. Its name in English is en, plural ens.

Speech coding is an application of data compression to digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

A vowel is a syllabic speech sound pronounced without any stricture in the vocal tract. Vowels are one of the two principal classes of speech sounds, the other being the consonant. Vowels vary in quality, in loudness and also in quantity (length). They are usually voiced and are closely involved in prosodic variation such as tone, intonation and stress.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

Speex is an audio compression codec specifically tuned for the reproduction of human speech and also a free software speech codec that may be used on voice over IP applications and podcasts. It is based on the code excited linear prediction speech coding algorithm. Its creators claim Speex to be free of any patent restrictions and it is licensed under the revised (3-clause) BSD license. It may be used with the Ogg container format or directly transmitted over UDP/RTP. It may also be used with the FLV container format.

A440 (also known as Stuttgart pitch) is the musical pitch corresponding to an audio frequency of 440 Hz, which serves as a tuning standard for the musical note of A above middle C, or A₄ in scientific pitch notation. It is standardized by the International Organization for Standardization as ISO 16. While other frequencies have been (and occasionally still are) used to tune the first A above middle C, A440 is now commonly used as a reference frequency to calibrate acoustic equipment and to tune pianos, violins, and other musical instruments.

Mixed-excitation linear prediction (MELP) is a United States Department of Defense speech coding standard used mainly in military applications and satellite communications, secure voice, and secure radio devices. Its standardization and later development was led and supported by the NSA and NATO. The current "enhanced" version is known as MELPe.

Secure voice is a term in cryptography for the encryption of voice communication over a range of communication types such as radio, telephone or IP.

Speech Transmission Index (STI) is a measure of speech transmission quality. The absolute measurement of speech intelligibility is a complex science. The STI measures some physical characteristics of a transmission channel (a room, electro-acoustic equipment, telephone line, etc.), and expresses the ability of the channel to carry across the characteristics of a speech signal. STI is a well-established objective measurement predictor of how the characteristics of the transmission channel affect speech intelligibility.

The Sikkimese language, also called Sikkimese, Bhutia, or Drenjongké, Dranjoke, Denjongka, Denzongpeke and Denzongke, belongs to the Tibeto-Burman languages. It is spoken by the Bhutia in Sikkim, India and in parts of Koshi, Nepal. It is the Official Language of Sikkim, India. The Sikkimese people refer to their own language as Drendzongké and their homeland as Drendzong. Up until 1975 Sikkimese was not a written language. After gaining Indian Statehood the language was introduced as a school subject in Sikkim and the written language was developed.

Perceptual Evaluation of Speech Quality (PESQ) is a family of standards comprising a test methodology for automated assessment of the speech quality as experienced by a user of a telephony system. It was standardized as Recommendation ITU-T P.862 in 2001. PESQ is used for objective voice quality testing by phone manufacturers, network equipment vendors and telecom operators. Its usage requires a license. The first edition of PESQ's successor POLQA entered into force in 2011.

SVOPC is a compression method for audio which is used by VOIP applications. It is a lossy speech compression codec designed specifically towards communication channels suffering from packet loss. It uses more bandwidth than best bandwidth-optimised codecs, but it is packet loss resistant instead.

Wideband audio, also known as wideband voice or HD voice, is high definition voice quality for telephony audio, contrasted with standard digital telephony "toll quality". It extends the frequency range of audio signals transmitted over telephone lines, resulting in higher quality speech. The range of the human voice extends from 100 Hz to 17 kHz but traditional, voiceband or narrowband telephone calls limit audio frequencies to the range of 300 Hz to 3.4 kHz. Wideband audio relaxes the bandwidth limitation and transmits in the audio frequency range of 50 Hz to 7 kHz. In addition, some wideband codecs may use a higher audio bit depth of 16 bits to encode samples, also resulting in much better voice quality.

Speaker diarisation is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. It is used to answer the question "who spoke when?" Speaker diarisation is a combination of speaker segmentation and speaker clustering. The first aims at finding speaker change points in an audio stream. The second aims at grouping together speech segments on the basis of speaker characteristics.

Opus is a lossy audio coding format developed by the Xiph.Org Foundation and standardized by the Internet Engineering Task Force, designed to efficiently code speech and general audio in a single format, while remaining low-latency enough for real-time interactive communication and low-complexity enough for low-end embedded processors. Opus replaces both Vorbis and Speex for new applications, and several blind listening tests have ranked it higher-quality than any other standard audio format at any given bitrate until transparency is reached, including MP3, AAC, and HE-AAC.

A lisp is a speech impairment in which a person misarticulates sibilants. These misarticulations often result in unclear speech in languages with phonemic sibilants.

Enhanced Voice Services (EVS) is a superwideband speech audio coding standard that was developed for VoLTE and VoNR. It offers up to 20 kHz audio bandwidth and has high robustness to delay jitter and packet losses due to its channel aware coding and improved packet loss concealment. It has been developed in 3GPP and is described in 3GPP TS 26.441. The application areas of EVS consist of improved telephony and teleconferencing, audiovisual conferencing services, and streaming audio. Source code of both decoder and encoder in ANSI C is available as 3GPP TS 26.442 and is being updated regularly. Samsung uses the term HD+ when doing a call using EVS.

<span class="mw-page-title-main">Ernst Rothauser</span> Austrian computer scientist (1931–2015)

Ernst Rothauser was an Austrian computer scientist. As member of Heinz Zemanek's "Mailüfterl-Team", he worked on the country's first transistor computer. After finishing his dissertation he was hired by IBM Zurich Research Laboratory, retiring in 1995.

An audio deepfake is a product of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

References

1 2 "Why it’s so hard to make CGI skin look real" (at 7m13s), Vox, 3 August 2021 archived at Ghostarchive.org on 4 May 2022
↑ "IEEE Recommended Practice for Speech Quality Measurements". IEEE Transactions on Audio and Electroacoustics. 17 (3): 225–246. September 1969. doi:10.1109/TAU.1969.1162058 . Retrieved 2012-01-05.
↑ "The Open Speech Repository" . Retrieved 2012-01-05.
↑ "Harvard Sentences". www.cs.columbia.edu. Archived from the original on 2022-02-24. Retrieved 2022-03-04.

External links

Zhang, Sarah (2015-03-09). "The 'Harvard Sentences' Secretly Shaped The Development Of Audio Tech". Gizmodo.
"harvardsentences.com". 2021-03-01.

This article related to telephony is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[vox-20210803-1] 1 2 "Why it’s so hard to make CGI skin look real" (at 7m13s), Vox, 3 August 2021 archived at Ghostarchive.org on 4 May 2022

[2] "IEEE Recommended Practice for Speech Quality Measurements". IEEE Transactions on Audio and Electroacoustics. 17 (3): 225–246. September 1969. doi:10.1109/TAU.1969.1162058 . Retrieved 2012-01-05.

[3] "The Open Speech Repository" . Retrieved 2012-01-05.

[4] "Harvard Sentences". www.cs.columbia.edu. Archived from the original on 2022-02-24. Retrieved 2022-03-04.

[1]

[2]

[3]

[4]

v t e Standard test items
Pangram Reference implementation Sanity check Standard test image
Artificial intelligence	Chinese room Turing test
Television (test card)	SMPTE color bars EBU colour bars Indian-head test pattern EIA 1956 resolution chart BBC Test Card A, B, C, D, E, F, G, H, J, W, X ETP-1 Philips circle pattern (PM 5538, PM 5540, PM 5544, PM 5644) Snell & Wilcox SW2/SW4 Telefunken FuBK TVE test card UEIT
Computer languages	"Hello, World!" program Quine Trabb Pardo–Knuth algorithm Man or boy test Just another Perl hacker
Data compression	Calgary corpus Canterbury corpus Silesia corpus enwik8, enwik9
3D computer graphics	Cornell box Stanford bunny Stanford dragon Utah teapot List
Machine learning	ImageNet MNIST database List
Typography (filler text)	Etaoin shrdlu Hamburgevons Lorem ipsum The quick brown fox jumps over the lazy dog
Other	3DBenchy Acid 1 2 3 "Bad Apple!!" EICAR test file functions for optimization GTUBE Harvard sentences Lenna "The North Wind and the Sun" "Tom's Diner" SMPTE universal leader EURion constellation Shakedown Webdriver Torso 1951 USAF resolution test chart