Speech tempo

Last updated

Speech tempo is a measure of the number of speech units of a given type produced within a given amount of time. Speech tempo is believed to vary within the speech of one person according to contextual and emotional factors, between speakers and also between different languages and dialects. However, there are many problems involved in investigating this variance scientifically.

Contents

Problems of definition

While most people seem to believe that they can judge how quickly someone is speaking, it is generally said that subjective judgements and opinions cannot serve as scientific evidence for statements about speech tempo; John Laver has written that analyzing tempo can be "dangerously open to subjective bias ... listeners' judgements rapidly begin to lose objectivity when the utterance concerned comes either from an unfamiliar accent or ... from an unfamiliar language". [1] Scientific observation depends on accurate segmenting of recorded speech along the time course of an utterance, usually using one of the acoustic analysis software tools available on the internet such as Audacity or, specifically for speech research, Praat.

Measurements of speech tempo can be strongly affected by pauses and hesitations. For this reason, it is usual to distinguish between speech tempo including pauses and hesitations and speech tempo excluding them. The former is called speaking rate and the latter articulation rate. [2]

Various units of speech have been used as a basis for measurement. The traditional measure of speed in typing and Morse code transmission has been words per minute (wpm). However, in the study of speech the word is not well defined (being primarily a unit of grammar), and speech is not usually temporally stable over a period as long as a minute. Many studies have used the measure of syllables per second, but this is not completely reliable because, although the syllable as a phonological unit of a given language is well-defined, it is not always possible to get agreement on the phonetic syllable. For example, the English word 'particularly' in the form in which it occurs in dictionaries is, phonologically speaking, composed of five syllables /pə.tɪk.jə.lə.li/. Phonetic realizations of the word, however, may be heard as comprising five [pə.tɪk.jə.lə.li], four [pə.tɪk.jə.li], three [pə.tɪk.li] or even two syllables [ptɪk.li], and listeners are likely to have different opinions about the number of syllables heard.

An alternative measure that has been proposed is that of sounds per second. One study found rates varying from an average of 9.4 sounds per second for poetry reading to 13.83 per second for sports commentary. [3] The problem with this approach is that the researcher must be clear as to whether the "sounds" s/he is counting are phonemes or physically observable phonetic units (sometimes called "phones"). As an example, the utterance 'Don't forget to record it' might in slow, careful speech be pronounced /dəʊnt fəget tə rɪkɔːd ɪt/, with 19 phonemes, each of which is phonetically realized. When the sentence is said at high speed it might be pronounced as [də̃ʊ̃ʔ fɡeʔtrɪkɔːd ɪt], with 16 units. If we are counting only units that can be observed and measured, it is clear that at faster speeds of utterance the number of sounds produced per second does not necessarily increase. [4]

Within-speaker variability

Speakers vary their speed of speaking according to contextual and physical factors. A typical speaking rate for English is 4 syllables per second, [5] but in different emotional or social contexts the rate may vary, one study reporting a range between 3.3 and 5.9 syl/sec, [6] Another study found significant differences in speaking rate between story-telling and taking part in an interview. [7]

Speech tempo may be regarded as one of the components of prosody. Possibly the most detailed analytical framework for the role of tempo in English prosody is that of David Crystal. [8] His system, which uses terms mostly borrowed from musical usage, allows for simple variation away from normal in tempo, where monosyllables may be pronounced as "clipped", "drawled" or "held" and polysyllabic utterances may be spoken at "allegro", "allegrissimo", "lento" and "lentissimo". Complex variation includes "accelerando" and "rallentando". Crystal claims that "tempo has probably the most highly discrete grammatical function of all prosodic parameters other than pitch ...". He cites from his corpus-based analysis instances of increased tempo in cases of speakers' self-corrections of speech errors, and in citing embedded material in the form of titles and names, e.g. "I'm sorry, but we won't be able to start So you think you know what's happening for a few moments" and "This is the I'll show you a picture and you tell me what it is technique" (where the italicized text is spoken at faster tempo).

Between-language differences

Subjective impressions of tempo differences between different languages and dialects are difficult to substantiate with scientific data. [9] Counting syllables per second will result in differences caused by the different syllable structures found in different languages; many languages have a predominantly CV (consonant+vowel) syllable structure while English syllables may begin with up to 3 consonants and end with up to 4. Consequently, it is likely that a Japanese speaker can produce more syllables in their language per second than an English speaker can in theirs. Counting sounds per second is also problematic for the reason mentioned above, i.e. that the researcher needs to be sure what objects it is that they are counting.

Howard Giles has studied the relationship between perceived tempo and perceived competence of speakers of different accents of English, and found a positive linear relationship between the two (i.e. people who speak faster are perceived as more competent). [10]

Osser and Peng counted sounds per second for Japanese and English and found no significant difference. [11] The study by Kowal et al., referred to above, comparing story-telling with speaking in an interview, looked at English, Finnish, French, German and Spanish. [12] They found no significant differences in rate between the languages, but highly significant differences between the speaking styles. Similarly, Barik found that differences in tempo between French and English were due to speaking style rather than to the language. [13] From the point of view of the perception of tempo differences between languages, Vaane used spoken Dutch, English, French, Spanish and Arabic produced at three different rates and found that untrained and phonetically trained listeners performed equally well at judging the rate of speaking for familiar and unfamiliar languages. [14]

In the absence of reliable evidence to support it, it seems that the widespread view that some languages are spoken more rapidly than others is an illusion. This illusion may well be related to other factors such as differences of rhythm and pausing. In another study, an analysis of speech rate and perception in radio bulletins, the average rate of bulletins varied from 168 (English, BBC) to 210 words per minutes (Spanish, RNE). [15]

See also

Related Research Articles

<span class="mw-page-title-main">Allophone</span> Phone used to pronounce a single phoneme

In phonology, an allophone is one of multiple possible spoken sounds – or phones – or signs used to pronounce a single phoneme in a particular language. For example, in English, the voiceless plosive and the aspirated form are allophones for the phoneme, while these two are considered to be different phonemes in some languages such as Thai. Similarly, in Spanish, and are allophones for the phoneme, while these two are considered to be different phonemes in English.

In linguistics, creaky voice refers to a low, scratchy sound that occupies the vocal range below the common vocal register. It is a special kind of phonation in which the arytenoid cartilages in the larynx are drawn together; as a result, the vocal folds are compressed rather tightly, becoming relatively slack and compact. They normally vibrate irregularly at 20–50 pulses per second, about two octaves below the frequency of modal voicing, and the airflow through the glottis is very slow. Although creaky voice may occur with very low pitch, as at the end of a long intonation unit, it can also occur with a higher pitch. All contribute to make a speaker's voice sound creaky or raspy.

In phonology and linguistics, a phoneme is a set of phones that can distinguish one word from another in a particular language.

Received Pronunciation (RP) is the accent traditionally regarded as the standard and most prestigious form of spoken British English. For over a century, there has been argument over such questions as the definition of RP, whether it is geographically neutral, how many speakers there are, whether sub-varieties exist, how appropriate a choice it is as a standard and how the accent has changed over time. The name itself is controversial. RP is an accent, so the study of RP is concerned only with matters of pronunciation; other areas relevant to the study of language standards such as vocabulary, grammar, and style are not considered.

A vowel is a syllabic speech sound pronounced without any stricture in the vocal tract. Vowels are one of the two principal classes of speech sounds, the other being the consonant. Vowels vary in quality, in loudness and also in quantity (length). They are usually voiced and are closely involved in prosodic variation such as tone, intonation and stress.

In linguistics, and particularly phonology, stress or accent is the relative emphasis or prominence given to a certain syllable in a word or to a certain word in a phrase or sentence. That emphasis is typically caused by such properties as increased loudness and vowel length, full articulation of the vowel, and changes in tone. The terms stress and accent are often used synonymously in that context but are sometimes distinguished. For example, when emphasis is produced through pitch alone, it is called pitch accent, and when produced through length alone, it is called quantitative accent. When caused by a combination of various intensified properties, it is called stress accent or dynamic accent; English uses what is called variable stress accent.

Isochrony is the postulated rhythmic division of time into equal portions by a language. Rhythm is an aspect of prosody, others being intonation, stress, and tempo of speech.

English phonology is the system of speech sounds used in spoken English. Like many other languages, English has wide variation in pronunciation, both historically and from dialect to dialect. In general, however, the regional dialects of English share a largely similar phonological system. Among other things, most dialects have vowel reduction in unstressed syllables and a complex set of phonological features that distinguish fortis and lenis consonants.

Stress is a prominent feature of the English language, both at the level of the word (lexical stress) and at the level of the phrase or sentence (prosodic stress). Absence of stress on a syllable, or on a word in some cases, is frequently associated in English with vowel reduction – many such syllables are pronounced with a centralized vowel (schwa) or with certain other vowels that are described as being "reduced". Various phonological analyses exist for these phenomena.

In linguistics, prosody is the study of elements of speech that are not individual phonetic segments but which are properties of syllables and larger units of speech, including linguistic functions such as intonation, stress, and rhythm. Such elements are known as suprasegmentals.

Formulaic language is a linguistic term for verbal expressions that are fixed in form, often non-literal in meaning with attitudinal nuances, and closely related to communicative-pragmatic context. Along with idioms, expletives and proverbs, formulaic language includes pause fillers and conversational speech formulas.

In linguistics, intonation is the variation in pitch used to indicate the speaker's attitudes and emotions, to highlight or focus an expression, to signal the illocutionary act performed by a sentence, or to regulate the flow of discourse. For example, the English question "Does Maria speak Spanish or French?" is interpreted as a yes-or-no question when it is uttered with a single rising intonation contour, but is interpreted as an alternative question when uttered with a rising contour on "Spanish" and a falling contour on "French". Although intonation is primarily a matter of pitch variation, its effects almost always work hand-in-hand with other prosodic features. Intonation is distinct from tone, the phenomenon where pitch is used to distinguish words or to mark grammatical features.

Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing.

Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.

Speech production is the process by which thoughts are translated into speech. This includes the selection of words, the organization of relevant grammatical forms, and then the articulation of the resulting sounds by the motor system using the vocal apparatus. Speech production can be spontaneous such as when a person creates the words of a conversation, reactive such as when they name a picture or read aloud a written word, or imitative, such as in speech repetition. Speech production is not the same as language production since language can also be produced manually by signs.

In linguistics, a prosodic unit is a segment of speech that occurs with specific prosodic properties. These properties can be those of stress, intonation, or tonal patterns.

Phonological development refers to how children learn to organize sounds into meaning or language (phonology) during their stages of growth.

The phonology of second languages is different from the phonology of first languages in various ways. The differences are considered to come from general characteristics of second languages, such as slower speech rate, lower proficiency than native speakers, and from the interaction between non-native speakers' first and second languages.

Sign languages such as American Sign Language (ASL) are characterized by phonological processes analogous to, yet dissimilar from, those of oral languages. Although there is a qualitative difference from oral languages in that sign-language phonemes are not based on sound, and are spatial in addition to being temporal, they fulfill the same role as phonemes in oral languages.

Phonemic contrast refers to a minimal phonetic difference, that is, small differences in speech sounds, that makes a difference in how the sound is perceived by listeners, and can therefore lead to different mental lexical entries for words. For example, whether a sound is voiced or unvoiced matters for how a sound is perceived in many languages, such that changing this phonetic feature can yield a different word ; see Phoneme. Other examples in English of a phonemic contrast would be the difference between leak and league; the minimal difference of voicing between [k] and [g] does lead to the two utterances being perceived as different words. On the other hand, an example that is not a phonemic contrast in English is the difference between and. In this case the minimal difference of vowel length is not a contrast in English and so those two forms would be perceived as different pronunciations of the same word seat.

References

  1. Laver, John (1994). Principles of Phonetics. Cambridge. p. 542.
  2. Laver, John (1994). Principles of Phonetics. Cambridge. p. 158. ISBN   0-521-45655-X.
  3. Fonagy, I.; K. Magdics (1960). "Speed of utterance in phrases of different length". Language and Speech. 3 (4): 179–192. doi:10.1177/002383096000300401. S2CID   147689762.
  4. Roach, P. (1998). Some languages are spoken more quickly than others.
  5. Cruttenden, A. (2014). Gimson's Pronunciation of English. Routledge. p. 54.
  6. Arnfield, S.; Roach; Setter; Greasley; Horton (1995). "Emotional stress and speech tempo variability". Proceedings of the ESCA/NATO Workshop on Speech Under Stress: 13–15.
  7. Kowal, S.; Wiese and O'Connell (1983). "The use of time in storytelling". Language and Speech. 26 (4): 377–392. doi:10.1177/002383098302600405. S2CID   142712380.
  8. Crystal, David (1976). Prosodic Systems and Intonation in English. Cambridge. pp. 152–156.
  9. Roach, P. (1998). Some languages are spoken more quickly than others.
  10. Giles, Howard (1992). Speech tempo. in W. Bright (ed.) Oxford international Encyclopedia of Linguistics: Oxford.
  11. Osser, H.; Peng, F. (1964). "A cross-cultural study of speech rate". Language and Speech. 7 (2): 120–125. doi:10.1177/002383096400700208. S2CID   147239348.
  12. "Steps to Getting Used to the Speech Rate of Spanish Language". 5 February 2018.
  13. Barik, H.C. (1977). "Cross-linguistic study of temporal characteristics of different types of speech material". Language and Speech. 20 (2): 116–126. doi:10.1177/002383097702000203. PMID   611347. S2CID   45001897.
  14. Vaane, E. (1982). "Subjective estimation of speech rate". Phonetica. 39 (2–3): 136–149. doi:10.1159/000261656. S2CID   143024954.
  15. Emma Rodero: A comparative analysis of speech rate and perception in radio bulletins

Bibliography