Speech tempo is a measure of the number of speech units of a given type produced within a given amount of time. Speech tempo is believed to vary within the speech of one person according to contextual and emotional factors, between speakers and also between different languages and dialects. However, there are many problems involved in investigating this variance scientifically.
While most people seem to believe that they can judge how quickly someone is speaking, it is generally said that subjective judgements and opinions cannot serve as scientific evidence for statements about speech tempo; John Laver has written that analyzing tempo can be "dangerously open to subjective bias ... listeners' judgements rapidly begin to lose objectivity when the utterance concerned comes either from an unfamiliar accent or ... from an unfamiliar language". [1] Scientific observation depends on accurate segmenting of recorded speech along the time course of an utterance, usually using one of the acoustic analysis software tools available on the internet such as Audacity or, specifically for speech research, Praat.
Measurements of speech tempo can be strongly affected by pauses and hesitations. For this reason, it is usual to distinguish between speech tempo including pauses and hesitations and speech tempo excluding them. The former is called speaking rate and the latter articulation rate. [2]
Various units of speech have been used as a basis for measurement. The traditional measure of speed in typing and Morse code transmission has been words per minute (wpm). However, in the study of speech the word is not well defined (being primarily a unit of grammar), and speech is not usually temporally stable over a period as long as a minute. Many studies have used the measure of syllables per second, but this is not completely reliable because, although the syllable as a phonological unit of a given language is well-defined, it is not always possible to get agreement on the phonetic syllable. For example, the English word 'particularly' in the form in which it occurs in dictionaries is, phonologically speaking, composed of five syllables /pə.tɪk.jə.lə.li/. Phonetic realizations of the word, however, may be heard as comprising five [pə.tɪk.jə.lə.li], four [pə.tɪk.jə.li], three [pə.tɪk.li] or even two syllables [ptɪk.li], and listeners are likely to have different opinions about the number of syllables heard.
An alternative measure that has been proposed is that of sounds per second. One study found rates varying from an average of 9.4 sounds per second for poetry reading to 13.83 per second for sports commentary. [3] The problem with this approach is that the researcher must be clear as to whether the "sounds" s/he is counting are phonemes or physically observable phonetic units (sometimes called "phones"). As an example, the utterance 'Don't forget to record it' might in slow, careful speech be pronounced /dəʊnt fəget tə rɪkɔːd ɪt/, with 19 phonemes, each of which is phonetically realized. When the sentence is said at high speed it might be pronounced as [də̃ʊ̃ʔ fɡeʔtrɪkɔːd ɪt], with 16 units. If we are counting only units that can be observed and measured, it is clear that at faster speeds of utterance the number of sounds produced per second does not necessarily increase. [4]
Speakers vary their speed of speaking according to contextual and physical factors. A typical speaking rate for English is 4 syllables per second, [5] but in different emotional or social contexts the rate may vary, one study reporting a range between 3.3 and 5.9 syl/sec, [6] Another study found significant differences in speaking rate between story-telling and taking part in an interview. [7]
Speech tempo may be regarded as one of the components of prosody. Possibly the most detailed analytical framework for the role of tempo in English prosody is that of David Crystal. [8] His system, which uses terms mostly borrowed from musical usage, allows for simple variation away from normal in tempo, where monosyllables may be pronounced as "clipped", "drawled" or "held" and polysyllabic utterances may be spoken at "allegro", "allegrissimo", "lento" and "lentissimo". Complex variation includes "accelerando" and "rallentando". Crystal claims that "tempo has probably the most highly discrete grammatical function of all prosodic parameters other than pitch ...". He cites from his corpus-based analysis instances of increased tempo in cases of speakers' self-corrections of speech errors, and in citing embedded material in the form of titles and names, e.g. "I'm sorry, but we won't be able to start So you think you know what's happening for a few moments" and "This is the I'll show you a picture and you tell me what it is technique" (where the italicized text is spoken at faster tempo).
Subjective impressions of tempo differences between different languages and dialects are difficult to substantiate with scientific data. [9] Counting syllables per second will result in differences caused by the different syllable structures found in different languages; many languages have a predominantly CV (consonant+vowel) syllable structure while English syllables may begin with up to 3 consonants and end with up to 4. Consequently, it is likely that a Japanese speaker can produce more syllables in their language per second than an English speaker can in theirs. Counting sounds per second is also problematic for the reason mentioned above, i.e. that the researcher needs to be sure what objects it is that they are counting.
Howard Giles has studied the relationship between perceived tempo and perceived competence of speakers of different accents of English, and found a positive linear relationship between the two (i.e. people who speak faster are perceived as more competent). [10]
Osser and Peng counted sounds per second for Japanese and English and found no significant difference. [11] The study by Kowal et al., referred to above, comparing story-telling with speaking in an interview, looked at English, Finnish, French, German and Spanish. They found no significant differences in rate between the languages, but highly significant differences between the speaking styles. Similarly, Barik found that differences in tempo between French and English were due to speaking style rather than to the language. [12] From the point of view of the perception of tempo differences between languages, Vaane used spoken Dutch, English, French, Spanish and Arabic produced at three different rates and found that untrained and phonetically trained listeners performed equally well at judging the rate of speaking for familiar and unfamiliar languages. [13]
In the absence of reliable evidence to support it, it seems that the widespread view that some languages are spoken more rapidly than others is an illusion. This illusion may well be related to other factors such as differences of rhythm and pausing. In another study, an analysis of speech rate and perception in radio bulletins, the average rate of bulletins varied from 168 (English, BBC) to 210 words per minutes (Spanish, RNE). [14]
Chinese is a group of languages spoken natively by the ethnic Han Chinese majority and many minority ethnic groups in China. Approximately 1.35 billion people, or 17% of the global population, speak a variety of Chinese as their first language.
In linguistics and specifically phonology, a phoneme is any set of similar phones that, within a given language, is perceptually regarded as a single distinct sound and helps distinguish one word from another.
Phonology is the branch of linguistics that studies how languages systematically organize their phones or, for sign languages, their constituent parts of signs. The term can also refer specifically to the sound or sign system of a particular language variety. At one time, the study of phonology related only to the study of the systems of phonemes in spoken languages, but may now relate to any linguistic analysis either:
Received Pronunciation (RP) is the accent traditionally regarded as the standard and most prestigious form of spoken British English. For over a century, there has been argument over such questions as the definition of RP, whether it is geographically neutral, how many speakers there are, the nature and classification of its sub-varieties, how appropriate a choice it is as a standard, and how the accent has changed over time. The name too is controversial. RP is an accent, so the study of RP is concerned only with matters of pronunciation, while other areas relevant to the study of language standards, such as vocabulary, grammar, and style, are not considered.
A vowel is a syllabic speech sound pronounced without any stricture in the vocal tract. Vowels are one of the two principal classes of speech sounds, the other being the consonant. Vowels vary in quality, in loudness and also in quantity (length). They are usually voiced and are closely involved in prosodic variation such as tone, intonation and stress.
In linguistics, and particularly phonology, stress or accent is the relative emphasis or prominence given to a certain syllable in a word or to a certain word in a phrase or sentence. That emphasis is typically caused by such properties as increased loudness and vowel length, full articulation of the vowel, and changes in tone. The terms stress and accent are often used synonymously in that context but are sometimes distinguished. For example, when emphasis is produced through pitch alone, it is called pitch accent, and when produced through length alone, it is called quantitative accent. When caused by a combination of various intensified properties, it is called stress accent or dynamic accent; English uses what is called variable stress accent.
Isochrony is the postulated rhythmic division of time into equal portions by a language. Rhythm is an aspect of prosody, others being intonation, stress, and tempo of speech.
English phonology is the system of speech sounds used in spoken English. Like many other languages, English has wide variation in pronunciation, both historically and from dialect to dialect. In general, however, the regional dialects of English share a largely similar phonological system. Among other things, most dialects have vowel reduction in unstressed syllables and a complex set of phonological features that distinguish fortis and lenis consonants.
Stress is a prominent feature of the English language, both at the level of the word (lexical stress) and at the level of the phrase or sentence (prosodic stress). Absence of stress on a syllable, or on a word in some cases, is frequently associated in English with vowel reduction – many such syllables are pronounced with a centralized vowel (schwa) or with certain other vowels that are described as being "reduced". Various disagreeing phonological analyses exist for these phenomena.
In linguistics, prosody is the study of elements of speech that are not individual phonetic segments but which are properties of syllables and larger units of speech, including linguistic functions such as intonation, stress, and rhythm. Such elements are known as suprasegmentals.
Formulaic language is a linguistic term for verbal expressions that are fixed in form, often non-literal in meaning with attitudinal nuances, and closely related to communicative-pragmatic context. Along with idioms, expletives and proverbs, formulaic language includes pause fillers and conversational speech formulas.
In linguistics, intonation is the variation in pitch used to indicate the speaker's attitudes and emotions, to highlight or focus an expression, to signal the illocutionary act performed by a sentence, or to regulate the flow of discourse. For example, the English question "Does Maria speak Spanish or French?" is interpreted as a yes-or-no question when it is uttered with a single rising intonation contour, but is interpreted as an alternative question when uttered with a rising contour on "Spanish" and a falling contour on "French". Although intonation is primarily a matter of pitch variation, its effects almost always work hand-in-hand with other prosodic features. Intonation is distinct from tone, the phenomenon where pitch is used to distinguish words or to mark grammatical features.
Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing.
Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.
Speech production is the process by which thoughts are translated into speech. This includes the selection of words, the organization of relevant grammatical forms, and then the articulation of the resulting sounds by the motor system using the vocal apparatus. Speech production can be spontaneous such as when a person creates the words of a conversation, reactive such as when they name a picture or read aloud a written word, or imitative, such as in speech repetition. Speech production is not the same as language production since language can also be produced manually by signs.
In linguistics, a prosodic unit is a segment of speech that occurs with specific prosodic properties. These properties can be those of stress, intonation, or tonal patterns.
Phonological development refers to how children learn to organize sounds into meaning or language (phonology) during their stages of growth.
The phonology of second languages is different from the phonology of first languages in various ways. The differences are considered to come from general characteristics of second languages, such as slower speech rate, lower proficiency than native speakers, and from the interaction between non-native speakers' first and second languages.
Sign languages such as American Sign Language (ASL) are characterized by phonological processes analogous to, yet dissimilar from, those of oral languages. Although there is a qualitative difference from oral languages in that sign-language phonemes are not based on sound, and are spatial in addition to being temporal, they fulfill the same role as phonemes in oral languages.
Peter John Roach is a British retired phonetician. He taught at the Universities of Leeds and Reading, and is best known for his work on the pronunciation of British English.