Speaker recognition

Last updated

Speaker recognition is the identification of a person from characteristics of voices. [1] It is used to answer the question "Who is speaking?" The term voice recognition [2] [3] [4] [5] [6] can refer to speaker recognition or speech recognition . Speaker verification (also called speaker authentication) contrasts with identification, and speaker recognition differs from speaker diarisation (recognizing when the same speaker is speaking).

Contents

Recognizing the speaker can simplify the task of translating speech in systems that have been trained on specific voices or it can be used to authenticate or verify the identity of a speaker as part of a security process. Speaker recognition has a history dating back some four decades as of 2019 and uses the acoustic features of speech that have been found to differ between individuals. These acoustic patterns reflect both anatomy and learned behavioral patterns.

Verification versus identification

There are two major applications of speaker recognition technologies and methodologies. If the speaker claims to be of a certain identity and the voice is used to verify this claim, this is called verification or authentication. On the other hand, identification is the task of determining an unknown speaker's identity. In a sense, speaker verification is a 1:1 match where one speaker's voice is matched to a particular template whereas speaker identification is a 1:N match where the voice is compared against multiple templates.

From a security perspective, identification is different from verification. Speaker verification is usually employed as a "gatekeeper" in order to provide access to a secure system. These systems operate with the users' knowledge and typically require their cooperation. Speaker identification systems can also be implemented covertly without the user's knowledge to identify talkers in a discussion, alert automated systems of speaker changes, check if a user is already enrolled in a system, etc.

In forensic applications, it is common to first perform a speaker identification process to create a list of "best matches" and then perform a series of verification processes to determine a conclusive match. Working to match the samples from the speaker to the list of best matches helps figure out if they are the same person based on the amount of similarities or differences. The prosecution and defense use this as evidence to determine if the suspect is actually the offender. [7]

Training

One of the earliest training technologies to commercialize was implemented in Worlds of Wonder's 1987 Julie doll. At that point, speaker independence was an intended breakthrough, and systems required a training period. A 1987 ad for the doll carried the tagline "Finally, the doll that understands you." - despite the fact that it was described as a product "which children could train to respond to their voice." [8] The term voice recognition, even a decade later, referred to speaker independence. [9] [ clarification needed ]

Variants of speaker recognition

Each speaker recognition system has two phases: enrollment and verification. During enrollment, the speaker's voice is recorded and typically a number of features are extracted to form a voice print, template, or model. In the verification phase, a speech sample or "utterance" is compared against a previously created voice print. For identification systems, the utterance is compared against multiple voice prints in order to determine the best match(es) while verification systems compare an utterance against a single voice print. Because of the process involved, verification is faster than identification.

Speaker recognition systems fall into two categories: text-dependent and text-independent. [10] Text-dependent recognition requires the text to be the same for both enrollment and verification. [11] In a text-dependent system, prompts can either be common across all speakers (e.g. a common pass phrase) or unique. In addition, the use of shared-secrets (e.g.: passwords and PINs) or knowledge-based information can be employed in order to create a multi-factor authentication scenario. Conversely, text-independent systems do not require the use of a specific text. They are most often used for speaker identification as they require very little if any cooperation by the speaker. In this case the text during enrollment and test is different. In fact, the enrollment may happen without the user's knowledge, as in the case for many forensic applications. As text-independent technologies do not compare what was said at enrollment and verification, verification applications tend to also employ speech recognition to determine what the user is saying at the point of authentication.[ citation needed ] In text independent systems both acoustics and speech analysis techniques are used. [12]

Technology

Speaker recognition is a pattern recognition problem. The various technologies used to process and store voice prints include frequency estimation, hidden Markov models, Gaussian mixture models, pattern matching algorithms, neural networks, matrix representation, vector quantization and decision trees. For comparing utterances against voice prints, more basic methods like cosine similarity are traditionally used for their simplicity and performance. Some systems also use "anti-speaker" techniques such as cohort models and world models. Spectral features are predominantly used in representing speaker characteristics. [13] Linear predictive coding (LPC) is a speech coding method used in speaker recognition and speech verification.[ citation needed ]

Ambient noise levels can impede both collections of the initial and subsequent voice samples. Noise reduction algorithms can be employed to improve accuracy, but incorrect application can have the opposite effect. Performance degradation can result from changes in behavioural attributes of the voice and from enrollment using one telephone and verification on another telephone. Integration with two-factor authentication products is expected to increase. Voice changes due to ageing may impact system performance over time. Some systems adapt the speaker models after each successful verification to capture such long-term changes in the voice, though there is debate regarding the overall security impact imposed by automated adaptation[ citation needed ]

Due to the introduction of legislation like the General Data Protection Regulation in the European Union and the California Consumer Privacy Act in the United States, there has been much discussion about the use of speaker recognition in the work place. In September 2019 Irish speech recognition developer Soapbox Labs warned about the legal implications that may be involved. [14]

Applications

The first international patent was filed in 1983, coming from the telecommunication research in CSELT [15] (Italy) by Michele Cavazza and Alberto Ciaramella as a basis for both future telco services to final customers and to improve the noise-reduction techniques across the network.

Between 1996 and 1998, speaker recognition technology was used at the Scobey–Coronach Border Crossing to enable enrolled local residents with nothing to declare to cross the Canada–United States border when the inspection stations were closed for the night. [16] The system was developed for the U.S. Immigration and Naturalization Service by Voice Strategies of Warren, Michigan.[ citation needed ]

In 2013 Barclays Wealth, the private banking division of Barclays, became the first financial services firm to deploy voice biometrics as the primary means of identifying customers to their call centers. The system used passive speaker recognition to verify the identity of telephone customers within 30 seconds of normal conversation. [17] It was developed by voice recognition company Nuance (that in 2011 acquired the company Loquendo, the spin-off from CSELT itself for speech technology), the company behind Apple's Siri technology. 93% of customers gave the system at "9 out of 10" for speed, ease of use and security. [18]

Speaker recognition may also be used in criminal investigations, such as those of the 2014 executions of, amongst others, James Foley and Steven Sotloff. [19]

In February 2016 UK high-street bank HSBC and its internet-based retail bank First Direct announced that it would offer 15 million customers its biometric banking software to access online and phone accounts using their fingerprint or voice. [20]

In 2023 Vice News and The Guardian separately demonstrated they could defeat standard financial speaker-authentication systems using AI-generated voices generated from about five minutes of the target's voice samples. [21] [22]

See also

Lists

Notes

  1. Poddar, Arnab; Sahidullah, Md; Saha, Goutam (November 27, 2017). "Speaker verification with short utterances: a review of challenges, trends and opportunities". IET Biometrics. 7 (2). Institution of Engineering and Technology (IET): 91–101. doi:10.1049/iet-bmt.2017.0065. ISSN   2047-4938.
  2. Lass, Norman J. (1974). Experimental Phonetics. MSS Information Corporation. pp. 251–258. ISBN   978-0-8422-5149-5.
  3. Van Lancker, Diana; Kreiman, Jody; Emmorey, Karen (1985). "Familiar voice recognition: patterns and parameters Part I: Recognition of backward voices". Journal of Phonetics. 13 (1). Elsevier BV: 19–38. doi: 10.1016/s0095-4470(19)30723-5 . ISSN   0095-4470.
  4. "VOICE RECOGNITION (noun) definition and synonyms". macmillandictionary.com. January 23, 2010. Archived from the original on March 27, 2023. Retrieved October 13, 2023.{{cite web}}: CS1 maint: unfit URL (link)
  5. "What is voice recognition? definition and meaning". businessdictionary.com. October 6, 2008. Archived from the original on December 3, 2011.
  6. "The Mailbag LG #114". Linux Gazette. March 28, 2005.
  7. Rose, Phil; Osanai, Takashi; Kinoshita, Yuko (August 6, 2003). "Strength of forensic speaker identification evidence: multispeaker formant- and cepstrum-based segmental discrimination with a Bayesian likelihood ratio as threshold". International Journal of Speech, Language and the Law. 10 (2). Equinox Publishing: 179–202. doi:10.1558/sll.2003.10.2.179. ISSN   1748-8893.
  8. Pinola, Melanie (November 2, 2011). "Speech Recognition Through the Decades: How We Ended Up With Siri". PCWorld.
  9. Rosen, Cheryl (March 3, 1997). "Voice Recognition To Ease Travel Bookings". Business Travel News. The earliest applications of speech recognition software were dictation ... Four months ago, IBM introduced a "continual dictation product" designed to ... debuted at the National Business Travel Association trade show in 1994.
  10. "Speaker Verification: Text-Dependent vs. Text-Independent". Microsoft Research. June 19, 2017. text-dependent and text-independent speaker .. both equal error rate and detection ..
  11. Hébert, Matthieu (2008). "Text-Dependent Speaker Recognition". Springer Handbook of Speech Processing. Springer Handbooks. Berlin, Heidelberg: Springer Berlin Heidelberg. pp. 743–762. doi:10.1007/978-3-540-49127-9_37. ISBN   978-3-540-49125-5. ISSN   2522-8692. task .. verification or identification
  12. Myers, Lisa (July 25, 2004). "An Exploration of Voice Biometrics". SANS Institute.
  13. Sahidullah, Md; Kinnunen, Tomi (2016). "Local spectral variability features for speaker verification" (PDF). Digital Signal Processing. 50. Elsevier BV: 1–11. doi:10.1016/j.dsp.2015.10.011. ISSN   1051-2004.
  14. "Speech recognition expert raises concerns around voice technology in the workplace". Independent.ie. September 29, 2019. Retrieved September 30, 2019.
  15. US4752958 A, Michele Cavazza, Alberto Ciaramella, "Device for speaker's verification" https://patents.google.com/patent/US4752958/en
  16. Meyer, Barb (June 12, 1996). "Automated Border Crossing". Television news report. Meyer Television News.
  17. International Banking (December 27, 2013). "Voice Biometric Technology in Banking | Barclays". Wealth.barclays.com. Retrieved February 21, 2016.
  18. Matt Warman (May 8, 2013). "Say goodbye to the pin: voice recognition takes over at Barclays Wealth" . Retrieved June 5, 2013.
  19. Ewen MacAskill. "Did 'Jihadi John' kill Steven Sotloff? | Media". The Guardian . Retrieved February 21, 2016.
  20. Julia Kollewe (February 19, 2016). "HSBC rolls out voice and touch ID security for bank customers | Business". The Guardian . Retrieved February 21, 2016.
  21. "How I Broke into a Bank Account with an AI-Generated Voice". February 23, 2023.
  22. Evershed, Nick; Taylor, Josh (March 16, 2023). "AI can fool voice recognition used to verify identity by Centrelink and Australian tax office". The Guardian. Retrieved June 16, 2023.

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

<span class="mw-page-title-main">Authentication</span> Act of proving an assertion

Authentication is the act of proving an assertion, such as the identity of a computer system user. In contrast with identification, the act of indicating a person or thing's identity, authentication is the process of verifying that identity. It might involve validating personal identity documents, verifying the authenticity of a website with a digital certificate, determining the age of an artifact by carbon dating, or ensuring that a product or document is not counterfeit.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

Interactive voice response (IVR) is a technology that allows telephone users to interact with a computer-operated telephone system through the use of voice and DTMF tones input with a keypad. In telephony, IVR allows customers to interact with a company's host system via a telephone keypad or by speech recognition, after which services can be inquired about through the IVR dialogue. IVR systems can respond with pre-recorded or dynamically generated audio to further direct users on how to proceed. IVR systems deployed in the network are sized to handle large call volumes and also used for outbound calling as IVR systems are more intelligent than many predictive dialer systems.

Biometrics are body measurements and calculations related to human characteristics. Biometric authentication is used in computer science as a form of identification and access control. It is also used to identify individuals in groups that are under surveillance.

<span class="mw-page-title-main">Iris recognition</span> Method of biometric identification

Iris recognition is an automated method of biometric identification that uses mathematical pattern-recognition techniques on video images of one or both of the irises of an individual's eyes, whose complex patterns are unique, stable, and can be seen from some distance. The discriminating powers of all biometric technologies depend on the amount of entropy they are able to encode and use in matching. Iris recognition is exceptional in this regard, enabling the avoidance of "collisions" even in cross-comparisons across massive populations. Its major limitation is that image acquisition from distances greater than a meter or two, or without cooperation, can be very difficult. However, the technology is in development and iris recognition can be accomplished from even up to 10 meters away or in a live camera feed.

Automatic identification and data capture (AIDC) refers to the methods of automatically identifying objects, collecting data about them, and entering them directly into computer systems, without human involvement. Technologies typically considered as part of AIDC include QR codes, bar codes, radio frequency identification (RFID), biometrics, magnetic stripes, optical character recognition (OCR), smart cards, and voice recognition. AIDC is also commonly referred to as "Automatic Identification", "Auto-ID" and "Automatic Data Capture".

<span class="mw-page-title-main">Hand geometry</span> Biometric identification

Hand geometry is a biometric that identifies users from the shape of their hands. Hand geometry readers measure a user's palm and fingers along many dimensions including length, width, deviation, and angle and compare those measurements to measurements stored in a file.

Living in the intersection of cryptography and psychology, password psychology is the study of what makes passwords or cryptographic keys easy to remember or guess.

A card reader is a data input device that reads data from a card-shaped storage medium and provides the data to a computer. Card readers can acquire data from a card via a number of methods, including: optical scanning of printed text or barcodes or holes on punched cards, electrical signals from connections made or interrupted by a card's punched holes or embedded circuitry, or electronic devices that can read plastic cards embedded with either a magnetic strip, computer chip, RFID chip, or another storage medium.

Electronic authentication is the process of establishing confidence in user identities electronically presented to an information system. Digital authentication, or e-authentication, may be used synonymously when referring to the authentication process that confirms or certifies a person's identity and works. When used in conjunction with an electronic signature, it can provide evidence of whether data received has been tampered with after being signed by its original sender. Electronic authentication can reduce the risk of fraud and identity theft by verifying that a person is who they say they are when performing transactions online.

Recognition may refer to:

PerSay was an Israeli start-up company specializing in Voice Biometrics technology. Founded in 2000, its voice biometrics systems are used in the banking, insurance, governments, and telecommunications industries worldwide.

AGNITIO S.L. was a voice biometrics technology company, headquartered in Madrid, Spain. Biometric authentication uses unique biological characteristics to verify an individual’s identity. It’s harder to spoof and considered more convenient for some users since they do not have to remember passwords or worry about passwords being stolen. Agnitio provides voice biometrics services for homeland security and corporate clients.

<span class="mw-page-title-main">Vein matching</span> Technique of biometric identification

Vein matching, also called vascular technology, is a technique of biometric identification through the analysis of the patterns of blood vessels visible from the surface of the skin. Though used by the Federal Bureau of Investigation and the Central Intelligence Agency, this method of identification is still in development and has not yet been universally adopted by crime labs as it is not considered as reliable as more established techniques, such as fingerprinting. However, it can be used in conjunction with existing forensic data in support of a conclusion.

Loquendo is an Italian multinational computer software technology corporation, headquartered in Torino, Italy, that provides speech recognition, speech synthesis, speaker verification and identification applications. Loquendo, which was founded in 2001 under the Telecom Italia Lab, also had offices in United Kingdom, Spain, Germany, France, and the United States.

In order to identify a person, a security system has to compare personal characteristics with a database. A scan of a person's iris, fingerprint, face, or other distinguishing feature is created, and a series of biometric points are drawn at key locations in the scan. For example, in the case of a facial scan, biometric points might be placed at the tip of each ear lobe and in the corners of both eyes. Measurements taken between all the points of a scan are compiled and result in a numerical "score". This score is unique for every individual, but it can quickly and easily be compared to any compiled scores of the facial scans in the database to determine if there is a match.

<span class="mw-page-title-main">Sensory, Inc.</span>

Sensory, Inc. is an American company which develops software AI technologies for speech, sound and vision. It is based in Santa Clara, California.

A whole new range of techniques has been developed to identify people since the 1960s from the measurement and analysis of parts of their bodies to DNA profiles. Forms of identification are used to ensure that citizens are eligible for rights to benefits and to vote without fear of impersonation while private individuals have used seals and signatures for centuries to lay claim to real and personal estate. Generally, the amount of proof of identity that is required to gain access to something is proportionate to the value of what is being sought. It is estimated that only 4% of online transactions use methods other than simple passwords. Security of systems resources generally follows a three-step process of identification, authentication and authorization. Today, a high level of trust is as critical to eCommerce transactions as it is to traditional face-to-face transactions.

<span class="mw-page-title-main">Biometric device</span> Identification and authentication device

A biometric device is a security identification and authentication device. Such devices use automated methods of verifying or recognising the identity of a living person based on a physiological or behavioral characteristic. These characteristics include fingerprints, facial images, iris and voice recognition.

References

Software