Speech analytics

Last updated

Speech analytics is the process of analyzing recorded calls to gather customer information to improve communication and future interaction. The process is primarily used by customer contact centers to extract information buried in client interactions with an enterprise. [1] Although speech analytics includes elements of automatic speech recognition, it is known for analyzing the topic being discussed, which is weighed against the emotional character of the speech and the amount and locations of speech versus non-speech during the interaction. Speech analytics in contact centers can be used to mine recorded customer interactions to surface the intelligence essential for building effective cost containment and customer service strategies. The technology can pinpoint cost drivers, trend analysis, identify strengths and weaknesses with processes and products, and help understand how the marketplace perceives offerings. [2]

Contents

Definition

Speech analytics provides a Complete analysis of recorded phone conversations between a company and its customers. [3] It provides advanced functionality and valuable intelligence from customer calls. This information can be used to discover information relating to strategy, product, process, operational issues and contact center agent performance. [4] In addition, speech analytics can automatically identify areas in which contact center agents may need additional training or coaching, [5] and can automatically monitor the customer service provided on calls. [6]

The process can isolate the words and phrases used most frequently within a given time period, as well as indicate whether usage is trending up or down. This information is useful for supervisors, analysts, and others in an organization to spot changes in consumer behavior and take action to reduce call volumes—and increase customer satisfaction. It allows insight into a customer's thought process, which in turn creates an opportunity for companies to make adjustments. [7]

Usability

Speech analytics applications can spot spoken keywords or phrases, either as real-time alerts on live audio or as a post-processing step on recorded speech. This technique is also known as audio mining. Other uses include categorization of speech in the contact center environment to identify calls from unsatisfied customers. [8]

Measures such as Precision and recall, commonly used in the field of Information retrieval, are typical ways of quantifying the response of a speech analytics search system. [9] Precision measures the proportion of search results that are relevant to the query. Recall measures the proportion of the total number of relevant items that were returned by the search results. Where a standardised test set has been used, measures such as precision and recall can be used to directly compare the search performance of different speech analytics systems.

Making a meaningful comparison of the accuracy of different speech analytics systems can be difficult. The output of LVCSR systems can be scored against reference word-level transcriptions to produce a value for the word error rate (WER), but because phonetic systems use phones as the basic recognition unit, rather than words, comparisons using this measure cannot be made. When speech analytics systems are used to search for spoken words or phrases, what matters to the user is the accuracy of the search results that are returned. Because the impact of individual recognition errors on these search results can vary greatly, measures such as word error rate are not always helpful in determining overall search accuracy from the user perspective.

According to the US Government Accountability Office, [10] “data reliability refers to the accuracy and completeness of computer-processed data, given the uses they are intended for.” In the realm of Speech Recognition and Analytics, “completeness” is measured by the “detection rate”, and usually as accuracy goes up, the detection rate goes down. [11]

Technology

Speech analytics vendors use the "engine" of a 3rd party and others develop proprietary engines. The technology mainly uses three approaches. The phonetic approach is the fastest for processing, mostly because the size of the grammar is very small, with a phoneme as the basic recognition unit. There are only few tens of unique phonemes in most languages, and the output of this recognition is a stream (text) of phonemes, which can then be searched. Large-vocabulary continuous speech recognition (LVCSR, more commonly known as speech-to-text, full transcription or ASR - automatic speech recognition) uses a set of words (bi-grams, tri-grams etc.) as the basic unit. This approach requires hundreds of thousands of words to match the audio against. It can surface new business issues, the queries are much faster, and the accuracy is higher than the phonetic approach. [12]

Extended speech emotion recognition and prediction is based on three main classifiers: kNN, C4.5 and SVM RBF Kernel. This set achieves better performance than each basic classifier taken separately. It is compared with two other sets of classifiers: one-against-all (OAA) multiclass SVM with Hybrid kernels and the set of classifiers which consists of the following two basic classifiers: C5.0 and Neural Network. The proposed variant achieves better performance than the other two sets of classifiers. [13]

Growth

Market research indicates that speech analytics is projected to become a billion dollar industry by 2020 with North America having the largest market share. [14] The growth rate is attributed to rising requirements for compliance and risk management as well as an increase in industry competition through market intelligence. [15] The telecommunications, IT and outsourcing segments of the industry are considered to hold the largest market share with expected growth from the travel and hospitality segments. [14]

See also

Related Research Articles

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the main benefit of searchability. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Accuracy and precision are two measures of observational error. Accuracy is how close a given set of measurements are to their true value, while precision is how close the measurements are to each other.

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

A process called Optical Character Recognition (OCR) converts printed texts into digital image files. It is a digital copier that uses automation to convert scanned documents into editable, shareable PDFs that are machine-readable. OCR may be seen in action when you use your computer to scan a receipt. The scan is then saved as a picture on your computer. The words in the image cannot be searched, edited, or counted, but you may use OCR to convert the image to a text document with the content stored as text. OCR software can extract data from scanned documents, camera photos, and image-only PDFs. It makes static material editable and does away with the necessity for human data entry.

Binary classification is the task of classifying the elements of a set into two groups on the basis of a classification rule. Typical binary classification problems include:

Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer science, psychology, and cognitive science. While some core ideas in the field may be traced as far back as to early philosophical inquiries into emotion, the more modern branch of computer science originated with Rosalind Picard's 1995 paper on affective computing and her book Affective Computing published by MIT Press. One of the motivations for the research is the ability to give machines emotional intelligence, including to simulate empathy. The machine should interpret the emotional state of humans and adapt its behavior to them, giving an appropriate response to those emotions.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

A dialer or dialler is an electronic device that is connected to a telephone line to monitor the dialed numbers and alter them to seamlessly provide services that otherwise require lengthy National or International access codes to be dialed. A dialer automatically inserts and modifies the numbers depending on the time of day, country or area code dialed, allowing the user to subscribe to the service providers who offer the best rates. For example, a dialer could be programmed to use one service provider for international calls and another for cellular calls. This process is known as prefix insertion or least cost routing. A line powered dialer does not need any external power but instead takes the power it needs from the telephone line.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

A video search engine is a web-based search engine which crawls the web for video content. Some video search engines parse externally hosted content while others allow content to be uploaded and hosted on their own servers. Some engines also allow users to search by video format type and by length of the clip. The video search results are usually accompanied by a thumbnail view of the video.

Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system.

Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.

Noisy text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from noisy unstructured text data. While Text analytics is a growing and mature field that has great value because of the huge amounts of data being produced, processing of noisy text is gaining in importance because a lot of common applications produce noisy text data. Noisy unstructured text data is found in informal settings such as online chat, text messages, e-mails, message boards, newsgroups, blogs, wikis and web pages. Also, text produced by processing spontaneous speech using automatic speech recognition and printed or handwritten text using optical character recognition contains processing noise. Text produced under such circumstances is typically highly noisy containing spelling errors, abbreviations, non-standard words, false starts, repetitions, missing punctuations, missing letter case information, pause filling words such as “um” and “uh” and other texting and speech disfluencies. Such text can be seen in large amounts in contact centers, chat rooms, optical character recognition (OCR) of text documents, short message service (SMS) text, etc. Documents with historical language can also be considered noisy with respect to today's knowledge about the language. Such text contains important historical, religious, ancient medical knowledge that is useful. The nature of the noisy text produced in all these contexts warrants moving beyond traditional text analysis techniques.

Customer intelligence (CI) as part of business intelligence is the process of gathering and analyzing information regarding customers, and their details and activities, to build deeper and more effective customer relationships and improve decision-making by vendors.

Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ‘audio mining’ is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.

<span class="mw-page-title-main">Precision and recall</span> Pattern recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

<span class="mw-page-title-main">Fluency</span> Ability to deliver information quickly and correctly

Fluency refers to continuity, smoothness, rate, and effort in speech production. It is also used to characterize language production, language ability or language proficiency.

<span class="mw-page-title-main">Time delay neural network</span>

Time delay neural network (TDNN) is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.

Legal information retrieval is the science of information retrieval applied to legal text, including legislation, case law, and scholarly works. Accurate legal information retrieval is important to provide access to the law to laymen and legal professionals. Its importance has increased because of the vast and quickly increasing amount of legal documents available through electronic means. Legal information retrieval is a part of the growing field of legal informatics.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Outline of machine learning</span> Overview of and topical guide to machine learning

The following outline is provided as an overview of and topical guide to machine learning. Machine learning is a subfield of soft computing within computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed". Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.

References

  1. Coreen Bailor (August 2006). "The Why Factor in Speech Analytics About". Destination CRM (Destination: Customer Relationship Management). pp. 32–33. Retrieved 2013-10-30.
  2. "Speech analytics: Why the big data source isn't music to your competitors' ears". Tech Republic. 8 January 2016. Retrieved 30 September 2016.
  3. "Top five benefits of speech analytics for the call center". TechTarget.
  4. "Speech & Text Analytics". Genesys.
  5. "Real Time Voice Analytics". Xdroid.
  6. "Do Speech Analytics Tools Change Agent Behavior?". ICMI.
  7. "Reverse a Pattern of Poor Sales With Speech Analytics". Entrepreneur.
  8. "The Age of Speech Analytics Is Close at Hand". Destination CRM. Retrieved 30 September 2016.
  9. C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, Chapter 8.
  10. "Assessing the Reliability of Computer-Processed Data" (PDF). United States General Accounting Office.
  11. "What Does Speech Analytics Software Actually Do? - KnowledgeSpace". Archived from the original on 2018-01-23.
  12. "The Right Technology for your Speech Analytics Project" (PDF). CallMiner. Retrieved 30 September 2016.
  13. S.E. Khoruzhnikov; et al. (2014). "Extended speech emotion recognition and prediction". Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 14 (6): 137.
  14. 1 2 "Speech Analytics Market Worth 1.60 Billion USD by 2020". PR Newswire.
  15. "Speech Analytics Industry Market Share, Size, Growth & Forecast 2025". MENAFN.