Keyword spotting

Last updated

Keyword spotting (or more simply, word spotting) is a problem that was historically first defined in the context of speech processing. [1] [2] In speech processing, keyword spotting deals with the identification of keywords in utterances.

Contents

Keyword spotting is also defined as a separate, but related, problem in the context of document image processing. [1] In document image processing, keyword spotting is the problem of finding all instances of a query word that exist in a scanned document image, without fully recognizing it.

In speech processing

The first works in keyword spotting appeared in the late 1980s. [2]

A special case of keyword spotting is wake word (also called hot word) detection used by personal digital assistants such as Alexa or Siri to activate the dormant speaker, in other words "wake up" when their name is spoken.

In the United States, the National Security Agency has made use of keyword spotting since at least 2006. [3] This technology allows analysts to search through large volumes of recorded conversations and isolate mentions of suspicious keywords. Recordings can be indexed and analysts can run queries over the database to find conversations of interest. IARPA funded research into keyword spotting in the Babel program.

Some algorithms used for this task are:

In document image processing

Keyword spotting in document image processing can be seen as an instance of the more generic problem of content-based image retrieval (CBIR). Given a query, the goal is to retrieve the most relevant instances of words in a collection of scanned documents. [1] The query may be a text string (query-by-string keyword spotting) or a word image (query-by-example keyword spotting).

Related Research Articles

<span class="mw-page-title-main">Natural language processing</span> Field of linguistics and computer science

Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the main benefit of searchability. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Lemmatisation in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

A language model is a probability distribution over sequences of words. Given any sequence of words of length m, a language model assigns a probability to the whole sequence. Language models generate probabilities by training on text corpora in one or many languages. Given that languages can be used to express an infinite variety of valid sentences, language modeling faces the problem of assigning non-zero probabilities to linguistically valid sequences that may never be encountered in the training data. Several modelling approaches have been designed to surmount this problem, such as applying the Markov assumption or using neural architectures such as recurrent neural networks or transformers.

Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation, and has more recently been superseded by neural machine translation in many applications.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ‘audio mining’ is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

<span class="mw-page-title-main">Time delay neural network</span>

Time delay neural network (TDNN) is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is part of a broader family of machine learning methods, which is based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.

The following outline is provided as an overview of and topical guide to natural-language processing:

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

<span class="mw-page-title-main">Convolutional neural network</span> Artificial neural network

In deep learning, a convolutional neural network (CNN) is a class of artificial neural network most commonly applied to analyze visual imagery. CNNs use a mathematical operation called convolution in place of general matrix multiplication in at least one of their layers. They are specifically designed to process pixel data and are used in image recognition and processing. They have applications in image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain–computer interfaces, and financial time series.

<span class="mw-page-title-main">Transformer (machine learning model)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer vision (CV).

Bidirectional Encoder Representations from Transformers (BERT) is a family of masked-language models introduced in 2018 by researchers at Google. A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in Natural Language Processing (NLP) experiments counting over 150 research publications analyzing and improving the model."

Emotion recognition in conversation (ERC) is a sub-field of emotion recognition, that focuses on mining human emotions from conversations or dialogues having two or more interlocutors. The datasets in this field are usually derived from social platforms that allow free and plenty of samples, often containing multimodal data. Self- and inter-personal influences play critical role in identifying some basic emotions, such as, fear, anger, joy, surprise, etc. The more fine grained the emotion labels are the harder it is to detect the correct emotion. ERC poses a number of challenges, such as, conversational-context modeling, speaker-state modeling, presence of sarcasm in conversation, emotion shift across consecutive utterances of the same interlocutor.

<span class="mw-page-title-main">Attention (machine learning)</span> Machine learning technique

In artificial neural networks, attention is a technique that is meant to mimic cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the motivation being that the network should devote more focus to the important parts of the data, even though they may be small. Learning which part of the data is more important than another depends on the context, and this is trained by gradient descent.

A Vision Transformer (ViT) is a transformer that is targeted at vision processing tasks such as image recognition.

References

  1. 1 2 3 Giotis, A.P; Sfikas, G.; Gatos, B.; Nikou, C. (2017). "A survey of document image word spotting techniques". Pattern Recognition. 68: 310–332. Bibcode:2017PatRe..68..310G. doi:10.1016/j.patcog.2017.02.023.
  2. 1 2 Rohlicek, J.; Russell, W.; Roukos, S.; Gish, H. (1989). "Continuous hidden Markov modeling for speaker-independent word spotting". Proceedings of the 14th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 1: 627–630.
  3. Froomkin, Dan. "THE COMPUTERS ARE LISTENING". The Intercept. Retrieved 20 June 2015.
  4. Sainath, Tara N; Parada, Carolina (2015). "Convolutional neural networks for small-footprint keyword spotting". Sixteenth Annual Conference of the International Speech Communication Association. arXiv: 1711.00333 .
  5. Wei, Bo; Yang, Meirong; Zhang, Tao; Tang, Xiao; Huang, Xing; Kim, Kyuhong; Lee, Jaeyun; Cho, Kiho; Park, Sung-Un (30 August 2021). End-to-End Transformer-Based Open-Vocabulary Keyword Spotting with Location-Guided Local Attention (PDF). Interspeech 2021.{{cite conference}}: CS1 maint: date and year (link)