Visual words, as used in image retrieval systems, [1] refer to small parts of an image that carry some kind of information related to the features (such as the color, shape, or texture) or changes occurring in the pixels such as the filtering, low-level feature descriptors (SIFT or SURF).
The approaches of text retrieval system (or information retrieval IR system [1] ) which were developed over 40 years, are based on keywords or Term. The advantage of these approaches is that they are effective and fast. Text-search engines are able to quickly find documents from hundreds or millions (by using a vector space model [2] ). At the same time, text retrieval systems have huge successes, whereas the standard image retrieval systems (like simple search by colors or shapes) have a large number of limitations. Consequently, researchers try to take advantage of text retrieval techniques to apply them to image retrieval. That can be accomplished by a new kind of vision to understand images as textual documents, which is the visual words approach. [3]
Consider that the pixels of an image, which are the smallest parts of a digital image and cannot be divided into smaller ones, are like the letters of an alphabetical language. Then, a set of pixels in an image (a patch or arrays of pixels) is a word. Each word can then be reprocessed into a morphological system to extract a term related to that word. Then, several words can share the same meaning, each one will refer to the same term (like in any language). Multiple words share the same meaning and belong to the same term (have the same information). By this view, researchers can take advantage from text retrieval techniques to apply them to image retrieval system.
This principle can be applied to games to find what words and terms will be in our images. The idea is to try to understand the images with a collection of "visual words".
A small patch on the image which can carry any information in any feature space, such as color changes or texture changes.
In general visual words (VWs) exist in a feature space of continuous values implying a huge number of words and therefore a huge language. Since image retrieval systems need to use text retrieval techniques that are dependent on natural languages, which have a limit to the number of terms and words, there is a need to reduce the number of visual words.
A number of solutions exist to solve this problem, such as dividing the feature space into ranges, each having common characteristics (which can be considered as the same word). Nonetheless, this solution carries many issues, like the division strategy and the size of the range in the feature space. Another solution proposed by researchers is using a clustering mechanism to classify and merge words carrying common information in a finite number of terms.
The clustering result in the feature space (centers of the clusters). More than one patch can give the nearest information in feature space, so we can consider it in the same term.
As the Term in a text (the infinity verb, nouns, and articles) refer to many common words having the same characteristics, the visual term (with its clustering result) will refer to all common words which shared the same information in a feature space.
Lastly, if all images refer to the same set of visual terms, then all images can speak the same language (or visual language).
A set of visual words and visual terms. Considering the visual terms alone is the “Visual Vocabulary” which will be the reference and retrieval system that will depend on it for retrieving images.
All images will be represented with this visual language as a collection of visual words, or bag of visual words.
A collection of visual words which together give information on the meaning of part or all of the image.
Based on this kind of image representation, it is possible to use text retrieval techniques to design an image retrieval system. However, since all text retrieval systems depend on terms, the user's query images must be converted into a set of visual terms in the system. Then, it will compare these visual terms with all visual terms in the database.
Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.
In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.
Content-based image retrieval, also known as query by image content (QBIC) and content-based visual information retrieval (CBVIR), is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching for digital images in large databases. Content-based image retrieval is opposed to traditional concept-based approaches.
In computational linguistics, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.
A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms. It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.
In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.
In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.
A language model is a probability distribution over sequences of words. Given any sequence of words of length m, a language model assigns a probability to the whole sequence. Language models generate probabilities by training on text corpora in one or many languages. Given that languages can be used to express an infinite variety of valid sentences, language modeling faces the problem of assigning non-zero probabilities to linguistically valid sequences that may never be encountered in the training data. Several modelling approaches have been designed to surmount this problem, such as applying the Markov assumption or using neural architectures such as recurrent neural networks or transformers.
A tag cloud is a visual representation of text data which is often used to depict keyword metadata on websites, or to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or color. When used as website navigation aids, the terms are hyperlinked to items associated with the tag.
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision.
Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ‘audio mining’ is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.
In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.
The modality effect is a term used in experimental psychology, most often in the fields dealing with memory and learning, to refer to how learner performance depends on the presentation mode of studied items.
A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.
In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.
Semantic folding theory describes a procedure for encoding the semantics of natural language text in a semantically grounded binary representation. This approach provides a framework for modelling how language data is processed by the neocortex.
A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another in the latent space. Position within the latent space can be viewed as being defined by a set of latent variables that emerge from the resemblances from the objects.
A Vision Transformer (ViT) is a transformer that is targeted at vision processing tasks such as image recognition.