Multimedia information retrieval

Last updated January 18, 2025

Multimedia information retrieval (MMIR or MIR) is a research discipline of computer science that aims at extracting semantic information from multimedia data sources.^[1]^{[ failed verification ]} Data sources include directly perceivable media such as audio, image and video, indirectly perceivable sources such as text, semantic descriptions,^[2] biosignals as well as not perceivable sources such as bioinformation, stock prices, etc. The methodology of MMIR can be organized in three groups:

Feature extraction methods

Feature extraction is motivated by the sheer size of multimedia objects as well as their redundancy and, possibly, noisiness.^[1]^: 2^{[ failed verification ]} Generally, two possible goals can be achieved by feature extraction:

Summarization of media content. Methods for summarization include in the audio domain, for example, mel-frequency cepstral coefficients, Zero Crossings Rate, Short-Time Energy. In the visual domain, color histograms^[3] such as the MPEG-7 Scalable Color Descriptor can be used for summarization.
Detection of patterns by auto-correlation and/or cross-correlation. Patterns are recurring media chunks that can either be detected by comparing chunks over the media dimensions (time, space, etc.) or comparing media chunks to templates (e.g. face templates, phrases). Typical methods include Linear Predictive Coding in the audio/biosignal domain,^[4] texture description in the visual domain and n-grams in text information retrieval.

Merging and filtering methods

Multimedia Information Retrieval implies that multiple channels are employed for the understanding of media content.^[5] Each of this channels is described by media-specific feature transformations. The resulting descriptions have to be merged to one description per media object. Merging can be performed by simple concatenation if the descriptions are of fixed size. Variable-sized descriptions – as they frequently occur in motion description – have to be normalized to a fixed length first.

Frequently used methods for description filtering include factor analysis (e.g. by PCA), singular value decomposition (e.g. as latent semantic indexing in text retrieval) and the extraction and testing of statistical moments. Advanced concepts such as the Kalman filter are used for merging of descriptions.

Categorization methods

Generally, all forms of machine learning can be employed for the categorization of multimedia descriptions^[1]^: 125^{[ failed verification ]} though some methods are more frequently used in one area than another. For example, hidden Markov models are state-of-the-art in speech recognition, while dynamic time warping – a semantically related method – is state-of-the-art in gene sequence alignment. The list of applicable classifiers includes the following:

Metric approaches (Cluster analysis, vector space model, Minkowski distances, dynamic alignment)
Nearest Neighbor methods (K-nearest neighbors algorithm, K-means, self-organizing map)
Risk Minimization (Support vector regression, support vector machine, linear discriminant analysis)
Density-based Methods (Bayes nets, Markov processes, mixture models)
Neural Networks (Perceptron, associative memories, spiking nets)
Heuristics (Decision trees, random forests, etc.)

The selection of the best classifier for a given problem (test set with descriptions and class labels, so-called ground truth) can be performed automatically, for example, using the Weka Data Miner.

Models of Multimedia Information Retrieval Spoken Language Audio Retrieval Spoken Language Audio Retrieval focuses on audio content containing spoken words. It involves the transcription of spoken content into text using Automatic Speech Recognition (ASR) and indexing the transcriptions for text-based search.

Key Features: Techniques: ASR for transcription and text indexing. Query Types: Text-based queries. Applications: Searching podcast transcripts. Analyzing customer service call logs. Finding specific phrases in meeting recordings. Challenges: Errors in ASR can reduce retrieval accuracy. Multilingual and accent variability requires robust systems. Non-Speech Audio Retrieval Non-Speech Audio Retrieval handles audio content without spoken words, such as music, environmental sounds, or sound effects. This model relies on extracting audio features like pitch, rhythm, and timbre to identify relevant audio.

Key Features: Techniques: Acoustic feature extraction (e.g., spectrograms, MFCCs). Query Types: Audio samples or textual descriptions. Applications: Music recommendation systems. Environmental sound detection (e.g., gunshots, animal calls). Sound effect retrieval in media production. Challenges: Difficulty in bridging the semantic gap between user queries and low-level audio features. Efficient indexing of large datasets. Graph Retrieval Graph Retrieval retrieves information represented as graphs, which consist of nodes (entities) and edges (relationships). It is widely used in social networks, knowledge graphs, and bioinformatics.

Key Features: Techniques: Graph matching, adjacency list/matrix storage, and graph databases (e.g., Neo4j). Query Types: Subgraphs, patterns, or textual queries. Applications: Social network analysis. Searching knowledge graphs. Molecular structure retrieval. Challenges: Computationally intensive subgraph matching. Scalability for large, complex graphs. Imagery Retrieval Imagery Retrieval retrieves images based on user input, such as textual descriptions or visual samples. It leverages both low-level features and semantic analysis for search.

Key Features: Techniques: Content-Based Image Retrieval (CBIR), visual feature extraction, semantic analysis. Query Types: Text, sketches, or example images. Applications: Stock image search. E-commerce product matching. Medical imaging analysis. Challenges: Bridging the semantic gap between user queries and image content. Efficient indexing of large-scale image datasets. Video Retrieval Video Retrieval is the process of finding specific video content based on user queries. It involves analyzing both the visual and temporal features of videos.

Key Features: Techniques: Keyframe extraction, motion pattern analysis, temporal indexing. Query Types: Textual descriptions, sample clips, or temporal queries. Applications: Streaming service recommendations. Surveillance footage analysis. Sports analytics. Challenges: Managing the large file sizes of video content. Efficient analysis of temporal sequences and multimodal features. Comparison of Retrieval Models Model Data Type Query Types Applications Spoken Language Audio Speech recordings Text queries Podcasts, meeting logs, call centers Non-Speech Audio Music, sound effects Audio samples or text Music apps, environmental sounds Graph Retrieval Graph structures Subgraphs, patterns Knowledge graphs, bioinformatics Imagery Retrieval Images Text, sketches, or images E-commerce, medical imaging Video Retrieval Videos (visual + temporal) Text, clips, or time queries Surveillance, sports analysis Conclusion Multimedia Information Retrieval plays a crucial role in organizing and accessing vast multimedia data repositories. The variety of retrieval models ensures that users can effectively interact with and extract insights from complex multimedia datasets. Future advancements in artificial intelligence and machine learning are expected to improve the accuracy and scalability of MIR systems.

Related areas

MMIR provides an overview over methods employed in the areas of information retrieval.^[6]^[7] Methods of one area are adapted and employed on other types of media. Multimedia content is merged before the classification is performed. MMIR methods are, therefore, usually reused from other areas such as:

The International Journal of Multimedia Information Retrieval ^[8] documents the development of MMIR as a research discipline that is independent of these areas. See also Handbook of Multimedia Information Retrieval^[9] for a complete overview over this research discipline.

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

MPEG-7 is a multimedia content description standard. It was standardized in ISO/IEC 15938. This description will be associated with the content itself, to allow fast and efficient searching for material that is of interest to the user. MPEG-7 is formally called Multimedia Content Description Interface. Thus, it is not a standard which deals with the actual encoding of moving pictures and audio, like MPEG-1, MPEG-2 and MPEG-4. It uses XML to store metadata, and can be attached to timecode in order to tag particular events, or synchronise lyrics to a song, for example.

Music information retrieval (MIR) is the interdisciplinary science of retrieving information from music. Those involved in MIR may have a background in academic musicology, psychoacoustics, psychology, signal processing, informatics, machine learning, optical music recognition, computational intelligence, or some combination of these.

An image retrieval system is a computer system used for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as captioning, keywords, title or descriptions to the images so that retrieval can be performed over the annotation words. Manual image annotation is time-consuming, laborious and expensive; to address this, there has been a large amount of research done on automatic image annotation. Additionally, the increase in social web applications and the semantic web have inspired the development of several web-based image annotation tools.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

Content-based image retrieval, also known as query by image content and content-based visual information retrieval (CBVIR), is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching for digital images in large databases. Content-based image retrieval is opposed to traditional concept-based approaches.

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

A video search engine is a web-based search engine which crawls the web for video content. Some video search engines parse externally hosted content while others allow content to be uploaded and hosted on their own servers. Some engines also allow users to search by video format type and by length of the clip. The video search results are usually accompanied by a thumbnail view of the video.

Oracle Spatial and Graph, formerly Oracle Spatial, is a free option component of the Oracle Database. The spatial features in Oracle Spatial and Graph aid users in managing geographic and location-data in a native type within an Oracle database, potentially supporting a wide range of applications — from automated mapping, facilities management, and geographic information systems (AM/FM/GIS), to wireless location services and location-enabled e-business. The graph features in Oracle Spatial and Graph include Oracle Network Data Model (NDM) graphs used in traditional network applications in major transportation, telcos, utilities and energy organizations and RDF semantic graphs used in social networks and social interactions and in linking disparate data sets to address requirements from the research, health sciences, finance, media and intelligence communities.

Multimedia search enables information search using queries in multiple data types including text and other multimedia formats. Multimedia search can be implemented through multimodal search interfaces, i.e., interfaces that allow to submit search queries not only as textual requests, but also through other media. We can distinguish two methodologies in multimedia search:

In computer vision, visual descriptors or image descriptors are descriptions of the visual features of the contents in images, videos, or algorithms or applications that produce such descriptions. They describe elementary characteristics such as the shape, the color, the texture or the motion, among others.

Machine interpretation of documents and services in Semantic Web environment is primarily enabled by (a) the capability to mark documents, document segments and services with semantic tags and (b) the ability to establish contextual relations between the tags with a domain model, which is formally represented as ontology. Human beings use natural languages to communicate an abstract view of the world. Natural language constructs are symbolic representations of human experience and are close to the conceptual model that Semantic Web technologies deal with. Thus, natural language constructs have been naturally used to represent the ontology elements. This makes it convenient to apply Semantic Web technologies in the domain of textual information. In contrast, multimedia documents are perceptual recording of human experience. An attempt to use a conceptual model to interpret the perceptual records gets severely impaired by the semantic gap that exists between the perceptual media features and the conceptual world. Notably, the concepts have their roots in perceptual experience of human beings and the apparent disconnect between the conceptual and the perceptual world is rather artificial. The key to semantic processing of multimedia data lies in harmonizing the seemingly isolated conceptual and the perceptual worlds. Representation of the Domain knowledge needs to be extended to enable perceptual modeling, over and above conceptual modeling that is supported. The perceptual model of a domain primarily comprises observable media properties of the concepts. Such perceptual models are useful for semantic interpretation of media documents, just as the conceptual models help in the semantic interpretation of textual documents.

Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ‘audio mining’ is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.

An audio search engine is a web-based search engine which crawls the web for audio content. The information can consist of web pages, images, audio files, or another type of document. Various techniques exist for research on these engines.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

<span class="mw-page-title-main">Reverse image search</span> Content-based image retrieval

Reverse image search is a content-based image retrieval (CBIR) query technique that involves providing the CBIR system with a sample image that it will then base its search upon; in terms of information retrieval, the sample image is very useful. In particular, reverse image search is characterized by a lack of search terms. This effectively removes the need for a user to guess at keywords or terms that may or may not return a correct result. Reverse image search also allows users to discover content that is related to a specific sample image or the popularity of an image, and to discover manipulated versions and derivative works.

The following outline is provided as an overview of and topical guide to natural-language processing:

In natural language processing (NLP), a text graph is a graph representation of a text item. It is typically created as a preprocessing step to support NLP tasks such as text condensation term disambiguation (topic-based) text summarization, relation extraction and textual entailment.

A 3D Content Retrieval system is a computer system for browsing, searching and retrieving three dimensional digital contents from a large database of digital images. The most original way of doing 3D content retrieval uses methods to add description text to 3D content files such as the content file name, link text, and the web page title so that related 3D content can be found through text retrieval. Because of the inefficiency of manually annotating 3D files, researchers have investigated ways to automate the annotation process and provide a unified standard to create text descriptions for 3D contents. Moreover, the increase in 3D content has demanded and inspired more advanced ways to retrieve 3D information. Thus, shape matching methods for 3D content retrieval have become popular. Shape matching retrieval is based on techniques that compare and contrast similarities between 3D models.

Video browsing, also known as exploratory video search, is the interactive process of skimming through video content in order to satisfy some information need or to interactively check if the video content is relevant. While originally proposed to help users inspecting a single video through visual thumbnails, modern video browsing tools enable users to quickly find desired information in a video archive by iterative human–computer interaction through an exploratory search approach. Many of these tools presume a smart user that wants features to interactively inspect video content, as well as automatic content filtering features. For that purpose, several video interaction features are usually provided, such as sophisticated navigation in video or search by a content-based query. Video browsing tools often build on lower-level video content analysis, such as shot transition detection, keyframe extraction, semantic concept detection, and create a structured content overview of the video file or video archive. Furthermore, they usually provide sophisticated navigation features, such as advanced timelines, visual seeker bars or a list of selected thumbnails, as well as means for content querying. Examples of content queries are shot filtering through visual concepts, through some specific characteristics, through user-provided sketches, or through content-based similarity search.

References

1 2 3 H Eidenberger. Fundamental Media Understanding, atpress, 2011, p. 1.
↑ Sikos, L. F. (2016). "RDF-powered semantic video annotation tools with concept mapping to Linked Data for next-generation video indexing: a comprehensive review". Multimedia Tools and Applications. 76 (12): 14437–14460. doi:10.1007/s11042-016-3705-7. S2CID 254832794.
↑ A Del Bimbo. Visual Information Retrieval, Morgan Kaufmann, 1999.
↑ HG Kim, N Moreau, T Sikora. MPEG-7 Audio and Beyond, Wiley, 2005.
↑ MS Lew (Ed.). Principles of Visual Information Retrieval, Springer, 2001.
↑ H Eidenberger. Professional Media Understanding, atpress, 2012.
↑ Raieli, Roberto (2016). "Introducing Multimedia Information Retrieval to libraries". JLIS.it. 7 (3): 9–42. doi:10.4403/jlis.it-11530. S2CID 56652314.
↑ "International Journal of Multimedia Information Retrieval", Springer, 2011, Retrieved 21 October 2011.
↑ H Eidenberger. Handbook of Multimedia Information Retrieval, atpress, 2012.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Eidenberger-1] 1 2 3 H Eidenberger. Fundamental Media Understanding, atpress, 2011, p. 1.

[2] Sikos, L. F. (2016). "RDF-powered semantic video annotation tools with concept mapping to Linked Data for next-generation video indexing: a comprehensive review". Multimedia Tools and Applications. 76 (12): 14437–14460. doi:10.1007/s11042-016-3705-7. S2CID 254832794.

[3] A Del Bimbo. Visual Information Retrieval, Morgan Kaufmann, 1999.

[4] HG Kim, N Moreau, T Sikora. MPEG-7 Audio and Beyond, Wiley, 2005.

[5] MS Lew (Ed.). Principles of Visual Information Retrieval, Springer, 2001.

[6] H Eidenberger. Professional Media Understanding, atpress, 2012.

[7] Raieli, Roberto (2016). "Introducing Multimedia Information Retrieval to libraries". JLIS.it. 7 (3): 9–42. doi:10.4403/jlis.it-11530. S2CID 56652314.

[8] "International Journal of Multimedia Information Retrieval", Springer, 2011, Retrieved 21 October 2011.

[9] H Eidenberger. Handbook of Multimedia Information Retrieval, atpress, 2012.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]