Search by sound

Last updated

Search by sound is the retrieval of information based on audio input. There are a handful of applications, specifically for mobile devices that utilize search by sound. Shazam, Soundhound, Axwave, ACRCloud and others have seen considerable success by using a simple algorithm to match an acoustic fingerprint to a song in a library. These applications take a sample clip of a song, or a user-generated melody and check a music library/music database to see where the clip matches with the song. From there, song information will be queried and displayed to the user.

Contents

These kind of applications are mainly used for finding a song that the user does not already know. Searching by sound is not limited to just identifying songs, but also for identifying melodies, tunes or advertisements, sound library management and video files.

Acoustic fingerprinting

The way these apps search by sound is through generating an acoustic fingerprint; a digital summary of the sound. A microphone is used to pick up an audio sample, which is then broken down into a simple numeric signature, a code unique to each track. Using the same method of fingerprinting sounds, when Shazam picks up a sound clip, it will generate a signature for that clip. Then it’s simple pattern matching from there using an extensive audio music database.

The practice of using acoustic fingerprints is not limited to just music, however, but other areas of the entertainment business as well. Shazam also can identify television shows with the same technique of acoustic fingerprinting. Of course, this method of breaking down a sound sample into a unique signature is useless unless there is an extensive database of music with keys to match with the samples. Shazam has over 11 million songs in its database. [1]

Other services such as Midomi and Soundhound allow users to add to that library of music in order to expand the chances to match a sound sample with its corresponding sound.

Query by humming

Midomi and Soundhound both offer query by humming. This is a branch off of acoustic fingerprints but is still a musical retrieval system. After receiving a user-generated hummed melody, which is the input query, the system returns a ranked list of songs that are closest to the user query.

See also

Related Research Articles

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

Music information retrieval (MIR) is the interdisciplinary science of retrieving information from music. Those involved in MIR may have a background in academic musicology, psychoacoustics, psychology, signal processing, informatics, machine learning, optical music recognition, computational intelligence or some combination of these.

<span class="mw-page-title-main">Content-based image retrieval</span> Method of image retrieval

Content-based image retrieval, also known as query by image content and content-based visual information retrieval (CBVIR), is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching for digital images in large databases. Content-based image retrieval is opposed to traditional concept-based approaches.

<span class="mw-page-title-main">Ableton Live</span> Digital audio workstation

Ableton Live, also known as Live or sometimes colloquially as "Ableton", is a digital audio workstation for macOS and Windows developed by the German company Ableton.

Query by humming (QbH) is a music retrieval system that branches off the original classification systems of title, artist, composer, and genre. It normally applies to songs or other music with a distinct single theme or melody. The system involves taking a user-hummed melody and comparing it to an existing database. The system then returns a ranked list of music closest to the input query.

A fingerprint is a mark made by the pattern of ridges on the pad of a human finger.

Musipedia is a search engine for identifying pieces of music. This can be done by whistling a theme, playing it on a virtual piano keyboard, tapping the rhythm on the computer keyboard, or entering the Parsons code. Anybody can modify the collection of melodies and enter MIDI files, bitmaps with sheet music, lyrics or some text about the piece, or the melodic contours as Parsons Code. Certain features on the site may no longer work due to reliance on flash which became defunct in 2020.

Multimedia search enables information search using queries in multiple data types including text and other multimedia formats. Multimedia search can be implemented through multimodal search interfaces, i.e., interfaces that allow to submit search queries not only as textual requests, but also through other media. We can distinguish two methodologies in multimedia search:

Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ‘audio mining’ is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.

An audio search engine is a web-based search engine which crawls the web for audio content. The information can consist of web pages, images, audio files, or another type of document. Various techniques exist for research on these engines.

<span class="mw-page-title-main">Shazam (music app)</span> Music identification application

Shazam is an application that can identify music based on a short sample played using the microphone on the device. It was created by the British company Shazam Entertainment, based in London, and has been owned by Apple Inc. since 2018. The software is available for Android, macOS, iOS, Wear OS, watchOS and as a Google Chrome extension.

<span class="mw-page-title-main">SoundHound</span> American music and speech recognition company

SoundHound AI, Inc. is a voice AI and speech recognition company founded in 2005. It develops speech recognition, natural language understanding, sound recognition and search technologies. Its featured products include a voice AI developer platform, SoundHound Chat AI, a voice-enabled digital assistant, and music recognition mobile app SoundHound. Key vertical industries include the automotive, IoT devices, restaurant and customer service industries. The company’s headquarters are in Santa Clara, California.

Tunebot is a music search engine developed by the Interactive Audio Lab at Northwestern University. Users can search the database by humming or singing a melody into a microphone, playing the melody on a virtual keyboard, or by typing some of the lyrics. This allows users to finally identify that song that was stuck in their head.

<span class="mw-page-title-main">Reverse image search</span> Content-based image retrieval

Reverse image search is a content-based image retrieval (CBIR) query technique that involves providing the CBIR system with a sample image that it will then base its search upon; in terms of information retrieval, the sample image is very useful. In particular, reverse image search is characterized by a lack of search terms. This effectively removes the need for a user to guess at keywords or terms that may or may not return a correct result. Reverse image search also allows users to discover content that is related to a specific sample image or the popularity of an image, and to discover manipulated versions and derivative works.

An acoustic fingerprint is a condensed digital summary, a fingerprint, deterministically generated from an audio signal, that can be used to identify an audio sample or quickly locate similar items in an audio database.

<span class="mw-page-title-main">International Society for Music Information Retrieval</span>

The International Society for Music Information Retrieval (ISMIR) is an international forum for research on the organization of music-related data. It started as an informal group steered by an ad hoc committee in 2000 which established a yearly symposium - whence "ISMIR", which meant International Symposium on Music Information Retrieval. It was turned into a conference in 2002 while retaining the acronym. ISMIR was incorporated in Canada on July 4, 2008.

<span class="mw-page-title-main">Bing Audio</span>

Bing Audio is a music recognition application created by Microsoft which is installed on Windows Phones running version 7.5 and above, including Windows Phone 8. On Windows Phone 8.1, and in regions where the Microsoft Cortana voice assistant is available, Bing Music is integrated with Cortana and the music search history is a part of Cortana's "Notebook". The service is only designed to recognize recorded songs, not live performances or humming. Xbox Music Pass subscribers can immediately add the songs to their playlists. A unique feature compared to similar services is that Bing Audio continuously listens and analyzes music while most other services can only listen for a fixed amount of time. Bing Research developed a fingerprinting algorithm to identify songs.

Doreso is an automatic content recognition (ACR) company specialized in music discovery and social TV recognition service for the second screen. Their sound-to-sound music search engine allows users to obtain more detailed information about music and songs by singing, humming or by recording original music.

Perceptual hashing is the use of a fingerprinting algorithm that produces a snippet, hash, or fingerprint of various forms of multimedia. A perceptual hash is a type of locality-sensitive hash, which is analogous if features of the multimedia are similar. This is in contrast to cryptographic hashing, which relies on the avalanche effect of a small change in input value creating a drastic change in output value. Perceptual hash functions are widely used in finding cases of online copyright infringement as well as in digital forensics because of the ability to have a correlation between hashes so similar data can be found.

Automatic content recognition (ACR) is a technology used to identify content played on a media device or presented within a media file. Devices with ACR can allow for the collection of content consumption information automatically at the screen level itself, without any user-based input or search efforts. This information may be collected for purposes such as personalized advertising, content recommendations, sale to customer data aggregators and other applications.

References

  1. "That Tune, Named", Slate, retrieved 2020-07-11