Optical music recognition (OMR) is a field of research that investigates how to computationally read musical notation in documents. [1] The goal of OMR is to teach the computer to read and interpret sheet music and produce a machine-readable version of the written music score. Once captured digitally, the music can be saved in commonly used file formats, e.g. MIDI (for playback) and MusicXML (for page layout). In the past it has, misleadingly, also been called "music optical character recognition". Due to significant differences, this term should no longer be used. [2]
Optical music recognition of printed sheet music started in the late 1960s at the Massachusetts Institute of Technology when the first image scanners became affordable for research institutes. [3] [4] [5] Due to the limited memory of early computers, the first attempts were limited to only a few measures of music. In 1984, a Japanese research group from Waseda University developed a specialized robot, called WABOT (WAseda roBOT), which was capable of reading the music sheet in front of it and accompanying a singer on an electric organ. [6] [7]
Early research in OMR was conducted by Ichiro Fujinaga, Nicholas Carter, Kia Ng, David Bainbridge, and Tim Bell. These researchers developed many of the techniques that are still being used today.
The first commercial OMR application, MIDISCAN (now SmartScore), was released in 1991 by Musitek Corporation.
The availability of smartphones with good cameras and sufficient computational power, paved the way to mobile solutions where the user takes a picture with the smartphone and the device directly processes the image.
Optical music recognition relates to other fields of research, including computer vision, document analysis, and music information retrieval. It is relevant for practicing musicians and composers that could use OMR systems as a means to enter music into the computer and thus ease the process of composing, transcribing, and editing music. In a library, an OMR system could make music scores searchable [8] and for musicologists it would allow to conduct quantitative musicological studies at scale. [9]
Optical music recognition has frequently been compared to Optical character recognition. [2] [10] [11] The biggest difference is that music notation is a featural writing system. This means that while the alphabet consists of well-defined primitives (e.g., stems, noteheads, or flags), it is their configuration – how they are placed and arranged on the staff – that determines the semantics and how it should be interpreted.
The second major distinction is the fact that while an OCR system does not go beyond recognizing letters and words, an OMR system is expected to also recover the semantics of music: The user expects that the vertical position of a note (graphical concept) is being translated into the pitch (musical concept) by applying the rules of music notation. Notice that there is no proper equivalent in text recognition. By analogy, recovering the music from an image of a music sheet can be as challenging as recovering the HTML source code from the screenshot of a website.
The third difference comes from the used character set. Although writing systems like Chinese have extraordinarily complex character sets, the character set of primitives for OMR spans a much greater range of sizes, ranging from tiny elements such as a dot to big elements that potentially span an entire page such as a brace. Some symbols have a nearly unrestricted appearance like slurs, that are only defined as more-or-less smooth curves that may be interrupted anywhere.
Finally, music notation involves ubiquitous two-dimensional spatial relationships, whereas text can be read as a one-dimensional stream of information, once the baseline is established.
The process of recognizing music scores is typically broken down into smaller steps that are handled with specialized pattern recognition algorithms.
Many competing approaches have been proposed with most of them sharing a pipeline architecture, where each step in this pipeline performs a certain operation, such as detecting and removing staff lines before moving on to the next stage. A common problem with that approach is that errors and artifacts that were made in one stage are propagated through the system and can heavily affect the performance. For example, if the staff line detection stage fails to correctly identify the existence of the music staffs, subsequent steps will probably ignore that region of the image, leading to missing information in the output.
Optical music recognition is frequently underestimated due to the seemingly easy nature of the problem: If provided with a perfect scan of typeset music, the visual recognition can be solved with a sequence of fairly simple algorithms, such as projections and template matching. However, the process gets significantly harder for poor scans or handwritten music, which many systems fail to recognize altogether. And even if all symbols would have been detected perfectly, it is still challenging to recover the musical semantics due to ambiguities and frequent violations of the rules of music notation (see the example of Chopin's Nocturne). Donald Byrd and Jakob Simonsen argue that OMR is difficult because modern music notation is extremely complex. [11]
Donald Byrd also collected a number of interesting examples [12] as well as extreme examples [13] of music notation that demonstrate the sheer complexity of music notation.
Typical applications for OMR systems include the creation of an audible version of the music score (referred to as replayability). A common way to create such a version is by generating a MIDI file, which can be synthesised into an audio file. MIDI files, though, are not capable of storing engraving information (how the notes were laid out) or enharmonic spelling.
If the music scores are recognized with the goal of human readability (referred to as reprintability), the structured encoding has to be recovered, which includes precise information on the layout and engraving. Suitable formats to store this information include MEI and MusicXML.
Apart from those two applications, it might also be interesting to just extract metadata from the image or enable searching. In contrast to the first two applications, a lower level of comprehension of the music score might be sufficient to perform these tasks.
In 2001, David Bainbridge and Tim Bell published their work on the challenges of OMR, where they reviewed previous research and extracted a general framework for OMR. [10] Their framework has been used by many systems developed after 2001. The framework has four distinct stages with a heavy emphasis on the visual detection of objects. They noticed that the reconstruction of the musical semantics was often omitted from published articles because the used operations were specific to the output format.
In 2012, Ana Rebelo et al. surveyed techniques for optical music recognition. [14] They categorized the published research and refined the OMR pipeline into the four stages: Preprocessing, Music symbols recognition, Musical notation reconstruction and Final representation construction. This framework became the de facto standard for OMR and is still being used today (although sometimes with slightly different terminology). For each block, they give an overview of techniques that are used to tackle that problem. This publication is the most cited paper on OMR research as of 2019.
With the advent of deep learning, many computer vision problems have shifted from imperative programming with hand-crafted heuristics and feature engineering towards machine learning. In optical music recognition, the staff processing stage, [15] [16] the music object detection stage, [17] [18] [19] [20] as well as the music notation reconstruction stage [21] have seen successful attempts to solve them with deep learning.
Even completely new approaches have been proposed, including solving OMR in an end-to-end fashion with sequence-to-sequence models, that take an image of music scores and directly produce the recognized music in a simplified format. [22] [23] [24] [25]
For systems that were developed before 2016, staff detection and removal posed a significant obstacle. A scientific competition was organized to improve the state of the art and advance the field. [26] Due to excellent results and modern techniques that made the staff removal stage obsolete, this competition was discontinued.
However, the freely available CVC-MUSCIMA dataset that was developed for this challenge is still highly relevant for OMR research as it contains 1000 high-quality images of handwritten music scores, transcribed by 50 different musicians. It has been further extended into the MUSCIMA++ dataset, which contains detailed annotations for 140 out of 1000 pages.
The Single Interface for Music Score Searching and Analysis project (SIMSSA) [27] is probably the largest project that attempts to teach computers to recognize musical scores and make them accessible. Several sub-projects have already been successfully completed, including the Liber Usualis [28] and Cantus Ultimus. [29]
Towards Richer Online Music Public-domain Archives (TROMPA) is an international research project, sponsored by the European Union that investigates how to make public-domain digital music resources more accessible. [30]
The development of OMR systems benefits from test datasets of sufficient size and diversity to ensure the system being developed works under various conditions. However, for legal reasons and potential copyright violations, it is challenging to compile and publish such a dataset. The most notable datasets for OMR are referenced and summarized by the OMR Datasets project [31] and include the CVC-MUSCIMA, [32] MUSCIMA++, [33] DeepScores, [34] PrIMuS, [35] HOMUS, [36] and SEILS dataset, [37] as well as the Universal Music Symbol Collection. [38]
French company Newzik took a different approach in the development of its OMR technology Maestria, [39] by using random score generation. Using synthetic data helped with avoiding copyright issues and training the artificial intelligence algorithms on musical cases that rarely occur in actual repertoire, ultimately resulting in (according to claims by the company) more accurate music recognition. [40]
Open source OMR projects vary significantly, from well developed software such as Audiveris, to many projects that have been realized in academia, but only a few of which reached a mature state and have been successfully deployed to users. These systems include:
Most of the commercial desktop applications that were developed in the last 20 years have been shut down again due to the lack of commercial success, leaving only a few vendors that are still developing, maintaining, and selling OMR products. Some of these products claim extremely high recognition rates with up to 100% accuracy [49] [50] but fail to disclose how those numbers were obtained, making it nearly impossible to verify them and compare different OMR systems.
Better cameras and increases in processing power have enabled a range of mobile applications, both on the Google Play Store and the Apple Store. Frequently the focus is on sight-playing (see sight-reading) – converting the sheet music into sound that is played on the device.
Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.
Sibelius is a scorewriter program developed and released by Sibelius Software Limited. Beyond creating, editing and printing music scores, it can also play the music back using sampled or synthesised sounds. It produces printed scores, and can also publish them via the Internet for others to access. Less advanced versions of Sibelius at lower prices have been released, as have various add-ons for the software.
Music information retrieval (MIR) is the interdisciplinary science of retrieving information from music. Those involved in MIR may have a background in academic musicology, psychoacoustics, psychology, signal processing, informatics, machine learning, optical music recognition, computational intelligence, or some combination of these.
Optical mark recognition (OMR) collects data from people by identifying markings on a paper. OMR enables the hourly processing of hundreds or even thousands of documents. A common application of this technology is used in exams, where students mark cells as their answers. This allows for very fast automated grading of exam sheets.
Document processing is a field of research and a set of production processes aimed at making an analog document digital. Document processing does not simply aim to photograph or scan a document to obtain a digital image, but also to make it digitally intelligible. This includes extracting the structure of the document or the layout and then the content, which can take the form of text or images. The process can involve traditional computer vision algorithms, convolutional neural networks or manual labor. The problems addressed are related to semantic segmentation, object detection, optical character recognition (OCR), handwritten text recognition (HTR) and, more broadly, transcription, whether automatic or not. The term can also include the phase of digitizing the document using a scanner and the phase of interpreting the document, for example using natural language processing (NLP) or image classification technologies. It is applied in many industrial and scientific fields for the optimization of administrative processes, mail processing and the digitization of analog archives and historical documents.
A data entry clerk, also known as data preparation and control operator, data registration and control operator, and data preparation and registration operator, is a member of staff employed to enter or update data into a computer system. Data is often entered into a computer from paper documents using a keyboard. The keyboards used can often have special keys and multiple colors to help in the task and speed up the work. Proper ergonomics at the workstation is a common topic considered.
SmartScore 64 NE is a music OCR and scorewriter program, developed, published and distributed by Musitek Corporation based in Ojai, California.
The following outline is provided as an overview of and topical guide to computer vision:
Music informatics is a study of music processing, in particular music representations, fourier analysis of music, music synchronization, music structure analysis and chord recognition. Other music informatics research topics include computational music modeling, computational music analysis, optical music recognition, digital audio editors, online music search engines, music information retrieval and cognitive issues in music. Because music informatics is an emerging discipline, it is a very dynamic area of research with many diverse viewpoints, whose future is yet to be determined.
Forms processing is a process by which one can capture information entered into data fields and convert it into an electronic format. This can be done manually or automatically, but the general process is that hard copy data is filled out by humans and then "captured" from their respective fields and entered into a database or other electronic format.
The Music Encoding Initiative (MEI) is an open-source effort to create a system for representation of musical documents in a machine-readable structure. MEI closely mirrors work done by text scholars in the Text Encoding Initiative (TEI) and while the two encoding initiatives are not formally related, they share many common characteristics and development practices. The term "MEI", like "TEI", describes the governing organization and the markup language. The MEI community solicits input and development directions from specialists in various music research communities, including technologists, librarians, historians, and theorists in a common effort to discuss and define best practices for representing a broad range of musical documents and structures. The results of these discussions are then formalized into the MEI schema, a core set of rules for recording physical and intellectual characteristics of music notation documents. This schema is expressed in an XML schema Language, with RelaxNG being the preferred format. The MEI schema is developed using the One-Document-Does-it-all (ODD) format, a literate programming XML format developed by the Text Encoding Initiative.
Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data may, for example, consist of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.
Audiveris is an open source tool for optical music recognition (OMR).
Forte is a music notation program developed by the German company Lugert Verlag, located in Handorf. Its name is derived from the dynamic marking of forte. The program is available in both German and English.
Optical braille recognition is technology to capture and process images of braille characters into natural language characters. It is used to convert braille documents for people who cannot read them into text, and for preservation and reproduction of the documents.
The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.
Scene text is text that appears in an image captured by a camera in an outdoor environment.
VisualAudio is a project that retrieves sound from a picture of a phonograph record. It originated from a partnership between the Swiss National Sound Archives and the School of Engineering and Architecture of Fribourg.
Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.
Media related to Optical music recognition at Wikimedia Commons