In Codice Ratio

Last updated

In Codice Ratio is a research project designed to study and use novel techniques such as Optical Character Recognition and Artificial Intelligence to digitize works in the Vatican Apostolic Archive, [1] [2] most of which is handwritten. [3] [4]

Contents

History

In 2017, a project based in Roma Tre University called In Codice Ratio began using artificial intelligence and optical character recognition to attempt to transcribe more documents from the archives. [3] [5] While character-recognition software is adept at reading typed text, the cramped and many-serifed style of medieval handwriting makes distinguishing individual characters difficult for the software. [6] Many individual letters of the alphabet are often confused by human readers of medieval handwriting, let alone a computer program. The team behind In Codice Ratio tried to solve this problem by developing a machine-learning software that could parse this handwriting. Their program eventually achieved 96% accuracy in parsing this type of text. [7]

Related Research Articles

Optical character recognition Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo or from subtitle text superimposed on an image.

Handwriting recognition Ability of a computer to receive and interpret intelligible handwritten input

Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most plausible words.

Vatican Apostolic Archive Archive of the Holy See

The Vatican Apostolic Archive, known until October 2019 as the Vatican Secret Archive, is the central repository in the Vatican City of all acts promulgated by the Holy See.

In computer vision or natural language processing, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document and this kind of semantic labeling is the scope of the logical layout analysis.

Document processing is a field of research and a set of production processes aimed at making an analog document digital. Document processing does not simply aim to photograph or scan a document to obtain a digital image, but also to make it digitally intelligible. This includes extracting the structure of the document or the layout and then the content, which can take the form of text or images. The process can involve traditional computer vision algorithms, convolutional neural networks or manual labor. The problems addressed are related to semantic segmentation, object detection, optical character recognition (OCR), handwritten text recognition (HTR) and, more broadly, transcription, whether automatic or not. The term can also include the phase of digitizing the document using a scanner and the phase of interpreting the document, for example using natural language processing (NLP) or image classification technologies. It is applied in many industrial and scientific fields for the optimization of administrative processes, mail processing and the digitization of analog archives and historical documents.

Automatic identification and data capture (AIDC) refers to the methods of automatically identifying objects, collecting data about them, and entering them directly into computer systems, without human involvement. Technologies typically considered as part of AIDC include QR codes, bar codes, radio frequency identification (RFID), biometrics, magnetic stripes, optical character recognition (OCR), smart cards, and voice recognition. AIDC is also commonly referred to as "Automatic Identification", "Auto-ID" and "Automatic Data Capture".

Neural network Structure in biology and artificial intelligence

A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological neurons, or an artificial neural network, used for solving artificial intelligence (AI) problems. The connections of the biological neuron are modeled in artificial neural networks as weights between nodes. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed. This activity is referred to as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be −1 and 1.

SmartScore X2 is a music OCR and scorewriter program, developed, published and distributed by Musitek Corporation based in Ojai, California.

Optical music recognition (OMR) is a field of research that investigates how to computationally read musical notation in documents. The goal of OMR is to teach the computer to read and interpret sheet music and produce a machine-readable version of the written music score. Once captured digitally, the music can be saved in commonly used file formats, e.g. MIDI and MusicXML . In the past it has, misleadingly, also been called "music optical character recognition". Due to significant differences, this term should no longer be used.

In computer science, intelligent character recognition (ICR) is an advanced optical character recognition (OCR) or — rather more specific — handwriting recognition system that allows fonts and different styles of handwriting to be learned by a computer during processing to improve accuracy and recognition levels.

ABBYY FineReader PDF is an optical character recognition (OCR) application developed by ABBYY, with support for PDF file editing since v15. The program runs under Microsoft Windows 7 or later, and Apple macOS 10.12 Sierra or later. The first version was released in 1993.

Intelligent Word Recognition, or IWR, is the recognition of unconstrained handwritten words. IWR recognizes entire handwritten words or phrases instead of character-by-character, like its predecessor, optical character recognition (OCR). IWR technology matches handwritten or printed words to a user-defined dictionary, significantly reducing character errors encountered in typical character-based recognition engines.

This comparison of optical character recognition software includes:

Yann LeCun French computer scientist

Yann André LeCun is a French computer scientist working primarily in the fields of machine learning, computer vision, mobile robotics, and computational neuroscience. He is the Silver Professor of the Courant Institute of Mathematical Sciences at New York University, and Vice President, Chief AI Scientist at Meta.

This is a software system for forensic comparison of handwriting. It was developed at CEDAR, the Center of Excellence for Document Analysis and Recognition at the University at Buffalo. CEDAR-FOX has capabilities for interaction with the questioned document examiner to go through processing steps such as extracting regions of interest from a scanned document, determining lines and words of text, recognize textual elements. The final goal is to compare two samples of writing to determine the log-likelihood ratio under the prosecution and defense hypotheses. It can also be used to compare signature samples. The software, which is protected by a United States Patent can be licensed from Cedartech, Inc.

Handwritten biometric recognition Process of identifying the author of a given text from the handwriting style

Handwritten biometric recognition is the process of identifying the author of a given text from the handwriting style. Handwritten biometric recognition belongs to behavioural biometric systems because it is based on something that the user has learned to do.

MNIST database Database of handwritten digits

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

This is a timeline of optical character recognition.

Outline of machine learning Overview of and topical guide to machine learning

The following outline is provided as an overview of and topical guide to machine learning. Machine learning is a subfield of soft computing within computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed". Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.

OCR Systems American computing company

OCR Systems, Inc., was an American computer hardware manufacturer and software publisher dedicated to optical character recognition technologies. The company's first product, the System 1000 in 1970, was used by numerous large corporations for bill processing and mail sorting. Following a series of pitfalls in the 1970s and early 1980s, founder Theodore Herzl Levine put the company in the hands of Gregory Boleslavsky and Vadim Brikman, the company's vice presidents and recent immigrants from the Soviet Ukraine, who were able to turn OCR System's fortunes around and expand its employee base. The company released the software-based OCR application ReadRight for DOS, later ported to Windows, in the late 1980s. Adobe Inc. bought the company in 1992.

References

  1. Firmani, Donatella; Maiorino, Marco; Merialdo, Paolo; Nieddu, Elena (2018-03-01). "Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio - Episode 1". Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 263–272. arXiv: 1803.03200 . doi:10.1145/3219819.3219879. ISBN   9781450355520. S2CID   3772349.
  2. "Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio". SIGKDD - KDD 2018. Retrieved 2021-03-25.
  3. 1 2 Kean, Sam (2018-04-30). "Artificial Intelligence Is Cracking Open the Vatican's Secret Archives". The Atlantic. Retrieved 2021-03-25.
  4. Firmani, Donatella; Merialdo, Paolo; Nieddu, Elena; Scardapane, Simone (December 2017). "In Codice Ratio: OCR of Handwritten Latin Documents using Deep Convolutional Networks".
  5. Firmani, D.; Merialdo, P.; Nieddu, E.; Scardapane, S. (2017). "In codice ratio: OCR of handwritten Latin documents using deep convolutional networks" (PDF). International Workshop on Artificial Intelligence for Cultural Heritage. pp. 9–16.
  6. "AI tackles the Vatican's secrets". MIT Technology Review. 15 March 2018. Retrieved 27 November 2018.
  7. Firmani, Donatella; Merialdo, Paolo; Maiorino, Marco (25 September 2017). "In Codice Ratio: Scalable Transcription of Vatican Registers". ERCIM News. Retrieved 27 November 2018.