Scene text

Last updated

Scene text is text that appears in an image captured by a camera in an outdoor environment.

Contents

The image displays the coach category in text format. We can observe that the coach belongs to Sleeper category. Text on a coach.jpg
The image displays the coach category in text format. We can observe that the coach belongs to Sleeper category.

The detection and recognition of scene text from camera captured images are computer vision tasks which became important after smart phones with good cameras became ubiquitous. The text in scene images varies in shape, font, colour and position. The recognition of scene text is further complicated sometimes by non-uniform illumination and focus.

To improve scene text recognition, the International Conference on Document Analysis and Recognition (ICDAR) conducts a robust reading competition once in two years. The competition was held in 2003, 2005 [1] [2] [3] and during every ICDAR conference. [4] [5] [6] International association for pattern recognition (IAPR) has created a list of datasets as Reading systems. [7]

Text detection

Text detection is the process of detecting the text present in the image, followed by surrounding it with a rectangular bounding box. Text detection can be carried out using image based techniques or frequency based techniques.

In image based techniques, an image is segmented into multiple segments. Each segment is a connected component of pixels with similar characteristics. The statistical features of connected components are utilised to group them and form the text. Machine learning approaches such as support vector machine and convolutional neural networks are used to classify the components into text and non-text.

In frequency based techniques, discrete Fourier transform (DFT) or discrete wavelet transform (DWT) are used to extract the high frequency coefficients. It is assumed that the text present in an image has high frequency components and selecting only the high frequency coefficients filters the text from the non-text regions in an image.

Word recognition

In word recognition, the text is assumed to be already detected and located and the rectangular bounding box containing the text is available. The word present in the bounding box needs to be recognized. The methods available to perform word recognition can be broadly classified into top-down and bottom-up approaches.

In the top-down approaches, a set of words from a dictionary is used to identify which word suits the given image. [8] [9] [10] Images are not segmented in most of these methods. Hence, the top-down approach is sometimes referred as segmentation free recognition.

In the bottom-up approaches, the image is segmented into multiple components and the segmented image is passed through a recognition engine. [11] [12] [13] Either an off the shelf Optical character recognition (OCR) engine [14] [15] [16] or a custom-trained one is used to recognise the text.

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and information retrieval. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. To this end, natural language processing often borrows ideas from theoretical linguistics. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

<span class="mw-page-title-main">Handwriting recognition</span> Ability of a computer to receive and interpret intelligible handwritten input

Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most possible words.

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

In computer vision or natural language processing, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document and this kind of semantic labeling is the scope of the logical layout analysis.

Document processing is a field of research and a set of production processes aimed at making an analog document digital. Document processing does not simply aim to photograph or scan a document to obtain a digital image, but also to make it digitally intelligible. This includes extracting the structure of the document or the layout and then the content, which can take the form of text or images. The process can involve traditional computer vision algorithms, convolutional neural networks or manual labor. The problems addressed are related to semantic segmentation, object detection, optical character recognition (OCR), handwritten text recognition (HTR) and, more broadly, transcription, whether automatic or not. The term can also include the phase of digitizing the document using a scanner and the phase of interpreting the document, for example using natural language processing (NLP) or image classification technologies. It is applied in many industrial and scientific fields for the optimization of administrative processes, mail processing and the digitization of analog archives and historical documents.

<span class="mw-page-title-main">Gabor filter</span> Linear filter used for texture analysis

In image processing, a Gabor filter, named after Dennis Gabor, who first proposed it as a 1D filter. The Gabor filter was first generalized to 2D by Gösta Granlund, by adding a reference direction. The Gabor filter is a linear filter used for texture analysis, which essentially means that it analyzes whether there is any specific frequency content in the image in specific directions in a localized region around the point or region of analysis. Frequency and orientation representations of Gabor filters are claimed by many contemporary vision scientists to be similar to those of the human visual system. They have been found to be particularly appropriate for texture representation and discrimination. In the spatial domain, a 2D Gabor filter is a Gaussian kernel function modulated by a sinusoidal plane wave.

ABBYY FineReader PDF is an optical character recognition (OCR) application developed by ABBYY, with support for PDF file editing since v15. The program runs under Microsoft Windows 7 or later, and Apple macOS 10.12 Sierra or later. The first version was released in 1993.

<span class="mw-page-title-main">OCRopus</span>

OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License v2.0 with a very modular design using command-line interfaces.

Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.

Region growing is a simple region-based image segmentation method. It is also classified as a pixel-based image segmentation method since it involves the selection of initial seed points.

hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML.

Analyzed Layout and Text Object (ALTO) is an open XML Schema developed by the EU-funded project called METAe.

Document mosaicing is a process that stitches multiple, overlapping snapshot images of a document together to produce one large, high resolution composite. The document is slid under a stationary, over-the-desk camera by hand until all parts of the document are snapshotted by the camera's field of view. As the document slid under the camera, all motion of the document is coarsely tracked by the vision system. The document is periodically snapshotted such that the successive snapshots are overlap by about 50%. The system then finds the overlapped pairs and stitches them together repeatedly until all pairs are stitched together as one piece of document.

Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.

The Medical Intelligence and Language Engineering Laboratory, also known as MILE lab, is a research laboratory at the Indian Institute of Science, Bangalore under the Department of Electrical Engineering. The lab is known for its work on Image processing, online handwriting recognition, Text-To-Speech and Optical character recognition systems, all of which are focused mainly on documents and speech in Indian languages. The lab is headed by A. G. Ramakrishnan.

<span class="mw-page-title-main">MNIST database</span> Database of handwritten digits

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

Tin Kam Ho is a computer scientist at IBM Research with contributions to machine learning, data mining, and classification. Ho is noted for introducing random decision forests in 1995, and for her pioneering work in ensemble learning and data complexity analysis. She is an IEEE fellow and IAPR fellow.

References

  1. Lucas, S.M. (2005). "ICDAR 2005 text locating competition results". S. M. Lucas. Text Locating Competition Results. In Proc. 8th ICDAR, pages 80–85, 2005. pp. 80–84 Vol. 1. doi:10.1109/ICDAR.2005.231. ISBN   978-0-7695-2420-7. S2CID   1842569.
  2. ICDAR 2005 Competitions. http://www.iapr-tc11.org/mediawiki/index.php/ICDAR_2005_Robust_Reading_Competitions.
  3. Lucas, Simon M.; Panaretos, Alex; Sosa, Luis; Tang, Anthony; Wong, Shirley; Young, Robert; Ashida, Kazuki; Nagai, Hiroki; Okamoto, Masayuki; Yamamoto, Hiroaki; Miyao, Hidetoshi; Zhu, Junmin; Ou, Wuwen; Wolf, Christian; Jolion, Jean-Michel; Todoran, Leon; Worring, Marcel; Lin, Xiaofan (2005). "S. M. Lucas. ICDAR 2003 Robust Reading Competitions: Entries, Results, and Future Directions. IJDAR, 7(2):105–122, June 2005". International Journal of Document Analysis and Recognition. 7 (2–3): 105–122. CiteSeerX   10.1.1.104.1667 . doi:10.1007/s10032-004-0134-3. S2CID   2250003.
  4. ICDAR 2013. http://www.icdar2013.org.
  5. ICDAR 2017. http://u-pat.org/ICDAR2017/
  6. ICDAR 2011 Robust Reading Competition. http://www.cvc.uab.es/icdar2011competition/.
  7. IAPR TC11 Reading Systems-Datasets List. http://www.iapr-tc11.org/mediawiki/index.php?title=Datasets.
  8. Weinman, J.J.; Learned-Miller, E.; Hanson, A.R. (2009). "J. J. Weinmann, E. Learned-Miller, and A. R. Hanson. Scene text recognition using similarity and a lexicon with sparse belief propagation. IEEE Trans. PAMI, 31(10):1733–1746, 2009". IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (10): 1733–1746. doi:10.1109/TPAMI.2009.38. PMC   3021989 . PMID   19696446.
  9. "A. Mishra, K. Alahari, and C. V. Jawahar. Scene Text Recognition using Higher Order Language Priors. In Proc. BMVC, 2012" (PDF).
  10. Novikova, Tatiana; Barinova, Olga; Kohli, Pushmeet; Lempitsky, Victor (2012). "Large-Lexicon Attribute-Consistent Text Recognition in Natural Images". Computer Vision – ECCV 2012. Lecture Notes in Computer Science. Vol. 7577. pp. 752–765. CiteSeerX   10.1.1.296.4807 . doi:10.1007/978-3-642-33783-3_54. ISBN   978-3-642-33782-6.
  11. Kumar, Deepak; Ramakrishnan, A. G. (2012). "Power-law transformation for enhanced recognition of born-digital word images". D. Kumar and A. G. Ramakrishnan. Power-law transformation for enhanced recognition of born-digital word images. In Proc. 9th SPCOM, 2012. pp. 1–5. doi:10.1109/SPCOM.2012.6290009. ISBN   978-1-4673-2014-6. S2CID   13876092.
  12. D. Kumar; M. N. Anil Prasad; A. G. Ramakrishnan. "MAPS: Midline analysis and propagation of segmentation". Proc. 8th ICVGIP, 2012. doi:10.1145/2425333.2425348. S2CID   13303734.
  13. Kumar, Deepak; Anil Prasad, M. N.; Ramakrishnan, A. G. (2013). "NESP: Nonlinear enhancement and selection of plane for optimal segmentation and recognition of scene word images". In Zanibbi, Richard; Coüasnon, Bertrand (eds.). Document Recognition and Retrieval XX. Vol. 8658. p. 865806. doi:10.1117/12.2008519. S2CID   13848101.
  14. Abbyy Fine Reader. http://www.abbyy.com/
  15. Nuance Omnipage Reader. http://www.nuance.com/
  16. Tesseract OCR Engine. http://code.google.com/p/tesseract-ocr/