CEDAR-FOX

Last updated

This is a software system for forensic comparison of handwriting. It was developed at CEDAR, the Center of Excellence for Document Analysis and Recognition at the University at Buffalo. [1] [2] [3] CEDAR-FOX has capabilities for interaction with the questioned document examiner to go through processing steps such as extracting regions of interest from a scanned document, determining lines and words of text, recognize textual elements. The final goal is to compare two samples of writing to determine the log-likelihood ratio under the prosecution and defense hypotheses. It can also be used to compare signature samples. The software, which is protected by a United States Patent [4] can be licensed from Cedartech, Inc.

Contents

Details

Writer verification is the task to determine whether two handwritten samples are written by the same writer or not. It is used in questioned document examiner. By using a set of metrics, CedarFox can associate a measure of confidence whether two documents are written by the same individual or by different individuals. CedarFox allows you to select either the entire document or a specific region of a document in order to obtain the comparison. The comparison is based on macro features (which measure global characteristics such as slant, connectivity, etc.), micro features (which are based on individual character shapes), and style features (e.g., shapes of character pairs, or bigrams). Two different modes of writer verification are available: (i) a questioned document is compared against a single known document (the basis of this comparison are statistics based on how much variation a person can have), and (ii) a questioned document is compared against "multiple known" documents. Here the system learns from the known documents about the writer's habits. At least four known documents have to be available to use this mode. The task of identifying the user is split into two parts,

Document processing and feature extraction

CEDAR-FOX performs variety of operations on document to make them ready for comparison. They include thresholding, line removal, line segmentation, word segmentation and transcript mapping.

Analyse the image property. CEDAR FOX File Property.JPG
Analyse the image property.

Image Processing

  • Thresholding converts a gray scale image to binary for separating the foreground pixel from background pixel. The thresholding methods used are Otsu's thresholding, Adaptive thresholding and texture thresholding.
  • If document is written using rule line paper, user can perform an underline removal operation. Hough transform is applied for this operation and user can select the correct threshold for the same. Selecting high threshold will result in removing some of the character strokes and user has to come up with correct value for the threshold.
  • Line segmentation separates each line in the document and uses the concept of Bi-Variate Gaussian Densities. Word segmentation acts in similar way and separates each word within the document.
    Word Segmentation. CEDAR FOX Word Segmentation.jpg
    Word Segmentation.
  • Transcript Matching is a ground truth matching where the software is provided a text file containing the transcript of the handwritten image. This is useful when different subjects are required to handwrite the same content and then it is matched with the unknown document. It finds the best word level alignment between transcript and the handwritten image. The character images are extracted and can be used to compare the similarity between the document.

System Utilities

CedarFox has user interfaces for scanning documents directly as well as for entering the results directly into spread-sheets and for printing intermediate results. A database access is also available for storing document meta-data.

Document Comparison

Many options are available with CEDAR-FOX for document comparison. The four major verification model used are

Features are split into Macro(global) and Micro(local) features. Macro features are calculated on entire document whereas Micro features are calculated on selected characters/bi-grams/words. Macro features are gray scale based, contour based, slope based , stroke-width, slant, height, and word-gap. These features are used for comparison.
The comparison of document maps from feature space to distance space. The macro features are real valued and so the mapping to distance space is absolute difference between two features. Similarity for binary valued feature can be calculates using hamming distance, Euclidean distance and etcetera. Correlation similarity measure is recommended as the best measure.
Distribution for distance space is modeled using probability density function which are represented as Gaussian or Gamma distribution. the nature of documents affects the micro features but not the macro features. Likelihood Ratio(LR) is calculated followed by Log Likelihood Ratio(LLR).
LLR is mapped to a 9 point qualitative scale. This scale corresponds to the strength of evidence that is associated with the LLR value. It follows the 9 point scale from the ASTM technology. [1- Identified as same, 2-Highly probable, 3-Probably did, 4-Indications did, 5-No conclusion, 6-Indication did not, 7-Probably did not, 8-Highly probable did not, 9-Identified as Elimination ].

Searching

CedarFox has several modalities for searching handwritten documents for the presence of key-words. Word spotting allows the user to select a word image as a query, which is used to find similar word images in a specified document. Another type of search allows the user to type in a word which is used to rank all words in the document(s) as to how likely the word matches the query.

Handwriting Recognition

CedarFox has automatic character recognition capability. Word recognition with a pre-specified lexicon is also built-in. The user can also manually input character identities if the highest character recognition accuracy is desired for the purpose of writer verification/identification.

Comparing handwriting samples. CEDAR FOX Compare files.jpg
Comparing handwriting samples.

Legibility and Readability Analysis

Word gap comparison and comparison with Palmer metrics is supported.

Related Research Articles

A signature is a handwritten depiction of someone's name, nickname, or even a simple "X" or other mark that a person writes on documents as a proof of identity and intent. The writer of a signature is a signatory or signer. Similar to a handwritten signature, a signature work describes the work as readily identifying its creator. A signature may be confused with an autograph, which is chiefly an artistic signature. This can lead to confusion when people have both an autograph and signature and as such some people in the public eye keep their signatures private whilst fully publishing their autograph.

<span class="mw-page-title-main">Text editor</span> Computer software used to edit plain text documents

A text editor is a type of computer program that edits plain text. Such programs are sometimes known as "notepad" software. Text editors are provided with operating systems and software development packages, and can be used to change files such as configuration files, documentation files and programming language source code.

Visual Basic for Applications (VBA) is an implementation of Microsoft's event-driven programming language Visual Basic 6.0 built into most desktop Microsoft Office applications. Although based on pre-.NET Visual Basic, which is no longer supported or updated by Microsoft, the VBA implementation in Office continues to be updated to support new Office features. VBA is used for professional and end-user development due to its perceived ease-of-use, Office's vast installed userbase, and extensive legacy in business.

<span class="mw-page-title-main">WordStar</span> Word processor application

WordStar is a word processor application for microcomputers. It was published by MicroPro International and originally written for the CP/M-80 operating system, with later editions added for MS-DOS and other 16-bit PC OSes. Rob Barnaby was the sole author of the early versions of the program.

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

<span class="mw-page-title-main">Handwriting recognition</span> Ability of a computer to receive and interpret intelligible handwritten input

Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most possible words.

<span class="mw-page-title-main">Handwriting</span> Writing created by a person with a writing implement

Handwriting is the writing done with a writing instrument, such as a pen or pencil, in the hand. Handwriting includes both block and cursive styles and is separate from formal calligraphy or typeface. Because each person's handwriting is unique and different, it can be used to verify a document's writer. The deterioration of a person's handwriting is also a symptom or result of several different diseases. The inability to produce clear and coherent handwriting is also known as dysgraphia.

<span class="mw-page-title-main">Questioned document examination</span> Examination of documents potentially disputed in a court of law

In forensic science, questioned document examination (QDE) is the examination of documents potentially disputed in a court of law. Its primary purpose is to provide evidence about a suspicious or questionable document using scientific processes and methods. Evidence might include alterations, the chain of possession, damage to the document, forgery, origin, authenticity, or other questions that come up when a document is challenged in court.

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

In computer vision or natural language processing, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document and this kind of semantic labeling is the scope of the logical layout analysis.

In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.

The word count is the number of words in a document or passage of text. Word counting may be needed when a text is required to stay within certain numbers of words. This may particularly be the case in academia, legal proceedings, journalism and advertising. Word count is commonly used by translators to determine the price of a translation job. Word counts may also be used to calculate measures of readability and to measure typing and reading speeds. When converting character counts to words, a measure of 5 or 6 characters to a word is generally used for English.

<span class="mw-page-title-main">Microsoft Office XP</span> Version of Microsoft Office suite

Microsoft Office XP is an office suite which was officially revealed in July 2000 by Microsoft for the Windows operating system. Office XP was released to manufacturing on March 5, 2001, and was later made available to retail on May 31, 2001, less than five months prior to the release of Windows XP. It is the successor to Office 2000 and the predecessor of Office 2003. A Mac OS X equivalent, Microsoft Office v. X was released on November 19, 2001.

Intelligent character recognition (ICR) is used to extract handwritten text from image images using ICR, also referred to as intelligent OCR. It is a more sophisticated type of OCR technology that recognizes different handwriting styles and fonts to intelligently interpret data on forms and physical documents.

<span class="mw-page-title-main">Windows Speech Recognition</span> Speech recognition software

Windows Speech Recognition (WSR) is speech recognition developed by Microsoft for Windows Vista that enables voice commands to control the desktop user interface, dictate text in electronic documents and email, navigate websites, perform keyboard shortcuts, and operate the mouse cursor. It supports custom macros to perform additional or supplementary tasks. Hotkeys for this action is Windows logo key+Ctrl+S

Intelligent Word Recognition, or IWR, is the recognition of unconstrained handwritten words. IWR recognizes entire handwritten words or phrases instead of character-by-character, like its predecessor, optical character recognition (OCR). IWR technology matches handwritten or printed words to a user-defined dictionary, significantly reducing character errors encountered in typical character-based recognition engines.

<span class="mw-page-title-main">Time delay neural network</span>

Time delay neural network (TDNN) is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.

<span class="mw-page-title-main">Sargur Srihari</span> Indian academic (1949–2022)

Sargur Narasimhamurthy Srihari was an Indian and American computer scientist and educator who made contributions to the field of pattern recognition. The principal impact of his work has been in handwritten address reading systems and in computer forensics. He was a SUNY Distinguished Professor in the School of Engineering and Applied Sciences at the University at Buffalo, Buffalo, New York, USA.

The Center of Excellence for Document Analysis and Recognition (CEDAR) is a research laboratory at the University at Buffalo, State University of New York. The center was established with funding from the United States Postal Service and National Institute of Justice. CEDAR was formalized by the United States Postal Service by Postmaster General Anthony Frank in 1991.The primary goal of CEDAR was to conduct research and development for developing software useful for the automation of postal sorting equipment. Work at CEDAR, with Sargur Srihari as principal investigator, led to the first handwritten address interpretation system in the world. CEDAR-FOX, the first system for automatic comparison of handwriting for the purpose of forensic analysis, was developed at CEDAR.

Sayre's paradox is a dilemma encountered in the design of automated handwriting recognition systems. A standard statement of the paradox is that a cursively written word cannot be recognized without being segmented and cannot be segmented without being recognized. The paradox was first articulated in a 1973 publication by Kenneth M. Sayre, after whom it was named.

References

  1. S. N. Srihari, C. Huang and H. Srinivasan, "On the Discriminability of the Handwriting of Twins," Journal of Forensic Sciences Archived 2010-11-23 at the Wayback Machine , March 2008, vol. 53(2), pp. 430-446
  2. Srihari, S. N., S-H. Cha, H. Arora and S. Lee, "Individuality of Handwriting",Journal of Forensic Sciences Archived 2010-11-23 at the Wayback Machine , 2002, 47(4): 856-872
  3. S. N. Srihari, H. Srinivasan and K. Desai, "Questioned Document Examination using CEDAR-FOX,"Journal of Forensic Document Examination, 18, 2007, pp. 1-20
  4. S. N. Srihari, et.al, Method and Apparatus for analyzing and/or comparing handwritten or Biometric Samples, United States patent No. 7,580,551, Aug 29,2009.