File comparison

Last updated
The KDE diff tool Kompare Kompare.png
The KDE diff tool Kompare

Editing documents, program code, or any data always risks introducing errors. Displaying the differences between two or more sets of data, file comparison tools can make computing simpler, and more efficient by focusing on new data and ignoring what did not change. Generically known as a diff [1] after the Unix diff utility, there are a range of ways to compare data sources and display the results.

Contents

Some widely used file comparison programs are diff, cmp, FileMerge, WinMerge, Beyond Compare, and File Compare.

Because understanding changes is important to writers of code or documents, many text editors and word processors include the functionality necessary to see the changes between different versions of a file or document.

Method types

The most efficient method of finding differences depends on the source data, and the nature of the changes. One approach is to find the longest common subsequence between two files, then regard the non-common data as an insertion, or a deletion.

In 1978, Paul Heckel published an algorithm that identifies most moved blocks of text. [2] This is used in the IBM History Flow tool. [3] Other file comparison programs find block moves.[ clarification needed ]

Some specialized file comparison tools find the longest increasing subsequence between two files. [4] The rsync protocol uses a rolling hash function to compare two files on two distant computers with low communication overhead.

File comparison in word processors is typically at the word level, while comparison in most programming tools is at the line level. Byte or character-level comparison is useful in some specialized applications.

Display

The optimal way to display the results of a file comparison depends on many factors, including the type of source data. The fixed lines of programming code provide a clear unit of comparison. This does not work with documents, where adding a single word may cause the following lines to wrap differently, but still not change the content.

The most popular ways to display changes are either side-by-side, or a consolidating view that highlights data inserts, and deletes. In either side-by-side viewing, code folding or text folding, for the sake of efficiency, the interface may hide portions of the file that did not change and show only the changes.[ clarification needed ]

Reasoning

There are various reasons to use comparison tools, and tools themselves use different approaches. To compare binary files, a tool may use byte-level comparison. Comparing text files or computer programs, many tools use a side-by-side visual comparison. [5] This gives the user the chance to choose which changes to keep or reject before merging the files into a new version. [6] Or perhaps to keep them both as-is for later reference, through some form of "versioning" control.

File comparison is an important, and integral process of file synchronization and backup. In backup methodologies, the issue of data corruption is important. Rarely is there a warning before corruption occurs, this can make recovery difficult or impossible. Often, the problem is only apparent the next time someone tries to open a file. In this circumstance, a comparison tool can help to isolate the introduction of the problem. [7]

Historical uses

Prior to file comparison, machines existed to compare magnetic tapes or punch cards. The IBM 519 Card Reproducer could determine whether a deck of punched cards were equivalent. In 1957, John Van Gardner developed a system to compare the check sums of loaded sections of Fortran programs to debug compilation problems on the IBM 704. [8]

See also

Related Research Articles

<span class="mw-page-title-main">Fortran</span> General-purpose programming language

Fortran is a third generation, compiled, imperative programming language that is especially suited to numeric computation and scientific computing.

<span class="mw-page-title-main">Text editor</span> Computer software used to edit plain text documents

A text editor is a type of computer program that edits plain text. Such programs are sometimes known as "notepad" software. Text editors are provided with operating systems and software development packages, and can be used to change files such as configuration files, documentation files and programming language source code.

<span class="mw-page-title-main">Endianness</span> Order of bytes in a computer word

In computing, endianness is the order in which bytes within a word of digital data are transmitted over a data communication medium or addressed in computer memory, counting only byte significance compared to earliness. Endianness is primarily expressed as big-endian (BE) or little-endian (LE), terms introduced by Danny Cohen into computer science for data ordering in an Internet Experiment Note published in 1980. The adjective endian has its origin in the writings of 18th century Anglo-Irish writer Jonathan Swift. In the 1726 novel Gulliver's Travels, he portrays the conflict between sects of Lilliputians divided into those breaking the shell of a boiled egg from the big end or from the little end. By analogy, a CPU may read a digital word big end first, or little end first.

In software engineering, version control is a class of systems responsible for managing changes to computer programs, documents, large web sites, or other collections of information. Version control is a component of software configuration management.

In computing, the utility diff is a data comparison tool that computes and displays the differences between the contents of files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but it is like Levenshtein distance in that it tries to determine the smallest set of deletions and insertions to create one file from the other. The utility displays the changes in one of several standard formats, such that both humans or computers can parse the changes, and use them for patching.

comm Standard UNIX utility for comparing files

The comm command in the Unix family of computer operating systems is a utility that is used to compare two files for common and distinct lines. comm is specified in the POSIX standard. It has been widely available on Unix-like operating systems since the mid to late 1980s.

Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files; more generally this is known as data differencing. Delta encoding is sometimes called delta compression, particularly where archival histories of changes are required.

Pretty-printing is the application of any of various stylistic formatting conventions to text files, such as source code, markup, and similar kinds of content. These formatting conventions may entail adhering to an indentation style, using different color and typeface to highlight syntactic elements of source code, or adjusting size, to make the content easier for people to read, and understand. Pretty-printers for source code are sometimes called code formatters or beautifiers.

patch (Unix) Unix utility to apply changes to text files

The computer tool patch is a Unix program that updates text files according to instructions contained in a separate file, called a patch file. The patch file is a text file that consists of a list of differences and is produced by running the related diff program with the original and updated file as arguments. Updating files with patch is often referred to as applying the patch or simply patching the files.

<span class="mw-page-title-main">Comma-separated values</span> File format used to store data

Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores tabular data in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

<span class="mw-page-title-main">IBM 1130</span> 16-bit IBM minicomputer introduced in 1965

The IBM 1130 Computing System, introduced in 1965, was IBM's least expensive computer at that time. A binary 16-bit machine, it was marketed to price-sensitive, computing-intensive technical markets, like education and engineering, succeeding the decimal IBM 1620 in that market segment. Typical installations included a 1 megabyte disk drive that stored the operating system, compilers and object programs, with program source generated and maintained on punched cards. Fortran was the most common programming language used, but several others, including APL, were available.

<span class="mw-page-title-main">ExamDiff Pro</span>

ExamDiff Pro is a commercial software utility for visual file and directory comparison, for Microsoft Windows.

<span class="mw-page-title-main">Merge (version control)</span>

In version control, merging is a fundamental operation that reconciles multiple changes made to a version-controlled collection of files. Most often, it is necessary when a file is modified on two independent branches and subsequently merged. The result is a single collection of files that contains both sets of changes.

<span class="mw-page-title-main">IBM Displaywriter System</span> 1980 office desktop computer

The IBM 6580 Displaywriter System is a 16-bit microcomputer that was marketed and sold by IBM's Office Products Division primarily as a word processor. Announced on June 17, 1980 and effectively withdrawn from marketing on July 2, 1986, the system was sold with a 5 MHz Intel 8086, 128 KB to 448 KB of RAM, a swivel-mounted monochrome CRT monitor, a detached keyboard, a detached 8" floppy disk drive enclosure with one or two drives, and a detached daisy wheel printer, or Selectric typewriter printer. The primary operating system for the Displaywriter is IBM's internally developed word processing software titled "Textpack", but UCSD p-System, CP/M-86, and MS-DOS were also offered by IBM, Digital Research, and CompuSystems, respectively.

This article compares computer software tools that are used for accomplishing comparisons of files of various types. The file types addressed by individual file comparison apps varies but may include text, symbols, images, audio, or video. This category of software tool is often called "file comparison" or "diff tool", but those effectively are equivalent terms — where the term "diff" is more commonly associated with the Unix diff utility.

Emacs, originally named EMACS, is a family of text editors that are characterized by their extensibility. The manual for the most widely used variant, GNU Emacs, describes it as "the extensible, customizable, self-documenting, real-time display editor". Development of the first Emacs began in the mid-1970s, and work on GNU Emacs, directly descended from the original, is ongoing; its latest version is 29.2, released January 2024.

Pretty Diff is a language-aware data comparison utility implemented in TypeScript. The online utility is capable of source code prettification, minification, and comparison of two pieces of input text. It operates by removing code comments from supported languages and then performs a pretty-print operation prior to executing the diff algorithm. An abbreviated list of unit tests is provided. The documentation claims the JavaScript pretty-print operation conforms to the requirements of JSLint.

comp (command)

In computing, comp is a command used on DEC OS/8, DOS, DR FlexOS, IBM OS/2, Microsoft Windows and related computer operating systems such as ReactOS. It is used to perform comparisons of multiple computer files to show the differences between them.

<span class="mw-page-title-main">Diff-Text</span> Application Software

Diff-Text is a web-based software tool that identifies differences between two blocks of plain text. It operates on a closed-source model and offers a donation or pay-what-you-want payment option. To be compared, text is pasted directly into the web-page. Diff-Text was developed by DiffEngineX LLC and uses improved algorithms originally developed for the spreadsheet compare tool DiffEngineX. It allows the user to choose between comparing on the level of paragraphs, whole lines, words, or characters. If comparing whole lines, only the line that is not a part of the other block will be reported. Diff Text considers a paragraph to be any line ending with a Windows, Macintosh or Unix line terminator.

<span class="mw-page-title-main">Comparison</span> Examination of two or more entities to deduce their similarities and differences

Comparison or comparing is the act of evaluating two or more things by determining the relevant, comparable characteristics of each thing, and then determining which characteristics of each are similar to the other, which are different, and to what degree. Where characteristics are different, the differences may then be evaluated to determine which thing is best suited for a particular purpose. The description of similarities and differences found between the two things is also called a comparison. Comparison can take many distinct forms, varying by field:

To compare is to bring two or more things together and to examine them systematically, identifying similarities and differences among them. Comparison has a different meaning within each framework of study. Any exploration of the similarities or differences of two or more units is a comparison. In the most limited sense, it consists of comparing two units isolated from each other.

References

  1. "diff", The Jargon File
  2. Heckel, Paul (1978), "A Technique for Isolating Differences Between Files" (PDF), Communications of the ACM, 21 (4): 264–268, doi:10.1145/359460.359467, S2CID   207683976 , retrieved 2011-12-04
  3. Viégas, Fernanda B.; Wattenberg, Martin; Kushal, Kushal Dave (2004), Studying Cooperation and Conflict between Authors with history flow Visualizations (PDF), vol. 6, Vienna: CHI, pp. 575–582, retrieved 2011-12-01
  4. Liwei Ren; Jinsheng Gu; Luosheng Peng (18 April 2006). "Algorithms for block-level code alignment of software binary files". Google Patents. USPTO. Retrieved 10 May 2019.
  5. MacKenzie, David; Eggert, Paul; Stallman, Richard (2003). Comparing and Merging Files with Gnu Diff and Patch. Network Theory. ISBN   978-0-9541617-5-0.
  6. "File comparison software: vc-dwim and vc-chlog". www.gnu.org. Retrieved 2023-04-16.
  7. "SystemRescue - System Rescue Homepage". www.system-rescue.org. Retrieved 2023-04-16.
  8. John Van Gardner. "Fortran And The Genesis Of Project Intercept" (PDF). Retrieved 2011-12-06.