This article includes a list of general references, but it lacks sufficient corresponding inline citations .(January 2013) |
Original author(s) | Lee E. McMahon |
---|---|
Developer(s) | AT&T Bell Laboratories, Richard Stallman, David MacKenzie |
Initial release | November 1973 |
Written in | C |
Operating system | Unix, Unix-like, Plan 9, Inferno |
Platform | Cross-platform |
Type | Command |
License | coreutils: GPLv3+ Plan 9: MIT License |
The comm command in the Unix family of computer operating systems is a utility that is used to compare two files for common and distinct lines. comm is specified in the POSIX standard. It has been widely available on Unix-like operating systems since the mid to late 1980s.
Written by Lee E. McMahon, comm first appeared in Version 4 Unix. [1]
The version of comm bundled in GNU coreutils was written by Richard Stallman and David MacKenzie. [2]
comm reads two files as input, regarded as lines of text. comm outputs one file, which contains three columns. The first two columns contain lines unique to the first and second file, respectively. The last column contains lines common to both. This functionally is similar to diff .
Columns are typically distinguished with the <tab> character. If the input files contain lines beginning with the separator character, the output columns can become ambiguous.
For efficiency, standard implementations of comm expect both input files to be sequenced in the same line collation order, sorted lexically. The sort (Unix) command can be used for this purpose.
The comm algorithm makes use of the collating sequence of the current locale. If the lines in the files are not both collated in accordance with the current locale, the result is undefined.
Unlike diff, the return code from comm has no logical significance concerning the relationship of the two files. A return code of 0 indicates success, a return code >0 indicates an error occurred during processing.
$ catfoo applebananaeggplant$ catbar applebananabananazucchini$ commfoobar apple banana bananaeggplant zucchini
This shows that both files have one banana, but only bar has a second banana.
In more detail, the output file has the appearance that follows. Note that the column is interpreted by the number of leading tab characters. \t represents a tab character and \n represents a newline (Escape character#Programming and data formats).
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | \t | \t | a | p | p | l | e | \n | ||
1 | \t | \t | b | a | n | a | n | a | \n | |
2 | \t | b | a | n | a | n | a | \n | ||
3 | e | g | g | p | l | a | n | t | \n | |
4 | \t | z | u | c | c | h | i | n | i | \n |
In general terms, diff is a more powerful utility than comm. The simpler comm is best suited for use in scripts.
The primary distinction between comm and diff is that comm discards information about the order of the lines prior to sorting.
A minor difference between comm and diff is that comm will not try to indicate that a line has "changed" between the two files; lines are either shown in the "from file #1", "from file #2", or "in both" columns. This can be useful if one wishes two lines to be considered different even if they only have subtle differences.
comm has command-line options to suppress any of the three columns. This is useful for scripting.
There is also an option to read one file (but not both) from standard input.
Up to a full line must be buffered from each input file during line comparison, before the next output line is written.
Some implementations read lines with the function readlinebuffer() which does not impose any line length limits if system memory suffices.
Other implementations read lines with the function fgets(). This function requires a fixed buffer. For these implementations, the buffer is often sized according to the POSIX macro LINE_MAX.
ed is a line editor for Unix and Unix-like operating systems. It was one of the first parts of the Unix operating system that was developed, in August 1969. It remains part of the POSIX and Open Group standards for Unix-based operating systems, alongside the more sophisticated full-screen editor vi.
sed is a Unix utility that parses and transforms text, using a simple, compact programming language. It was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, and is available today for most operating systems. sed was based on the scripting features of the interactive editor ed and the earlier qed. It was one of the earliest tools to support regular expressions, and remains in use for text processing, most notably with the substitution command. Popular alternative tools for plaintext string manipulation and "stream editing" include AWK and Perl.
In computing, the utility diff is a data comparison tool that computes and displays the differences between the contents of files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but it is like Levenshtein distance in that it tries to determine the smallest set of deletions and insertions to create one file from the other. The utility displays the changes in one of several standard formats, such that both humans or computers can parse the changes, and use them for patching.
In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own, such as devices that use magnetic tape. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.
In computer programming, standard streams are preconnected input and output communication channels between a computer program and its environment when it begins execution. The three input/output (I/O) connections are called standard input (stdin), standard output (stdout) and standard error (stderr). Originally I/O happened via a physically connected system console, but standard streams abstract this. When a command is executed via an interactive shell, the streams are typically connected to the text terminal on which the shell is running, but can be changed with redirection or a pipeline. More generally, a child process inherits the standard streams of its parent process.
xargs is a command on Unix and most Unix-like operating systems used to build and execute commands from standard input. It converts input from standard input into arguments to a command.
join
is a command in Unix and Unix-like operating systems that merges the lines of two sorted text files based on the presence of a common field. It is similar to the join operator used in relational databases but operating on text files.
pax is an archiving utility available for various operating systems and defined since 1995. Rather than sort out the incompatible options that have crept up between tar
and cpio
, along with their implementations across various versions of Unix, the IEEE designed a new archive utility pax that could support various archive formats with useful options from both archivers. The pax
command is available on Unix and Unix-like operating systems and on IBM i, and Microsoft Windows NT until Windows 2000.
In computing, cut
is a command line utility on Unix and Unix-like operating systems which is used to extract sections from each line of input — usually from a file. It is currently part of the GNU coreutils package and the BSD Base System.
paste is a Unix command line utility which is used to join files horizontally by outputting lines consisting of the sequentially corresponding lines of each file specified, separated by tabs, to the standard output.
nl is a Unix utility for numbering lines, either from a file or from standard input, reproducing output on standard output.
In Unix-like computer operating systems, a pipeline is a mechanism for inter-process communication using message passing. A pipeline is a set of processes chained together by their standard streams, so that the output text of each process (stdout) is passed directly as input (stdin) to the next one. The second process is started as the first process is still executing, and they are executed concurrently.
In computing, a here document is a file literal or input stream literal: it is a section of a source code file that is treated as if it were a separate file. The term is also used for a form of multiline string literals that use similar syntax, preserving line breaks and other whitespace in the text.
tail is a program available on Unix, Unix-like systems, FreeDOS and MSX-DOS used to display the tail end of a text file or piped data.
In computing, tee
is a command in command-line interpreters (shells) using standard streams which reads standard input and writes it to both standard output and one or more files, effectively duplicating its input. It is primarily used in conjunction with pipes and filters. The command is named after the T-splitter used in plumbing.
In computing, sort is a standard command line program of Unix and Unix-like operating systems, that prints the lines of its input or concatenation of all files listed in its argument list in sorted order. Sorting is done based on one or more sort keys extracted from each line of input. By default, the entire input is taken as sort key. Blank space is the default field separator. The command supports a number of command-line options that can vary by implementation. For instance the "-r
" flag will reverse the sort order.
Toybox is a free and open-source software implementation of over 200 Unix command line utilities such as ls, cp, and mv. The Toybox project was started in 2006, and became a 0BSD licensed BusyBox alternative. Toybox is used for most of Android's command-line tools in all currently supported Android versions, and is also used to build Android on Linux and macOS. All of the tools are tested on Linux, and many of them also work on BSD and macOS.
The POSIX terminal interface is the generalized abstraction, comprising both an application programming interface for programs, and a set of behavioural expectations for users of a terminal, as defined by the POSIX standard and the Single Unix Specification. It is a historical development from the terminal interfaces of BSD version 4 and Seventh Edition Unix.
In computing, process substitution is a form of inter-process communication that allows the input or output of a command to appear as a file. The command is substituted in-line, where a file name would normally occur, by the command shell. This allows programs that normally only accept files to directly read from or write to another program.
cat
is a standard Unix utility that reads files sequentially, writing them to standard output. The name is derived from its function to (con)catenate files . It has been ported to a number of operating systems.