Original author(s) | Douglas McIlroy |
---|---|
Developer(s) | AT&T Bell Laboratories |
Initial release | January 1979 |
Written in | C |
Operating system | Unix, Unix-like, Plan 9 |
Platform | Cross-platform |
Type | Command |
License | coreutils: GPLv3+ Plan 9: MIT License |
join
is a command in Unix and Unix-like operating systems that merges the lines of two sorted text files based on the presence of a common field. It is similar to the join operator used in relational databases but operating on text files.
The join
command takes as input two text files and several options. If no command-line argument is given, this command looks for a pair of lines from the two files having the same first field (a sequence of characters that are different from space), and outputs a line composed of the first field followed by the rest of the two lines.
The program arguments specify which character to be used in place of space to separate the fields of the line, which field to use when looking for matching lines, and whether to output lines that do not match. The output can be stored to another file rather than printed using redirection.
As an example, the two following files list the known fathers and the mothers of some people. Both files have been sorted on the join field — this is a requirement of the program.
george jim kumar gunaware
albert martha george sophie
The join of these two files (with no argument) would produce:
george jim sophie
Indeed, only "george" is common as a first word of both files.
join
is intended to be a relation database operator. It is part of the X/Open Portability Guide since issue 2 of 1987. It was inherited into the first version of POSIX.1 and the Single Unix Specification. [1] [2]
The version of join
bundled in GNU coreutils was written by Mike Haertel. [3] The command is available as a separate package for Microsoft Windows as part of the UnxUtils collection of native Win32 ports of common GNU Unix-like utilities. [4]
AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems.
uniq
is a utility command on Unix, Plan 9, Inferno, and Unix-like operating systems which, when fed a text file or standard input, outputs the text with adjacent identical lines collapsed to one, unique line of text.
In computing, ls
is a command to list computer files and directories in Unix and Unix-like operating systems. It is specified by POSIX and the Single UNIX Specification.
uname is a computer program in Unix and Unix-like computer operating systems that prints the name, version and other details about the current machine and the operating system running on it.
printf is a C standard library function that formats text and writes it to standard output.
xargs is a command on Unix and most Unix-like operating systems used to build and execute commands from standard input. It converts input from standard input into arguments to a command.
tr is a command in Unix, Plan 9, Inferno, and Unix-like operating systems. It is an abbreviation of translate or transliterate, indicating its operation of replacing or removing specific characters in its input data set.
wc
is a command in Unix, Plan 9, Inferno, and Unix-like operating systems. The program reads either standard input or a list of computer files and generates one or more of the following statistics: newline count, word count, and byte count. If a list of files is provided, both individual file and total statistics follow.
In computing, cut
is a command line utility on Unix and Unix-like operating systems which is used to extract sections from each line of input — usually from a file. It is currently part of the GNU coreutils package and the BSD Base System.
cksum
is a command in Unix and Unix-like operating systems that generates a checksum value for a file or stream of data. The cksum command reads each file given in its arguments, or standard input if no arguments are provided, and outputs the file's 32-bit cyclic redundancy check (CRC) checksum and byte count. The CRC output by cksum is different from the CRC-32 used in zip, PNG and zlib.
paste is a Unix command line utility which is used to join files horizontally by outputting lines consisting of the sequentially corresponding lines of each file specified, separated by tabs, to the standard output.
nl is a Unix utility for numbering lines, either from a file or from standard input, reproducing output on standard output.
In computing, cmp
is a command-line utility on Unix and Unix-like operating systems that compares two files of any type and writes the results to the standard output. By default, cmp
is silent if the files are the same; if they differ, the byte and line number at which the first difference occurred is reported. The command is also available in the OS-9 shell.
tail is a program available on Unix, Unix-like systems, FreeDOS and MSX-DOS used to display the tail end of a text file or piped data.
In computing, tee
is a command in command-line interpreters (shells) using standard streams which reads standard input and writes it to both standard output and one or more files, effectively duplicating its input. It is primarily used in conjunction with pipes and filters. The command is named after the T-splitter used in plumbing.
sum is a legacy utility available on some Unix and Unix-like operating systems. This utility outputs a 16-bit checksum of each argument file, as well as the number of blocks they take on disk. Two different checksum algorithms are in use. POSIX abandoned sum
in favor of cksum.
In Unix and Unix-like operating systems, printf is a shell builtin that formats and outputs text like the same-named C function.
The csplit
command in Unix and Unix-like operating systems is a utility that is used to split a file into two or more smaller files determined by context lines.
fold is a Unix command used for making a file with long lines more readable on a limited width computer terminal by performing a line wrap.
cat
is a standard Unix utility that reads files sequentially, writing them to standard output. The name is derived from its function to (con)catenate files . It has been ported to a number of operating systems.