This article is written like a manual or guidebook.(June 2013) |
Original author(s) | AT&T Bell Laboratories |
---|---|
Developer(s) | Various open-source and commercial developers |
Initial release | February 1973 |
Written in | C |
Operating system | Unix, Unix-like, Plan 9, IBM i |
Platform | Cross-platform |
Type | Command |
License | coreutils: GPLv3+ Plan 9: MIT License |
split
is a utility on Unix, Plan 9, and Unix-like operating systems most commonly used to split a computer file into two or more smaller files.
The split
command first appeared in Version 3 Unix [1] and is part of the X/Open Portability Guide since issue 2 of 1987. It was inherited into the first version of POSIX.1 and the Single Unix Specification. [2] The version of split
bundled in GNU coreutils was written by Torbjorn Granlund and Richard Stallman. [3] The split command has also been ported to the IBM i operating system. [4]
The command-syntax is:
split[OPTION][INPUT[PREFIX]]
The default behavior of split
is to generate output files of a fixed size, default 1000 lines. The files are named by appending aa, ab, ac, etc. to output filename. If output filename is not given, the default filename of x is used, for example, xaa, xab, etc. When a hyphen (-) is used instead of input filename, data is derived from standard input. The files are typically rejoined using a utility such as cat.
Additional program options permit a maximum character count (instead of a line count), a maximum line length, how many incrementing characters in generated filenames, and whether to use letters or digits.
Create a file named "myfile.txt
" with exactly 3,000 lines of data:
$ head-3000</dev/urandom>myfile.txt
Now, use the split
command to break this file into pieces (note: unless otherwise specified, split
will break the file into 1,000-line files):
$ splitmyfile.txt $ ls-l -rw-r--r-- 1 root root 761K Jun 16 18:17 myfile.txt-rw-r--r-- 1 root root 242K Jun 16 18:17 xaa-rw-r--r-- 1 root root 263K Jun 16 18:17 xab-rw-r--r-- 1 root root 256K Jun 16 18:17 xac$ wc--linesxa* 1000 xaa 1000 xab 1000 xac 3000 total
As seen above, the split
command has broken the original file (keeping the original intact) into three, equal in number of lines (i.e., 1,000), files: xaa
, xab
, and xac
.
AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems.
uniq
is a utility command on Unix, Plan 9, Inferno, and Unix-like operating systems which, when fed a text file or standard input, outputs the text with adjacent identical lines collapsed to one, unique line of text.
In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.
A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.
The ln
command is a standard Unix command utility used to create a hard link or a symbolic link (symlink) to an existing file or directory. The use of a hard link allows multiple filenames to be associated with the same file since a hard link points to the inode of a given file, the data of which is stored on disk. On the other hand, symbolic links are special files that refer to other files by name.
A filename or file name is a name used to uniquely identify a computer file in a file system. Different file systems impose different restrictions on filename lengths.
xargs is a command on Unix and most Unix-like operating systems used to build and execute commands from standard input. It converts input from standard input into arguments to a command.
dd is a command-line utility for Unix, Plan 9, Inferno, and Unix-like operating systems and beyond, the primary purpose of which is to convert and copy files. On Unix, device drivers for hardware and special device files appear in the file system just like normal files; dd can also read and/or write from/to these files, provided that function is implemented in their respective driver. As a result, dd can be used for tasks such as backing up the boot sector of a hard drive, and obtaining a fixed amount of random data. The dd program can also perform conversions on the data as it is copied, including byte order swapping and conversion to and from the ASCII and EBCDIC text encodings.
wc
is a command in Unix, Plan 9, Inferno, and Unix-like operating systems. The program reads either standard input or a list of computer files and generates one or more of the following statistics: newline count, word count, and byte count. If a list of files is provided, both individual file and total statistics follow.
pax is an archiving utility available for various operating systems and defined since 1995. Rather than sort out the incompatible options that have crept up between tar
and cpio
, along with their implementations across various versions of Unix, the IEEE designed new archive utility pax that could support various archive formats with useful options from both archivers. The pax
command is available on Unix and Unix-like operating systems and on IBM i, and Microsoft Windows NT until Windows 2000.
In computing, cut
is a command line utility on Unix and Unix-like operating systems which is used to extract sections from each line of input — usually from a file. It is currently part of the GNU coreutils package and the BSD Base System.
nl is a Unix utility for numbering lines, either from a file or from standard input, reproducing output on standard output.
mv
is a Unix command that moves one or more files or directories from one place to another. If both filenames are on the same filesystem, this results in a simple file rename; otherwise the file content is copied to the new location and the old file is removed. Using mv
requires the user to have write permission for the directories the file will move between. This is because mv
changes the content of both directories involved in the move. When using the mv
command on files located on the same filesystem, the file's timestamp is not updated.
In Unix-like and some other operating systems, find
is a command-line utility that locates files based on some user-specified criteria and either prints the pathname of each matched object or, if another action is requested, performs that action on each matched object.
In computing, more
is a command to view the contents of a text file one screen at a time. It is available on Unix and Unix-like systems, DOS, Digital Research FlexOS, IBM/Toshiba 4690 OS, IBM OS/2, Microsoft Windows and ReactOS. Programs of this sort are called pagers. more
is a very basic pager, originally allowing only forward navigation through a file, though newer implementations do allow for limited backward movement.
tail is a program available on Unix, Unix-like systems, FreeDOS and MSX-DOS used to display the tail end of a text file or piped data.
In computing, tee
is a command in command-line interpreters (shells) using standard streams which reads standard input and writes it to both standard output and one or more files, effectively duplicating its input. It is primarily used in conjunction with pipes and filters. The command is named after the T-splitter used in plumbing.
In computing, sort is a standard command line program of Unix and Unix-like operating systems, that prints the lines of its input or concatenation of all files listed in its argument list in sorted order. Sorting is done based on one or more sort keys extracted from each line of input. By default, the entire input is taken as sort key. Blank space is the default field separator. The command supports a number of command-line options that can vary by implementation. For instance the "-r
" flag will reverse the sort order.
The csplit
command in Unix and Unix-like operating systems is a utility that is used to split a file into two or more smaller files determined by context lines.
cat
is a standard Unix utility that reads files sequentially, writing them to standard output. The name is derived from its function to (con)catenate files. It has been ported to a number of operating systems.