Split (Unix)

Last updated
split
Original author(s) AT&T Bell Laboratories
Developer(s) Various open-source and commercial developers
Initial releaseFebruary 1973;50 years ago (1973-02)
Written in C
Operating system Unix, Unix-like, Plan 9, IBM i
Platform Cross-platform
Type Command
License coreutils: GPLv3+
Plan 9: MIT License

split is a utility on Unix, Plan 9, and Unix-like operating systems most commonly used to split a computer file into two or more smaller files.

Contents

History

The split command first appeared in Version 3 Unix [1] and is part of the X/Open Portability Guide since issue 2 of 1987. It was inherited into the first version of POSIX.1 and the Single Unix Specification. [2] The version of split bundled in GNU coreutils was written by Torbjorn Granlund and Richard Stallman. [3] The split command has also been ported to the IBM i operating system. [4]

Usage

The command-syntax is:

split[OPTION][INPUT[PREFIX]]

The default behavior of split is to generate output files of a fixed size, default 1000 lines. The files are named by appending aa, ab, ac, etc. to output filename. If output filename is not given, the default filename of x is used, for example, xaa, xab, etc. When a hyphen (-) is used instead of input filename, data is derived from standard input. The files are typically rejoined using a utility such as cat.

Additional program options permit a maximum character count (instead of a line count), a maximum line length, how many incrementing characters in generated filenames, and whether to use letters or digits.

Split file into pieces

Create a file named "myfile.txt" with exactly 3,000 lines of data:

$ head-3000</dev/urandom>myfile.txt 

Now, use the split command to break this file into pieces (note: unless otherwise specified, split will break the file into 1,000-line files):

$ splitmyfile.txt $ ls-l -rw-r--r--  1 root root 761K Jun 16 18:17 myfile.txt-rw-r--r--  1 root root 242K Jun 16 18:17 xaa-rw-r--r--  1 root root 263K Jun 16 18:17 xab-rw-r--r--  1 root root 256K Jun 16 18:17 xac$ wc--linesxa*   1000 xaa  1000 xab  1000 xac  3000 total

As seen above, the split command has broken the original file (keeping the original intact) into three, equal in number of lines (i.e., 1,000), files: xaa, xab, and xac.

See also

Related Research Articles

<span class="mw-page-title-main">AWK</span> Programming language

AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems.

uniq is a utility command on Unix, Plan 9, Inferno, and Unix-like operating systems which, when fed a text file or standard input, outputs the text with adjacent identical lines collapsed to one, unique line of text.

In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.

<span class="mw-page-title-main">Newline</span> Special characters in computing signifying the end of a line of text

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

ln (Unix) Unix file management utility

The ln command is a standard Unix command utility used to create a hard link or a symbolic link (symlink) to an existing file or directory. The use of a hard link allows multiple filenames to be associated with the same file since a hard link points to the inode of a given file, the data of which is stored on disk. On the other hand, symbolic links are special files that refer to other files by name.

<span class="mw-page-title-main">Filename</span> Text string used to uniquely identify a computer file

A filename or file name is a name used to uniquely identify a computer file in a file system. Different file systems impose different restrictions on filename lengths.

xargs is a command on Unix and most Unix-like operating systems used to build and execute commands from standard input. It converts input from standard input into arguments to a command.

dd is a command-line utility for Unix, Plan 9, Inferno, and Unix-like operating systems and beyond, the primary purpose of which is to convert and copy files. On Unix, device drivers for hardware and special device files appear in the file system just like normal files; dd can also read and/or write from/to these files, provided that function is implemented in their respective driver. As a result, dd can be used for tasks such as backing up the boot sector of a hard drive, and obtaining a fixed amount of random data. The dd program can also perform conversions on the data as it is copied, including byte order swapping and conversion to and from the ASCII and EBCDIC text encodings.

wc (Unix) Unix command utility

wc is a command in Unix, Plan 9, Inferno, and Unix-like operating systems. The program reads either standard input or a list of computer files and generates one or more of the following statistics: newline count, word count, and byte count. If a list of files is provided, both individual file and total statistics follow.

pax is an archiving utility available for various operating systems and defined since 1995. Rather than sort out the incompatible options that have crept up between tar and cpio, along with their implementations across various versions of Unix, the IEEE designed new archive utility pax that could support various archive formats with useful options from both archivers. The pax command is available on Unix and Unix-like operating systems and on IBM i, and Microsoft Windows NT until Windows 2000.

In computing, cut is a command line utility on Unix and Unix-like operating systems which is used to extract sections from each line of input — usually from a file. It is currently part of the GNU coreutils package and the BSD Base System.

nl is a Unix utility for numbering lines, either from a file or from standard input, reproducing output on standard output.

mv is a Unix command that moves one or more files or directories from one place to another. If both filenames are on the same filesystem, this results in a simple file rename; otherwise the file content is copied to the new location and the old file is removed. Using mv requires the user to have write permission for the directories the file will move between. This is because mv changes the content of both directories involved in the move. When using the mv command on files located on the same filesystem, the file's timestamp is not updated.

In Unix-like and some other operating systems, find is a command-line utility that locates files based on some user-specified criteria and either prints the pathname of each matched object or, if another action is requested, performs that action on each matched object.

more (command) Terminal pager available on various operating systems

In computing, more is a command to view the contents of a text file one screen at a time. It is available on Unix and Unix-like systems, DOS, Digital Research FlexOS, IBM/Toshiba 4690 OS, IBM OS/2, Microsoft Windows and ReactOS. Programs of this sort are called pagers. more is a very basic pager, originally allowing only forward navigation through a file, though newer implementations do allow for limited backward movement.

tail is a program available on Unix, Unix-like systems, FreeDOS and MSX-DOS used to display the tail end of a text file or piped data.

In computing, tee is a command in command-line interpreters (shells) using standard streams which reads standard input and writes it to both standard output and one or more files, effectively duplicating its input. It is primarily used in conjunction with pipes and filters. The command is named after the T-splitter used in plumbing.

sort (Unix) Standard UNIX utility

In computing, sort is a standard command line program of Unix and Unix-like operating systems, that prints the lines of its input or concatenation of all files listed in its argument list in sorted order. Sorting is done based on one or more sort keys extracted from each line of input. By default, the entire input is taken as sort key. Blank space is the default field separator. The command supports a number of command-line options that can vary by implementation. For instance the "-r" flag will reverse the sort order.

The csplit command in Unix and Unix-like operating systems is a utility that is used to split a file into two or more smaller files determined by context lines.

cat (Unix) Unix command utility

cat is a standard Unix utility that reads files sequentially, writing them to standard output. The name is derived from its function to (con)catenate files. It has been ported to a number of operating systems.

References

  1. split(1)    FreeBSD General Commands Manual
  2. split   Shell and Utilities Reference, The Single UNIX Specification , Version 4 from The Open Group
  3. "split(1): split file into pieces - Linux man page". linux.die.net.
  4. IBM. "IBM System i Version 7.2 Programming Qshell" (PDF). IBM . Retrieved 2020-09-05.