Sed

Last updated
sed
Sed stream editor cropped1.jpg
An excerpt from GNU sed's man page
Paradigm scripting
Designed by Lee E. McMahon
First appeared1974;50 years ago (1974)
Implementation language C
Influenced by
ed
Influenced
Perl, AWK

sed ("stream editor") is a Unix utility that parses and transforms text, using a simple, compact programming language. It was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, [1] and is available today for most operating systems. [2] sed was based on the scripting features of the interactive editor ed ("editor", 1971) and the earlier qed ("quick editor", 1965–66). It was one of the earliest tools to support regular expressions, and remains in use for text processing, most notably with the substitution command. Popular alternative tools for plaintext string manipulation and "stream editing" include AWK and Perl.

Contents

History

First appearing in Version 7 Unix, [3] sed is one of the early Unix commands built for command line processing of data files. It evolved as the natural successor to the popular grep command. [4] The original motivation was an analogue of grep (g/re/p) for substitution, hence "g/re/s". [3] Foreseeing that further special-purpose programs for each command would also arise, such as g/re/d, McMahon wrote a general-purpose line-oriented stream editor, which became sed. [4] The syntax for sed, notably the use of / for pattern matching, and s/// for substitution, originated with ed, the precursor to sed, which was in common use at the time, [4] and the regular expression syntax has influenced other languages, notably ECMAScript and Perl. Later, the more powerful language AWK developed, and these functioned as cousins, allowing powerful text processing to be done by shell scripts. sed and AWK are often cited as progenitors and inspiration for Perl, and influenced Perl's syntax and semantics, notably in the matching and substitution operators.

GNU sed added several new features, including in-place editing of files. Super-sed is an extended version of sed that includes regular expressions compatible with Perl. Another variant of sed is minised, originally reverse-engineered from 4.1BSD sed by Eric S. Raymond and currently maintained by René Rebe. minised was used by the GNU Project until the GNU Project wrote a new version of sed based on the new GNU regular expression library. The current minised contains some extensions to BSD sed but is not as feature-rich as GNU sed. Its advantage is that it is very fast and uses little memory. It is used on embedded systems and is the version of sed provided with Minix. [5]

Mode of operation

sed is a line-oriented text processing utility: it reads text, line by line, from an input stream or file, into an internal buffer called the pattern space. Each line read starts a cycle. To the pattern space, sed applies one or more operations which have been specified via a sed script. sed implements a programming language with about 25 commands that specify the operations on the text. For each input line, after running the script, sed ordinarily outputs the pattern space (the line as modified by the script) and begins the cycle again with the next line. Other end-of-script behaviors are available through sed options and script commands, e.g. d to delete the pattern space, q to quit, N to add the next line to the pattern space immediately, and so on. Thus a sed script corresponds to the body of a loop that iterates through the lines of a stream, where the loop itself and the loop variable (the current line number) are implicit and maintained by sed.

The sed script can either be specified on the command line (-e option) or read from a separate file (-f option). Commands in the sed script may take an optional address, in terms of line numbers or regular expressions. The address determines when the command is run. For example, 2d would only run the d (delete) command on the second input line (printing all lines but the second), while /^ /d would delete all lines beginning with a space. A separate special buffer, the hold space, may be used by a few sed commands to hold and accumulate text between cycles. sed's command language has only two variables (the "hold space" and the "pattern space") and GOTO-like branching functionality; nevertheless, the language is Turing-complete, [6] [7] and esoteric sed scripts exist for games such as sokoban, arkanoid, [8] chess, [9] and tetris. [10]

A main loop executes for each line of the input stream, evaluating the sed script on each line of the input. Lines of a sed script are each a pattern-action pair, indicating what pattern to match and which action to perform, which can be recast as a conditional statement. Because the main loop, working variables (pattern space and hold space), input and output streams, and default actions (copy line to pattern space, print pattern space) are implicit, it is possible to write terse one-liner programs. For example, the sed program given by:

10q

will print the first 10 lines of input, then stop.

Usage

Substitution command

The following example shows a typical, and the most common, use of sed: substitution. This usage was indeed the original motivation for sed: [4]

sed's/regexp/replacement/g'inputFileName>outputFileName 

In some versions of sed, the expression must be preceded by -e to indicate that an expression follows. The s stands for substitute, while the g stands for global, which means that all matching occurrences in the line would be replaced. The regular expression (i.e. pattern) to be searched is placed after the first delimiting symbol (slash here) and the replacement follows the second symbol. Slash (/) is the conventional symbol, originating in the character for "search" in ed, but any other could be used to make syntax more readable if it does not occur in the pattern or replacement; this is useful to avoid "leaning toothpick syndrome".

The substitution command, which originates in search-and-replace in ed, implements simple parsing and templating. The regexp provides both pattern matching and saving text via sub-expressions, while the replacement can be either literal text, or a format string containing the characters & for "entire match" or the special escape sequences \1 through \9 for the nth saved sub-expression. For example, sed -r "s/(cat|dog)s?/\1s/g" replaces all occurrences of "cat" or "dog" with "cats" or "dogs", without duplicating an existing "s": (cat|dog) is the 1st (and only) saved sub-expression in the regexp, and \1 in the format string substitutes this into the output.

Other sed commands

Besides substitution, other forms of simple processing are possible, using some 25 sed commands. For example, the following uses the d command to filter out lines that only contain spaces, or only contain the end of line character:

sed'/^ *$/d'inputFileName 

This example uses some of the following regular expression metacharacters (sed supports the full range of regular expressions):

Complex sed constructs are possible, allowing it to serve as a simple, but highly specialized, programming language. Flow of control, for example, can be managed by the use of a label (a colon followed by a string) and the branch instruction b, as well as the conditional branch t. An instruction b followed by a valid label name will move processing to the command following that label. The t instruction will only do so if there was a successful substitution since the previous t (or the start of the program, in case of the first t encountered). Additionally, the { instruction starts a subsequence of commands (up to the matching }); in most cases, it will be conditioned by an address pattern.

sed used as a filter

Under Unix, sed is often used as a filter in a pipeline:

$ generateData|sed's/x/y/g'

That is, a program such as "generateData" generates data, and then sed makes the small change of replacing x with y. For example:

$ echoxyzxyz|sed's/x/y/g'yyz yyz

[notes 1]

File-based sed scripts

It is often useful to put several sed commands, one command per line, into a script file such as subst.sed, and then use the -f option to run the commands (such as s/x/y/g) from the file:

sed-fsubst.sedinputFileName>outputFileName 

Any number of commands may be placed into the script file, and using a script file also avoids problems with shell escaping or substitutions.

Such a script file may be made directly executable from the command line by prepending it with a "shebang line" containing the sed command and assigning the executable permission to the file. For example, a file subst.sed can be created with contents:

#!/bin/sed -fs/x/y/g

The file may then be made executable by the current user with the chmod command:

chmodu+xsubst.sed 

The file may then be executed directly from the command line:

subst.sedinputFileName>outputFileName 

In-place editing

The -i option, introduced in GNU sed, allows in-place editing of files (actually, a temporary output file is created in the background, and then the original file is replaced by the temporary file). For example:

sed-i's/abc/def/'fileName 

Examples

Hello, world! example

# convert input text stream to "Hello, world!"s/.*/Hello, world!/q

This "Hello, world!" script is in a file (e.g., script.txt) and invoked with sed -f script.txt inputFileName, where "inputFileName" is the input text file. The script changes "inputFileName" line #1 to "Hello, world!" and then quits, printing the result before sed exits. Any input lines past line #1 are not read, and not printed. So the sole output is "Hello, world!".

The example emphasizes many key characteristics of sed:

Other simple examples

Below follow various sed scripts; these can be executed by passing as an argument to sed, or put in a separate file and executed via -f or by making the script itself executable.

To replace any instance of a certain word in a file with "REDACTED", such as an IRC password, and save the result:

$ sed-i"s/yourpassword/REDACTED/"./status.chat.log 

To delete any line containing the word "yourword" (the address is '/yourword/'):

/yourword/d

To delete all instances of the word "yourword":

s/yourword//g

To delete two words from a file simultaneously:

s/firstword//gs/secondword//g

To express the previous example on one line, such as when entering at the command line, one may join two commands via the semicolon:

$ sed"s/firstword//g; s/secondword//g"inputFileName 

Multiline processing example

In the next example, sed, which usually only works on one line, removes newlines from sentences where the second line starts with one space. Consider the following text:

This is my dog,  whose name is Frank. This is my fish, whose name is George. This is my goat,  whose name is Adam.

The sed script below will turn the text above into the following text. Note that the script affects only the input lines that start with a space:

This is my dog, whose name is Frank. This is my fish, whose name is George. This is my goat, whose name is Adam.

The script is:

Ns/\n //PD

This is explained as:

This can be expressed on a single line via semicolons:

sed 'N;s/\n //;P;D' inputFileName

Limitations and alternatives

While simple and limited, sed is sufficiently powerful for a large number of purposes. For more sophisticated processing, more powerful languages such as AWK or Perl are used instead. These are particularly used if transforming a line in a way more complicated than a regex extracting and template replacement, though arbitrarily complicated transforms are in principle possible by using the hold buffer.

Conversely, for simpler operations, specialized Unix utilities such as grep (print lines matching a pattern), head (print the first part of a file), tail (print the last part of a file), and tr (translate or delete characters) are often preferable. For the specific tasks they are designed to carry out, such specialized utilities are usually simpler, clearer, and faster than a more general solution such as sed.

The ed/sed commands and syntax continue to be used in descendent programs, such as the text editors vi and vim. An analog to ed/sed is sam/ssam, where sam is the Plan 9 editor, and ssam is a stream interface to it, yielding functionality similar to sed.

See also

Notes

  1. In command line use, the quotes around the expression are not required, and are only necessary if the shell would otherwise not interpret the expression as a single word (token). For the script s/x/y/g there is no ambiguity, so generateData | sed s/x/y/g works correctly. However, quotes are usually included for clarity, and are often necessary, notably for whitespace (e.g., 's/x x/y y/'). Most often single quotes are used, to avoid having the shell interpret $ as a shell variable. Double quotes are used, such as "s/$1/$2/g", to allow the shell to substitute for a command line argument or other shell variable.

Related Research Articles

<span class="mw-page-title-main">AWK</span> Programming language

AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and it is a standard feature of most Unix-like operating systems.

<span class="mw-page-title-main">Bash (Unix shell)</span> GNU replacement for the Bourne shell

Bash, short for Bourne-Again SHell, is a shell program and command language supported by the Free Software Foundation and first developed for the GNU Project by Brian Fox. Designed as a 100% free software alternative for the Bourne shell, it was initially released in 1989. Its moniker is a play on words, referencing both its predecessor, the Bourne shell, and the concept of rebirth.

ed (software) Line-oriented text editor for Unix

ed is a line editor for Unix and Unix-like operating systems. It was one of the first parts of the Unix operating system that was developed, in August 1969. It remains part of the POSIX and Open Group standards for Unix-based operating systems, alongside the more sophisticated full-screen editor vi.

<span class="mw-page-title-main">Shell script</span> Script written for the shell, or command line interpreter, of an operating system

A shell script is a computer program designed to be run by a Unix shell, a command-line interpreter. The various dialects of shell scripts are considered to be command languages. Typical operations performed by shell scripts include file manipulation, program execution, and printing text. A script which sets up the environment, runs the program, and does any necessary cleanup or logging, is called a wrapper.

<span class="mw-page-title-main">Newline</span> Special characters in computing signifying the end of a line of text

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

head (Unix) Program on Unix and Unix-like systems

head is a program on Unix and Unix-like operating systems used to display the beginning of a text file or piped data.

xargs is a command on Unix and most Unix-like operating systems used to build and execute commands from standard input. It converts input from standard input into arguments to a command.

dc is a cross-platform reverse-Polish calculator which supports arbitrary-precision arithmetic. It was written by Lorinda Cherry and Robert Morris at Bell Labs. It is one of the oldest Unix utilities, preceding even the invention of the C programming language. Like other utilities of that vintage, it has a powerful set of features but terse syntax. Traditionally, the bc calculator program was implemented on top of dc.

tr (Unix) Unix text formatting utility

tr is a command in Unix, Plan 9, Inferno, and Unix-like operating systems. It is an abbreviation of translate or transliterate, indicating its operation of replacing or removing specific characters in its input data set.

In computing, cut is a command line utility on Unix and Unix-like operating systems which is used to extract sections from each line of input — usually from a file. It is currently part of the GNU coreutils package and the BSD Base System.

In computer programming, a one-liner program originally was textual input to the command line of an operating system shell that performed some function in just one line of input. In the present day, a one-liner can be

<span class="mw-page-title-main">Pipeline (Unix)</span> Mechanism for inter-process communication using message passing

In Unix-like computer operating systems, a pipeline is a mechanism for inter-process communication using message passing. A pipeline is a set of processes chained together by their standard streams, so that the output text of each process (stdout) is passed directly as input (stdin) to the next one. The second process is started as the first process is still executing, and they are executed concurrently.

In computing, a here document is a file literal or input stream literal: it is a section of a source code file that is treated as if it were a separate file. The term is also used for a form of multiline string literals that use similar syntax, preserving line breaks and other whitespace in the text.

In Unix-like operating systems, find is a command-line utility that locates files based on some user-specified criteria and either prints the pathname of each matched object or, if another action is requested, performs that action on each matched object.

In computing, tee is a command in command-line interpreters (shells) using standard streams which reads standard input and writes it to both standard output and one or more files, effectively duplicating its input. It is primarily used in conjunction with pipes and filters. The command is named after the T-splitter used in plumbing.

A filter is a computer program or subroutine to process a stream, producing another stream. While a single filter can be used individually, they are frequently strung together to form a pipeline.

unix2dos is a tool to convert line breaks in a text file from Unix format to DOS format and vice versa. When invoked as unix2dos the program will convert a Unix text file to DOS format, when invoked as dos2unix it will convert a DOS text file to Unix format.

In computing, process substitution is a form of inter-process communication that allows the input or output of a command to appear as a file. The command is substituted in-line, where a file name would normally occur, by the command shell. This allows programs that normally only accept files to directly read from or write to another program.

The following outline is provided as an overview of and topical guide to the Perl programming language:

cat (Unix) Unix command utility

cat is a standard Unix utility that reads files sequentially, writing them to standard output. The name is derived from its function to (con)catenate files . It has been ported to a number of operating systems.

References

  1. "The sed FAQ, Section 2.1". Archived from the original on 2018-06-27. Retrieved 2013-05-21.
  2. "The sed FAQ, Section 2.2". Archived from the original on 2018-06-27. Retrieved 2013-05-21.
  3. 1 2 McIlroy, M. D. (1987). A Research Unix reader: annotated excerpts from the Programmer's Manual, 1971–1986 (PDF) (Technical report). CSTR. Bell Labs. 139.
  4. 1 2 3 4 "On the Early History and Impact of Unix". A while later a demand arose for another special-purpose program, gres, for substitution: g/re/s. Lee McMahon undertook to write it, and soon foresaw that there would be no end to the family: g/re/d, g/re/a, etc. As his concept developed it became sed…
  5. Raymond, Eric Steven; Rebe, René (2017-03-03). "tar-mirror/minised: A smaller, cheaper, faster SED implementation". GitHub . Archived from the original on 2018-06-13. Retrieved 2024-05-20.
  6. "Implementation of a Turing Machine as Sed Script". Archived from the original on 2018-02-20. Retrieved 2003-04-24.
  7. "Turing.sed". Archived from the original on 2018-01-16. Retrieved 2003-04-24.
  8. "The $SED Home - gamez".
  9. "bolknote/SedChess". GitHub. Retrieved August 23, 2013.
  10. "Sedtris, a Tetris game written for sed". GitHub . Retrieved October 3, 2016.

Further reading