Filter (software)

Last updated

A filter is a computer program or subroutine to process a stream, producing another stream. While a single filter can be used individually, they are frequently strung together to form a pipeline.

Contents

Some operating systems such as Unix are rich with filter programs. Windows 7 and later are also rich with filters, as they include Windows PowerShell. In comparison, however, few filters are built into cmd.exe (the original command-line interface of Windows), most of which have significant enhancements relative to the similar filter commands that were available in MS-DOS. OS X includes filters from its underlying Unix base but also has Automator, which allows filters (known as "Actions") to be strung together to form a pipeline.

Unix

In Unix and Unix-like operating systems, a filter is a program that gets most of its data from its standard input (the main input stream) and writes its main results to its standard output (the main output stream). Auxiliary input may come from command line flags or configuration files, while auxiliary output may go to standard error. The command syntax for getting data from a device or file other than standard input is the input operator (<). Similarly, to send data to a device or file other than standard output is the output operator (>). To append data lines to an existing output file, one can use the append operator (>>). Filters may be strung together into a pipeline with the pipe operator ("|"). This operator signifies that the main output of the command to the left is passed as main input to the command on the right.

The Unix philosophy encourages combining small, discrete tools to accomplish larger tasks. The classic filter in Unix is Ken Thompson's grep , which Doug McIlroy cites as what "ingrained the tools outlook irrevocably" in the operating system, with later tools imitating it. [1] grep at its simplest prints any lines containing a character string to its output. The following is an example:

cut -d : -f 1 /etc/passwd | grep foo 

This finds all registered users that have "foo" as part of their username by using the cut command to take the first field (username) of each line of the Unix system password file and passing them all as input to grep, which searches its input for lines containing the character string "foo" and prints them on its output.

Common Unix filter programs are: cat, cut, grep, head, sort, tail, and uniq. Programs like awk and sed can be used to build quite complex filters because they are fully programmable. Unix filters can also be used by Data scientists to get a quick overview about a file based dataset. [2]

List of Unix filter programs

DOS

Two standard filters from the early days of DOS-based computers are find and sort.

Examples:

find "keyword" < inputfilename > outputfilename sort "keyword" < inputfilename > outputfilename find /v "keyword" < inputfilename | sort > outputfilename

Such filters may be used in batch files (*.bat, *.cmd etc.).

For use in the same command shell environment, there are many more filters available than those built into Windows. Some of these are freeware, some shareware and some are commercial programs. A number of these mimic the function and features of the filters in Unix. Some filtering programs have a graphical user interface (GUI) to enable users to design a customized filter to suit their special data processing and/or data mining requirements.

Windows

Windows Command Prompt inherited MS-DOS commands, improved some and added a few. For example, Windows Server 2003 features six command-line filters for modifying Active Directory that can be chained by piping: DSAdd, DSGet, DSMod, DSMove, DSRm and DSQuery. [3]

Windows PowerShell adds an entire host of filters known as "cmdlets" which can be chained together with a pipe, except a few simple ones, e.g. Clear-Screen. The following example gets a list of files in the C:\Windows folder, gets the size of each and sorts the size in ascending order. It shows how three filters (Get-ChildItem, ForEach-Object and Sort-Object) are chained with pipes.

Get-ChildItemC:\Windows|ForEach-Object{$_.length}|Sort-Object-Ascending

Related Research Articles

<span class="mw-page-title-main">AWK</span> Data-driven programming language made by Alfred Aho, Peter Weinberger and Brian Kernighan

AWK (awk) is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems.

sed Standard UNIX utility for editing streams of data

sed is a Unix utility that parses and transforms text, using a simple, compact programming language. It was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, and is available today for most operating systems. sed was based on the scripting features of the interactive editor ed and the earlier qed. It was one of the earliest tools to support regular expressions, and remains in use for text processing, most notably with the substitution command. Popular alternative tools for plaintext string manipulation and "stream editing" include AWK and Perl.

<span class="mw-page-title-main">Shell script</span> Script written for the shell, or command line interpreter, of an operating system

A shell script is a computer program designed to be run by the Unix shell, a command-line interpreter. The various dialects of shell scripts are considered to be scripting languages. Typical operations performed by shell scripts include file manipulation, program execution, and printing text. A script which sets up the environment, runs the program, and does any necessary cleanup or logging, is called a wrapper.

uniq is a utility command on Unix, Plan 9, Inferno, and Unix-like operating systems which, when fed a text file or standard input, outputs the text with adjacent identical lines collapsed to one, unique line of text.

<span class="mw-page-title-main">C shell</span> Unix shell

The C shell is a Unix shell created by Bill Joy while he was a graduate student at University of California, Berkeley in the late 1970s. It has been widely distributed, beginning with the 2BSD release of the Berkeley Software Distribution (BSD) which Joy first distributed in 1978. Other early contributors to the ideas or the code were Michael Ubell, Eric Allman, Mike O'Brien and Jim Kulp.

In computer programming, standard streams are interconnected input and output communication channels between a computer program and its environment when it begins execution. The three input/output (I/O) connections are called standard input (stdin), standard output (stdout) and standard error (stderr). Originally I/O happened via a physically connected system console, but standard streams abstract this. When a command is executed via an interactive shell, the streams are typically connected to the text terminal on which the shell is running, but can be changed with redirection or a pipeline. More generally, a child process inherits the standard streams of its parent process.

In software, an XML pipeline is formed when XML processes, especially XML transformations and XML validations, are connected.

xargs is a command on Unix and most Unix-like operating systems used to build and execute commands from standard input. It converts input from standard input into arguments to a command.

<span class="mw-page-title-main">Redirection (computing)</span> Form of interprocess communication

In computing, redirection is a form of interprocess communication, and is a function common to most command-line interpreters, including the various Unix shells that can redirect standard streams to user-specified locations.

wc (Unix) Unix command utility

wc is a command in Unix, Plan 9, Inferno, and Unix-like operating systems. The program reads either standard input or a list of computer files and generates one or more of the following statistics: newline count, word count, and byte count. If a list of files is provided, both individual file and total statistics follow.

In computing, cut is a command line utility on Unix and Unix-like operating systems which is used to extract sections from each line of input — usually from a file. It is currently part of the GNU coreutils package and the BSD Base System.

In computer programming, a one-liner program originally was textual input to the command-line of an operating system shell that performed some function in just one line of input. In the present day, a one-liner can be

<span class="mw-page-title-main">Pipeline (Unix)</span>

In Unix-like computer operating systems, a pipeline is a mechanism for inter-process communication using message passing. A pipeline is a set of processes chained together by their standard streams, so that the output text of each process (stdout) is passed directly as input (stdin) to the next one. The second process is started as the first process is still executing, and they are executed concurrently. The concept of pipelines was championed by Douglas McIlroy at Unix's ancestral home of Bell Labs, during the development of Unix, shaping its toolbox philosophy. It is named by analogy to a physical pipeline. A key feature of these pipelines is their "hiding of internals". This in turn allows for more clarity and simplicity in the system.

In software engineering, a pipeline consists of a chain of processing elements, arranged so that the output of each element is the input of the next; the name is by analogy to a physical pipeline. Usually some amount of buffering is provided between consecutive elements. The information that flows in these pipelines is often a stream of records, bytes, or bits, and the elements of a pipeline may be called filters; this is also called the pipes and filters design pattern. Connecting elements into a pipeline is analogous to function composition.

In Unix-like and some other operating systems, find is a command-line utility that locates files based on some user-specified criteria and either prints the pathname of each matched object or, if another action is requested, performs that action on each matched object.

In computing, tee is a command in command-line interpreters (shells) using standard streams which reads standard input and writes it to both standard output and one or more files, effectively duplicating its input. It is primarily used in conjunction with pipes and filters. The command is named after the T-splitter used in plumbing.

Tacit programming, also called point-free style, is a programming paradigm in which function definitions do not identify the arguments on which they operate. Instead the definitions merely compose other functions, among which are combinators that manipulate the arguments. Tacit programming is of theoretical interest, because the strict use of composition results in programs that are well adapted for equational reasoning. It is also the natural style of certain programming languages, including APL and its derivatives, and concatenative languages such as Forth. The lack of argument naming gives point-free style a reputation of being unnecessarily obscure, hence the epithet "pointless style".

find (Windows)

In computing, find is a command in the command-line interpreters (shells) of a number of operating systems. It is used to search for a specific text string in a file or files. The command sends the specified lines to the standard output device.

<span class="mw-page-title-main">Command-line interface</span> Computer interface that uses text

A command-line interpreter or command-line processor uses a command-line interface (CLI) to receive commands from a user in the form of lines of text. This provides a means of setting parameters for the environment, invoking executables and providing information to them as to what actions they are to perform. In some cases the invocation is conditional based on conditions established by the user or previous executables. Such access was first provided by computer terminals starting in the mid-1960s. This provided an interactive environment not available with punched cards or other input methods.

Strozzi NoSQL is a shell-based relational database management system initialized and developed by Carlo Strozzi that runs under Unix-like operating systems, or others with compatibility layers. Its file name NoSQL merely reflects the fact that it does not express its queries using Structured Query Language; the NoSQL RDBMS is distinct from the circa-2009 general concept of NoSQL databases, which are typically non-relational, unlike the NoSQL RDBMS. Strozzi NoSQL is released under the GNU GPL.

References

  1. McIlroy, M. D. (1987). A Research Unix reader: annotated excerpts from the Programmer's Manual, 1971–1986 (PDF) (Technical report). CSTR. Bell Labs. 139.
  2. Data Analysis with the Unix Shell Archived 2016-01-22 at the Wayback Machine - Bernd Zuther, comSysto GmbH, 2013
  3. Holme, Dan; Thomas, Orin (2004). Managing and maintaining a Microsoft Windows Server 2003 environment : exam 70-290. Redmond, WA: Microsoft Press. pp.  3|17—3|26. ISBN   9780735614376.