Trimming (computer programming)

Last updated

In computer programming, trimming (trim) or stripping (strip) is a string manipulation in which leading and trailing whitespace is removed from a string.

Contents

For example, the string (enclosed by apostrophes)

'  this is a test  ' 

would be changed, after trimming, to

'this is a test' 

Variants

Left or right trimming

The most popular variants of the trim function strip only the beginning or end of the string. Typically named ltrim and rtrim respectively, or in the case of Python: lstrip and rstrip. C# uses TrimStart and TrimEnd, and Common Lisp string-left-trim and string-right-trim. Pascal and Java do not have these variants built-in, although Object Pascal (Delphi) has TrimLeft and TrimRight functions. [1]

Whitespace character list parameterization

Many trim functions have an optional parameter to specify a list of characters to trim, instead of the default whitespace characters. For example, PHP and Python allow this optional parameter, while Pascal and Java do not. With Common Lisp's string-trim function, the parameter (called character-bag) is required. The C++ Boost library defines space characters according to locale, as well as offering variants with a predicate parameter (a functor) to select which characters are trimmed.

Special empty string return value

An uncommon variant of trim returns a special result if no characters remain after the trim operation. For example, Apache Jakarta's StringUtils has a function called stripToNull which returns null in place of an empty string.

Space normalization

Space normalization is a related string manipulation where in addition to removing surrounding whitespace, any sequence of whitespace characters within the string is replaced with a single space. Space normalization is performed by the function named Trim() in spreadsheet applications (including Excel, Calc, Gnumeric, and Google Docs), and by the normalize-space() function in XSLT and XPath,

In-place trimming

While most algorithms return a new (trimmed) string, some alter the original string in-place. Notably, the Boost library allows either in-place trimming or a trimmed copy to be returned.

Definition of whitespace

The characters which are considered whitespace varies between programming languages and implementations. For example, C traditionally only counts space, tab, line feed, and carriage return characters, while languages which support Unicode typically include all Unicode space characters. Some implementations also include ASCII control codes (non-printing characters) along with whitespace characters.

Java's trim method considers ASCII spaces and control codes as whitespace, contrasting with the Java isWhitespace() method, [2] which recognizes all Unicode space characters.

Delphi's Trim function considers characters U+0000 (NULL) through U+0020 (SPACE) to be whitespace.

Non-space blanks

The Braille Patterns Unicode block contains U+2800BRAILLE PATTERN BLANK, a Braille pattern with no dots raised. The Unicode standard explicitly states that it does not act as a space.

The Non-breaking space U+00A0 NO-BREAK SPACE ( ,  ) can also be treated as non-space for trimming purposes.

Usage

Related Research Articles

A "Hello, World!" program is generally a simple computer program which outputs to the screen a message similar to "Hello, World!" while ignoring any user input. A small piece of code in most general-purpose programming languages, this program is used to illustrate a language's basic syntax. A "Hello, World!" program is often the first written by a student of a new programming language, but such a program can also be used as a sanity check to ensure that the computer software intended to compile or run source code is correctly installed, and that its operator understands how to use it.

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

<span class="mw-page-title-main">Newline</span> Special characters in computing signifying the end of a line of text

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

In computer science, primitive data types are a set of basic data types from which all other data types are constructed. Specifically it often refers to the limited set of data representations in use by a particular processor, which all compiled programs must use. Most processors support a similar set of primitive data types, although the specific representations vary. More generally, "primitive data types" may refer to the standard data types built into a programming language. Data types which are not primitive are referred to as derived or composite.

The backtick` is a typographical mark used mainly in computing. It is also known as backquote, grave, or grave accent.

In some programming languages, eval, short for the English evaluate, is a function which evaluates a string as though it were an expression in the language, and returns a result; in others, it executes multiple lines of code as though they had been included instead of the line including the eval. The input to eval is not necessarily a string; it may be structured representation of code, such as an abstract syntax tree, or of special type such as code. The analog for a statement is exec, which executes a string as if it were a statement; in some languages, such as Python, both are present, while in other languages only one of either eval or exec is.

In computer programming, a naming convention is a set of rules for choosing the character sequence to be used for identifiers which denote variables, types, functions, and other entities in source code and documentation.

In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE represents a blank space punctuation character in text, used as a word divider in Western scripts.

Braille ASCII is a subset of the ASCII character set which uses 64 of the printable ASCII characters to represent all possible dot combinations in six-dot braille. It was developed around 1969 and, despite originally being known as North American Braille ASCII, it is now used internationally.

In the area of mathematical logic and computer science known as type theory, a unit type is a type that allows only one value. The carrier associated with a unit type can be any singleton set. There is an isomorphism between any two such sets, so it is customary to talk about the unit type and ignore the details of its value. One may also regard the unit type as the type of 0-tuples, i.e. the product of no types.

String functions are used in computer programming languages to manipulate a string or query information about a string.

.properties is a file extension for files mainly used in Java-related technologies to store the configurable parameters of an application. They can also be used for storing strings for Internationalization and localization; these are known as Property Resource Bundles.

The greater-than sign is a mathematical symbol that denotes an inequality between two values. The widely adopted form of two equal-length strokes connecting in an acute angle at the right, >, has been found in documents dated as far back as 1631. In mathematical writing, the greater-than sign is typically placed between two values being compared and signifies that the first number is greater than the second number. Examples of typical usage include 1.5 > 1 and 1 > −2. The less-than sign and greater-than sign always "point" to the smaller number. Since the development of computer programming languages, the greater-than sign and the less-than sign have been repurposed for a range of uses and operations.

In computer programming, ellipsis notation is used to denote ranges, an unspecified number of arguments, or a parent directory. Most programming languages require the ellipsis to be written as a series of periods; a single (Unicode) ellipsis character cannot be used.

In software engineering, the module pattern is a design pattern used to implement the concept of software modules, defined by modular programming, in a programming language with incomplete direct support for the concept.

References

  1. "Trim". Freepascal.org. 2013-02-02. Retrieved 2013-08-24.
  2. "Character (Java 2 Platform SE 5.0)". Java.sun.com. Retrieved 2013-08-24.