Zipping (computer science)

Last updated

In computer science, zipping is a function which maps a tuple of sequences into a sequence of tuples. This name zip derives from the action of a zipper in that it interleaves two formerly disjoint sequences. The inverse function is unzip.

Contents

Example

Given the three words cat, fish and be where |cat| is 3, |fish| is 4 and |be| is 2. Let denote the length of the longest word which is fish; . The zip of cat, fish, be is then 4 tuples of elements:

where # is a symbol not in the original alphabet. In Haskell this truncates to the shortest sequence , where :

zip3"cat""fish""be"-- [('c','f','b'),('a','i','e')]

Definition

Let Σ be an alphabet, # a symbol not in Σ.

Let x1x2... x|x|, y1y2... y|y|, z1z2... z|z|, ... be n words (i.e. finite sequences) of elements of Σ. Let denote the length of the longest word, i.e. the maximum of |x|, |y|, |z|, ... .

The zip of these words is a finite sequence of n-tuples of elements of (Σ ∪ {#}), i.e. an element of :

,

where for any index i > |w|, the wi is #.

The zip of x, y, z, ... is denoted zip(x, y, z, ...) or xyz ⋆ ...

The inverse to zip is sometimes denoted unzip.

A variation of the zip operation is defined by:

where is the minimum length of the input words. It avoids the use of an adjoined element , but destroys information about elements of the input sequences beyond .

In programming languages

Zip functions are often available in programming languages, often referred to as zip. In Lisp-dialects one can simply map the desired function over the desired lists, map is variadic in Lisp so it can take an arbitrary number of lists as argument. An example from Clojure: [1]

;; `nums' contains an infinite list of numbers (0 1 2 3 ...)(def nums(range))(def tens[102030])(def firstname"Alice");; To zip (0 1 2 3 ...) and [10 20 30] into a vector, invoke `map vector' on them; same with list(map vector numstens); ⇒ ([0 10] [1 20] [2 30])(map list numstens); ⇒ ((0 10) (1 20) (2 30))(map str numstens); ⇒ ("010" "120" "230");; `map' truncates to the shortest sequence; note missing \c and \e from "Alice"(map vector numstensfirstname); ⇒ ([0 10 \A] [1 20 \l] [2 30 \i])(map str numstensfirstname); ⇒ ("010A" "120l" "230i");; To unzip, apply `map vector' or `map list'(apply map list (map vector numstensfirstname));; ⇒ ((0 1 2) (10 20 30) (\A \l \i))

In Common Lisp:

(defparameternums'(123))(defparametertens'(102030))(defparameterfirstname"Alice")(mapcar#'listnumstens);; ⇒ ((1 10) (2 20) (3 30))(mapcar#'listnumstens(coercefirstname'list));; ⇒ ((1 10 #\A) (2 20 #\l) (3 30 #\i)) — truncates on shortest list;; Unzips(apply#'mapcar#'list(mapcar#'listnumstens(coercefirstname'list)));; ⇒ ((1 2 3) (10 20 30) (#\A #\l #\i))

Languages such as Python provide a zip() function, older version (Python 2.*) allowed mapping None over lists to get a similar effect. [2] zip() in conjunction with the * operator unzips a list: [2]

>>> nums=[1,2,3]>>> tens=[10,20,30]>>> firstname='Alice'>>> zipped=zip(nums,tens)>>> zipped[(1, 10), (2, 20), (3, 30)]>>> zip(*zipped)# unzip[(1, 2, 3), (10, 20, 30)]>>> zipped2=zip(nums,tens,list(firstname))>>> zipped2# zip, truncates on shortest[(1, 10, 'A'), (2, 20, 'l'), (3, 30, 'i')] >>> zip(*zipped2)# unzip[(1, 2, 3), (10, 20, 30), ('A', 'l', 'i')]>>> # mapping with `None' doesn't truncate; deprecated in Python 3.*>>> map(None,nums,tens,list(firstname))[(1, 10, 'A'), (2, 20, 'l'), (3, 30, 'i'), (None, None, 'c'), (None, None, 'e')]

Haskell has a method of zipping sequences but requires a specific function for each arity (zip for two sequences, zip3 for three etc.), [3] similarly the functions unzip and unzip3 are available for unzipping:

-- nums contains an infinite list of numbers [1, 2, 3, ...] nums=[1..]tens=[10,20,30]firstname="Alice"zipnumstens-- ⇒ [(1,10), (2,20), (3,30)] — zip, truncates infinite listunzip$zipnumstens-- ⇒ ([1,2,3], [10,20,30]) — unzipzip3numstensfirstname-- ⇒ [(1,10,'A'), (2,20,'l'), (3,30,'i')] — zip, truncatesunzip3$zip3numstensfirstname-- ⇒ ([1,2,3], [10,20,30], "Ali") — unzip

Language comparison

List of languages by support of zip:

Zip in various languages
LanguageZipZip 3 listsZip n listsNotes
Chapel zip (iter1iter2)zip (iter1iter2iter3)zip (iter1 ... itern)The shape of each iterator, the rank and the extents in each dimension, must be identical. [4]
Clojure (map list list1list2)
(map vector list1list2)
(map list list1list2list3)
(map vector list1list2list3)
(map list list1listn)
(map vector list1listn)
Stops after the length of the shortest list.
Common Lisp (mapcar#'listlist1list2)(mapcar#'listlist1list2list3)(mapcar#'listlist1...listn)Stops after the length of the shortest list.
D zip(range1, range2)
range1.zip(range2)
zip(range1, range2,range3)
range1.zip(range2, range3)
zip(range1, …, rangeN)
range1.zip(…, rangeN)
The stopping policy defaults to shortest and can be optionally provided as shortest, longest, or requiring the same length. [5] The second form is an example of UFCS.
F# List.zip list1list2
Seq.zip source1source2
Array.zip array1array2
List.zip3 list1list2list3
Seq.zip3 source1source2source3
Array.zip3 array1array2array3
Haskell zip list1list2zip3 list1list2list3zipnlist1listnzipn for n > 3 is available in the module Data.List. Stops after the shortest list ends.
Python zip(list1, list2)zip(list1, list2, list3)zip(list1, …, listn)zip() and map() (3.x) stops after the shortest list ends, whereas map() (2.x) and itertools.zip_longest() (3.x) extends the shorter lists with None items
Ruby list1.zip(list2)list1.zip(list2, list3)list1.zip(list1, .., listn)When the list being executed upon (list1) is shorter than the lists being zipped the resulting list is the length of list1. If list1 is longer nil values are used to fill the missing values [6]
Scala list1.zip(list2)If one of the two collections is longer than the other, its remaining elements are ignored. [7]
Unzip in various languages
LanguageUnzipUnzip 3 tuplesUnzip n tuplesNotes
Clojure (apply map vector ziplist)(apply map vector ziplist)(apply map vector ziplist)
Common Lisp (apply#'mapcar#'listziplist)(apply#'mapcar#'listziplist)(apply#'mapcar#'listziplist)
F# List.unzip list1list2
Seq.unzip source1source2
Array.unzip array1array2
List.unzip3 list1list2list3
Seq.unzip3 source1source2source3
Array.unzip3 array1array2array3
Haskell unzip ziplistunzip3 ziplistunzipnziplistunzipn for n > 3 is available in the module Data.List.
Python zip(*zipvlist)zip(*zipvlist)zip(*zipvlist)

See also

Related Research Articles

<span class="mw-page-title-main">Autocorrelation</span> Correlation of a signal with a time-shifted copy of itself, as a function of shift

Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

In mathematics, more specifically in functional analysis, a Banach space is a complete normed vector space. Thus, a Banach space is a vector space with a metric that allows the computation of vector length and distance between vectors and is complete in the sense that a Cauchy sequence of vectors always converges to a well-defined limit that is within the space.

<span class="mw-page-title-main">Basis (linear algebra)</span> Set of vectors used to define coordinates

In mathematics, a set B of vectors in a vector space V is called a basis if every element of V may be written in a unique way as a finite linear combination of elements of B. The coefficients of this linear combination are referred to as components or coordinates of the vector with respect to B. The elements of a basis are called basis vectors.

In mathematics, the Lp spaces are function spaces defined using a natural generalization of the p-norm for finite-dimensional vector spaces. They are sometimes called Lebesgue spaces, named after Henri Lebesgue, although according to the Bourbaki group they were first introduced by Frigyes Riesz.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

In the area of mathematics known as functional analysis, a reflexive space is a locally convex topological vector space for which the canonical evaluation map from into its bidual is a homeomorphism. A normed space is reflexive if and only if this canonical evaluation map is surjective, in which case this evaluation map is an isometric isomorphism and the normed space is a Banach space. Those space for which the canonical evaluation map is surjective are called semi-reflexive spaces.

In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

In functional analysis and operator theory, a bounded linear operator is a linear transformation between topological vector spaces (TVSs) and that maps bounded subsets of to bounded subsets of If and are normed vector spaces, then is bounded if and only if there exists some such that for all

In abstract algebra and multilinear algebra, a multilinear form on a vector space over a field is a map

In abstract algebra, Hilbert's Theorem 90 (or Satz 90) is an important result on cyclic extensions of fields (or to one of its generalizations) that leads to Kummer theory. In its most basic form, it states that if L/K is an extension of fields with cyclic Galois group G = Gal(L/K) generated by an element and if is an element of L of relative norm 1, that is

In functional analysis and related areas of mathematics, a sequence space is a vector space whose elements are infinite sequences of real or complex numbers. Equivalently, it is a function space whose elements are functions from the natural numbers to the field K of real or complex numbers. The set of all such functions is naturally identified with the set of all possible infinite sequences with elements in K, and can be turned into a vector space under the operations of pointwise addition of functions and pointwise scalar multiplication. All sequence spaces are linear subspaces of this space. Sequence spaces are typically equipped with a norm, or at least the structure of a topological vector space.

In statistics and signal processing, a minimum mean square error (MMSE) estimator is an estimation method which minimizes the mean square error (MSE), which is a common measure of estimator quality, of the fitted values of a dependent variable. In the Bayesian setting, the term MMSE more specifically refers to estimation with quadratic loss function. In such case, the MMSE estimator is given by the posterior mean of the parameter to be estimated. Since the posterior mean is cumbersome to calculate, the form of the MMSE estimator is usually constrained to be within a certain class of functions. Linear MMSE estimators are a popular choice since they are easy to use, easy to calculate, and very versatile. It has given rise to many popular estimators such as the Wiener–Kolmogorov filter and Kalman filter.

In mathematics, nuclear spaces are topological vector spaces that can be viewed as a generalization of finite dimensional Euclidean spaces and share many of their desirable properties. Nuclear spaces are however quite different from Hilbert spaces, another generalization of finite dimensional Euclidean spaces. They were introduced by Alexander Grothendieck.

In universal algebra, a basis is a structure inside of some (universal) algebras, which are called free algebras. It generates all algebra elements from its own elements by the algebra operations in an independent manner. It also represents the endomorphisms of an algebra by certain indexings of algebra elements, which can correspond to the usual matrices when the free algebra is a vector space.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

In linear algebra, particularly projective geometry, a semilinear map between vector spaces V and W over a field K is a function that is a linear map "up to a twist", hence semi-linear, where "twist" means "field automorphism of K". Explicitly, it is a function T : VW that is:

Sliced inverse regression is a tool for dimensionality reduction in the field of multivariate statistics.

In mathematics, , the vector space of bounded sequences with the supremum norm, and , the vector space of essentially bounded measurable functions with the essential supremum norm, are two closely related Banach spaces. In fact the former is a special case of the latter. As a Banach space they are the continuous dual of the Banach spaces of absolutely summable sequences, and of absolutely integrable measurable functions. Pointwise multiplication gives them the structure of a Banach algebra, and in fact they are the standard examples of abelian Von Neumann algebras.

This is a glossary of linear algebra.

References

  1. map from ClojureDocs
  2. 1 2 map(function, iterable, ...) from section Built-in Functions from Python v2.7.2 documentation
  3. zip :: [a] -> [b] -> [(a, b)] from Prelude, Basic libraries
  4. "Statements — Chapel Documentation 1.25".
  5. "std.range - D Programming Language".
  6. "Class: Array (Ruby 2.0.0)".
  7. "IterableOps". scala-lang.org.