Iteratee

Last updated

In functional programming, an iteratee is a composable abstraction for incrementally processing sequentially presented chunks of input data in a purely functional fashion. With iteratees, it is possible to lazily transform how a resource will emit data, for example, by converting each chunk of the input to uppercase as they are retrieved or by limiting the data to only the five first chunks without loading the whole input data into memory. Iteratees are also responsible for opening and closing resources, providing predictable resource management.

Contents

On each step, an iteratee is presented with one of three possible types of values: the next chunk of data, a value to indicate no data is available, or a value to indicate the iteration process has finished. It may return one of three possible types of values, to indicate to the caller what should be done next: one that means "stop" (and contains the final return value), one that means "continue" (and specifies how to continue), and one that means "signal an error". The latter types of values in effect represent the possible "states" of an iteratee. An iteratee would typically start in the "continue" state.

Iteratees are used in Haskell and Scala (in the Play Framework [1] and in Scalaz), and are also available for F#. [2] Various slightly different implementations of iteratees exist. For example, in the Play framework, they involve Futures so that asynchronous processing can be performed.

Because iteratees are called by other code which feeds them with data, they are an example of inversion of control. However, unlike many other examples of inversion of control such as SAX XML parsing, the iteratee retains a limited amount of control over the process. It cannot reverse back and look at previous data (unless it stores that data internally), but it can stop the process cleanly without throwing an exception (using exceptions as a means of control flow, rather than to signal an exceptional event, is often frowned upon by programmers [3] ).

Commonly associated abstractions

The following abstractions are not strictly speaking necessary to work with iteratees, but they do make it more convenient.

Enumerators

An Enumerator is a convenient abstraction for feeding data into an iteratee from an arbitrary data source. Typically the enumerator will take care of any necessary resource cleanup associated with the data source. Because the enumerator knows exactly when the iteratee has finished reading data, it will do the resource cleanup (such as closing a file) at exactly the right time – neither too early nor too late. However, it can do this without needing to know about, or being co-located to, the implementation of the iteratee – so enumerators and iteratees form an example of separation of concerns.

Enumeratees

An Enumeratee is a convenient abstraction for transforming the output of either an enumerator or iteratee, and feeding that output to an iteratee. For example, a "map" enumeratee would map a function over each input chunk. [4]

Motivations

Iteratees were created due to problems with existing purely functional solutions to the problem of making input/output composable yet correct. Lazy I/O in Haskell allowed pure functions to operate on data on disk as if it were in memory, without explicitly doing I/O at all after opening the file - a kind of memory-mapped file feature - but because it was impossible in general (due to the Halting problem) for the runtime to know whether the file or other resource was still needed, excessive numbers of files could be left open unnecessarily, resulting in file descriptor exhaustion at the operating system level. Traditional C-style I/O, on the other hand, was too low-level and required the developer to be concerned with low-level details such as the current position in the file, which hindered composability. Iteratees and enumerators combine the high-level functional programming benefits of lazy I/O, with the ability to control resources and low-level details where necessary afforded by C-style I/O. [5]

Examples

Uses

Iteratees are used in the Play framework to push data out to long-running Comet and WebSocket connections to web browsers.

Iteratees may also be used to perform incremental parsing (that is, parsing that does not read all the data into memory at once), for example of JSON. [6]

Iteratees are a very general abstraction and can be used for arbitrary kinds of sequential information processing (or mixed sequential/random-access processing) - and need not involve any I/O at all. This makes it easy to repurpose an iteratee to work on an in-memory dataset instead of data flowing in from the network.

History

In a sense, a distant predecessor of the notion of an enumerator pushing data into a chain of one or more iteratees, was the pipeline concept in operating systems. However, unlike a typical pipeline, iteratees are not separate processes (and hence do not have the overhead of IPC) - or even separate threads, although they can perform work in a similar manner to a chain of worker threads sending messages to each other. This means that iteratees are more lightweight than processes or threads - unlike the situations with separate processes or threads, no extra stacks are needed.

Iteratees and enumerators were invented by Oleg Kiselyov for use in Haskell. [5] Later, they were introduced into Scalaz (in version 5.0; enumeratees were absent and were introduced in Scalaz 7) and into Play Framework 2.0.

Formal semantics

Iteratees have been formally modelled as free monads, allowing equational laws to be validated, and employed to optimise programs using iteratees. [5]

Alternatives

Related Research Articles

In computer science, functional programming is a programming paradigm where programs are constructed by applying and composing functions. It is a declarative programming paradigm in which function definitions are trees of expressions that map values to other values, rather than a sequence of imperative statements which update the running state of the program.

In programming language theory, lazy evaluation, or call-by-need, is an evaluation strategy which delays the evaluation of an expression until its value is needed and which also avoids repeated evaluations.

In computer programming, an iterator is an object that enables a programmer to traverse a container, particularly lists. Various types of iterators are often provided via a container's interface. Though the interface and semantics of a given iterator are fixed, iterators are often implemented in terms of the structures underlying a container implementation and are often tightly coupled to the container to enable the operational semantics of the iterator. An iterator performs traversal and also gives access to data elements in a container, but does not itself perform iteration.

Programming languages can be grouped by the number and types of paradigms supported.

<span class="mw-page-title-main">OpenMP</span> Open standard for parallelizing

OpenMP is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran, on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, FreeBSD, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

In computer science, a generator is a routine that can be used to control the iteration behaviour of a loop. All generators are also iterators. A generator is very similar to a function that returns an array, in that a generator has parameters, can be called, and generates a sequence of values. However, instead of building an array containing all the values and returning them all at once, a generator yields the values one at a time, which requires less memory and allows the caller to get started processing the first few values immediately. In short, a generator looks like a function but behaves like an iterator.

In functional programming, a monad is a structure that combines program fragments (functions) and wraps their return values in a type with additional computation. In addition to defining a wrapping monadic type, monads define two operators: one to wrap a value in the monad type, and another to compose together functions that output values of the monad type. General-purpose languages use monads to reduce boilerplate code needed for common operations. Functional languages use monads to turn complicated sequences of functions into succinct pipelines that abstract away control flow, and side-effects.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

<span class="mw-page-title-main">Apache Portable Runtime</span> Supporting library for the Apache web server

The Apache Portable Runtime (APR) is a supporting library for the Apache web server. It provides a set of APIs that map to the underlying operating system (OS). Where the OS does not support a particular function, APR will provide an emulation. Thus programmers can use the APR to make a program truly portable across platforms.

<span class="mw-page-title-main">Architecture of Windows NT</span> Overview of the architecture of the Microsoft Windows NT line of operating systems

The architecture of Windows NT, a line of operating systems produced and sold by Microsoft, is a layered design that consists of two main components, user mode and kernel mode. It is a preemptive, reentrant multitasking operating system, which has been designed to work with uniprocessor and symmetrical multiprocessor (SMP)-based computers. To process input/output (I/O) requests, they use packet-driven I/O, which utilizes I/O request packets (IRPs) and asynchronous I/O. Starting with Windows XP, Microsoft began making 64-bit versions of Windows available; before this, there were only 32-bit versions of these operating systems.

In computing, a parallel programming model is an abstraction of parallel computer architecture, with which it is convenient to express algorithms and their composition in programs. The value of a programming model can be judged on its generality: how well a range of different problems can be expressed for a variety of different architectures, and its performance: how efficiently the compiled programs can execute. The implementation of a parallel programming model can take the form of a library invoked from a sequential language, as an extension to an existing language, or as an entirely new language.

Automatic parallelization, also auto parallelization, or autoparallelization refers to converting sequential code into multi-threaded and/or vectorized code in order to use multiple processors simultaneously in a shared-memory multiprocessor (SMP) machine. Fully automatic parallelization of sequential programs is a challenge because it requires complex program analysis and the best approach may depend upon parameter values that are not known at compilation time.

Concurrent computing is a form of computing in which several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts.

Thrift is an interface definition language and binary communication protocol used for defining and creating services for numerous programming languages. It was developed at Facebook for "scalable cross-language services development" and as of 2020 is an open source project in the Apache Software Foundation.

This article describes the features in Haskell.

Haskell is a general-purpose, statically-typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research and industrial applications, Haskell has pioneered a number of programming language features such as type classes, which enable type-safe operator overloading, and monadic IO. It is named after logician Haskell Curry. Haskell's main implementation is the Glasgow Haskell Compiler (GHC).

<span class="mw-page-title-main">XML transformation language</span> Type of programming language

An XML transformation language is a programming language designed specifically to transform an input XML document into an output document which satisfies some specific goal.

<span class="mw-page-title-main">Yesod (web framework)</span>

Yesod is a free and open-source web framework based on Haskell for productive development of type-safe, REST model based, high performance web applications, developed by Michael Snoyman et al.

Underscore.js is a JavaScript library which provides utility functions for common programming tasks. It is comparable to features provided by Prototype.js and the Ruby language, but opts for a functional programming design instead of extending object prototypes. The documentation refers to Underscore.js as "the tie to go along with jQuery's tux, and Backbone.js' suspenders." Underscore.js was created by Jeremy Ashkenas, who is also known for Backbone.js and CoffeeScript.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

References

  1. "Handling data streams reactively". Play Framework documentation. Retrieved 29 June 2013.
  2. "Github Search Results: Iteratee in FSharpx". GitHub .
  3. "Java theory and practice: The exceptions debate". IBM developerWorks. Retrieved 17 May 2014.
  4. "Enumeratees". Play framework documentation. Retrieved 29 June 2013.
  5. 1 2 3 Kiselyov, O. (2012). "Iteratees". Functional and Logic Programming. Lecture Notes in Computer Science. Vol. 7294. pp. 166–181. doi:10.1007/978-3-642-29822-6_15. ISBN   978-3-642-29821-9.
  6. James Roper (10 December 2012). "Json.scala". play-iteratees-extras. Retrieved 29 June 2013.

Further reading