Thunk

Last updated

In computer programming, a thunk is a subroutine used to inject an additional calculation into another subroutine. Thunks are primarily used to delay a calculation until its result is needed, or to insert operations at the beginning or end of the other subroutine. They have many other applications in compiler code generation and modular programming.

Contents

The term originated as a humorous, incorrect, past participle of "think". That is, a "thunk value" becomes available after its calculation routine is thought through, or executed. [1]

Background

The early years of compiler research saw broad experimentation with different evaluation strategies. A key question was how to compile a subroutine call if the arguments can be arbitrary mathematical expressions rather than constants. One approach, known as "call by value", calculates all of the arguments before the call and then passes the resulting values to the subroutine. In the rival "call by name" approach, the subroutine receives the unevaluated argument expression and must evaluate it.

A simple implementation of "call by name" might substitute the code of an argument expression for each appearance of the corresponding parameter in the subroutine, but this can produce multiple versions of the subroutine and multiple copies of the expression code. As an improvement, the compiler can generate a helper subroutine, called a thunk, that calculates the value of the argument. The address and environment [lower-alpha 1] of this helper subroutine are then passed to the original subroutine in place of the original argument, where it can be called as many times as needed. Peter Ingerman first described thunks in reference to the ALGOL 60 programming language, which supported call-by-name evaluation. [3]

Applications

Functional programming

Although the software industry largely standardized on call-by-value and call-by-reference evaluation, [4] active study of call-by-name continued in the functional programming community. This research produced a series of lazy evaluation programming languages in which some variant of call-by-name is the standard evaluation strategy. Compilers for these languages, such as the Glasgow Haskell Compiler, have relied heavily on thunks, with the added feature that the thunks save their initial result so that they can avoid recalculating it; [5] this is known as memoization or call-by-need.

Functional programming languages have also allowed programmers to explicitly generate thunks. This is done in source code by wrapping an argument expression in an anonymous function that has no parameters of its own. This prevents the expression from being evaluated until a receiving function calls the anonymous function, thereby achieving the same effect as call-by-name. [6] The adoption of anonymous functions into other programming languages has made this capability widely available.

The following is a simple demonstration in JavaScript (ES6):

// 'hypot' is a binary functionconsthypot=(x,y)=>Math.sqrt(x*x+y*y);// 'thunk' is a function that takes no arguments and, when invoked, performs a potentially expensive// operation (computing a square root, in this example) and/or causes some side-effect to occurconstthunk=()=>hypot(3,4);// the thunk can then be passed around without being evaluated...doSomethingWithThunk(thunk);// ...or evaluatedthunk();// === 5

Object-oriented programming

Thunks are useful in object-oriented programming platforms that allow a class to inherit multiple interfaces, leading to situations where the same method might be called via any of several interfaces. The following code illustrates such a situation in C++.

classA{public:virtualintAccess()const{returnvalue_;}private:intvalue_;};classB{public:virtualintAccess()const{returnvalue_;}private:intvalue_;};classC:publicA,publicB{public:intAccess()constoverride{returnbetter_value_;}private:intbetter_value_;};intUse(B*b){returnb->Access();}intmain(){// ...Bsome_b;Use(&some_b);Csome_c;Use(&some_c);}

In this example, the code generated for each of the classes A, B and C will include a dispatch table that can be used to call Access on an object of that type, via a reference that has the same type. Class C will have an additional dispatch table, used to call Access on an object of type C via a reference of type B. The expression b->Access() will use B's own dispatch table or the additional C table, depending on the type of object b refers to. If it refers to an object of type C, the compiler must ensure that C's Access implementation receives an instance address for the entire C object, rather than the inherited B part of that object. [7]

As a direct approach to this pointer adjustment problem, the compiler can include an integer offset in each dispatch table entry. This offset is the difference between the reference's address and the address required by the method implementation. The code generated for each call through these dispatch tables must then retrieve the offset and use it to adjust the instance address before calling the method.

The solution just described has problems similar to the naïve implementation of call-by-name described earlier: the compiler generates several copies of code to calculate an argument (the instance address), while also increasing the dispatch table sizes to hold the offsets. As an alternative, the compiler can generate an adjustor thunk along with C's implementation of Access that adjusts the instance address by the required amount and then calls the method. The thunk can appear in C's dispatch table for B, thereby eliminating the need for callers to adjust the address themselves. [8]

Numerical calculations requiring evaluations at multiple points

Routines for computations such as integration need to calculate an expression at multiple points. Call by name was used for this purpose in languages that didn't support closures or procedure parameters.

Interoperability

Thunks have been widely used to provide interoperability between software modules whose routines cannot call each other directly. This may occur because the routines have different calling conventions, run in different CPU modes or address spaces, or at least one runs in a virtual machine. A compiler (or other tool) can solve this problem by generating a thunk that automates the additional steps needed to call the target routine, whether that is transforming arguments, copying them to another location, or switching the CPU mode. A successful thunk minimizes the extra work the caller must do compared to a normal call.

Much of the literature on interoperability thunks relates to various Wintel platforms, including MS-DOS, OS/2, [9] Windows [10] [11] [12] [13] and .NET, and to the transition from 16-bit to 32-bit memory addressing. As customers have migrated from one platform to another, thunks have been essential to support legacy software written for the older platforms.

The transition from 32-bit to 64-bit code on x86 also uses a form of thunking (WoW64). However, because the x86-64 address space is lager than the one available to 32-bit code, the old "generic thunk" mechanism could not be used to call 64-bit code from 32-bit code. [14] The only case of 32-bit code calling 64-bit code is in the WoW64's thunking of Windows APIs to 32-bit.

Overlays and dynamic linking

On systems that lack automatic virtual memory hardware, thunks can implement a limited form of virtual memory known as overlays. With overlays, a developer divides a program's code into segments that can be loaded and unloaded independently, and identifies the entry points into each segment. A segment that calls into another segment must do so indirectly via a branch table. When a segment is in memory, its branch table entries jump into the segment. When a segment is unloaded, its entries are replaced with "reload thunks" that can reload it on demand. [15]

Similarly, systems that dynamically link modules of a program together at run-time can use thunks to connect the modules. Each module can call the others through a table of thunks that the linker fills in when it loads the module. This way the modules can interact without prior knowledge of where they are located in memory. [16]

See also

Thunk technologies

Notes

  1. A thunk is an early limited type of closure. The environment passed for the thunk is that of the expression, not that of the called routine. [2]

Related Research Articles

In computer science, threaded code is a programming technique where the code has a form that essentially consists entirely of calls to subroutines. It is often used in compilers, which may generate code in that form or be implemented in that form themselves. The code may be processed by an interpreter or it may simply be a sequence of machine code call instructions.

Multiple dispatch or multimethods is a feature of some programming languages in which a function or method can be dynamically dispatched based on the run time (dynamic) type or, in the more general case, some other attribute of more than one of its arguments. This is a generalization of single dispatch polymorphism where a function or method call is dynamically dispatched based on the derived type of the object on which the method has been called. Multiple dispatch routes the dynamic dispatch to the implementing function or method using the combined characteristics of one or more arguments.

Visual Basic .NET (VB.NET) is a multi-paradigm, object-oriented programming language, implemented on the .NET Framework. Microsoft launched VB.NET in 2002 as the successor to its original Visual Basic language. Although the ".NET" portion of the name was dropped in 2005, this article uses "Visual Basic [.NET]" to refer to all Visual Basic languages released since 2002, in order to distinguish between them and the classic Visual Basic. Along with Visual C#, it is one of the two main languages targeting the .NET framework.

x86 Assembly Language is a family of backward-compatible assembly languages, which provide some level of compatibility all the way back to the Intel 8008 introduced in April 1972. x86 assembly languages are used to produce object code for the x86 class of processors. Like all assembly languages, it uses short mnemonics to represent the fundamental instructions that the CPU in a computer can understand and follow. Compilers sometimes produce assembly code as an intermediate step when translating a high level program into machine code. Regarded as a programming language, assembly coding is machine-specific and low level. Assembly languages are more typically used for detailed and time critical applications such as small real-time embedded systems or operating system kernels and device drivers.

In computer programming, a parameter or a formal argument, is a special kind of variable, used in a subroutine to refer to one of the pieces of data provided as input to the subroutine. These pieces of data are the values of the arguments with which the subroutine is going to be called/invoked. An ordered list of parameters is usually included in the definition of a subroutine, so that, each time the subroutine is called, its arguments for that call are evaluated, and the resulting values can be assigned to the corresponding parameters.

The syntax of the C programming language is the set of rules governing writing of software in the language. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

In computer systems a loader is the part of an operating system that is responsible for loading programs and libraries. It is one of the essential stages in the process of starting a program, as it places programs into memory and prepares them for execution. Loading a program involves reading the contents of the executable file containing the program instructions into memory, and then carrying out other required preparatory tasks to prepare the executable for running. Once loading is complete, the operating system starts the program by passing control to the loaded program code.

In computer programming, a function object is a construct allowing an object to be invoked or called as if it were an ordinary function, usually with the same syntax. Function objects are often called functors.

Short-circuit evaluation, minimal evaluation, or McCarthy evaluation is the semantics of some Boolean operators in some programming languages in which the second argument is executed or evaluated only if the first argument does not suffice to determine the value of the expression: when the first argument of the AND function evaluates to false, the overall value must be false; and when the first argument of the OR function evaluates to true, the overall value must be true.

In compiler construction, name mangling is a technique used to solve various problems caused by the need to resolve unique names for programming entities in many modern programming languages.

In functional programming, continuation-passing style (CPS) is a style of programming in which control is passed explicitly in the form of a continuation. This is contrasted with direct style, which is the usual style of programming. Gerald Jay Sussman and Guy L. Steele, Jr. coined the phrase in AI Memo 349 (1975), which sets out the first version of the Scheme programming language. John C. Reynolds gives a detailed account of the numerous discoveries of continuations.

A virtual method table (VMT), virtual function table, virtual call table, dispatch table, vtable, or vftable is a mechanism used in a programming language to support dynamic dispatch.

In computer science, a calling convention is an implementation-level (low-level) scheme for how subroutines receive parameters from their caller and how they return a result. Differences in various implementations include where parameters, return values, return addresses and scope links are placed, and how the tasks of preparing for a function call and restoring the environment afterwards are divided between the caller and the callee.

In computer programming, the term hooking covers a range of techniques used to alter or augment the behaviour of an operating system, of applications, or of other software components by intercepting function calls or messages or events passed between software components. Code that handles such intercepted function calls, events or messages is called a hook.

A class in C++ is a user-defined type or data structure declared with keyword class that has data and functions as its members whose access is governed by the three access specifiers private, protected or public. By default access to members of a C++ class is private. The private members are not accessible outside the class; they can be accessed only through methods of the class. The public members form an interface to the class and are accessible outside the class.

C++11 is a version of the standard for the programming language C++. It was approved by International Organization for Standardization (ISO) on 12 August 2011, replacing C++03, superseded by C++14 on 18 August 2014 and later, by C++17. The name follows the tradition of naming language versions by the publication year of the specification, though it was formerly named C++0x because it was expected to be published before 2010.

This article describes the calling conventions used when programming x86 architecture microprocessors.

Comparison of programming languages is a common topic of discussion among software engineers. Basic instructions of several programming languages are compared here.

The Perl virtual machine is a stack-based process virtual machine implemented as an opcodes interpreter which runs previously compiled programs written in the Perl language. The opcodes interpreter is a part of the Perl interpreter, which also contains a compiler in one executable file, commonly /usr/bin/perl on various Unix-like systems or perl.exe on Microsoft Windows systems.

In computer programming, a subroutine is a sequence of program instructions that performs a specific task, packaged as a unit. This unit can then be used in programs wherever that particular task should be performed.

References

  1. Eric Raymond rejects "a couple of onomatopoeic myths circulating about the origin of this term" and cites the inventors of the thunk recalling that the term "was coined after they realized (in the wee hours after hours of discussion) that the type of an argument in Algol-60 could be figured out in advance with a little compile-time thought [...] In other words, it had 'already been thought of'; thus it was christened a thunk, which is 'the past tense of "think" at two in the morning'. See: Raymond, Eric S. (1996). Raymond, Eric S. (ed.). The New Hacker's Dictionary. MIT Press. p. 445. ISBN   9780262680929 . Retrieved 2015-05-25.
  2. E. T. Irons (1961-01-01). "Comments on the Implementation of Recursive Procedures and Blocks in ALGOL". Communications of the ACM. Association for Computing Machinery (ACM). 4 (1): 65–69. doi:10.1145/366062.366084. ISSN   0001-0782.
  3. Ingerman, P. Z. (1961-01-01). "Thunks: a way of compiling procedure statements with some comments on procedure declarations". Communications of the ACM. Association for Computing Machinery (ACM). 4 (1): 55–58. doi:10.1145/366062.366084. ISSN   0001-0782.
  4. Scott, Michael (2009). Programming Language Pragmatics. p. 395.
  5. Marlow, Simon (2013). Parallel and Concurrent Programming in Haskell. p. 10.
  6. Queinnec, Christian (2003). Lisp in Small Pieces. p. 176.
  7. Stroustrup, Bjarne (Fall 1989). "Multiple Inheritance for C++" (PDF). Computing Systems. USENIX. 1 (4). Retrieved 2014-08-04.
  8. Driesen, Karel; Hölzle, Urs (1996). "The Direct Cost of Virtual Function Calls in C++" (PDF). OOPSLA . Retrieved 2011-02-24.Cite journal requires |journal= (help)
  9. Calcote, John (May 1995). "Thunking: Using 16-Bit Libraries in OS/2 2.0". OS/2 Developer Magazine. 7 (3).
  10. King, Adrian (1994). Inside Microsoft Windows 95 (2nd ed.). Redmond, Washington, USA: Microsoft Press. ISBN   1-55615-626-X.
  11. Programmer's Guide to Microsoft Windows 95: Key Topics on Programming for Windows from the Microsoft Windows Development Team. Technical Reference (1st ed.). Redmond, Washington, USA: Microsoft Press. 1995-07-01. ISBN   1-55615-834-3 . Retrieved 2016-05-26.
  12. Hazzah, Karen (1997). Writing Windows VxDs and Device Drivers - Programming Secrets for Virtual Device Drivers (2nd printing, 2nd ed.). Lawrence, Kansas, USA: R&D Books / Miller Freeman, Inc. ISBN   0-87930-438-3.
  13. Kauler, Barry (August 1997). Windows Assembly Language and Systems Programming - 16- and 32-Bit Low-Level Programming for the PC and Windows (2nd ed.). Lawrence, Kansas, USA: R&D Books / Miller Freeman, Inc. ISBN   0-87930-474-X.
  14. "Why can't you thunk between 32-bit and 64-bit Windows?". The Old New Thing. 2008-10-20.
  15. Bright, Walter (1990-07-01). "Virtual Memory For 640K DOS". Dr. Dobb's Journal. Retrieved 2014-03-06.
  16. Levine, John R. (2000) [October 1999]. Linkers and Loaders. The Morgan Kaufmann Series in Software Engineering and Programming (1 ed.). San Francisco, USA: Morgan Kaufmann. ISBN   1-55860-496-0. OCLC   42413382. ISBN   978-1-55860-496-4. Archived from the original on 2012-12-05. Retrieved 2020-01-12. Code: Errata: