Threaded code

Last updated

In computer science, threaded code is a programming technique where the code has a form that essentially consists entirely of calls to subroutines. It is often used in compilers, which may generate code in that form or be implemented in that form themselves. The code may be processed by an interpreter or it may simply be a sequence of machine code call instructions.

Contents

Threaded code has better density than code generated by alternative generation techniques and by alternative calling conventions. In cached architectures, it may execute slightly slower.[ citation needed ] However, a program that is small enough to fit in a computer processor's cache may run faster than a larger program that suffers many cache misses. [1] Small programs may also be faster at thread switching, when other programs have filled the cache.

Threaded code is best known for its use in many compilers of programming languages, such as Forth, many implementations of BASIC, some implementations of COBOL, early versions of B, [2] and other languages for small minicomputers and for amateur radio satellites.[ citation needed ]

History

The common way to make computer programs is to use a compiler to translate source code (written in some symbolic language) to machine code. The resulting executable is typically fast but, because it is specific to a hardware platform, it isn't portable. A different approach is to generate instructions for a virtual machine and to use an interpreter on each hardware platform. The interpreter instantiates the virtual machine environment and executes the instructions. Thus, only the interpreter must be compiled.

Early computers had relatively little memory. For example, most Data General Nova, IBM 1130, and many of the first microcomputers had only 4 kB of RAM installed. Consequently, a lot of time was spent trying to find ways to reduce a program's size, to fit in the available memory.

One solution is to use an interpreter which reads the symbolic language a bit at a time, and calls functions to perform the actions. As the source code is typically much denser than the resulting machine code, this can reduce overall memory use. This was the reason Microsoft BASIC is an interpreter: [lower-alpha 1] its own code had to share the 4 kB memory of machines like the Altair 8800 with the user's source code. A compiler translates from a source language to machine code, so the compiler, source, and output must all be in memory at the same time. In an interpreter, there is no output.

Threaded code is a formatting style for compiled code that minimizes memory use. Instead of writing out every step of an operation at its every occurrence in the program, as was common in macro assemblers for instance, the compiler writes each common bit of code into a subroutine. Thus, each bit exists in only one place in memory (see "Don't repeat yourself"). The top-level application in these programs may consist of nothing but subroutine calls. Many of these subroutines, in turn, also consist of nothing but lower-level subroutine calls.

Mainframes and some early microprocessors such as the RCA 1802 required several instructions to call a subroutine. In the top-level application and in many subroutines, that sequence is constantly repeated, with only the subroutine address changing from one call to the next. This means that a program consisting of many function calls may have considerable amounts of repeated code as well.

To address this, threaded code systems used pseudo-code to represent function calls in a single operator. At run time, a tiny "interpreter" would scan over the top-level code, extract the subroutine's address in memory, and call it. In other systems, this same basic concept is implemented as a branch table, dispatch table, or virtual method table, all of which consist of a table of subroutine addresses.

During the 1970s, hardware designers spent considerable effort to make subroutine calls faster and simpler. On the improved designs, only a single instruction is expended to call a subroutine, so the use of a pseudo-instruction saves no room.[ citation needed ] Additionally, the performance of these calls is almost free of additional overhead. Today, though almost all programming languages focus on isolating code into subroutines, they do so for code clarity and maintainability, not to save space.

Threaded code systems save room by replacing that list of function calls, where only the subroutine address changes from one call to the next, with a list of execution tokens, which are essentially function calls with the call opcode(s) stripped off, leaving behind only a list of addresses. [3] [4] [5] [6] [7]

Over the years, programmers have created many variations on that "interpreter" or "small selector". The particular address in the list of addresses may be extracted using an index, general-purpose register or pointer. The addresses may be direct or indirect, contiguous or non-contiguous (linked by pointers), relative or absolute, resolved at compile time or dynamically built. No single variation is "best" for all situations.

Development

To save space, programmers squeezed the lists of subroutine calls into simple lists of subroutine addresses, and used a small loop to call each subroutine in turn. For example, the following pseudocode uses this technique to add two numbers A and B. In the example, the list is labeled thread and a variable ip (Instruction Pointer) tracks our place within the list. Another variable sp (Stack Pointer) contains an address elsewhere in memory that is available to hold a value temporarily.

start:ip=&thread// points to the address '&pushA', not the textual label 'thread'top:jump*ip++// follow ip to address in thread, follow that address to subroutine, advance ipthread:&pushA&pushB&add...pushA:*sp++=A// follow sp to available memory, store A there, advance sp to next jumptoppushB:*sp++=Bjumptopadd:addend1=*--sp// Pop the top value off the stackaddend2=*--sp// Pop second value off the stack*sp++=addend1+addend2// Add the two values together and store the result on the top of the stackjumptop


The calling loop at top is so simple that it can be repeated inline at the end of each subroutine. Control now jumps once, from the end of a subroutine to the start of another, instead of jumping twice via top. For example:

start:ip=&thread// ip points to &pushA (which points to the first instruction of pushA)jump*ip++// send control to first instruction of pushA and advance ip to &pushBthread:&pushA&pushB&add...pushA:*sp++=A// follow sp to available memory, store A there, advance sp to next jump*ip++// send control where ip says to (i.e. to pushB) and advance ippushB:*sp++=Bjump*ip++add:addend1=*--sp// Pop the top value off the stackaddend2=*--sp// Pop second value off the stack*sp++=addend1+addend2// Add the two values together and store the result on top of the stackjump*ip++

This is called direct threaded code (DTC). Although the technique is older, the first widely circulated use of the term "threaded code" is probably James R. Bell's 1973 article "Threaded Code". [8]

In 1970, Charles H. Moore invented a more compact arrangement, indirect threaded code (ITC), for his Forth virtual machine. Moore arrived at this arrangement because Nova minicomputers had an indirection bit in every address, which made ITC easy and fast. Later, he said that he found it so convenient that he propagated it into all later Forth designs. [9]

Today, some Forth compilers generate direct-threaded code while others generate indirect-threaded code. The executables act the same either way.

Threading models

Practically all executable threaded code uses one or another of these methods for invoking subroutines (each method is called a "threading model").

Direct threading

Addresses in the thread are the addresses of machine language. This form is simple, but may have overheads because the thread consists only of machine addresses, so all further parameters must be loaded indirectly from memory. Some Forth systems produce direct-threaded code. On many machines direct-threading is faster than subroutine threading (see reference below).

An example of a stack machine might execute the sequence "push A, push B, add". That might be translated to the following thread and routines, where ip is initialized to the address labeled thread (i.e., the address where &pushA is stored).

#define PUSH(x) (*sp++ = (x))#define POP() (*--sp)start:ip=&thread// ip points to &pushA (which points to the first instruction of pushA)jump*ip++// send control to first instruction of pushA and advance ip to &pushBthread:&pushA&pushB&add...pushA:PUSH(A)jump*ip++// send control where ip says to (i.e. to pushB) and advance ippushB:PUSH(B)jump*ip++add:result=POP()+POP()PUSH(result)jump*ip++

Alternatively, operands may be included in the thread. This can remove some indirection needed above, but makes the thread larger:

#define PUSH(x) (*sp++ = (x))#define POP() (*--sp)start:ip=&threadjump*ip++thread:&push&A// address where A is stored, not literal A&push&B&add...push:variable_address=*ip++// must move ip past operand address, since it is not a subroutine addressPUSH(*variable_address)// Read value from variable and push on stackjump*ip++add:result=POP()+POP()PUSH(result)jump*ip++

Indirect threading

Indirect threading uses pointers to locations that in turn point to machine code. The indirect pointer may be followed by operands which are stored in the indirect "block" rather than storing them repeatedly in the thread. Thus, indirect code is often more compact than direct-threaded code. The indirection typically makes it slower, though usually still faster than bytecode interpreters. Where the handler operands include both values and types, the space savings over direct-threaded code may be significant. Older FORTH systems typically produce indirect-threaded code.

For example, if the goal is to execute "push A, push B, add", the following might be used. Here, ip is initialized to address &thread, each code fragment (push, add) is found by double-indirecting through ip and an indirect block; and any operands to the fragment are found in the indirect block following the fragment's address. This requires keeping the current subroutine in ip, unlike all previous examples where it contained the next subroutine to be called.

start:ip=&thread// points to '&i_pushA'jump*(*ip)// follow pointers to 1st instruction of 'push', DO NOT advance ip yetthread:&i_pushA&i_pushB&i_add...i_pushA:&push&Ai_pushB:&push&Bi_add:&addpush:*sp++=*(*ip+1)// look 1 past start of indirect block for operand addressjump*(*++ip)// advance ip in thread, jump through next indirect block to next subroutineadd:addend1=*--spaddend2=*--sp*sp++=addend1+addend2jump*(*++ip)

Subroutine threading

So-called "subroutine-threaded code" (also "call-threaded code") consists of a series of machine-language "call" instructions (or addresses of functions to "call", as opposed to direct threading's use of "jump"). Early compilers for ALGOL, Fortran, Cobol and some Forth systems often produced subroutine-threaded code. The code in many of these systems operated on a last-in-first-out (LIFO) stack of operands, for which compiler theory was well-developed. Most modern processors have special hardware support for subroutine "call" and "return" instructions, so the overhead of one extra machine instruction per dispatch is somewhat diminished.

Anton Ertl, the Gforth compiler's co-creator, stated that "in contrast to popular myths, subroutine threading is usually slower than direct threading". [10] However, Ertl's most recent tests [1] show that subroutine threading is faster than direct threading in 15 out of 25 test cases. More specifically, he found that direct threading is the fastest threading model on Xeon, Opteron, and Athlon processors, indirect threading is fastest on Pentium M processors, and subroutine threading is fastest on Pentium 4, Pentium III, and PPC processors.

As an example of call threading for "push A, push B, add":

thread:callpushAcallpushBcalladdretpushA:*sp++=AretpushB:*sp++=Bretadd:addend1=*--spaddend2=*--sp*sp++=addend1+addend2ret

Token threading

Token-threaded code implements the thread as a list of indices into a table of operations; the index width is naturally chosen to be as small as possible for density and efficiency. 1 byte / 8-bits is the natural choice for ease of programming, but smaller sizes like 4-bits, or larger like 12 or 16 bits, can be used depending on the number of operations supported. As long as the index width is chosen to be narrower than a machine pointer, it will naturally be more compact than the other threading types without much special effort by the programmer. It is usually half to three-fourths the size of other threadings, which are themselves a quarter to an eighth the size of non-threaded code. The table's pointers can either be indirect or direct. Some Forth compilers produce token-threaded code. Some programmers consider the "p-code" generated by some Pascal compilers, as well as the bytecodes used by .NET, Java, BASIC and some C compilers, to be token-threading.

A common approach, historically, is bytecode, which typically uses 8-bit opcodes with a stack-based virtual machine. The archetypal bytecode interpreter is known as a "decode and dispatch interpreter" and follows the form:

start:vpc=&threaddispatch:addr=decode(&vpc)// Convert the next bytecode operation to a pointer to machine code that implements it// Any inter-instruction operations are performed here (e.g. updating global state, event processing, etc)jumpaddrCODE_PTRdecode(BYTE_CODE**p){// In a more complex encoding, there may be multiple tables to choose between or control/mode flagsreturntable[*(*p)++];}thread:/* Contains bytecode, not machine addresses.  Hence it is more compact. */1/*pushA*/2/*pushB*/0/*add*/table:&add/* table[0] = address of machine code that implements bytecode 0 */&pushA/* table[1] ... */&pushB/* table[2] ... */pushA:*sp++=AjumpdispatchpushB:*sp++=Bjumpdispatchadd:addend1=*--spaddend2=*--sp*sp++=addend1+addend2jumpdispatch

If the virtual machine uses only byte-size instructions, decode() is simply a fetch from thread, but often there are commonly used 1-byte instructions plus some less-common multibyte instructions (see complex instruction set computer), in which case decode() is more complex. The decoding of single byte opcodes can be very simply and efficiently handled by a branch table using the opcode directly as an index.

For instructions where the individual operations are simple, such as "push" and "add", the overhead involved in deciding what to execute is larger than the cost of actually executing it, so such interpreters are often much slower than machine code. However, for more complex ("compound") instructions, the overhead percentage is proportionally less significant.

There are times when token-threaded code can sometimes run faster than the equivalent machine code when that machine code ends up being too large to fit in the physical CPU's L1 instruction cache. The higher code density of threaded code, especially token-threaded code, can allow it to fit entirely in the L1 cache when it otherwise would not have, thereby avoiding cache thrashing. However, threaded code consumes both instruction cache (for the implementation of each operation) as well as data cache (for the bytecode and tables) unlike machine code which only consumes instruction cache; this means threaded code will eat into the budget for the amount of data that can be held for processing by the CPU at any given time. In any case, if the problem being computed involves applying a large number of operations to a small amount of data then using threaded code may be an ideal optimization. [4]

Huffman threading

Huffman threaded code consists of lists of tokens stored as Huffman codes. A Huffman code is a variable-length string of bits that identifies a unique token. A Huffman-threaded interpreter locates subroutines using an index table or a tree of pointers that can be navigated by the Huffman code. Huffman-threaded code is one of the most compact representations known for a computer program. The index and codes are chosen by measuring the frequency of calls to each subroutine in the code. Frequent calls are given the shortest codes. Operations with approximately equal frequencies are given codes with nearly equal bit-lengths. Most Huffman-threaded systems have been implemented as direct-threaded Forth systems, and used to pack large amounts of slow-running code into small, cheap microcontrollers. Most published [11] uses have been in smart cards, toys, calculators, and watches. The bit-oriented tokenized code used in PBASIC can be seen as a kind of Huffman-threaded code.

Lesser-used threading

An example is string threading, in which operations are identified by strings, usually looked up by a hash table. This was used in Charles H. Moore's earliest Forth implementations and in the University of Illinois's experimental hardware-interpreted computer language. It is also used in Bashforth.

RPL

HP's RPL, first introduced in the HP-18C calculator in 1986, is a type of proprietary hybrid (direct-threaded and indirect-threaded) threaded interpretive language (TIL) [12] that, unlike other TILs, allows embedding of RPL "objects" into the "runstream", i.e. the stream of addresses through which the interpreter pointer advances. An RPL "object" can be thought of as a special data type whose in-memory structure contains an address to an "object prolog" at the start of the object, and then data or executable code follows. The object prolog determines how the object's body should be executed or processed. Using the "RPL inner loop", [13] which was invented and patented [14] by William C. Wickes in 1986 and published in 1988, execution follows like so: [15]

  1. Dereference the IP (instruction pointer) and store it into O (current object pointer)
  2. Increment the IP by the length of one address pointer
  3. Dereference O and store its address in O_1 (this is the second level of indirection)
  4. Transfer control to next pointer or embedded object by setting the PC (program counter) to O_1 plus one address pointer
  5. Go back to step 1

This can be represented more precisely by:

    O  = [I]     I  = I + Δ     PC = [O] + Δ 

Where above, O is the current object pointer, I is the interpreter pointer, Δ is the length of one address word and the "[]" operator stands for "dereference".

When control is transferred to an object pointer or an embedded object, execution continues as follows:

PROLOG -> PROLOG (The prolog address at the start of the prolog code points to itself)     IF O + Δ =/= PC     THEN GOTO INDIRECT (Test for direct execution)         O = I - Δ (Correct O to point to start of embedded object)         I = I + α (Correct I to point after embedded object where α is the length of the object)     INDIRECT (Rest of prolog) 

On HP's Saturn microprocessors that use RPL, there is a third level of indirection made possible by an architectural / programming trick which allows faster execution. [13]

Branches

In all interpreters, a branch simply changes the thread pointer (ip) to a different address in the thread. A conditional jump-if-zero branch that jumps only if the top-of-stack value is zero could be implemented as shown below. This example uses the embedded parameter version of direct threading so the &thread[123] line is the destination of where to jump if the condition is true, so it must be skipped (ip++) over if the branch is not taken.

thread:...&brz&thread[123]...brz:when_true_ip=*ip++// Get destination address for branchif(*--sp==0)// Pop/Consume top of stack and check if it's zeroip=when_true_ipjump*ip++

Common amenities

Separating the data and return stacks in a machine eliminates a great deal of stack management code, substantially reducing the size of the threaded code. The dual-stack principle originated three times independently: for Burroughs large systems, Forth, and PostScript. It is used in some Java virtual machines.

Three registers are often present in a threaded virtual machine. Another one exists for passing data between subroutines ('words'). These are:

Often, threaded virtual machines, such as implementations of Forth, have a simple virtual machine at heart, consisting of three primitives. Those are:

  1. nest, also called docol
  2. unnest, or semi_s (;s)
  3. next

In an indirect-threaded virtual machine, the one given here, the operations are:

next:*ip++->wjump**w++nest:ip->*rp++w->ipnextunnest:*--rp->ipnext

See also

Notes

  1. Dartmouth BASIC, upon which Microsoft BASIC is ultimately based, was a compiler that ran on mainframe machines.

Related Research Articles

Forth is a procedural, concatenative, stack-oriented programming language and interactive development environment designed by Charles H. "Chuck" Moore and first used by other programmers in 1970. Although not an acronym, the language's name in its early years was often spelled in all capital letters as FORTH. The FORTH-79 and FORTH-83 implementations, which were not written by Moore, became de facto standards, and an official standardization of the language was published in 1994 as ANS Forth. A wide range of Forth derivatives existed before and after ANS Forth. The free software Gforth implementation is actively maintained, as are several commercially supported systems.

<span class="mw-page-title-main">Java virtual machine</span> Virtual machine that runs Java programs

A Java virtual machine (JVM) is a virtual machine that enables a computer to run Java programs as well as programs written in other languages that are also compiled to Java bytecode. The JVM is detailed by a specification that formally describes what is required in a JVM implementation. Having a specification ensures interoperability of Java programs across different implementations so that program authors using the Java Development Kit (JDK) need not worry about idiosyncrasies of the underlying hardware platform.

<span class="mw-page-title-main">PDP-8</span> Minicomputer product line

The PDP-8 is a family of 12-bit minicomputers that was produced by Digital Equipment Corporation (DEC). It was the first commercially successful minicomputer, with over 50,000 units being sold over the model's lifetime. Its basic design follows the pioneering LINC but has a smaller instruction set, which is an expanded version of the PDP-5 instruction set. Similar machines from DEC are the PDP-12 which is a modernized version of the PDP-8 and LINC concepts, and the PDP-14 industrial controller system.

In computer programming, a p-code machine is a virtual machine designed to execute p-code. This term is applied both generically to all such machines, and to specific implementations, the most famous being the p-Machine of the Pascal-P system, particularly the UCSD Pascal implementation, among whose developers, the p in p-code was construed to mean pseudo more often than portable, thus pseudo-code meaning instructions for a pseudo-machine.

<span class="mw-page-title-main">Interpreter (computing)</span> Program that executes source code without a separate compilation step

In computer science, an interpreter is a computer program that directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program. An interpreter generally uses one of the following strategies for program execution:

  1. Parse the source code and perform its behavior directly;
  2. Translate source code into some efficient intermediate representation or object code and immediately execute that;
  3. Explicitly execute stored precompiled bytecode made by a compiler and matched with the interpreter Virtual Machine.
<span class="mw-page-title-main">MCS-51</span> Single chip microcontroller series by Intel

The Intel MCS-51 is a single chip microcontroller (MCU) series developed by Intel in 1980 for use in embedded systems. The architect of the Intel MCS-51 instruction set was John H. Wharton. Intel's original versions were popular in the 1980s and early 1990s, and enhanced binary compatible derivatives remain popular today. It is a complex instruction set computer, but also has some of the features of RISC architectures, such as a large register set and register windows, and has separate memory spaces for program instructions and data.

Bytecode is a form of instruction set designed for efficient execution by a software interpreter. Unlike human-readable source code, bytecodes are compact numeric codes, constants, and references that encode the result of compiler parsing and performing semantic analysis of things like type, scope, and nesting depths of program objects.

x86 assembly language is the name for the family of assembly languages which provide some level of backward compatibility with CPUs back to the Intel 8008 microprocessor, which was launched in April 1972. It is used to produce object code for the x86 class of processors.

In computer science, self-modifying code is code that alters its own instructions while it is executing – usually to reduce the instruction path length and improve performance or simply to reduce otherwise repetitively similar code, thus simplifying maintenance. The term is usually only applied to code where the self-modification is intentional, not in situations where code accidentally modifies itself due to an error such as a buffer overflow.

In computer science, computer engineering and programming language implementations, a stack machine is a computer processor or a virtual machine in which the primary interaction is moving short-lived temporary values to and from a push down stack. In the case of a hardware processor, a hardware stack is used. The use of a stack significantly reduces the required number of processor registers. Stack machines extend push-down automata with additional load/store operations or multiple stacks and hence are Turing-complete.

An indirect branch is a type of program control instruction present in some machine language instruction sets. Rather than specifying the address of the next instruction to execute, as in a direct branch, the argument specifies where the address is located. An example is 'jump indirect on the r1 register', which means that the next instruction to be executed is at the address in register r1. The address to be jumped to is not known until the instruction is executed. Indirect branches can also depend on the value of a memory location.

Addressing modes are an aspect of the instruction set architecture in most central processing unit (CPU) designs. The various addressing modes that are defined in a given instruction set architecture define how the machine language instructions in that architecture identify the operand(s) of each instruction. An addressing mode specifies how to calculate the effective memory address of an operand by using information held in registers and/or constants contained within a machine instruction or elsewhere.

In computer science, a tail call is a subroutine call performed as the final action of a procedure. If the target of a tail is the same subroutine, the subroutine is said to be tail recursive, which is a special case of direct recursion. Tail recursion is particularly useful, and is often easy to optimize in implementations.

In computer science, a call stack is a stack data structure that stores information about the active subroutines of a computer program. This type of stack is also known as an execution stack, program stack, control stack, run-time stack, or machine stack, and is often shortened to simply "the stack". Although maintenance of the call stack is important for the proper functioning of most software, the details are normally hidden and automatic in high-level programming languages. Many computer instruction sets provide special instructions for manipulating stacks.

In computer science, a calling convention is an implementation-level (low-level) scheme for how subroutines or functions receive parameters from their caller and how they return a result. When some code calls a function, design choices have been taken for where and how parameters are passed to that function, and where and how results are returned from that function, with these transfers typically done via certain registers or within a stack frame on the call stack. There are design choices for how the tasks of preparing for a function call and restoring the environment after the function has completed are divided between the caller and the callee. Some calling convention specifies the way every function should get called. The correct calling convention should be used for every function call, to allow the correct and reliable execution of the whole program using these functions.

Peephole optimization is an optimization technique performed on a small set of compiler-generated instructions; the small set is known as the peephole or window.

A stack register is a computer central processor register whose purpose is to keep track of a call stack. On an accumulator-based architecture machine, this may be a dedicated register. On a machine with multiple general-purpose registers, it may be a register that is reserved by convention, such as on the IBM System/360 through z/Architecture architecture and RISC architectures, or it may be a register that procedure call and return instructions are hardwired to use, such as on the PDP-11, VAX, and Intel x86 architectures. Some designs such as the Data General Eclipse had no dedicated register, but used a reserved hardware memory address for this function.

<span class="mw-page-title-main">Control table</span> Data structures that control the execution order of computer commands

Control tables are tables that control the control flow or play a major part in program control. There are no rigid rules about the structure or content of a control table—its qualifying attribute is its ability to direct control flow in some way through "execution" by a processor or interpreter. The design of such tables is sometimes referred to as table-driven design. In some cases, control tables can be specific implementations of finite-state-machine-based automata-based programming. If there are several hierarchical levels of control table they may behave in a manner equivalent to UML state machines

In computer programming, a function or subroutine is a sequence of program instructions that performs a specific task, packaged as a unit. This unit can then be used in programs wherever that particular task should be performed.

Toi is an imperative, type-sensitive language that provides the basic functionality of a programming language. The language was designed and developed from the ground-up by Paul Longtine. Written in C, Toi was created with the intent to be an educational experience and serves as a learning tool for those looking to familiarize themselves with the inner-workings of a programming language.

References

  1. 1 2 "Speed of various interpreter dispatch techniques V2".
  2. Dennis M. Ritchie, "The Development of the C Language", 1993. Quote: "The B compiler on the PDP-7 did not generate machine instructions, but instead 'threaded code' ..."
  3. David Frech. "muforth readme". section "Simple and tail-recursive native compiler".
  4. 1 2 Steve Heller. "Efficient C/C++ Programming: Smaller, Faster, Better". 2014. Chapter 5: "Do you need an interpreter?" p. 195.
  5. Jean-Paul Tremblay; P. G. Sorenson. "The Theory and Practice of Compiler Writing". 1985. p. 527
  6. "Wireless World: Electronics, Radio, Television, Volume 89". p. 73.
  7. "Byte, Volume 5". 1980. p. 212
  8. Bell, James R. (1973). "Threaded code". Communications of the ACM. 16 (6): 370–372. doi: 10.1145/362248.362270 . S2CID   19042952.
  9. Moore, Charles H., published remarks in Byte Magazine's Forth Issue
  10. Ertl, Anton. "What is Threaded Code?".
  11. Latendresse, Mario; Feeley, Marc. Generation of Fast Interpreters for Huffman-Compressed Bytecode. Elsevier. CiteSeerX   10.1.1.156.2546 .
  12. Loelinger, R. G. (1981) [August 1979]. Written at Dayton, Ohio, USA. Threaded Interpretive Languages: Their Design and Implementation (2nd printing, 1st ed.). Peterborough, New Hampshire, UK: BYTE Books, BYTE Publications Inc. ISBN   0-07038360-X. LCCN   80-19392. ISBN   978-0-07038360-9 . Retrieved 2023-08-03. (xiv+2+251 pages)
  13. 1 2 Busby, Jonathan (2018-09-07). "The RPL inner loop explained". The Museum of HP Calculators. Archived from the original on 2023-08-03. Retrieved 2019-12-27.
  14. Wickes, William C. (1986-05-30). "Data processing system and method for the direct and indirect execution of uniformly structured object types". uspto.gov. Retrieved 2019-12-27.
  15. Wickes, William C. (1988-10-01) [14–18 June 1988]. Forsely, Lawrence P. (ed.). RPL: A Mathematical Control Language. Proceedings of the 1988 Rochester Forth Conference: Programming Environments. Vol. 8. Rochester, New York, USA: Institute for Applied Forth Research, Inc., University of Rochester. ISBN   978-0-91459308-9. OCLC   839704944. (NB. This title is often cited as "RPL: A Mathematics Control Language". An excerpt is available at: )

Further reading