Type safety

Last updated

In computer science, type safety and type soundness are the extent to which a programming language discourages or prevents type errors. Type safety is sometimes alternatively considered to be a property of facilities of a computer language; that is, some facilities are type-safe and their usage will not result in type errors, while other facilities in the same language may be type-unsafe and a program using them may encounter type errors. The behaviors classified as type errors by a given programming language are usually those that result from attempts to perform operations on values that are not of the appropriate data type, e.g., adding a string to an integer when there's no definition on how to handle this case. This classification is partly based on opinion.

Contents

Type enforcement can be static, catching potential errors at compile time, or dynamic, associating type information with values at run-time and consulting them as needed to detect imminent errors, or a combination of both. [1] Dynamic type enforcement often allows programs to run that would be invalid under static enforcement.

In the context of static (compile-time) type systems, type safety usually involves (among other things) a guarantee that the eventual value of any expression will be a legitimate member of that expression's static type. The precise requirement is more subtle than this — see, for example, subtyping and polymorphism for complications.

Definitions

Intuitively, type soundness is captured by Robin Milner's pithy statement that

Well-typed programs cannot "go wrong". [2]

In other words, if a type system is sound, then expressions accepted by that type system must evaluate to a value of the appropriate type (rather than produce a value of some other, unrelated type or crash with a type error). Vijay Saraswat provides the following, related definition:

A language is type-safe if the only operations that can be performed on data in the language are those sanctioned by the type of the data. [3]

However, what precisely it means for a program to be "well typed" or to "go wrong" are properties of its static and dynamic semantics, which are specific to each programming language. Consequently, a precise, formal definition of type soundness depends upon the style of formal semantics used to specify a language. In 1994, Andrew Wright and Matthias Felleisen formulated what has become the standard definition and proof technique for type safety in languages defined by operational semantics, [4] which is closest to the notion of type safety as understood by most programmers. Under this approach, the semantics of a language must have the following two properties to be considered type-sound:

Progress
A well-typed program never gets "stuck": every expression is either already a value or can be reduced towards a value in some well-defined way. In other words, the program never gets into an undefined state where no further transitions are possible.
Preservation (or subject reduction)
After each evaluation step, the type of each expression remains the same (that is, its type is preserved).

A number of other formal treatments of type soundness have also been published in terms of denotational semantics and structural operational semantics. [2] [5] [6]

Relation to other forms of safety

In isolation, type soundness is a relatively weak property, as it essentially just states that the rules of a type system are internally consistent and cannot be subverted. However, in practice, programming languages are designed so that well-typedness also entails other, stronger properties, some of which include:

Type-safe and type-unsafe languages

Type safety is usually a requirement for any toy language (i.e. esoteric language) proposed in academic programming language research. Many languages, on the other hand, are too big for human-generated type safety proofs, as they often require checking thousands of cases. Nevertheless, some languages such as Standard ML, which has rigorously defined semantics, have been proved to meet one definition of type safety. [8] Some other languages such as Haskell are believed[ discuss ] to meet some definition of type safety, provided certain "escape" features are not used (for example Haskell's unsafePerformIO, used to "escape" from the usual restricted environment in which I/O is possible, circumvents the type system and so can be used to break type safety. [9] ) Type punning is another example of such an "escape" feature. Regardless of the properties of the language definition, certain errors may occur at run-time due to bugs in the implementation, or in linked libraries written in other languages; such errors could render a given implementation type unsafe in certain circumstances. An early version of Sun's Java virtual machine was vulnerable to this sort of problem. [3]

Strong and weak typing

Programming languages are often colloquially classified as strongly typed or weakly typed (also loosely typed) to refer to certain aspects of type safety. In 1974, Liskov and Zilles defined a strongly-typed language as one in which "whenever an object is passed from a calling function to a called function, its type must be compatible with the type declared in the called function." [10] In 1977, Jackson wrote, "In a strongly typed language each data area will have a distinct type and each process will state its communication requirements in terms of these types." [11] In contrast, a weakly typed language may produce unpredictable results or may perform implicit type conversion. [12]

Memory management and type safety

Type safety is closely linked to memory safety. For instance, in an implementation of a language that has some type which allows some bit patterns but not others, a dangling pointer memory error allows writing a bit pattern that does not represent a legitimate member of into a dead variable of type , causing a type error when the variable is read. Conversely, if the language is memory-safe, it cannot allow an arbitrary integer to be used as a pointer, hence there must be a separate pointer or reference type.

As a minimal condition, a type-safe language must not allow dangling pointers across allocations of different types. But most languages enforce the proper use of abstract data types defined by programmers even when this is not strictly necessary for memory safety or for the prevention of any kind of catastrophic failure. Allocations are given a type describing its contents, and this type is fixed for the duration of the allocation. This allows type-based alias analysis to infer that allocations of different types are distinct.

Most type-safe languages use garbage collection. Pierce says, "it is extremely difficult to achieve type safety in the presence of an explicit deallocation operation", due to the dangling pointer problem. [13] However Rust is generally considered type-safe and uses a borrow checker to achieve memory safety, instead of garbage collection.

Type safety in object oriented languages

In object oriented languages type safety is usually intrinsic in the fact that a type system is in place. This is expressed in terms of class definitions.

A class essentially defines the structure of the objects derived from it and an API as a contract for handling these objects. Each time a new object is created it will comply with that contract.

Each function that exchanges objects derived from a specific class, or implementing a specific interface, will adhere to that contract: hence in that function the operations permitted on that object will be only those defined by the methods of the class the object implements. This will guarantee that the object integrity will be preserved. [14]

Exceptions to this are object oriented languages that allow dynamic modification of the object structure, or the use of reflection to modify the content of an object to overcome the constraints imposed by the class methods definitions.

Type safety issues in specific languages

Ada

Ada was designed to be suitable for embedded systems, device drivers and other forms of system programming, but also to encourage type-safe programming. To resolve these conflicting goals, Ada confines type-unsafety to a certain set of special constructs whose names usually begin with the string Unchecked_. Unchecked_Deallocation can be effectively banned from a unit of Ada text by applying pragma Pure to this unit. It is expected that programmers will use Unchecked_ constructs very carefully and only when necessary; programs that do not use them are type-safe.

The SPARK programming language is a subset of Ada eliminating all its potential ambiguities and insecurities while at the same time adding statically checked contracts to the language features available. SPARK avoids the issues with dangling pointers by disallowing allocation at run time entirely.

Ada2012 adds statically checked contracts to the language itself (in form of pre-, and post-conditions, as well as type invariants).

C

The C programming language is type-safe in limited contexts; for example, a compile-time error is generated when an attempt is made to convert a pointer to one type of structure to a pointer to another type of structure, unless an explicit cast is used. However, a number of very common operations are non-type-safe; for example, the usual way to print an integer is something like printf("%d", 12), where the %d tells printf at run-time to expect an integer argument. (Something like printf("%s", 12), which tells the function to expect a pointer to a character-string and yet supplies an integer argument, may be accepted by compilers, but will produce undefined results.) This is partially mitigated by some compilers (such as gcc) checking type correspondences between printf arguments and format strings.

In addition, C, like Ada, provides unspecified or undefined explicit conversions; and unlike in Ada, idioms that use these conversions are very common, and have helped to give C a type-unsafe reputation. For example, the standard way to allocate memory on the heap is to invoke a memory allocation function, such as malloc , with an argument indicating how many bytes are required. The function returns an untyped pointer (type void *), which the calling code must explicitly or implicitly cast to the appropriate pointer type. Pre-standardized implementations of C required an explicit cast to do so, therefore the code (struct foo *) malloc(sizeof(struct foo)) became the accepted practice. [15]

C++

Some features of C++ that promote more type-safe code:

C#

C# is type-safe. It has support for untyped pointers, but this must be accessed using the "unsafe" keyword which can be prohibited at the compiler level. It has inherent support for run-time cast validation. Casts can be validated by using the "as" keyword that will return a null reference if the cast is invalid, or by using a C-style cast that will throw an exception if the cast is invalid. See C Sharp conversion operators.

Undue reliance on the object type (from which all other types are derived) runs the risk of defeating the purpose of the C# type system. It is usually better practice to abandon object references in favour of generics, similar to templates in C++ and generics in Java.

Java

The Java language is designed to enforce type safety. Anything in Java happens inside an object and each object is an instance of a class.

To implement the type safety enforcement, each object, before usage, needs to be allocated. Java allows usage of primitive types but only inside properly allocated objects.

Sometimes a part of the type safety is implemented indirectly: e.g. the class BigDecimal represents a floating point number of arbitrary precision, but handles only numbers that can be expressed with a finite representation. The operation BigDecimal.divide() calculates a new object as the division of two numbers expressed as BigDecimal.

In this case if the division has no finite representation, as when one computes e.g. 1/3=0.33333..., the divide() method can raise an exception if no rounding mode is defined for the operation. Hence the library, rather than the language, guarantees that the object respects the contract implicit in the class definition.

Standard ML

Standard ML has rigorously defined semantics and is known to be type-safe. However, some implementations, including Standard ML of New Jersey (SML/NJ), its syntactic variant Mythryl and MLton, provide libraries that offer unsafe operations. These facilities are often used in conjunction with those implementations' foreign function interfaces to interact with non-ML code (such as C libraries) that may require data laid out in specific ways. Another example is the SML/NJ interactive toplevel itself, which must use unsafe operations to execute ML code entered by the user.

Modula-2

Modula-2 is a strongly-typed language with a design philosophy to require any unsafe facilities to be explicitly marked as unsafe. This is achieved by "moving" such facilities into a built-in pseudo-library called SYSTEM from where they must be imported before they can be used. The import thus makes it visible when such facilities are used. Unfortunately, this was not consequently implemented in the original language report and its implementation. [16] There still remained unsafe facilities such as the type cast syntax and variant records (inherited from Pascal) that could be used without prior import. [17] The difficulty in moving these facilities into the SYSTEM pseudo-module was the lack of any identifier for the facility that could then be imported since only identifiers can be imported, but not syntax.

IMPORTSYSTEM;(* allows the use of certain unsafe facilities: *)VARword:SYSTEM.WORD;addr:SYSTEM.ADDRESS;addr:=SYSTEM.ADR(word);(* but type cast syntax can be used without such import *)VARi:INTEGER;n:CARDINAL;n:=CARDINAL(i);(* or *)i:=INTEGER(n);

The ISO Modula-2 standard corrected this for the type cast facility by changing the type cast syntax into a function called CAST which has to be imported from pseudo-module SYSTEM. However, other unsafe facilities such as variant records remained available without any import from pseudo-module SYSTEM. [18]

IMPORTSYSTEM;VARi:INTEGER;n:CARDINAL;i:=SYSTEM.CAST(INTEGER,n);(* Type cast in ISO Modula-2 *)

A recent revision of the language applied the original design philosophy rigorously. First, pseudo-module SYSTEM was renamed to UNSAFE to make the unsafe nature of facilities imported from there more explicit. Then all remaining unsafe facilities where either removed altogether (for example variant records) or moved to pseudo-module UNSAFE. For facilities where there is no identifier that could be imported, enabling identifiers were introduced. In order to enable such a facility, its corresponding enabling identifier must be imported from pseudo-module UNSAFE. No unsafe facilities remain in the language that do not require import from UNSAFE. [17]

IMPORTUNSAFE;VARi:INTEGER;n:CARDINAL;i:=UNSAFE.CAST(INTEGER,n);(* Type cast in Modula-2 Revision 2010 *)FROMUNSAFEIMPORTFFI;(* enabling identifier for foreign function interface facility *)<*FFI="C"*>(* pragma for foreign function interface to C *)

Pascal

Pascal has had a number of type safety requirements, some of which are kept in some compilers. Where a Pascal compiler dictates "strict typing", two variables cannot be assigned to each other unless they are either compatible (such as conversion of integer to real) or assigned to the identical subtype. For example, if you have the following code fragment:

typeTwoTypes=recordI:Integer;Q:Real;end;DualTypes=recordI:Integer;Q:Real;end;varT1,T2:TwoTypes;D1,D2:DualTypes;

Under strict typing, a variable defined as TwoTypes is not compatible with DualTypes (because they are not identical, even though the components of that user defined type are identical) and an assignment of T1 := D2; is illegal. An assignment of T1 := T2; would be legal because the subtypes they are defined to are identical. However, an assignment such as T1.Q := D1.Q; would be legal.

Common Lisp

In general, Common Lisp is a type-safe language. A Common Lisp compiler is responsible for inserting dynamic checks for operations whose type safety cannot be proven statically. However, a programmer may indicate that a program should be compiled with a lower level of dynamic type-checking. [19] A program compiled in such a mode cannot be considered type-safe.

C++ examples

The following examples illustrates how C++ cast operators can break type safety when used incorrectly. The first example shows how basic data types can be incorrectly cast:

#include<iostream>usingnamespacestd;intmain(){intival=5;// integer valuefloatfval=reinterpret_cast<float&>(ival);// reinterpret bit patterncout<<fval<<endl;// output integer as floatreturn0;}

In this example, reinterpret_cast explicitly prevents the compiler from performing a safe conversion from integer to floating-point value. [20] When the program runs it will output a garbage floating-point value. The problem could have been avoided by instead writing float fval = ival;

The next example shows how object references can be incorrectly downcast:

#include<iostream>usingnamespacestd;classParent{public:virtual~Parent(){}// virtual destructor for RTTI};classChild1:publicParent{public:inta;};classChild2:publicParent{public:floatb;};intmain(){Child1c1;c1.a=5;Parent&p=c1;// upcast always safeChild2&c2=static_cast<Child2&>(p);// invalid downcastcout<<c2.b<<endl;// will output garbage datareturn0;}

The two child classes have members of different types. When downcasting a parent class pointer to a child class pointer, then the resulting pointer may not point to a valid object of correct type. In the example, this leads to garbage value being printed. The problem could have been avoided by replacing static_cast with dynamic_cast that throws an exception on invalid casts. [21]

See also

Notes

  1. "What to know before debating type systems | Ovid [blogs.perl.org]". blogs.perl.org. Retrieved 2023-06-27.
  2. 1 2 Milner, Robin (1978), "A Theory of Type Polymorphism in Programming", Journal of Computer and System Sciences, 17 (3): 348–375, doi: 10.1016/0022-0000(78)90014-4 , hdl: 20.500.11820/d16745d7-f113-44f0-a7a3-687c2b709f66
  3. 1 2 Saraswat, Vijay (1997-08-15). "Java is not type-safe" . Retrieved 2008-10-08.
  4. Wright, A. K.; Felleisen, M. (15 November 1994). "A Syntactic Approach to Type Soundness". Information and Computation. 115 (1): 38–94. doi: 10.1006/inco.1994.1093 . ISSN   0890-5401.
  5. Damas, Luis; Milner, Robin (25 January 1982). "Principal type-schemes for functional programs". Proceedings of the 9th ACM SIGPLAN-SIGACT symposium on Principles of programming languages - POPL '82. Association for Computing Machinery. pp. 207–212. doi:10.1145/582153.582176. ISBN   0897910656. S2CID   11319320.
  6. Tofte, Mads (1988). Operational Semantics and Polymorphic Type Inference (Thesis).
  7. Henriksen, Troels; Elsman, Martin (17 June 2021). "Towards size-dependent types for array programming". Proceedings of the 7th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming. Association for Computing Machinery. pp. 1–14. doi:10.1145/3460944.3464310. ISBN   9781450384667. S2CID   235474098.
  8. Standard ML. Smlnj.org. Retrieved on 2013-11-02.
  9. "System.IO.Unsafe". GHC libraries manual: base-3.0.1.0. Archived from the original on 2008-07-05. Retrieved 2008-07-17.
  10. Liskov, B; Zilles, S (1974). "Programming with abstract data types". ACM SIGPLAN Notices. 9 (4): 50–59. CiteSeerX   10.1.1.136.3043 . doi:10.1145/942572.807045.
  11. Jackson, K. (1977). "Parallel processing and modular software construction". Design and Implementation of Programming Languages. Lecture Notes in Computer Science. Vol. 54. pp. 436–443. doi:10.1007/BFb0021435. ISBN   3-540-08360-X.
  12. "CS1130. Transition to OO programming. – Spring 2012 --self-paced version". Cornell University, Department of Computer Science. 2005. Retrieved 2023-09-15.
  13. Pierce, Benjamin C. (2002). Types and programming languages. Cambridge, Mass.: MIT Press. p. 158. ISBN   0-262-16209-1.
  14. Type safety is hence also a matter of good class definition: public methods that modify the internal state of an object shall preserve the object itegrity
  15. Kernighan; Dennis M. Ritchie (March 1988). The C Programming Language (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. p.  116. ISBN   978-0-13-110362-7. In C, the proper method is to declare that malloc returns a pointer to void, then explicitly coerce the pointer into the desired type with a cast.
  16. Niklaus Wirth (1985). Programming in Modula-2. Springer Verlag.
  17. 1 2 "The Separation of Safe and Unsafe Facilities" . Retrieved 24 March 2015.
  18. "ISO Modula-2 Language Reference" . Retrieved 24 March 2015.
  19. "Common Lisp HyperSpec" . Retrieved 26 May 2013.
  20. "reinterpret_cast conversion - cppreference.com". En.cppreference.com. Retrieved 2022-09-21.
  21. "dynamic_cast conversion - cppreference.com". En.cppreference.com. Retrieved 2022-09-21.

Related Research Articles

C is a general-purpose computer programming language. It was created in the 1970s by Dennis Ritchie, and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of the targeted CPUs. It has found lasting use in operating systems, device drivers, and protocol stacks, but its use in application software has been decreasing. C is commonly used on computer architectures that range from the largest supercomputers to the smallest microcontrollers and embedded systems.

<span class="mw-page-title-main">Oberon (programming language)</span> General-purpose programming language

Oberon is a general-purpose programming language first published in 1987 by Niklaus Wirth and the latest member of the Wirthian family of ALGOL-like languages. Oberon was the result of a concentrated effort to increase the power of Modula-2, the direct successor of Pascal, and simultaneously to reduce its complexity. Its principal new feature is the concept of type extension of record types. It permits constructing new data types on the basis of existing ones and to relate them, deviating from the dogma of strictly static typing of data. Type extension is Wirth's way of inheritance reflecting the viewpoint of the parent site. Oberon was developed as part of the implementation of an operating system, also named Oberon at ETH Zurich in Switzerland. The name was inspired both by the Voyager space probe's pictures of the moon of the planet Uranus, named Oberon, and because Oberon is famous as the king of the elfs.

Java and C++ are two prominent object-oriented programming languages. By many language popularity metrics, the two languages have dominated object-oriented and high-performance software development for much of the 21st century, and are often directly compared and contrasted. Java's syntax was based on C/C++.

In computer programming, a type system is a logical system comprising a set of rules that assigns a property called a type to every term. Usually the terms are various language constructs of a computer program, such as variables, expressions, functions, or modules. A type system dictates the operations that can be performed on a term. For variables, the type system determines the allowed values of that term. Type systems formalize and enforce the otherwise implicit categories the programmer uses for algebraic data types, data structures, or other components.

In computer programming, a reference is a value that enables a program to indirectly access a particular datum, such as a variable's value or a record, in the computer's memory or in some other storage device. The reference is said to refer to the datum, and accessing the datum is called dereferencing the reference. A reference is distinct from the datum itself.

<span class="mw-page-title-main">Modula-3</span>

Modula-3 is a programming language conceived as a successor to an upgraded version of Modula-2 known as Modula-2+. While it has been influential in research circles it has not been adopted widely in industry. It was designed by Luca Cardelli, James Donahue, Lucille Glassman, Mick Jordan, Bill Kalsow and Greg Nelson at the Digital Equipment Corporation (DEC) Systems Research Center (SRC) and the Olivetti Research Center (ORC) in the late 1980s.

In computer science, a tagged union, also called a variant, variant record, choice type, discriminated union, disjoint union, sum type, or coproduct, is a data structure used to hold a value that could take on several different, but fixed, types. Only one of the types can be in use at any one time, and a tag field explicitly indicates which type is in use. It can be thought of as a type that has several "cases", each of which should be handled correctly when that type is manipulated. This is critical in defining recursive datatypes, in which some component of a value may have the same type as that value, for example in defining a type for representing trees, where it is necessary to distinguish multi-node subtrees and leaves. Like ordinary unions, tagged unions can save storage by overlapping storage areas for each type, since only one is in use at a time.

<span class="mw-page-title-main">Pointer (computer programming)</span> Object which stores memory addresses in a computer program

In computer science, a pointer is an object in many programming languages that stores a memory address. This can be that of another value located in computer memory, or in some cases, that of memory-mapped computer hardware. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture.

In computer programming, undefined behavior (UB) is the result of executing a program whose behavior is prescribed to be unpredictable, in the language specification to which the computer code adheres. This is different from unspecified behavior, for which the language specification does not prescribe a result, and implementation-defined behavior that defers to the documentation of another component of the platform.

In computer science, type conversion, type casting, type coercion, and type juggling are different ways of changing an expression from one data type to another. An example would be the conversion of an integer value into a floating point value or its textual representation as a string, and vice versa. Type conversions can take advantage of certain features of type hierarchies or data representations. Two important aspects of a type conversion are whether it happens implicitly (automatically) or explicitly, and whether the underlying data representation is converted from one representation into another, or a given representation is merely reinterpreted as the representation of another data type. In general, both primitive and compound data types can be converted.

This article compares two programming languages: C# with Java. While the focus of this article is mainly the languages and their features, such a comparison will necessarily also consider some features of platforms and libraries. For a more detailed comparison of the platforms, see Comparison of the Java and .NET platforms.

In some programming languages, const is a type qualifier, which indicates that the data is read-only. While this can be used to declare constants, const in the C family of languages differs from similar constructs in other languages in that it is part of the type, and thus has complicated behavior when combined with pointers, references, composite data types, and type-checking. In other languages, the data is not in a single memory location, but copied at compile time for each use. Languages which use it include C, C++, D, JavaScript, Julia, and Rust.

In computer programming, thread-local storage (TLS) is a memory management method that uses static or global memory local to a thread. The concept allows storage of data that appears to be global in a system with separate threads.

<span class="mw-page-title-main">C Sharp (programming language)</span> Programming language

C# is a general-purpose high-level programming language supporting multiple paradigms. C# encompasses static typing, strong typing, lexically scoped, imperative, declarative, functional, generic, object-oriented (class-based), and component-oriented programming disciplines.

C++11 is a version of the ISO/IEC 14882 standard for the C++ programming language. C++11 replaced the prior version of the C++ standard, called C++03, and was later replaced by C++14. The name follows the tradition of naming language versions by the publication year of the specification, though it was formerly named C++0x because it was expected to be published before 2010.

In computer science, a type punning is any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language.

<span class="mw-page-title-main">Go (programming language)</span> Programming language

Go is a statically typed, compiled high-level programming language designed at Google by Robert Griesemer, Rob Pike, and Ken Thompson. It is syntactically similar to C, but also has memory safety, garbage collection, structural typing, and CSP-style concurrency. It is often referred to as Golang because of its former domain name, golang.org, but its proper name is Go.

In computer programming, one of the many ways that programming languages are colloquially classified is whether the language's type system makes it strongly typed or weakly typed. However, there is no precise technical definition of what the terms mean and different authors disagree about the implied meaning of the terms and the relative rankings of the "strength" of the type systems of mainstream programming languages. For this reason, writers who wish to write unambiguously about type systems often eschew the terms "strong typing" and "weak typing" in favor of specific expressions such as "type safety".

Objective-C is a high-level general-purpose, object-oriented programming language that adds Smalltalk-style messaging to the C programming language. Originally developed by Brad Cox and Tom Love in the early 1980s, it was selected by NeXT for its NeXTSTEP operating system. Due to Apple macOS’s direct lineage from NeXTSTEP, Objective-C was the standard programming language used, supported, and promoted by Apple for developing macOS and iOS applications until the introduction of the Swift programming language in 2014.

<span class="mw-page-title-main">Zig (programming language)</span> A general-purpose programming language, toolchain to build Zig/C/C++ code

Zig is an imperative, general-purpose, statically typed, compiled system programming language designed by Andrew Kelley. It is intended to be a successor to the C programming language, with the intention of being even smaller and simpler to program in while also offering more functionality.

References