Polyglot (computing)

Last updated

In computing, a polyglot is a computer program or script (or other file) written in a valid form of multiple programming languages or file formats. [1] The name was coined by analogy to multilingualism. A polyglot file is composed by combining syntax from two or more different formats. [2]

Contents

When the file formats are to be compiled or interpreted as source code, the file can be said to be a polyglot program, though file formats and source code syntax are both fundamentally streams of bytes, and exploiting this commonality is key to the development of polyglots. [3] Polyglot files have practical applications in compatibility, [4] but can also present a security risk when used to bypass validation or to exploit a vulnerability.

History

Polyglot programs have been crafted as challenges and curios in hacker culture since at least the early 1990s. A notable early example, named simply polyglot was published on the Usenet group rec.puzzles in 1991, supporting 8 languages, though this was inspired by even earlier programs. [5] In 2000, a polyglot program was named a winner in the International Obfuscated C Code Contest. [6]

In the 21st century, polyglot programs and files gained attention as a covert channel mechanism for propagation of malware. [3]

Construction

A polyglot is composed by combining syntax from two or more different formats, leveraging various syntactic constructs that are either common between the formats, or constructs that are language specific but carrying different meaning in each language. A file is a valid polyglot if it can be successfully interpreted by multiple interpreting programs. For example, a PDF-Zip polyglot might be opened as both a valid PDF document and decompressed as a valid zip archive. To maintain validity across interpreting programs, one must ensure that constructs specific to one interpreter are not interpreted by another, and vice versa. This is often accomplished by hiding language-specific constructs in segments interpreted as comments or plain text of the other format. [1]

Examples

C, PHP, and Bash

Two commonly used techniques for constructing a polyglot program are to make use of languages that use different characters for comments, and to redefine various tokens as others in different languages. These are demonstrated in this public domain polyglot written in ANSI C, PHP and bash:

Highlighted for Bash

#define a /*#<?phpecho"\010Hello, world!\n";//2>/dev/null>/dev/null\ ;//2>/dev/null;x=a;$x=5;//2>/dev/null\ ;if(($x))//2>/dev/null;thenreturn0;//2>/dev/null;fi#define e ?>#define b */#include <stdio.h>#define main() int main(void)#define printf printf(#define true )#define functionfunctionmain(){printf"Hello, world!\n"true/*2>/dev/null|grep-vtrue*/;return0;}#define c /*main #*/

Highlighted for PHP

 #define a /* #<?phpecho"\010Hello, world!\n";// 2> /dev/null > /dev/null \ ;// 2> /dev/null; x=a;$x=5;// 2> /dev/null \ ;if(($x))// 2> /dev/null; thenreturn0;// 2> /dev/null; fi#define e ?>#define b */#include <stdio.h>#define main() int main(void)#define printf printf(#define true )#define functionfunctionmain(){printf"Hello, world!\n"true/* 2> /dev/null | grep -v true*/;return0;}#define c /*main#*/

Highlighted for C

#define a /* #<?php echo "\010Hello, world!\n";// 2> /dev/null > /dev/null \ ; // 2> /dev/null; x=a; $x=5; // 2> /dev/null \ ; if (($x)) // 2> /dev/null; then return 0; // 2> /dev/null; fi #define e ?> #define b */#include<stdio.h>#define main() int main(void)#define printf printf(#define true )#define functionfunctionmain(){printf"Hello, world!\n"true/* 2> /dev/null | grep -v true*/;return0;}#define c /* main #*/

Note the following:

  • A hash sign marks a preprocessor statement in C, but is a comment in both bash and PHP.
  • "//" is a comment in both PHP and C and the root directory in bash.
  • Shell redirection is used to eliminate undesirable outputs.
  • Even on commented out lines, the "<?php" and "?>" PHP indicators still have effect.
  • The statement "function main()" is valid in both PHP and bash; C #defines are used to convert it into "int main(void)" at compile time.
  • Comment indicators can be combined to perform various operations.
  • "if (($x))" is a valid statement in both bash and PHP.
  • printf is a bash shell builtin which is identical to the C printf except for its omission of brackets (which the C preprocessor adds if this is compiled with a C compiler).
  • The final three lines are only used by bash, to call the main function. In PHP the main function is defined but not called and in C there is no need to explicitly call the main function.

SNOBOL4, Win32Forth, PureBasicv4.x, and REBOL

The following is written simultaneously in SNOBOL4, Win32Forth, PureBasicv4.x, and REBOL:

Highlighted for SNOBOL

*BUFFER:A.A;.(Hello,world!)@ToIncluding?MacroSkipThis;OUTPUT=Char(10)"Hello, World !";OneKeyInputInput('Char',1,'[-f2-q1]');CharEnd;SNOBOL4+PureBASIC+Win32Forth+REBOL=<3EndMacro:OpenConsole():PrintN("Hello,world!")Repeat:UntilInkey():MacroSomeDummyMacroHereREBOL[Title:"'Hello,World!'in4languages"CopyLeft:"Developedin2010bySociety"]Print"Hello, world !"EndMacro:func[][]set-modessystem/ports/input[binary:true]Inputset-modessystem/ports/input[binary:false]NOP::EndMacro;Wishingtorefineitwithnewlanguage?Goon!

Highlighted for Forth

*BUFFER:A.A;.( Hello, world !)@ToIncluding?MacroSkipThis;OUTPUT=Char(10)"Hello,World!";OneKeyInputInput('Char',1,'[-f2-q1]');CharEnd;SNOBOL4+PureBASIC+Win32Forth+REBOL=<3EndMacro:OpenConsole():PrintN("Hello,world!")Repeat:UntilInkey():MacroSomeDummyMacroHereREBOL[Title:"'Hello,World!'in4languages"CopyLeft:"Developedin2010bySociety"]Print"Hello,world!"EndMacro:func[][]set-modessystem/ports/input[binary:true]Inputset-modessystem/ports/input[binary:false]NOP::EndMacro;Wishingtorefineitwithnewlanguage?Goon!

Highlighted for BASIC

*BUFFER:A.A;.(Hello,world!)@ToIncluding?MacroSkipThis;OUTPUT=Char(10)"Hello, World !";OneKeyInputInput('Char', 1, '[-f2-q1]')  ; CharEnd;SNOBOL4+PureBASIC+Win32Forth+REBOL=<3EndMacro:OpenConsole():PrintN("Hello, world !")Repeat:UntilInkey():MacroSomeDummyMacroHereREBOL[Title:"'Hello, World !' in 4 languages"CopyLeft:"Developed in 2010 by Society"]Print"Hello, world !"EndMacro:func[][]set-modessystem/ports/input[binary:true]Inputset-modessystem/ports/input[binary:false]NOP::EndMacro;Wishingtorefineitwithnewlanguage?Goon!

Highlighted for REBOL

*BUFFER:A.A; .( Hello, world !) @ To Including?MacroSkipThis;OUTPUT=Char(10)"Hello, World !";OneKeyInput  Input('Char', 1, '[-f2-q1]')  ; CharEnd;SNOBOL4+PureBASIC+Win32Forth+REBOL=<3EndMacro:OpenConsole():PrintN("Hello, world !")Repeat:UntilInkey():MacroSomeDummyMacroHereREBOL[Title:"'Hello, World !' in 4 languages"CopyLeft:"Developed in 2010 by Society"]Print"Hello, world !"EndMacro:func[][]set-modessystem/ports/input[binary:true]Inputset-modessystem/ports/input[binary:false]NOP::EndMacro; Wishing to refine it with new language ? Go on !

DOS batch file and Perl

The following file runs as a DOS batch file, then re-runs itself in Perl:

Highlighted for DOS batch

@rem = ' --PERL--@echo off  perl "%~dpnx0"%*gotoendofperl@rem ';  #!perl  print "Hello, world!\n";  __END__  :endofperl

Highlighted for Perl

@rem=' --PERL-- @echo off perl "%~dpnx0" %* goto endofperl @rem ';#!perlprint"Hello, world!\n";__END__ :endofperl

This allows creating Perl scripts that can be run on DOS systems with minimal effort. Note that there is no requirement for a file to perform exactly the same function in the different interpreters.

Types

Polyglot types include: [3]

Benefits

Polyglot markup

Polyglot markup has been proposed as a useful combination of the benefits of HTML5 and XHTML. [7] Such documents can be parsed as either HTML (which is SGML-compatible) or XML, and will produce the same DOM structure either way. For example, in order for an HTML5 document to meet these criteria, the two requirements are that it must have an HTML5 doctype, and be written in well-formed XHTML. The same document can then be served as either HTML or XHTML, depending on browser support and MIME type.

As expressed by the html-polyglot recommendation, [7] to write a polyglot HTML5 document, the following key points should be observed:

  1. Processing instructions and the XML declaration are both forbidden in polyglot markup
  2. Specifying a document’s character encoding
  3. The DOCTYPE
  4. Namespaces
  5. Element syntax (i.e. End tags are not optional. Use self-closing tags for void elements.)
  6. Element content
  7. Text (i.e. pre and textarea should not start with newline character)
  8. Attributes (i.e. Values must be quoted)
  9. Named entity references (i.e. Only amp, lt, gt, apos, quot)
  10. Comments (i.e. Use <!-- syntax -->)
  11. Scripting and styling polyglot markup

The most basic possible polyglot markup document would therefore look like this: [7]

<!DOCTYPE html><htmlxmlns="http://www.w3.org/1999/xhtml"lang=""xml:lang=""><head><title>The title element must not be empty.</title></head><body></body></html>

In a polyglot markup document non-void elements (such as script, p, div) cannot be self-closing even if they are empty, as this is not valid HTML. [8] For example, to add an empty textarea to a page, one cannot use <textarea/>, but has to use <textarea></textarea> instead.

Composing formats

The DICOM medical imaging format was designed to allow polyglotting with TIFF files, allowing efficient storage of the same image data in a file that can be interpreted by either DICOM or TIFF viewers. [9]

Compatibility

The Python 2 and Python 3 programming languages were not designed to be compatible with each other, but there is sufficient commonality of syntax that a polyglot Python program can be written than runs in both versions. [10]

Security implications

A polyglot of two formats may steganographically compose a malicious payload within an ostensibly benign and widely accepted wrapper format, such as a JPEG file that allows arbitrary data in its comment field. A vulnerable JPEG renderer could then be coerced into executing the payload, handing control to the attacker. The mismatch between what the interpreting program expects, and what the file actually contains, is the root cause of the vulnerability. [1]

SQL Injection is a trivial form of polyglot, where a server naively expects user-controlled input to conform to a certain constraint, but the user supplies syntax which is interpreted as SQL code.

Note that in a security context, there is no requirement for a polyglot file to be strictly valid in multiple formats; it is sufficient for the file to trigger unintended behaviour when being interpreted by its primary interpreter.

Highly flexible or extensible file formats have greater scope for polyglotting, and therefore more tightly constrained interpretation offers some mitigation against attacks using polyglot techniques. For example, the PDF file format requires that the magic number %PDF appears at byte offset zero, but many PDF interpreters waive this constraint and accept the file as valid PDF as long as the string appears within the first 1024 bytes. This creates a window of opportunity for polyglot PDF files to smuggle non-PDF content in the header of the file. [3] The PDF format has been described as "diverse and vague", and due to significantly varying behaviour between different PDF parsing engines, it is possible to create a PDF-PDF polyglot that renders as two entirely different documents in two different PDF readers. [11]

Detecting malware concealed within polyglot files requires more sophisticated analysis than relying on file-type identification utilities such as file. In 2019, an evaluation of commercial anti-malware software determined that several such packages were unable to detect any of the polyglot malware under test. [3] [2]

In 2019, the DICOM medical imaging file format was found to be vulnerable to malware injection using a PE-DICOM polyglot technique. [12] The polyglot nature of the attack, combined with regulatory considerations, led to disinfection complications: because "the malware is essentially fused to legitimate imaging files", "incident response teams and A/V software cannot delete the malware file as it contains protected patient health information". [13]

GIFAR attack

A Graphics Interchange Format Java Archives (GIFAR) is a polyglot file that is simultaneously in the GIF and JAR file format. [14] This technique can be used to exploit security vulnerabilities, for example through uploading a GIFAR to a website that allows image uploading (as it is a valid GIF file), and then causing the Java portion of the GIFAR to be executed as though it were part of the website's intended code, being delivered to the browser from the same origin. [15] Java was patched in JRE 6 Update 11, with a CVE published in December 2008. [16] [17]

GIFARs are possible because GIF images store their header in the beginning of the file, and JAR files (as with any ZIP archive-based format) store their data at the end. [18]

See also

Related Research Articles

A "Hello, World!" program is generally a simple computer program which outputs to the screen a message similar to "Hello, World!" while ignoring any user input. A small piece of code in most general-purpose programming languages, this program is used to illustrate a language's basic syntax. A "Hello, World!" program is often the first written by a student of a new programming language, but such a program can also be used as a sanity check to ensure that the computer software intended to compile or run source code is correctly installed, and that its operator understands how to use it.

<span class="mw-page-title-main">Quine (computing)</span> Self-replicating program

A quine is a computer program that takes no input and produces a copy of its own source code as its only output. The standard terms for these programs in the computability theory and computer science literature are "self-replicating programs", "self-reproducing programs", and "self-copying programs".

SNOBOL is a series of programming languages developed between 1962 and 1967 at AT&T Bell Laboratories by David J. Farber, Ralph E. Griswold and Ivan P. Polonsky, culminating in SNOBOL4. It was one of a number of text-string-oriented languages developed during the 1950s and 1960s; others included COMIT and TRAC.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

The C programming language provides many standard library functions for file input and output. These functions make up the bulk of the C standard library header <stdio.h>. The functionality descends from a "portable I/O package" written by Mike Lesk at Bell Labs in the early 1970s, and officially became part of the Unix operating system in Version 7.

<span class="mw-page-title-main">C syntax</span> Set of rules defining correctly structured programs

The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

The printf family of functions in the C programming language are a set of functions that take a format string as input among a variable sized list of other values and produce as output a string that corresponds to the format specifier and given input values. The string is written in a simple template language: characters are usually copied literally into the function's output, but format specifiers, which start with a % character, indicate the location and method to translate a piece of data to characters. The design has been copied to expose similar functionality in other programming languages.

A variadic macro is a feature of some computer programming languages, especially the C preprocessor, whereby a macro may be declared to accept a varying number of arguments.

Code injection is the exploitation of a computer bug that is caused by processing invalid data. The injection is used by an attacker to introduce code into a vulnerable computer program and change the course of execution. The result of successful code injection can be disastrous, for example, by allowing computer viruses or computer worms to propagate.

In mathematics and in computer programming, a variadic function is a function of indefinite arity, i.e., one which accepts a variable number of arguments. Support for variadic functions differs widely among programming languages.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

String functions are used in computer programming languages to manipulate a string or query information about a string.

A scanf format string is a control parameter used in various functions to specify the layout of an input string. The functions can then divide the string and translate into values of appropriate data types. String scanning functions are often supplied in standard libraries. Scanf is a function that reads formatted data from the standard input string, which is usually the keyboard and writes the results whenever called in the specified arguments.

Noweb, stylised in lowercase as noweb, is a literate programming tool, created in 1989–1999 by Norman Ramsey, and designed to be simple, easily extensible and language independent.

stdarg.h is a header in the C standard library of the C programming language that allows functions to accept an indefinite number of arguments. It provides facilities for stepping through a list of function arguments of unknown number and type. C++ provides this functionality in the header cstdarg.

select is a system call and application programming interface (API) in Unix-like and POSIX-compliant operating systems for examining the status of file descriptors of open input/output channels. The select system call is similar to the poll facility introduced in UNIX System V and later operating systems. However, with the c10k problem, both select and poll have been superseded by the likes of kqueue, epoll, /dev/poll and I/O completion ports.

Secure coding is the practice of developing computer software in such a way that guards against the accidental introduction of security vulnerabilities. Defects, bugs and logic flaws are consistently the primary cause of commonly exploited software vulnerabilities. Through the analysis of thousands of reported vulnerabilities, security professionals have discovered that most vulnerabilities stem from a relatively small number of common software programming errors. By identifying the insecure coding practices that lead to these errors and educating developers on secure alternatives, organizations can take proactive steps to help significantly reduce or eliminate vulnerabilities in software before deployment.

In computer programming, string interpolation is the process of evaluating a string literal containing one or more placeholders, yielding a result in which the placeholders are replaced with their corresponding values. It is a form of simple template processing or, in formal terms, a form of quasi-quotation. The placeholder may be a variable name, or in some languages an arbitrary expression, in either case evaluated in the current context.

In computer programming, ellipsis notation is used to denote ranges, an unspecified number of arguments, or a parent directory. Most programming languages require the ellipsis to be written as a series of periods; a single (Unicode) ellipsis character cannot be used.

In software engineering, the module pattern is a design pattern used to implement the concept of software modules, defined by modular programming, in a programming language with incomplete direct support for the concept.

References

  1. 1 2 3 Jonas Magazinius; Billy K. Rios; Andrei Sabelfeld (4 November 2013). "Polyglots". Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security - CCS '13. pp. 753–764. doi:10.1145/2508859.2516685. ISBN   9781450324779. S2CID   16516484.
  2. 1 2 Bridges, Robert A.; Oesch, Sean; Verma, Miki E.; Iannacone, Michael D.; Huffer, Kelly M. T.; Jewell, Brian; Nichols, Jeff A.; Weber, Brian; Beaver, Justin M.; Smith, Jared M.; Scofield, Daniel; Miles, Craig; Plummer, Thomas; Daniell, Mark; Tall, Anne M. (2023). "Beyond the Hype: An Evaluation of Commercially Available Machine-Learning-Based Malware Detectors". Digital Threats: Research and Practice. 4 (2): 1–22. arXiv: 2012.09214 . doi:10.1145/3567432. S2CID   247218744.
  3. 1 2 3 4 5 Koch, Luke; Oesch, Sean; Adkisson, Mary; Erwin, Sam; Weber, Brian; Chaulagain, Amul (2022). "Toward the Detection of Polyglot Files". arXiv: 2203.07561 [cs.CR].
  4. "Benefits of polyglot XHTML5" . Retrieved 4 September 2022.
  5. "Polyglot: A program in eight languages" . Retrieved 6 September 2022.
  6. "15th International Obfuscated C Code Contest (2000)" . Retrieved 6 September 2022.
  7. 1 2 3 "Polyglot Markup: A robust profile of the HTML5 vocabulary" . Retrieved 4 September 2022.
  8. Polyglot Markup: HTML-Compatible XHTML Documents: 6.4 Void Elements Archived 2 October 2012 at the Wayback Machine . W3C Editor's Draft 9 July 2012.
  9. "DICOM-TIFF dual personality files" . Retrieved 5 September 2022.
  10. Schofield, Ed. "Cheat Sheet: Writing Python 2-3 compatible code" . Retrieved 6 September 2022.
  11. Wolf, Julia (9 February 2011). "OMG WTF PDF". 27th Chaos Communication Congress. Retrieved 6 September 2022.
  12. Desjardins, Benoit; Mirsky, Yisroel; Ortiz, Markel Picado; Glozman, Zeev; Tarbox, Lawrence; Horn, Robert; Horii, Steven C. (April 2020). "DICOM Images Have Been Hacked! Now What?". American Journal of Roentgenology. 214 (4): 727–735. doi:10.2214/AJR.19.21958. PMID   31770023. S2CID   208318324 . Retrieved 5 September 2022.
  13. "Ubiquitous Bug Allows HIPAA-Protected Malware to Hide Behind Medical Images". 17 April 2019. Retrieved 5 September 2022.
  14. Byrd, Christopher. "How to Create a GIFAR" . Retrieved 6 March 2023.
  15. Eckel, Benjamin (5 August 2008). "The GIFAR Image Vulnerability". Hackaday. Retrieved 6 March 2023.
  16. "CVE-2008-5343". cve.mitre.org. 4 December 2008. Retrieved 20 April 2021.
  17. McMillan, Robert (1 August 2008). "A photo that can steal your online credentials". Infoworld.com. Archived from the original on 18 September 2020.
  18. Rios, Billy (17 December 2008). "Billy (BK) Rios » SUN Fixes GIFARs". Archived from the original on 14 March 2016. Retrieved 20 April 2021.
  19. Fjeldberg, Hans (2008). Polyglot Programming - A Business Perspective (PDF) (M.Sc). Norwegian University of Science and Technology.
  20. Gupta, Tripta (19 December 2018). "Analyzing Polyglot Microservices". Medium. Retrieved 5 August 2019.