Trojan Source

Last updated
Trojan Source
CVE identifier(s)
  • CVE- 2021-42574
  • CVE- 2021-42694
Date discoveredSeptember 9, 2021;2 years ago (2021-09-09)
DiscovererNicholas Boucher, Ross Anderson
Affected software Unicode, source code
Website trojansource.codes

Trojan Source is the name of a software vulnerability that abuses Unicode's bidirectional characters to display source code differently than the actual execution of the source code. [1] The exploit utilizes how writing scripts of different reading directions are displayed and encoded on computers. It was discovered by Nicholas Boucher and Ross Anderson at Cambridge University in late 2021. [2]

Contents

Background

Unicode is an encoding standard for representing text, symbols, and glyphs. Unicode is the most dominant encoding on computers, used in over 98% of websites as of September 2023. [3] It supports many languages, and because of this, it must support different methods of writing text. This requires support for both left-to-right languages such as English and Russian, and right-to-left languages such as Hebrew and Arabic. Since Unicode aims to solve using more than one writing system, it must be able to mix scripts with different display orders and resolve conflicting orders. To fix this, Unicode contains characters called bidirectional characters (Bidi) that describe how text is displayed and represented. These characters can be abused to change how text is interpreted without changing it visually, as the characters are often invisible. [4]

Relevant Unicode bidirectional formatting characters
AbbreviationNameDescription
LREU+202ALEFT-TO-RIGHT EMBEDDINGTry treating following text as left-to-right.
RLEU+202BRIGHT-TO-LEFT EMBEDDINGTry treating following text as right-to-left.
LROU+202DLEFT-TO-RIGHT OVERRIDEForce treating following text as left-to-right.
RLOU+202ERIGHT-TO-LEFT OVERRIDEForce treating following text as right-to-left.
LRIU+2066LEFT-TO-RIGHT ISOLATEForce treating following text as left-to-right without affecting adjacent text.
RLIU+2067RIGHT-TO-LEFT ISOLATEForce treating following text as right-to-left without affecting adjacent text.
FSIU+2068FIRST STRONG ISOLATEForce treating following text in direction indicated by the next character.
PDFU+202CPOP DIRECTIONAL FORMATTINGTerminate nearest LRE, RLE, LRO, or RLO.
PDIU+2069POP DIRECTIONAL ISOLATETerminate nearest LRI or RLI.

Methodology

In the exploit, bidirectional characters are abused to visually reorder text in source code so that later execution occurs in a different order. Bidirectional characters can be inserted in areas of source code where string literals are allowed. This often applies to documentation, variables, or comments.

Vulnerable Python code
Source code with hintsSource code displayed visuallySource code interpreted
defsum(num1,num2):'''Add num1 and num2, and [RLI] return; '''returnnum1+num2
defsum(num1,num2):'''Add num1 and num2, and return; '''returnnum1+num2
defsum(num1,num2):'''Add num1 and num2, and ''';returnreturnnum1+num2

In the above example, the RLI mark (right-to-left isolate) forces the following text to be interpreted in the reverse order: the triple-quote is first (ending the string), followed by a semicolon (starting a new line), and finally with the premature return (returning None and ignoring any code below it). The new line terminates the RLI mark, preventing it from flowing into the below code. Because of the Bidi character, some source code editors and IDEs rearrange the code for display without any visual indication that the code has been rearranged, so a human code reviewer would not normally detect them. However, when the code is inserted into a compiler, the compiler may ignore the Bidi char and process the characters in a different order than visually displayed. When the compiler is finished, it could potentially execute code that visually appeared to be non-executable. [5] Formatting marks can be combined multiple times to create complex attacks. [6]

Impact and mitigation

Programming languages that support Unicode strings and follow Unicode's Bidi algorithm are vulnerable to the exploit. This includes languages like Java, Go, C, C++, C#, Python, and JavaScript. [7]

While the attack is not strictly an error, many compilers, interpreters, and websites added warnings or mitigations for the exploit. Both GNU GCC and LLVM received requests to deal with the exploit. [8] Marek Polacek submitted a patch to GCC shortly after the exploit was published that implemented a warning for potentially unsafe directional characters; this functionality was merged for GCC 12 under the -Wbidi-chars flag. [9] [10] LLVM also merged similar patches. Rust fixed the exploit in 1.56.1, rejecting code that includes the characters by default. The developers of Rust found no vulnerable packages prior to the fix. [11]

Red Hat issued an advisory on their website, labeling the exploit as "moderate". [12] GitHub released a warning on their blog, as well as updating the website to show a dialog box when Bidi characters are detected in a repository's code. [13]

Related Research Articles

A bidirectional text contains two text directionalities, right-to-left (RTL) and left-to-right (LTR). It generally involves text containing different types of alphabets, but may also refer to boustrophedon, which is changing text direction in each row.

<span class="mw-page-title-main">GNU Compiler Collection</span> Free and open-source compiler for various programming languages

The GNU Compiler Collection (GCC) is an optimizing compiler produced by the GNU Project supporting various programming languages, hardware architectures and operating systems. The Free Software Foundation (FSF) distributes GCC as free software under the GNU General Public License. GCC is a key component of the GNU toolchain and the standard compiler for most projects related to GNU and the Linux kernel. With roughly 15 million lines of code in 2019, GCC is one of the biggest free programs in existence. It has played an important role in the growth of free software, as both a tool and an example.

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text written in all of the world's major writing systems. Version 15.1 of the standard defines 149813 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts. Many common characters, including numerals, punctuation, and other symbols, are unified within the standard and are not treated as specific to any given writing system. Unicode encodes thousands of emoji, with the continued development thereof conducted by the Consortium as a part of the standard. Moreover, the widespread adoption of Unicode was in large part responsible for the initial popularization of emoji outside of Japan. Unicode is ultimately capable of encoding more than 1.1 million characters.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from UnicodeTransformation Format – 8-bit.

<span class="mw-page-title-main">Shellcode</span> Small piece of code used as a payload to exploit a software vulnerability

In hacking, a shellcode is a small piece of code used as the payload in the exploitation of a software vulnerability. It is called "shellcode" because it typically starts a command shell from which the attacker can control the compromised machine, but any piece of code that performs a similar task can be called shellcode. Because the function of a payload is not limited to merely spawning a shell, some have suggested that the name shellcode is insufficient. However, attempts at replacing the term have not gained wide acceptance. Shellcode is commonly written in machine code.

ISO/IEC 8859-8, Information technology — 8-bit single-byte coded graphic character sets — Part 8: Latin/Hebrew alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings. ISO/IEC 8859-8:1999 from 1999 represents its second and current revision, preceded by the first edition ISO/IEC 8859-8:1988 in 1988. It is informally referred to as Latin/Hebrew. ISO/IEC 8859-8 covers all the Hebrew letters, but no Hebrew vowel signs. IBM assigned code page 916 to it. This character set was also adopted by Israeli Standard SI1311:2002, with some extensions.

Uncontrolled format string is a type of software vulnerability discovered around 1989 that can be used in security exploits. Originally thought harmless, format string exploits can be used to crash a program or to execute harmful code. The problem stems from the use of unchecked user input as the format string parameter in certain C functions that perform formatting, such as printf . A malicious user may use the %s and %x format tokens, among others, to print data from the call stack or possibly other locations in memory. One may also write arbitrary data to arbitrary locations using the %n format token, which commands printf and similar functions to write the number of bytes formatted to an address stored on the stack.

Buffer overflow protection is any of various techniques used during software development to enhance the security of executable programs by detecting buffer overflows on stack-allocated variables, and preventing them from causing program misbehavior or from becoming serious security vulnerabilities. A stack buffer overflow occurs when a program writes to a memory address on the program's call stack outside of the intended data structure, which is usually a fixed-length buffer. Stack buffer overflow bugs are caused when a program writes more data to a buffer located on the stack than what is actually allocated for that buffer. This almost always results in corruption of adjacent data on the stack, which could lead to program crashes, incorrect operation, or security issues.

<span class="mw-page-title-main">LLVM</span> Compiler backend for multiple programming languages

LLVM is a set of compiler and toolchain technologies that can be used to develop a frontend for any programming language and a backend for any instruction set architecture. LLVM is designed around a language-independent intermediate representation (IR) that serves as a portable, high-level assembly language that can be optimized with a variety of transformations over multiple passes. The name LLVM originally stood for Low Level Virtual Machine, though the project has expanded and the name is no longer officially an initialism.

<span class="mw-page-title-main">Dangling pointer</span> Pointer that does not point to a valid object

Dangling pointers and wild pointers in computer programming are pointers that do not point to a valid object of the appropriate type. These are special cases of memory safety violations. More generally, dangling references and wild references are references that do not resolve to a valid destination.

This article provides basic comparisons for notable text editors. More feature details for text editors are available from the Category of text editor features and from the individual products' articles. This article may not be up-to-date or necessarily all-inclusive.

<span class="mw-page-title-main">Homoglyph</span> Different glyphs which are visually similar

In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar but may have differing meaning. The designation is also applied to sequences of characters sharing these properties.

In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE represents a blank space punctuation character in text, used as a word divider in Western scripts.

<span class="mw-page-title-main">Code::Blocks</span> Free and open source, cross-platform IDE

Code::Blocks is a free, open-source cross-platform IDE that supports multiple compilers including GCC, Clang and Visual C++. It is developed in C++ using wxWidgets as the GUI toolkit. Using a plugin architecture, its capabilities and features are defined by the provided plugins. Currently, Code::Blocks is oriented towards C, C++, and Fortran. It has a custom build system and optional Make support.

The left-to-right mark (LRM) is a control character used in computerized typesetting of text containing a mix of left-to-right scripts and right-to-left scripts. It is used to set the way adjacent characters are grouped with respect to text direction.

<span class="mw-page-title-main">Universal Character Set characters</span> Complete list of the characters available on most computers

The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set, is an international standard to map characters, discrete symbols used in natural language, mathematics, music, and other domains, to unique machine-readable data values. By creating this mapping, the UCS enables computer software vendors to interoperate, and transmit—interchange—UCS-encoded text strings from one to another. Because it is a universal map, it can be used to represent multiple languages at the same time. This avoids the confusion of using multiple legacy character encodings, which can result in the same sequence of codes having multiple interpretations depending on the character encoding in use, resulting in mojibake if the wrong one is chosen.

<span class="mw-page-title-main">KWallet</span> Password manager

KDE Wallet Manager (KWallet) is free and open-source password management software written in C++ for UNIX-style operating systems. KDE Wallet Manager runs on a Linux-based OS and Its main feature is storing encrypted passwords in KDE Wallets. The main feature of KDE wallet manager (KWallet) is to collect user's credentials such as passwords or IDs and encrypt them through Blowfish symmetric block cipher algorithm or GNU Privacy Guard encryption.

The Universal Coded Character Set is a standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

The Unicode Standard assigns various properties to each Unicode character and code point.

<span class="mw-page-title-main">Shellshock (software bug)</span> Security bug in the Unix Bash shell discovered in 2014

Shellshock, also known as Bashdoor, is a family of security bugs in the Unix Bash shell, the first of which was disclosed on 24 September 2014. Shellshock could enable an attacker to cause Bash to execute arbitrary commands and gain unauthorized access to many Internet-facing services, such as web servers, that use Bash to process requests.

References

  1. "'Trojan Source' Bug Threatens the Security of All Code – Krebs on Security". November 2021. Archived from the original on 2022-01-14. Retrieved 2022-01-17.
  2. "VU#999008 - Compilers permit Unicode control and homoglyph characters". www.kb.cert.org. Archived from the original on 2022-01-21. Retrieved 2022-01-17.
  3. "Usage Survey of Character Encodings broken down by Ranking". w3techs.com. Archived from the original on 2022-01-21. Retrieved 2022-01-17.
  4. "UAX #9: Unicode Bidirectional Algorithm". www.unicode.org. Archived from the original on 2019-05-02. Retrieved 2022-01-17.
  5. Edge, Jake (2021-11-03). "Trojan Source: tricks (no treats) with Unicode [LWN.net]". lwn.net. Retrieved 2022-03-12.
  6. Stockley, Mark (2021-11-03). "Trojan Source: Hiding malicious code in plain sight". Malwarebytes Labs. Retrieved 2022-03-12.
  7. Tung, Liam. "Programming languages: This sneaky trick could allow attackers to hide 'invisible' vulnerabilities in code". ZDNet. Archived from the original on 2021-12-21. Retrieved 2022-01-21.
  8. "GCC & LLVM Patches Pending To Fend Off Trojan Source Attacks". www.phoronix.com. Archived from the original on 2021-12-01. Retrieved 2022-01-17.
  9. Malcolm, David (2022-01-12). "Prevent Trojan Source attacks with GCC 12". Red Hat Developer. Archived from the original on 2022-01-17. Retrieved 2022-01-17.
  10. "Warning Options (Using the GNU Compiler Collection (GCC))". gcc.gnu.org. Archived from the original on 2018-12-05. Retrieved 2022-01-17.
  11. "Security advisory for rustc (CVE-2021-42574) | Rust Blog". blog.rust-lang.org. Archived from the original on 2021-11-30. Retrieved 2022-01-21.
  12. "RHSB-2021-007 Trojan source attacks (CVE-2021-42574,CVE-2021-42694)". Red Hat Customer Portal. Archived from the original on 2022-01-17. Retrieved 2022-01-21.
  13. "Warning about bidirectional Unicode text | GitHub Changelog". The GitHub Blog. 31 October 2021. Archived from the original on 2022-01-15. Retrieved 2022-01-21.