Code stylometry

Last updated

Code stylometry (also known as program authorship attribution or source code authorship analysis) is the application of stylometry to computer code to attribute authorship to anonymous binary or source code. It often involves breaking down and examining the distinctive patterns and characteristics of the programming code and then comparing them to computer code whose authorship is known. [1] Unlike software forensics, code stylometry attributes authorship for purposes other than intellectual property infringement, including plagiarism detection, copyright investigation, and authorship verification. [2]

Contents

History

In 1989, researchers Paul Oman and Curtis Cook identified the authorship of 18 different Pascal programs written by six authors by using “markers” based on typographic characteristics. [3]

In 1998, researchers Stephen MacDonell, Andrew Gray, and Philip Sallis developed a dictionary-based author attribution system called IDENTIFIED (Integrated Dictionary-based Extraction of Non-language-dependent Token Information for Forensic Identification, Examination, and Discrimination) that determined the authorship of source code in computer programs written in C++. The researchers noted that authorship can be identified using degrees of flexibility in the writing style of the source code, such as: [4]

The IDENTIFIED system attributed authorship by first merging all the relevant files to produce a single source code file and then subjecting it to a metrics analysis by counting the number of occurrences for each metric. In addition, the system was language-independent due to its ability to create new dictionary files and meta-dictionaries. [4]

In 1999, a team of researchers led by Stephen MacDonell tested the performance of three different program authorship discrimination techniques on 351 programs written in C++ by 7 different authors. The researchers compared the effectiveness of using a feed-forward neural network (FFNN) that was trained on a back-propagation algorithm, multiple discriminant analysis (MDA), and case-based reasoning (CBR). At the end of the experiment, both the neural network and the MDA had an accuracy rate of 81.1%, while the CBR reached an accuracy performance of 88.0%. [5]

In 2005, researchers from the Laboratory of Information and Communication Systems Security at Aegean University introduced a language-independent method of program authorship attribution where they used byte-level n-grams to classify a program to an author. This technique scanned the files and then created a table of different n-grams found in the source code and the number of times they appear. In addition, the system could operate with limited numbers of training examples from each author. However, the more source code programs that were present for each author, the more reliable the author attribution. In an experiment testing their approach, the researchers found that classification using n-grams reached an accuracy rate of up to 100%, although the rate declined drastically if the profile size exceeded 500 and the n-gram size was 3 or less. [3]

In 2011, researchers from the University of Wisconsin created a program authorship attribution system that identified a programmer based on the binary code of a program instead of the source code. The researchers utilized machine learning and training code to determine which characteristics of the code would be helpful in describing the programming style. In an experiment testing the approach on a set of programs written by 10 different authors, the system achieved an accuracy rate of 81%. When tested using a set of programs written by almost 200 different authors, the system performed with an accuracy rate of 51%. [6]

In 2015, a team of postdoctoral researchers from Princeton University, Drexel University, the University of Maryland, and the University of Goettingen as well as researchers from the U.S. Army Research Laboratory developed a program authorship attribution system that could determine the author of a program from a sample pool with programs written by 1,600 coders with a 94 percent accuracy. The methodology consisted of four steps: [7]

  1. Disassembly - The program is disassembled to obtain information on its characteristics.
  2. Decompilation - The program is converted into a variant of C-like pseudocode through decompilation to obtain abstract syntax trees.
  3. Dimensionality reduction - The most relevant and useful features for author identification are selected.
  4. Classification - A random-forest classifier attributes the authorship of the program.

This approach analyzed various characteristics of the code, such as blank space, the use of tabs and spaces, and the names of variables, and then used a method of evaluation called a syntax tree analysis that translated the sample code into tree-like diagrams that displayed the structural decisions involved in writing the code. The design of these diagrams prioritized the order of the commands and the depths of the functions that were nestled in the code. [8]

The 2014 Sony Pictures hacking attack

U.S. intelligence officials were able to determine that the 2014 cyber attack on Sony Pictures was sponsored by North Korea after evaluating the software, techniques, and network sources. The attribution was made after cybersecurity experts noticed similarities between the code used in the attack and a malicious software known as Shamoon, which was used in the 2013 attacks against South Korean banks and broadcasting companies by North Korea. [9]

Related Research Articles

<span class="mw-page-title-main">Free software</span> Software licensed to be freely used, modified and distributed

Free software, libre software, or libreware is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, not price; all users are legally free to do what they want with their copies of a free software regardless of how much is paid to obtain the program. Computer programs are deemed "free" if they give end-users ultimate control over the software and, subsequently, over their devices.

In software development, obfuscation is the act of creating source or machine code that is difficult for humans or computers to understand. Like obfuscation in natural language, it may use needlessly roundabout expressions to compose statements. Programmers may deliberately obfuscate code to conceal its purpose or its logic or implicit values embedded in it, primarily, in order to prevent tampering, deter reverse engineering, or even to create a puzzle or recreational challenge for someone reading the source code. This can be done manually or by using an automated tool, the latter being the preferred technique in industry.

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

A disassembler is a computer program that translates machine language into assembly language—the inverse operation to that of an assembler. Disassembly, the output of a disassembler, is often formatted for human-readability rather than suitability for input to an assembler, making it principally a reverse-engineering tool. Common uses of disassemblers include analyzing high-level programming language compilers output and their optimizations, recovering source code of a program whose original source was lost, malware analysis, modifying software, and software cracking.

A filename extension, file name extension or file extension is a suffix to the name of a computer file. The extension indicates a characteristic of the file contents or its intended use. A filename extension is typically delimited from the rest of the filename with a period, but in some systems it is separated with spaces.

<i>Gratis</i> versus <i>libre</i> Two meanings of "free" in English

The adjective free in English is commonly used in one of two meanings: "at no monetary cost" (gratis) or "with little or no restriction" (libre). This ambiguity can cause issues where the distinction is important, as it often is in dealing with laws concerning the use of information, such as copyright and patents.

Stylometry is the application of the study of linguistic style, usually to written language. It has also been applied successfully to music, paintings, and chess.

Patrick Juola is an American computer scientist, internationally recognized as an expert in text analysis, security, forensics, and stylometry. He is currently a professor of computer science at Duquesne University in Pittsburgh.

Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others.

In software development, effort estimation is the process of predicting the most realistic amount of effort required to develop or maintain software based on incomplete, uncertain and noisy input. Effort estimates may be used as input to project plans, iteration plans, budgets, investment analyses, pricing processes and bidding rounds.

Reverse engineering is a process or method through which one attempts to understand through deductive reasoning how a previously made device, process, system, or piece of software accomplishes a task with very little insight into exactly how it does so. Depending on the system under consideration and the technologies employed, the knowledge gained during reverse engineering can help with repurposing obsolete objects, doing security analysis, or learning how something works.

A decompiler is a computer program that translates an executable file to high-level source code. It does therefore the opposite of a typical compiler, which translates a high-level language to a low-level language. While disassemblers translate an executable into assembly language, decompilers go a step further and translate the code into a higher level language such as C or Java, requiring more sophisticated techniques. Decompilers are usually unable to perfectly reconstruct the original source code, thus will frequently produce obfuscated code. Nonetheless, they remain an important tool in the reverse engineering of computer software.

<span class="mw-page-title-main">Emulator</span> System allowing a device to imitate another

In computing, an emulator is hardware or software that enables one computer system to behave like another computer system. An emulator typically enables the host system to run software or use peripheral devices designed for the guest system. Emulation refers to the ability of a computer program in an electronic device to emulate another program or device.

AForge.NET is a computer vision and artificial intelligence library originally developed by Andrew Kirillov for the .NET Framework.

Carole Elisabeth Chaski is a forensic linguist who is considered one of the leading experts in the field. Her research has led to improvements in the methodology and reliability of stylometric analysis and inspired further research on the use of this approach for authorship identification. Her contributions have served as expert testimony in several federal and state court cases in the United States and Canada. She is president of ALIAS Technology and executive director of the Institute for Linguistic Evidence, a non-profit research organization devoted to linguistic evidence.

<span class="mw-page-title-main">Software categories</span> Groups of software

Software categories are groups of software. They allow software to be understood in terms of those categories, instead of the particularities of each package. Different classification schemes consider different aspects of software.

Gray-box testing is a combination of white-box testing and black-box testing. The aim of this testing is to search for the defects, if any, due to improper structure or improper usage of applications.

<span class="mw-page-title-main">Silvio Cesare</span> Australian security researcher

Silvio Cesare is an Australian security researcher known for his multiple articles in phrack, talks at numerous security conferences including Defcon and Black Hat Briefings. Silvio is also a former member of w00w00. His security research includes an IDS evasion bug in the widely deployed Snort software. Silvio holds a PhD in Computer Science from Deakin University and is the co-founder of the security conference BSides Canberra. He earned his Master of Informatics and Bachelor of Information Technology from CQUniversity Australia. He currently operates the Canberra based training and consulting provider InfoSect.

<span class="mw-page-title-main">Author profiling</span> System to identify an author

Author profiling is the analysis of a given set of texts in an attempt to uncover various characteristics of the author based on stylistic- and content-based features, or to identify the author. Characteristics analysed commonly include age and gender, though more recent studies have looked at other characteristics like personality traits and occupation

Adversarial stylometry is the practice of altering writing style to reduce the potential for stylometry to discover the author's identity or their characteristics. This task is also known as authorship obfuscation or authorship anonymisation. Stylometry poses a significant privacy challenge in its ability to unmask anonymous authors or to link pseudonyms to an author's other identities, which, for example, creates difficulties for whistleblowers, activists, and hoaxers and fraudsters. The privacy risk is expected to grow as machine learning techniques and text corpora develop.

References

  1. Claburn, Thomas (March 16, 2018). "FYI: AI tools can unmask anonymous coders from their binary executables". The Register. Retrieved August 2, 2018.
  2. De-anonymizing Programmers via Code Stylometry. August 12, 2015. ISBN   9781939133113 . Retrieved August 2, 2018.{{cite book}}: |website= ignored (help)
  3. 1 2 Frantzeskou, Georgia; Stamatatos, Efstathios; Gritzalis, Stefanos (October 2005). "Supporting the Cybercrime Investigation Process: Effective Discrimination of Source Code Authors Based on Byte-Level Information". E-business and Telecommunication Networks. Communications in Computer and Information Science. Vol. 3. pp. 283–290. doi:10.1007/978-3-540-75993-5_14. ISBN   978-3-540-75992-8 via ResearchGate.
  4. 1 2 Gray, Andrew; MacDonnell, Stephen; Sallis, Philip (January 1998). "IDENTIFIED (Integrated Dictionary-based Extraction of Non-language-dependent Token Information for Forensic Identification, Examination, and Discrimination): A dictionary-based system for extracting source code metrics for software forensics". Proceedings. 1998 International Conference Software Engineering: Education and Practice (Cat. No.98EX220). pp. 252–259. doi:10.1109/SEEP.1998.707658. hdl: 10292/3472 . ISBN   978-0-8186-8828-7. S2CID   53463447 via ResearchGate.
  5. MacDonell, Stephen; Gray, Andrew; MacLennan, Grant; Sallis, Philip (February 1999). "Software forensics for discriminating between program authors using case-based reasoning, feedforward neural networks and multiple discriminant analysis". Neural Information Processing. 1. ISSN   1177-455X via ResearchGate.
  6. Rosenblum, Nathan; Zhu, Xiaojin; Miller, Barton (September 2011). "Who wrote this code? Identifying the authors of program binaries". Proceedings of the 16th European Conference on Research in Computer Security. Esorics'11: 172–189. ISBN   978-3-642-23821-5 via ACM Digital Library.
  7. Brayboy, Joyce (January 15, 2016). "Malicious coders will lose anonymity as identity-finding research matures". U.S. Army. Retrieved August 2, 2018.
  8. Greenstadt, Rachel (February 27, 2015). "Dusting for Cyber Fingerprints: Coding Style Identifies Anonymous Programmers". Forensic Magazine. Retrieved August 2, 2018.
  9. Brunnstrom, David; Finkle, Jim (December 18, 2014). "U.S. considers 'proportional' response to Sony hacking attack". Reuters. Retrieved August 2, 2018.