Code stylometry (also known as program authorship attribution or source code authorship analysis) is the application of
stylometry
Stylometry is the application of Stylistics (linguistics), the study of linguistic style, usually to written language. It has also been applied successfully to music and to fine-art paintings as well.Shlomo Argamon, Argamon, Shlomo, Kevin Burns, ...
to computer code to attribute authorship to anonymous
binary
Binary may refer to:
Science and technology Mathematics
* Binary number, a representation of numbers using only two digits (0 and 1)
* Binary function, a function that takes two arguments
* Binary operation, a mathematical operation that t ...
or
source code
In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the wo ...
. It often involves breaking down and examining the distinctive patterns and characteristics of the programming code and then comparing them to computer code whose authorship is known. Unlike
software forensics
Software forensics is the science of analyzing software source code or binary code to determine whether intellectual property infringement or theft occurred. It is the centerpiece of lawsuits, trials, and settlements when companies are in dispute ...
, code stylometry attributes authorship for purposes other than
intellectual property infringement
An intellectual is a person who engages in critical thinking, research, and reflection about the reality of society, and who proposes solutions for the normative problems of society. Coming from the world of culture, either as a creator or a ...
, including
plagiarism detection
Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to pla ...
, copyright investigation, and authorship verification.
History
In 1989, researchers Paul Oman and Curtis Cook identified the authorship of 18 different
Pascal
Pascal, Pascal's or PASCAL may refer to:
People and fictional characters
* Pascal (given name), including a list of people with the name
* Pascal (surname), including a list of people and fictional characters with the name
** Blaise Pascal, Fren ...
programs written by six authors by using “markers” based on
typographic
Typography is the art and technique of typesetting, arranging type to make written language legibility, legible, readability, readable and beauty, appealing when displayed. The arrangement of type involves selecting typefaces, Point (typogra ...
characteristics.
In 1998, researchers Stephen MacDonell, Andrew Gray, and Philip Sallis developed a dictionary-based author attribution system called IDENTIFIED (Integrated Dictionary-based Extraction of Non-language-dependent Token Information for Forensic Identification, Examination, and Discrimination) that determined the authorship of source code in computer programs written in
C++
C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
. The researchers noted that authorship can be identified using degrees if flexibility in the writing style of the source code, such as:
* The way the algorithm in the source code solves the given problem
* The way the source code is laid out (spacing, indentation, bordering characteristics, standard headings, etc.)
* The way the algorithm is implemented in the source code
The IDENTIFIED system attributed authorship by first merging all the relevant files to produce a single source code file and then subjecting it to a metrics analysis by counting the number of occurrences for each metric. In addition, the system was language-independent due to its ability to create new dictionary files and meta-dictionaries.
In 1999, a team of researchers led by Stephen MacDonell tested the performance of three different program authorship discrimination techniques on 351 programs written in C++ by 7 different authors. The researchers compared the effectiveness of using a
feed-forward neural network (FFNN) that was trained on a
back-propagation
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
algorithm,
multiple discriminant analysis (MDA), and
case-based reasoning (CBR). At the end of the experiment, both the neural network and the MDA had an accuracy rate of 81.1%, while the CBR reached an accuracy performance of 88.0%.
In 2005, researchers from the Laboratory of Information and Communication Systems Security at
Aegean University
The University of the Aegean ( el, Πανεπιστήμιο Αιγαίου) is a public, multi-campus university located in Lesvos, Chios, Samos, Rhodes, Syros and Lemnos, Greece. It was founded on March 20, 1984, by the Presidential Act 83/1984 ...
introduced a language-independent method of program authorship attribution where they used
byte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit ...
-level
''n''-grams to classify a program to an author. This technique scanned the files and then created a table of different ''n''-grams found in the source code and the number of times they appear. In addition, the system could operate with limited numbers of training examples from each author. However, the more source code programs that were present for each author, the more reliable the author attribution. In an experiment testing their approach, the researchers found that classification using ''n''-grams reached an accuracy rate of up to 100%, although the rate declined drastically if the profile size exceeded 500 and the ''n''-gram size was 3 or less.
In 2011, researchers from the University of Wisconsin created a program authorship attribution system that identified a programmer based on the binary code of a program instead of the source code. The researchers utilized
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
and training code to determine which characteristics of the code would be helpful in describing the programming style. In an experiment testing the approach on a set of programs written by 10 different authors, the system achieved an accuracy rate of 81%. When tested using a set of programs written by almost 200 different authors, the system performed with an accuracy rate of 51%.
In 2015, a team of postdoctoral researchers from
Princeton University
Princeton University is a private university, private research university in Princeton, New Jersey. Founded in 1746 in Elizabeth, New Jersey, Elizabeth as the College of New Jersey, Princeton is the List of Colonial Colleges, fourth-oldest ins ...
,
Drexel University
Drexel University is a private research university with its main campus in Philadelphia, Pennsylvania. Drexel's undergraduate school was founded in 1891 by Anthony J. Drexel, a financier and philanthropist. Founded as Drexel Institute of Art, S ...
, the
University of Maryland
The University of Maryland, College Park (University of Maryland, UMD, or simply Maryland) is a public land-grant research university in College Park, Maryland. Founded in 1856, UMD is the flagship institution of the University System of M ...
, and the
University of Goettingen
A university () is an institution of higher (or tertiary) education and research which awards academic degrees in several academic disciplines. Universities typically offer both undergraduate and postgraduate programs. In the United States, t ...
as well as researchers from the
U.S. Army Research Laboratory developed a program authorship attribution system that could determine the author of a program from a sample pool with programs written by 1,600 coders with a 94 percent accuracy. The methodology consisted of four steps:
# Disassembly - The program is disassembled to obtain information on its characteristics.
# Decompilation - The program is converted into a variant of C-like
pseudocode
In computer science, pseudocode is a plain language description of the steps in an algorithm or another system. Pseudocode often uses structural conventions of a normal programming language, but is intended for human reading rather than machine re ...
through
decompilation
A decompiler is a computer program that translates an executable file to a high-level source file which can be recompiled successfully. It does therefore the opposite of a typical compiler, which translates a high-level language to a low-level la ...
to obtain
abstract syntax tree
In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurring ...
s.
# Dimensionality reduction - The most relevant and useful features for author identification are selected.
# Classification - A random-forest classifier attributes the authorship of the program.
This approach analyzed various characteristics of the code, such as blank space, the use of tabs and spaces, and the names of variables, and then used a method of evaluation called a syntax tree analysis that translated the sample code into tree-like diagrams that displayed the structural decisions involved in writing the code. The design of these diagrams prioritized the order of the commands and the depths of the functions that were nestled in the code.
The 2014 Sony Pictures hacking attack
U.S. intelligence officials were able to determine that the
2014 cyber attack on Sony Pictures was sponsored by North Korea after evaluating the software, techniques, and network sources. The attribution was made after cybersecurity experts noticed similarities between the code used in the attack and a malicious software known as
Shamoon
Shamoon ( fa, شمعون), also known as W32.DistTrack, is a modular computer virus that was discovered in 2012, targeting then-recent 32-bit NT kernel versions of Microsoft Windows. The virus was notable due to the destructive nature of the atta ...
, which was used in the
2013 attacks against South Korean banks and broadcasting companies by North Korea.
References
{{reflist
Military technology
Language varieties and styles
Quantitative linguistics
Computational fields of study
Source code
Computer data