The notion of a self-reproducing computer program can be traced back to initial theories about the operation of complex automata.

John von Neumann John von Neumann (; hu, Neumann János Lajos, ; December 28, 1903 – February 8, 1957) was a Hungarian-American mathematician, physicist, computer scientist, engineer and polymath. He was regarded as having perhaps the widest cove ...

showed that in theory a program could reproduce itself. This constituted a plausibility result in

computability theory Computability theory, also known as recursion theory, is a branch of mathematical logic, computer science, and the theory of computation that originated in the 1930s with the study of computable functions and Turing degrees. The field has since e ...

Fred Cohen Frederick B. Cohen (born 1956) is an American computer scientist and best known as the inventor of computer virus defense techniques. He gave the definition of "computer virus". Cohen is best known for his pioneering work on computer viruses, t ...

experimented with computer viruses and confirmed Neumann's postulate and investigated other properties of malware such as detectability and self-obfuscation using rudimentary encryption. His 1988 Doctoral dissertation was on the subject of computer viruses. Cohen's faculty advisor,

Leonard Adleman Leonard Adleman (born December 31, 1945) is an American computer scientist. He is one of the creators of the RSA encryption algorithm, for which he received the 2002 Turing Award, often called the Nobel prize of Computer science. He is also kno ...

, presented a rigorous proof that, in the general case, algorithmic determination of the presence of a virus is undecidable. This problem must not be mistaken for that of determination within a broad class of programs that a virus is not present. This problem differs in that it does not require the ability to recognize all viruses. Adleman's proof is perhaps the deepest result in malware

to date and it relies on

Cantor's diagonal argument In set theory, Cantor's diagonal argument, also called the diagonalisation argument, the diagonal slash argument, the anti-diagonal argument, the diagonal method, and Cantor's diagonalization proof, was published in 1891 by Georg Cantor as a m ...

as well as the

halting problem In computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running, or continue to run forever. Alan Turing proved in 1936 that a g ...

. Ironically, it was later shown by Young and Yung that Adleman's work in

cryptography Cryptography, or cryptology (from grc, , translit=kryptós "hidden, secret"; and ''graphein'', "to write", or ''-logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adver ...

is ideal in constructing a virus that is highly resistant to reverse-engineering by presenting the notion of a cryptovirus. A cryptovirus is a virus that contains and uses a public key and randomly generated symmetric cipher

initialization vector In cryptography, an initialization vector (IV) or starting variable (SV) is an input to a cryptographic primitive being used to provide the initial state. The IV is typically required to be random or pseudorandom, but sometimes an IV only needs to ...

(IV) and

session key A session key is a single-use symmetric key used for encrypting all messages in one communication session. A closely related term is content encryption key (CEK), traffic encryption key (TEK), or multicast key which refers to any key used for enc ...

(SK). In the cryptoviral extortion attack, the virus hybrid encrypts

plaintext In cryptography, plaintext usually means unencrypted information pending input into cryptographic algorithms, usually encryption algorithms. This usually refers to data that is transmitted or stored unencrypted. Overview With the advent of comp ...

data on the victim's machine using the randomly generated IV and SK. The IV+SK are then encrypted using the virus writer's

public key Public-key cryptography, or asymmetric cryptography, is the field of cryptographic systems that use pairs of related keys. Each key pair consists of a public key and a corresponding private key. Key pairs are generated with cryptographic alg ...

. In theory the victim must negotiate with the virus writer to get the IV+SK back in order to decrypt the

ciphertext In cryptography, ciphertext or cyphertext is the result of encryption performed on plaintext using an algorithm, called a cipher. Ciphertext is also known as encrypted or encoded information because it contains a form of the original plaintext ...

(assuming there are no backups). Analysis of the virus reveals the public key, not the IV and SK needed for decryption, or the private key needed to recover the IV and SK. This result was the first to show that

computational complexity theory In theoretical computer science and mathematics, computational complexity theory focuses on classifying computational problems according to their resource usage, and relating these classes to each other. A computational problem is a task solved by ...

can be used to devise malware that is robust against reverse-engineering. A growing area of computer virus research is to mathematically model the infection behavior of worms using models such as Lotka–Volterra equations, which has been applied in the study of biological virus. Various virus propagation scenarios have been studied by researchers such as propagation of computer virus, fighting virus with virus like predator codes, effectiveness of patching etc. Behavioral malware detection has been researched more recently. Most approaches to behavioral detection are based on analysis of

system call In computing, a system call (commonly abbreviated to syscall) is the programmatic way in which a computer program requests a service from the operating system on which it is executed. This may include hardware-related services (for example, acc ...

dependencies. The executed binary code is traced using

strace strace is a diagnostic, debugging and instructional userspace utility for Linux. It is used to monitor and tamper with interactions between processes and the Linux kernel, which include system calls, signal deliveries, and changes of proce ...

or more precise taint analysis to compute data-flow dependencies among

s. The result is a

directed graph In mathematics, and more specifically in graph theory, a directed graph (or digraph) is a graph that is made up of a set of vertices connected by directed edges, often called arcs. Definition In formal terms, a directed graph is an ordered pa ...

G=(V,E)

such that nodes are

s, and edges represent dependencies. For example,

(s,t)\in E

if a result returned by system call

s

(either directly as a result or indirectly through output parameters) is later used as a parameter of system call

t

. The origins of the idea to use system calls to analyze software can be found in the work of Forrest et al. Christodorescu et al. point out that malware authors cannot easily reorder system calls without changing the semantics of the program, which makes system call dependency graphs suitable for malware detection. They compute a difference between malware and goodware system call dependency graphs and use the resulting graphs for detection, achieving high detection rates. Kolbitsch et al. pre-compute symbolic expressions and evaluate them on the syscall parameters observed at runtime. They detect dependencies by observing whether the result obtained by evaluation matches the parameter values observed at runtime. Malware is detected by comparing the dependency graphs of the training and test sets. Fredrikson et al. describe an approach that uncovers distinguishing features in malware system call dependency graphs. They extract significant behaviors using concept analysis and leap mining. Babic et al. recently proposed a novel approach for both malware detection and classification based on grammar inference of

tree automata A tree automaton is a type of state machine. Tree automata deal with tree structures, rather than the strings of more conventional state machines. The following article deals with branching tree automata, which correspond to regular languages ...

. Their approach infers an

automaton An automaton (; plural: automata or automatons) is a relatively self-operating machine, or control mechanism designed to automatically follow a sequence of operations, or respond to predetermined instructions.Automaton – Definition and More ...

from dependency graphs, and they show how such an automaton could be used for detection and classification of malware. Research in combining static and dynamic malware analysis techniques is also currently being conducted in an effort to minimize the shortcomings of both. Studies by researchers such as Islam et al.R. Islam, R. Tian, L. M. Batten, and S. Versteeg: Classification of malware based on integrated staic and dynamic features, Journal of Network Computer Applications, 2013, p. 646-656. are working to integrate static and dynamic techniques in order to better analyze and classify malware and malware variants.

References

Malware

See also

References