In
computer programming, undefined behavior (UB) is the result of executing a program whose behavior is prescribed to be unpredictable, in the
language specification
In computer programming, a programming language specification (or standard or definition) is a documentation artifact that defines a programming language so that users and implementors can agree on what programs in that language mean. Specificat ...
to which the
computer code
A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations (computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These progra ...
adheres. This is different from
unspecified behavior, for which the language specification does not prescribe a result, and implementation-defined behavior that defers to the documentation of another component of the
platform (such as the
ABI or the
translator
Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...
documentation).
In the
C community, undefined behavior may be humorously referred to as "nasal demons", after a
comp.std.c post that explained undefined behavior as allowing the compiler to do anything it chooses, even "to make demons fly out of your nose".
Overview
Some
programming language
A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language.
The description of a programming ...
s allow a program to operate differently or even have a different control flow than the
source code
In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the w ...
, as long as it exhibits the same user-visible
side effects
In medicine, a side effect is an effect, whether therapeutic or adverse, that is secondary to the one intended; although the term is predominantly employed to describe adverse effects, it can also apply to beneficial, but unintended, consequence ...
, ''if undefined behavior never happens during program execution''. Undefined behavior is the name of a list of conditions that the program must not meet.
In the early versions of
C, undefined behavior's primary advantage was the production of performant
compiler
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs th ...
s for a wide variety of machines: a specific construct could be mapped to a machine-specific feature, and the compiler did not have to generate additional code for the runtime to adapt the side effects to match semantics imposed by the language. The program source code was written with prior knowledge of the specific compiler and of the
platforms that it would support.
However, progressive standardization of the platforms has made this less of an advantage, especially in newer versions of C. Now, the cases for undefined behavior typically represent unambiguous
bugs in the code, for example
indexing an array outside of its bounds. By definition, the
runtime can assume that undefined behavior never happens; therefore, some invalid conditions do not need to be checked against. For a
compiler
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs th ...
, this also means that various
program transformations
A program transformation is any operation that takes a computer program and generates another program. In many cases the transformed program is required to be semantically equivalent to the original, relative to a particular formal semantics and ...
become valid, or their proofs of correctness are simplified; this allows for various kinds of optimizations whose correctness depend on the assumption that the program state never meets any such condition. The compiler can also remove explicit checks that may have been in the source code, without notifying the programmer; for example, detecting undefined behavior by testing whether it happened is not guaranteed to work, by definition. This makes it hard or impossible to program a portable fail-safe option (non-portable solutions are possible for some constructs).
Current compiler development usually evaluates and compares compiler performance with benchmarks designed around micro-optimizations, even on platforms that are mostly used on the general-purpose desktop and laptop market (such as amd64). Therefore, undefined behavior provides ample room for compiler performance improvement, as the source code for a specific source code statement is allowed to be mapped to anything at runtime.
For C and C++, the compiler is allowed to give a compile-time diagnostic in these cases, but is not required to: the implementation will be considered correct whatever it does in such cases, analogous to
don't-care term
In digital logic, a don't-care term (abbreviated DC, historically also known as ''redundancies'', ''irrelevancies'', ''optional entries'', ''invalid combinations'', ''vacuous combinations'', ''forbidden combinations'', ''unused states'' or ''lo ...
s in digital logic. It is the responsibility of the programmer to write code that never invokes undefined behavior, although compiler implementations are allowed to issue diagnostics when this happens. Compilers nowadays have flags that enable such diagnostics, for example,
-fsanitize
enables the "undefined behavior sanitizer" (
UBSan) in
gcc 4.9 and in
clang
Clang is a compiler front end for the C, C++, Objective-C, and Objective-C++ programming languages, as well as the OpenMP, OpenCL, RenderScript, CUDA, and HIP frameworks. It acts as a drop-in replacement for the GNU Compiler Collection (GCC) ...
. However, this flag is not the default and enabling it is a choice of the person who builds the code.
Under some circumstances there can be specific restrictions on undefined behavior. For example, the
instruction set
In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called ...
specifications of a
CPU
A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, and ...
might leave the behavior of some forms of an instruction undefined, but if the CPU supports
memory protection
Memory protection is a way to control memory access rights on a computer, and is a part of most modern instruction set architectures and operating systems. The main purpose of memory protection is to prevent a process from accessing memory that ha ...
then the specification will probably include a blanket rule stating that no user-accessible instruction may cause a hole in the
operating system
An operating system (OS) is system software that manages computer hardware, software resources, and provides common daemon (computing), services for computer programs.
Time-sharing operating systems scheduler (computing), schedule tasks for ef ...
's security; so an actual CPU would be permitted to corrupt user registers in response to such an instruction, but would not be allowed to, for example, switch into
supervisor mode
In computer science, hierarchical protection domains, often called protection rings, are mechanisms to protect data and functionality from faults (by improving fault tolerance) and malicious behavior (by providing computer security).
Comput ...
.
The runtime
platform can also provide some restrictions or guarantees on undefined behavior, if the
toolchain
In software, a toolchain is a set of programming tools that is used to perform a complex software development task or to create a software product, which is typically another computer program or a set of related programs. In general, the tools f ...
or the
runtime explicitly document that specific constructs found in the
source code
In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the w ...
are mapped to specific well-defined mechanisms available at runtime. For example, an
interpreter may document a particular behavior for some operations that are undefined in the language specification, while other interpreters or compilers for the same language may not. A
compiler
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs th ...
produces
executable code
In computing, executable code, an executable file, or an executable program, sometimes simply referred to as an executable or binary, causes a computer "to perform indicated tasks according to encoded instructions", as opposed to a data file ...
for a specific
ABI, filling the
semantic gap
The semantic gap characterizes the difference between two descriptions of an object by different linguistic representations, for instance languages or symbols. According to Andreas Hein, the semantic gap can be defined as "the difference in meani ...
in ways that depend on the compiler version: the documentation for that compiler version and the ABI specification can provide restrictions on undefined behavior. Relying on these implementation details makes the software non-
portable
Portable may refer to:
General
* Portable building, a manufactured structure that is built off site and moved in upon completion of site and utility work
* Portable classroom, a temporary building installed on the grounds of a school to provide a ...
, but portability may not be a concern if the software is not supposed to be used outside of a specific runtime.
Undefined behavior can result in a program crash or even in failures that are harder to detect and make the program look like it is working normally, such as silent loss of data and production of incorrect results.
Benefits
Documenting an operation as undefined behavior allows compilers to assume that this operation will never happen in a conforming program. This gives the compiler more information about the code and this information can lead to more optimization opportunities.
An example for the C language:
int foo(unsigned char x)
The value of
x
cannot be negative and, given that signed
integer overflow
In computer programming, an integer overflow occurs when an arithmetic operation attempts to create a numeric value that is outside of the range that can be represented with a given number of digits – either higher than the maximum or lower t ...
is undefined behavior in C, the compiler can assume that
value < 2147483600
will always be false. Thus the
if
statement, including the call to the function
bar
, can be ignored by the compiler since the test expression in the
if
has no
side effects
In medicine, a side effect is an effect, whether therapeutic or adverse, that is secondary to the one intended; although the term is predominantly employed to describe adverse effects, it can also apply to beneficial, but unintended, consequence ...
and its condition will never be satisfied. The code is therefore semantically equivalent to:
int foo(unsigned char x)
Had the compiler been forced to assume that signed integer overflow has ''wraparound'' behavior, then the transformation above would not have been legal.
Such optimizations become hard to spot by humans when the code is more complex and other optimizations, like
inlining
In computing, inline expansion, or inlining, is a manual or compiler optimization that replaces a function call site with the body of the called function. Inline expansion is similar to macro expansion, but occurs during compilation, without ch ...
, take place. For example, another function may call the above function:
void run_tasks(unsigned char *ptrx)
The compiler is free to optimize away the
while
-loop here by applying
value range analysis: by inspecting
foo()
, it knows that the initial value pointed to by
ptrx
cannot possibly exceed 47 (as any more would trigger undefined behavior in
foo()
); therefore, the initial check of
*ptrx > 60
will always be false in a conforming program. Going further, since the result
z
is now never used and
foo()
has no side effects, the compiler can optimize
run_tasks()
to be an empty function that returns immediately. The disappearance of the
while
-loop may be especially surprising if
foo()
is defined in a
separately compiled object file.
Another benefit from allowing signed integer overflow to be undefined is that it makes it possible to store and manipulate a variable's value in a
processor register
A processor register is a quickly accessible location available to a computer's processor. Registers usually consist of a small amount of fast storage, although some registers have specific hardware functions, and may be read-only or write-only ...
that is larger than the size of the variable in the source code. For example, if the type of a variable as specified in the source code is narrower than the native register width (such as
int
on a
64-bit
In computer architecture, 64-bit integers, memory addresses, or other data units are those that are 64 bits wide. Also, 64-bit CPUs and ALUs are those that are based on processor registers, address buses, or data buses of that size. A comput ...
machine, a common scenario), then the compiler can safely use a signed 64-bit integer for the variable in the
machine code
In computer programming, machine code is any low-level programming language, consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). Each instruction causes the CPU to perform a ve ...
it produces, without changing the defined behavior of the code. If a program depended on the behavior of a 32-bit integer overflow, then a compiler would have to insert additional logic when compiling for a 64-bit machine, because the overflow behavior of most machine instructions depends on the register width.
Undefined behavior also allows more compile-time checks by both compilers and
static program analysis
In computer science, static program analysis (or static analysis) is the program analysis, analysis of computer programs performed without executing them, in contrast with dynamic program analysis, which is performed on programs during their execu ...
.
Risks
C and C++ standards have several forms of undefined behavior throughout, which offer increased liberty in compiler implementations and compile-time checks at the expense of undefined run-time behavior if present. In particular, the
ISO
ISO is the most common abbreviation for the International Organization for Standardization.
ISO or Iso may also refer to: Business and finance
* Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007
* Is ...
standard for C has an appendix listing common sources of undefined behavior. Moreover, compilers are not required to diagnose code that relies on undefined behavior. Hence, it is common for programmers, even experienced ones, to rely on undefined behavior either by mistake, or simply because they are not well-versed in the rules of the language that can span hundreds of pages. This can result in bugs that are exposed when a different compiler, or different settings, are used. Testing or
fuzzing
In programming and software development, fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program. The program is then monitored for exception ...
with dynamic undefined behavior checks enabled, e.g., the
Clang
Clang is a compiler front end for the C, C++, Objective-C, and Objective-C++ programming languages, as well as the OpenMP, OpenCL, RenderScript, CUDA, and HIP frameworks. It acts as a drop-in replacement for the GNU Compiler Collection (GCC) ...
sanitizers, can help to catch undefined behavior not diagnosed by the compiler or static analyzers.
Undefined behavior can lead to
security
Security is protection from, or resilience against, potential harm (or other unwanted coercive change) caused by others, by restraining the freedom of others to act. Beneficiaries (technically referents) of security may be of persons and social ...
vulnerabilities in software. For example, buffer overflows and other security vulnerabilities in the major
web browser
A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used on ...
s are due to undefined behavior. The
Year 2038 problem
The year 2038 problem (also known as Y2038, Y2K38, or the Epochalypse) is a time formatting bug in computer systems with representing times after 03:14:07 UTC on 19 January 2038.
The problem exists in systems which measure Unix time ...
is another example due to
signed integer overflow
In computer programming, an integer overflow occurs when an arithmetic operation attempts to create a numeric value that is outside of the range that can be represented with a given number of digits – either higher than the maximum or lower t ...
. When
GCC's developers changed their compiler in 2008 such that it omitted certain overflow checks that relied on undefined behavior,
CERT issued a warning against the newer versions of the compiler.
Linux Weekly News
LWN.net is a computing webzine with an emphasis on free software and software for Linux and other Unix-like operating systems. It consists of a weekly issue, separate stories which are published most days, and threaded discussion attached to ever ...
pointed out that the same behavior was observed in
PathScale C,
Microsoft Visual C++ 2005 and several other compilers; the warning was later amended to warn about various compilers.
Examples in C and C++
The major forms of undefined behavior in C can be broadly classified as: spatial memory safety violations, temporal memory safety violations,
integer overflow
In computer programming, an integer overflow occurs when an arithmetic operation attempts to create a numeric value that is outside of the range that can be represented with a given number of digits – either higher than the maximum or lower t ...
, strict aliasing violations, alignment violations, unsequenced modifications, data races, and loops that neither perform I/O nor terminate.
In C the use of any
automatic variable before it has been initialized yields undefined behavior, as does integer
division by zero
In mathematics, division by zero is division where the divisor (denominator) is zero. Such a division can be formally expressed as \tfrac, where is the dividend (numerator). In ordinary arithmetic, the expression has no meaning, as there is ...
, signed integer overflow, indexing an array outside of its defined bounds (see
buffer overflow
In information security and programming, a buffer overflow, or buffer overrun, is an anomaly whereby a program, while writing data to a buffer, overruns the buffer's boundary and overwrites adjacent memory locations.
Buffers are areas of memor ...
), or
null pointer
In computing, a null pointer or null reference is a value saved for indicating that the pointer or reference does not refer to a valid object. Programs routinely use null pointers to represent conditions such as the end of a list of unknown leng ...
dereferencing. In general, any instance of undefined behavior leaves the abstract execution machine in an unknown state, and causes the behavior of the entire program to be undefined.
Attempting to modify a
string literal
A string literal or anonymous string is a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally " bracketed delimiters", as in x = "foo", where "foo" is a string ...
causes undefined behavior:
ISO
ISO is the most common abbreviation for the International Organization for Standardization.
ISO or Iso may also refer to: Business and finance
* Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007
* Is ...
/ IEC (2003). '' ISO/IEC 14882:2003(E): Programming Languages - C++ §2.13.4 String literals ex.string' para. 2
char *p = "wikipedia"; // valid C, deprecated in C++98/C++03, ill-formed as of C++11
p = 'W'; // undefined behavior
Integer
division by zero
In mathematics, division by zero is division where the divisor (denominator) is zero. Such a division can be formally expressed as \tfrac, where is the dividend (numerator). In ordinary arithmetic, the expression has no meaning, as there is ...
results in undefined behavior:
ISO
ISO is the most common abbreviation for the International Organization for Standardization.
ISO or Iso may also refer to: Business and finance
* Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007
* Is ...
/ IEC (2003). '' ISO/IEC 14882:2003(E): Programming Languages - C++ §5.6 Multiplicative operators xpr.mul' para. 4
int x = 1;
return x / 0; // undefined behavior
Certain pointer operations may result in undefined behavior:
ISO
ISO is the most common abbreviation for the International Organization for Standardization.
ISO or Iso may also refer to: Business and finance
* Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007
* Is ...
/ IEC (2003). '' ISO/IEC 14882:2003(E): Programming Languages - C++ §5.7 Additive operators xpr.add' para. 5
int arr = ;
int *p = arr + 5; // undefined behavior for indexing out of bounds
p = NULL;
int a = *p; // undefined behavior for dereferencing a null pointer
In C and C++, the relational comparison of
pointers to objects (for less-than or greater-than comparison) is only strictly defined if the pointers point to members of the same object, or elements of the same
array
An array is a systematic arrangement of similar objects, usually in rows and columns.
Things called an array include:
{{TOC right
Music
* In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
.
ISO
ISO is the most common abbreviation for the International Organization for Standardization.
ISO or Iso may also refer to: Business and finance
* Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007
* Is ...
/ IEC (2003). '' ISO/IEC 14882:2003(E): Programming Languages - C++ §5.9 Relational operators xpr.rel' para. 2 Example:
int main(void)
Reaching the end of a value-returning function (other than
main()
) without a return statement results in undefined behavior if the value of the function call is used by the caller:
ISO
ISO is the most common abbreviation for the International Organization for Standardization.
ISO or Iso may also refer to: Business and finance
* Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007
* Is ...
/ IEC (2007). '' ISO/IEC 9899:2007(E): Programming Languages - C §6.9 External definitions'' para. 1
int f()
/* undefined behavior if the value of the function call is used*/
Modifying an object between two
sequence point A sequence point defines any point in a computer program's execution at which it is guaranteed that all side effects of previous evaluations will have been performed, and no side effects from subsequent evaluations have yet been performed. They are ...
s more than once produces undefined behavior. There are considerable changes in what causes undefined behavior in relation to sequence points as of C++11.
Modern compilers can emit warnings when they encounter multiple unsequenced modifications to the same object. The following example will cause undefined behavior in both C and C++.
int f(int i)
When modifying an object between two sequence points, reading the value of the object for any other purpose than determining the value to be stored is also undefined behavior.
ISO
ISO is the most common abbreviation for the International Organization for Standardization.
ISO or Iso may also refer to: Business and finance
* Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007
* Is ...
/ IEC (1999). '' ISO/IEC 9899:1999(E): Programming Languages - C §6.5 Expressions'' para. 2
a = i++; // undefined behavior
printf("%d %d\n", ++n, power(2, n)); // also undefined behavior
In C/C++
bitwise shifting a value by a number of bits which is either a negative number or is greater than or equal to the total number of bits in this value results in undefined behavior. The safest way (regardless of compiler vendor) is to always keep the number of bits to shift (the right operand of the
<<
and
>>
bitwise operators
In computer programming, a bitwise operation operates on a bit string, a bit array or a binary numeral (considered as a bit string) at the level of its individual bits. It is a fast and simple action, basic to the higher-level arithmetic operat ...
) within the range: <
0, sizeof
sizeof is a unary operator in the programming languages C and C++. It generates the storage size of an expression or a data type, measured in the number of ''char''-sized units. Consequently, the construct ''sizeof (char)'' is guaranteed to be ...
(value)*CHAR_BIT - 1
> (where
value
is the left operand).
int num = -1;
unsigned int val = 1 << num; //shifting by a negative number - undefined behavior
num = 32; //or whatever number greater than 31
val = 1 << num; //the literal '1' is typed as a 32-bit integer - in this case shifting by more than 31 bits is undefined behavior
num = 64; //or whatever number greater than 63
unsigned long long val2 = 1ULL << num; //the literal '1ULL' is typed as a 64-bit integer - in this case shifting by more than 63 bits is undefined behavior
Examples in Rust
While undefined behavior is never present in safe
Rust
Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH) ...
, it is possible to invoke undefined behavior in unsafe Rust in many ways.
For example, creating an invalid reference (a reference which does not refer to a valid value) invokes immediate undefined behavior:
fn main()
Note that it is not necessary to use the reference; undefined behavior is invoked merely from the creation of such a reference.
See also
*
Compiler
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs th ...
*
Halt and Catch Fire
*
Unspecified behavior
References
Further reading
*
, ''Expert C Programming''.
UB Canaries(April 2015), John Regehr (University of Utah, USA)
Undefined Behavior in 2017(July 2017) Pascal Cuoq (TrustInSoft, France) and John Regehr (University of Utah, USA)
External links
Corrected version of the C99 standard Look at section 6.10.6 for #pragma
{{DEFAULTSORT:Undefined behavior
Programming language implementation
C (programming language)
C++
Articles with example C++ code