HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, bootstrapping is the technique for producing a self-compiling compiler – that is, a
compiler In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs tha ...
(or
assembler Assembler may refer to: Arts and media * Nobukazu Takemura, avant-garde electronic musician, stage name Assembler * Assemblers, a fictional race in the ''Star Wars'' universe * Assemblers, an alternative name of the superhero group Champions of ...
) written in the source
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
that it intends to compile. An initial core version of the compiler (the ''bootstrap compiler'') is generated in a different language (which could be assembly language); successive expanded versions of the compiler are developed using this minimal subset of the language. The problem of compiling a self-compiling compiler has been called the chicken-or-egg problem in compiler design, and bootstrapping is a solution to this problem. Many compilers for many programming languages are bootstrapped, including compilers for
BASIC BASIC (Beginners' All-purpose Symbolic Instruction Code) is a family of general-purpose, high-level programming languages designed for ease of use. The original version was created by John G. Kemeny and Thomas E. Kurtz at Dartmouth College ...
,
ALGOL ALGOL (; short for "Algorithmic Language") is a family of imperative computer programming languages originally developed in 1958. ALGOL heavily influenced many other languages and was the standard method for algorithm description used by the ...
, C, C#, D,
Pascal Pascal, Pascal's or PASCAL may refer to: People and fictional characters * Pascal (given name), including a list of people with the name * Pascal (surname), including a list of people and fictional characters with the name ** Blaise Pascal, Frenc ...
,
PL/I PL/I (Programming Language One, pronounced and sometimes written PL/1) is a procedural, imperative computer programming language developed and published by IBM. It is designed for scientific, engineering, business and system programming. I ...
,
Haskell Haskell () is a general-purpose, statically-typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research and industrial applications, Haskell has pioneered a number of programming lan ...
,
Modula-2 Modula-2 is a structured, procedural programming language developed between 1977 and 1985/8 by Niklaus Wirth at ETH Zurich. It was created as the language for the operating system and application software of the Lilith personal workstation. It ...
,
Oberon Oberon () is a king of the fairies in medieval and Renaissance literature. He is best known as a character in William Shakespeare's play ''A Midsummer Night's Dream'', in which he is King of the Fairies and spouse of Titania, Queen of the Fairi ...
,
OCaml OCaml ( , formerly Objective Caml) is a general-purpose, multi-paradigm programming language Programming paradigms are a way to classify programming languages based on their features. Languages can be classified into multiple paradigms. ...
,
Common Lisp Common Lisp (CL) is a dialect of the Lisp programming language, published in ANSI standard document ''ANSI INCITS 226-1994 (S20018)'' (formerly ''X3.226-1994 (R1999)''). The Common Lisp HyperSpec, a hyperlinked HTML version, has been derived fr ...
,
Scheme A scheme is a systematic plan for the implementation of a certain idea. Scheme or schemer may refer to: Arts and entertainment * ''The Scheme'' (TV series), a BBC Scotland documentary series * The Scheme (band), an English pop band * ''The Schem ...
, Go,
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
,
Elixir ELIXIR (the European life-sciences Infrastructure for biological Information) is an initiative that will allow life science laboratories across Europe to share and store their research data as part of an organised network. Its goal is to bring t ...
,
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO( ...
, Python, Scala, Nim, Eiffel,
TypeScript TypeScript is a free and open source programming language developed and maintained by Microsoft. It is a strict syntactical superset of JavaScript and adds optional static typing to the language. It is designed for the development of large app ...
,
Vala Vala or VALA may refer to: Religion and mythology * Vala (Vedic), a demon or a stone cavern in the Hindu scriptures * Völva, also spelled Vala, a priestess in Norse mythology and Norse paganism Fiction * Vala (Middle-earth), an angelic being in ...
, Zig and more.


Process

A typical bootstrap process works in three or four stages: * Stage 0: preparing an environment for the ''bootstrap compiler'' to work with. This is where the source language and output language of the bootstrap compiler are chosen. In the case of a "
bare machine In computer science, bare machine (or bare metal) refers to a computer executing instructions directly on logic hardware without an intervening operating system. Modern operating systems evolved through various stages, from elementary to the pre ...
" (one where no compiler for any language exist) the source and output are written as binary
machine code In computer programming, machine code is any low-level programming language, consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). Each instruction causes the CPU to perform a ve ...
, or may be created by
cross compiling A cross compiler is a compiler capable of creating executable code for a platform other than the one on which the compiler is running. For example, a compiler that runs on a PC but generates code that runs on an Android smartphone is a cross ...
on some other machine than the target. Otherwise, the bootstrap compiler is to be written in one of the programming languages which does exist on the target machine, and that compiler will generate something which can execute on the target, including a
high-level programming language In computer science, a high-level programming language is a programming language with strong abstraction from the details of the computer. In contrast to low-level programming languages, it may use natural language ''elements'', be easier to u ...
, an
assembly language In computer programming, assembly language (or assembler language, or symbolic machine code), often referred to simply as Assembly and commonly abbreviated as ASM or asm, is any low-level programming language with a very strong correspondence b ...
, an
object file An object file is a computer file containing object code, that is, machine code output of an assembler or compiler. The object code is usually relocatable, and not usually directly executable. There are various formats for object files, and the ...
, or even machine code. * Stage 1: the bootstrap compiler is produced. This compiler is enough to translate its own source into a program which can be executed on the target machine. At this point, all further development is done using the language defined by the bootstrap compiler, and stage 2 begins. * Stage 2: a full compiler is produced by the bootstrap compiler. This is typically done in stages as needed, e.g. compiler for version X of the language will be able to compile features from version X+1, but that compiler does not actually use those features. Once this compiler has been tested and can compile itself, now version X+1 features may be used by subsequent releases of the compiler. * Stage 3: a full compiler is produced by the stage 2 full compiler. If more features are to be added, work resumes at stage 2, with the current stage 3 full compiler replacing the bootstrap compiler. The full compiler is built twice in order to compare the outputs of the two stages. If they are different, either the bootstrap or the full compiler contains a bug.


Advantages

Bootstrapping a compiler has the following advantages:"Compiler Construction and Bootstrapping" by P.D.Terry 2000
HTML

PDF
.
* It is a non-trivial test of the language being compiled, and as such is a form of
dogfooding Eating your own dog food or "dogfooding" is the practice of using one's own products or services. This can be a way for an organization to test its products in real-world usage using product management techniques. Hence dogfooding can act as qual ...
. * Compiler developers and bug reporters only need to know the language being compiled. * Compiler development can be performed in the higher-level language being compiled. * Improvements to the compiler's back-end improve not only general-purpose programs but also the compiler itself. * It is a comprehensive consistency check as it should be able to reproduce its own object code. Note that some of these points assume that the language runtime is also written in the same language.


Methods

If one needs to compile a compiler for language X written in language X, there is the issue of how the first compiler can be compiled. The different methods that are used in practice include: * Implementing an interpreter or
compiler In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs tha ...
for language X in language Y.
Niklaus Wirth Niklaus Emil Wirth (born 15 February 1934) is a Swiss computer scientist. He has designed several programming languages, including Pascal, and pioneered several classic topics in software engineering. In 1984, he won the Turing Award, generally ...
reported that he wrote the first
Pascal Pascal, Pascal's or PASCAL may refer to: People and fictional characters * Pascal (given name), including a list of people with the name * Pascal (surname), including a list of people and fictional characters with the name ** Blaise Pascal, Frenc ...
compiler in Fortran. * Another interpreter or compiler for X has already been written in another language Y; this is how
Scheme A scheme is a systematic plan for the implementation of a certain idea. Scheme or schemer may refer to: Arts and entertainment * ''The Scheme'' (TV series), a BBC Scotland documentary series * The Scheme (band), an English pop band * ''The Schem ...
is often bootstrapped. * Earlier versions of the compiler were written in a subset of X for which there existed some other compiler; this is how some supersets of
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
,
Haskell Haskell () is a general-purpose, statically-typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research and industrial applications, Haskell has pioneered a number of programming lan ...
, and the initial
Free Pascal Free Pascal Compiler (FPC) is a compiler for the closely related programming-language dialects Pascal and Object Pascal. It is free software released under the GNU General Public License, witexception clausesthat allow static linking against its ...
compiler are bootstrapped. * A compiler supporting non-standard language extensions or optional language features can be written without using those extensions and features, to enable it being compiled with another compiler supporting the same base language but a different set of extensions and features. The main parts of the
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
compiler
clang Clang is a compiler front end for the C, C++, Objective-C, and Objective-C++ programming languages, as well as the OpenMP, OpenCL, RenderScript, CUDA, and HIP frameworks. It acts as a drop-in replacement for the GNU Compiler Collection ...
were written in a subset of C++ that can be compiled by both g++ and
Microsoft Visual C++ Microsoft Visual C++ (MSVC) is a compiler for the C, C++ and C++/CX programming languages by Microsoft. MSVC is proprietary software; it was originally a standalone product but later became a part of Visual Studio and made available in both tri ...
. Advanced features are written with some GCC extensions. * The compiler for X is cross compiled from another architecture where there exists a compiler for X; this is how compilers for C are usually ported to other platforms. Also this is the method used for
Free Pascal Free Pascal Compiler (FPC) is a compiler for the closely related programming-language dialects Pascal and Object Pascal. It is free software released under the GNU General Public License, witexception clausesthat allow static linking against its ...
after the initial bootstrap. * Writing the compiler in X; then hand-compiling it from source (most likely in a non-optimized way) and running that on the code to get an optimized compiler.
Donald Knuth Donald Ervin Knuth ( ; born January 10, 1938) is an American computer scientist, mathematician, and professor emeritus at Stanford University. He is the 1974 recipient of the ACM Turing Award, informally considered the Nobel Prize of computer sc ...
used this for his
WEB Web most often refers to: * Spider web, a silken structure created by the animal * World Wide Web or the Web, an Internet-based hypertext system Web, WEB, or the Web may also refer to: Computing * WEB, a literate programming system created by ...
literate programming Literate programming is a programming paradigm introduced in 1984 by Donald Knuth in which a computer program is given as an explanation of its logic in a natural language, such as English, interspersed (embedded) with snippets of macros an ...
system. Methods for distributing compilers in source code include providing a portable
bytecode Bytecode (also called portable code or p-code) is a form of instruction set designed for efficient execution by a software interpreter. Unlike human-readable source code, bytecodes are compact numeric codes, constants, and references (norma ...
version of the compiler, so as to ''bootstrap'' the process of compiling the compiler with itself. The
T-diagram In computing, tombstone diagrams (or T-diagrams) consist of a set of “puzzle pieces” representing compilers and other related language processing programs. They are used to illustrate and reason about transformations from a source language ...
is a
notation In linguistics and semiotics, a notation is a system of graphics or symbols, characters and abbreviated expressions, used (for example) in artistic and scientific disciplines to represent technical facts and quantities by convention. Therefore, ...
used to explain these compiler bootstrap techniques. In some cases, the most convenient way to get a complicated compiler running on a system that has little or no software on it involves a series of ever more sophisticated assemblers and compilers.


History

Assemblers were the first language tools to bootstrap themselves. The first high-level language to provide such a bootstrap was
NELIAC The Navy Electronics Laboratory International ALGOL Compiler (NELIAC) is a dialect and compiler implementation of the programming language ALGOL 58, developed by the Navy Electronics Laboratory (NEL) in 1958. It was designed for numeric and logi ...
in 1958. The first widely used languages to do so were
Burroughs B5000 The Burroughs Large Systems Group produced a family of large 48-bit mainframes using stack machine instruction sets with dense syllables.E.g., 12-bit syllables for B5000, 8-bit syllables for B6500 The first machine in the family was the B5000 i ...
Algol in 1961 and
LISP A lisp is a speech impairment in which a person misarticulates sibilants (, , , , , , , ). These misarticulations often result in unclear speech. Types * A frontal lisp occurs when the tongue is placed anterior to the target. Interdental lispin ...
in 1962. Hart and Levin wrote a LISP compiler in LISP at MIT in 1962, testing it inside an existing LISP interpreter. Once they had improved the compiler to the point where it could compile its own source code, it was self-hosting. This technique is only possible when an interpreter already exists for the very same language that is to be compiled. It borrows directly from the notion of running a program on itself as input, which is also used in various proofs in
theoretical computer science computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory. It is difficult to circumscribe the ...
, such as the variation of the proof that the
halting problem In computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running, or continue to run forever. Alan Turing proved in 1936 that a ...
is undecidable that uses
Rice's Theorem In computability theory, Rice's theorem states that all non-trivial semantic properties of programs are undecidable. A semantic property is one about the program's behavior (for instance, does the program terminate for all inputs), unlike a synta ...
.


Current efforts

Due to security concerns regarding the Trusting Trust Attack and various attacks against binary trustworthiness, multiple projects are working to reduce the effort for not only bootstrapping from source but also allowing everyone to verify that source and executable correspond. These include the Bootstrappable builds project and the Reproducible builds project.


See also

* Self-hosting *
Self-interpreter In computer science, an interpreter is a computer program that directly execution (computers), executes instructions written in a Programming language, programming or scripting language, without requiring them previously to have been Compiler, co ...
* Indirect self-modification * Tombstone diagram * Metacompiler


References

{{DEFAULTSORT:Bootstrapping (Compilers) Compilers Compiler construction Compiler theory