HOME

TheInfoList



OR:

In
computing Computing is any goal-oriented activity requiring, benefiting from, or creating computer, computing machinery. It includes the study and experimentation of algorithmic processes, and the development of both computer hardware, hardware and softw ...
, a polyglot is a
computer program A computer program is a sequence or set of instructions in a programming language for a computer to Execution (computing), execute. It is one component of software, which also includes software documentation, documentation and other intangibl ...
or
script Script may refer to: Writing systems * Script, a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire * Script (styles of handwriting) ** Script typeface, a typeface with characteristics of handw ...
(or other file) written in a valid form of multiple
programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
s or
file formats A file format is a Computer standard, standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary format, pr ...
. The name was coined by analogy to
multilingualism Multilingualism is the use of more than one language, either by an individual speaker or by a group of speakers. When the languages are just two, it is usually called bilingualism. It is believed that multilingual speakers outnumber monolin ...
. A polyglot file is composed by combining syntax from two or more different formats. When the file formats are to be
compiled In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
or interpreted as
source code In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer. Since a computer, at base, only ...
, the file can be said to be a polyglot program, though file formats and source code syntax are both fundamentally streams of bytes, and exploiting this commonality is key to the development of polyglots. Polyglot files have practical applications in
compatibility Compatibility may refer to: Computing * Backward compatibility, in which newer systems can understand data generated by older ones * Compatibility card, an expansion card for hardware emulation of another device * Compatibility layer, componen ...
, but can also present a
security Security is protection from, or resilience against, potential harm (or other unwanted coercion). Beneficiaries (technically referents) of security may be persons and social groups, objects and institutions, ecosystems, or any other entity or ...
risk when used to bypass validation or to exploit a
vulnerability Vulnerability refers to "the quality or state of being exposed to the possibility of being attacked or harmed, either physically or emotionally." The understanding of social and environmental vulnerability, as a methodological approach, involves ...
.


History

Polyglot programs have been crafted as challenges and curios in
hacker culture The hacker culture is a subculture of individuals who enjoy—often in collective effort—the intellectual challenge of creatively overcoming the limitations of software systems or electronic hardware (mostly digital electronics), ...
since at least the early 1990s. A notable early example, named simply polyglot was published on the
Usenet Usenet (), a portmanteau of User's Network, is a worldwide distributed discussion system available on computers. It was developed from the general-purpose UUCP, Unix-to-Unix Copy (UUCP) dial-up network architecture. Tom Truscott and Jim Elli ...
group rec.puzzles in 1991, supporting eight languages, though this was inspired by even earlier programs. In 2000, a polyglot program was named a winner in the
International Obfuscated C Code Contest The International Obfuscated C Code Contest (abbreviated IOCCC) is a computer programming contest for Source code, code written in C (programming language), C that is the most creatively obfuscated code, obfuscated. Held semi-annually, it is desc ...
. In the 21st century, polyglot programs and files gained attention as a
covert channel In computer security, a covert channel is a type of attack that creates a capability to transfer information objects between processes that are not supposed to be allowed to communicate by the computer security policy. The term, originated in 19 ...
mechanism for propagation of
malware Malware (a portmanteau of ''malicious software'')Tahir, R. (2018)A study on malware and malware detection techniques . ''International Journal of Education and Management Engineering'', ''8''(2), 20. is any software intentionally designed to caus ...
. Polyglot files have practical applications in
compatibility Compatibility may refer to: Computing * Backward compatibility, in which newer systems can understand data generated by older ones * Compatibility card, an expansion card for hardware emulation of another device * Compatibility layer, componen ...
.


Construction

A polyglot is composed by combining syntax from two or more different formats, leveraging various syntactic constructs that are either common between the formats, or constructs that are language specific but carrying different meaning in each language. A file is a valid polyglot if it can be successfully interpreted by multiple interpreting programs. For example, a PDF-Zip polyglot might be opened as both a valid PDF document and decompressed as a valid zip archive. To maintain validity across interpreting programs, one must ensure that constructs specific to one interpreter are not interpreted by another, and vice versa. This is often accomplished by hiding language-specific constructs in segments interpreted as comments or plain text of the other format.


Examples


C, PHP, and Bash

Two commonly used techniques for constructing a polyglot program are to make use of languages that use different
characters Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to Theoph ...
for comments, and to redefine various tokens as others in different languages. These are demonstrated in this
public domain The public domain (PD) consists of all the creative work to which no Exclusive exclusive intellectual property rights apply. Those rights may have expired, been forfeited, expressly Waiver, waived, or may be inapplicable. Because no one holds ...
polyglot written in
ANSI C ANSI C, ISO C, and Standard C are successive standards for the C programming language published by the American National Standards Institute (ANSI) and ISO/IEC JTC 1/SC 22/WG 14 of the International Organization for Standardization (ISO) and the ...
,
PHP PHP is a general-purpose scripting language geared towards web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by the PHP Group. ...
and bash: Highlighted for Bash #define a /* # /dev/null > /dev/null \ ; // 2> /dev/null; x=a; $x=5; // 2> /dev/null \ ; if (($x)) // 2> /dev/null; then return 0; // 2> /dev/null; fi #define e ?> #define b */ #include #define main() int main(void) #define printf printf( #define true ) #define function function main() #define c /* main #*/ Highlighted for PHP #define a /* # /dev/null > /dev/null \ ; // 2> /dev/null; x=a; $x=5; // 2> /dev/null \ ; if (($x)) // 2> /dev/null; then return 0; // 2> /dev/null; fi #define e ?> #define b */ #include #define main() int main(void) #define printf printf( #define true ) #define function function main() #define c /* main #*/ Highlighted for C #define a /* # /dev/null > /dev/null \ ; // 2> /dev/null; x=a; $x=5; // 2> /dev/null \ ; if (($x)) // 2> /dev/null; then return 0; // 2> /dev/null; fi #define e ?> #define b */ #include #define main() int main(void) #define printf printf( #define true ) #define function function main() #define c /* main #*/ Note the following: *A hash sign marks a
preprocessor In computer science, a preprocessor (or precompiler) is a Computer program, program that processes its input data to produce output that is used as input in another program. The output is said to be a preprocessed form of the input data, which i ...
statement in C, but is a comment in both bash and PHP. *"//" is a comment in both PHP and C and the
root directory In a Computing, computer file system, and primarily used in the Unix and Unix-like operating systems, the root directory is the first or top-most Directory (computing), directory in a hierarchy. It can be likened to the trunk of a Tree (data st ...
in bash. *Shell redirection is used to eliminate undesirable outputs. *Even on commented out lines, the "<?php" and "?>" PHP indicators still have effect. *The statement "function main()" is valid in both PHP and bash; C #defines are used to convert it into "int main(void)" at compile time. *Comment indicators can be combined to perform various operations. *"if (($x))" is a valid statement in both bash and PHP. *
printf printf is a C standard library function that formats text and writes it to standard output. The function accepts a format c-string argument and a variable number of value arguments that the function serializes per the format string. Mism ...
is a bash
shell builtin In computing, a shell builtin is a Command (computing), command or a Subroutine, function, exposed by a Shell (computing), shell, that is implemented in the shell itself, instead of an external computer program, program which the shell would load a ...
which is identical to the C printf except for its omission of brackets (which the
C preprocessor The C preprocessor (CPP) is a text file processor that is used with C, C++ and other programming tools. The preprocessor provides for file inclusion (often header files), macro expansion, conditional compilation, and line control. Although ...
adds if this is compiled with a
C compiler C, or c, is the third letter of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is ''cee'' (pronounced ), plural ''cees''. History "C ...
). *The final three lines are only used by bash, to call the main function. In PHP the main function is defined but not called and in C there is no need to explicitly call the main function.


SNOBOL4, Win32Forth, PureBasicv4.x, and REBOL

The following is written simultaneously in
SNOBOL SNOBOL ("StriNg Oriented and symBOlic Language") is a series of programming languages developed between 1962 and 1967 at AT&T Bell Laboratories by David J. Farber, Ralph Griswold and Ivan P. Polonsky, culminating in SNOBOL4. It was one of a ...
4, Win32Forth,
PureBasic PureBasic is a commercial software, commercially distributed procedural programming, procedural computer programming language and integrated development environment based on BASIC and developed by Fantaisie Software for Microsoft Windows, Windo ...
v4.x, and REBOL: Highlighted for SNOBOL *BUFFER : A.A ; .( Hello, world !) @ To Including? Macro SkipThis; OUTPUT = Char(10) "Hello, World !" ;OneKeyInput Input('Char', 1, ' f2-q1) ; Char End; SNOBOL4 + PureBASIC + Win32Forth + REBOL = <3 EndMacro: OpenConsole() : PrintN("Hello, world !") Repeat : Until Inkey() : Macro SomeDummyMacroHere REBOL Title: "'Hello, World !' in 4 languages" CopyLeft: "Developed in 2010 by Society" Print "Hello, world !" EndMacro: func [][] set-modes system/ports/input [binary: true] Input set-modes system/ports/input [binary: false] NOP:: EndMacro ; Wishing to refine it with new language ? Go on ! Highlighted for Forth *BUFFER : A.A ; .( Hello, world !) @ To Including? Macro SkipThis; OUTPUT = Char(10) "Hello, World !" ;OneKeyInput Input('Char', 1, ' f2-q1) ; Char End; SNOBOL4 + PureBASIC + Win32Forth + REBOL = <3 EndMacro: OpenConsole() : PrintN("Hello, world !") Repeat : Until Inkey() : Macro SomeDummyMacroHere REBOL Title: "'Hello, World !' in 4 languages" CopyLeft: "Developed in 2010 by Society" Print "Hello, world !" EndMacro: func [][] set-modes system/ports/input [binary: true] Input set-modes system/ports/input [binary: false] NOP:: EndMacro ; Wishing to refine it with new language ? Go on ! Highlighted for BASIC *BUFFER : A.A ; .( Hello, world !) @ To Including? Macro SkipThis; OUTPUT = Char(10) "Hello, World !" ;OneKeyInput Input('Char', 1, ' f2-q1) ; Char End; SNOBOL4 + PureBASIC + Win32Forth + REBOL = <3 EndMacro: OpenConsole() : PrintN("Hello, world !") Repeat : Until Inkey() : Macro SomeDummyMacroHere REBOL Title: "'Hello, World !' in 4 languages" CopyLeft: "Developed in 2010 by Society" Print "Hello, world !" EndMacro: func [][] set-modes system/ports/input [binary: true] Input set-modes system/ports/input [binary: false] NOP:: EndMacro ; Wishing to refine it with new language ? Go on ! Highlighted for REBOL *BUFFER : A.A ; .( Hello, world !) @ To Including? Macro SkipThis; OUTPUT = Char(10) "Hello, World !" ;OneKeyInput Input('Char', 1, ' f2-q1) ; Char End; SNOBOL4 + PureBASIC + Win32Forth + REBOL = <3 EndMacro: OpenConsole() : PrintN("Hello, world !") Repeat : Until Inkey() : Macro SomeDummyMacroHere REBOL Title: "'Hello, World !' in 4 languages" CopyLeft: "Developed in 2010 by Society" Print "Hello, world !" EndMacro: func [][] set-modes system/ports/input [binary: true] Input set-modes system/ports/input [binary: false] NOP:: EndMacro ; Wishing to refine it with new language ? Go on !


DOS batch file and Perl

The following file runs as a
DOS DOS (, ) is a family of disk-based operating systems for IBM PC compatible computers. The DOS family primarily consists of IBM PC DOS and a rebranded version, Microsoft's MS-DOS, both of which were introduced in 1981. Later compatible syste ...
batch file, then re-runs itself in
Perl Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language". Perl was developed ...
: Highlighted for DOS batch @rem = ' --PERL-- @echo off perl "%~dpnx0" %* goto endofperl @rem '; #!perl print "Hello, world!\n"; __END__ :endofperl Highlighted for Perl @rem = ' --PERL-- @echo off perl "%~dpnx0" %* goto endofperl @rem '; #!perl print "Hello, world!\n"; __END__ :endofperl This allows creating Perl scripts that can be run on DOS systems with minimal effort. Note that there is no requirement for a file to perform exactly the same function in the different interpreters.


Types

Polyglot types include: * ''stacks'', where multiple files are concatenated with each other * ''parasites'' where a secondary file format is hidden within comment fields in a primary file format * ''zippers'' where two files are mutually arranged within each others' comments * ''cavities'', where a secondary file format is hidden within null-padded areas of the primary file.


Benefits


Polyglot markup

Polyglot markup has been proposed as a useful combination of the benefits of
HTML5 HTML5 (Hypertext Markup Language 5) is a markup language used for structuring and presenting hypertext documents on the World Wide Web. It was the fifth and final major HTML version that is now a retired World Wide Web Consortium (W3C) recommend ...
and
XHTML Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated. While HTML, pr ...
. Such documents can be parsed as either HTML (which is
SGML The Standard Generalized Markup Language (SGML; International Organization for Standardization, ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on t ...
-compatible
) or
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
, and will produce the same DOM structure either way. For example, in order for an
HTML5 HTML5 (Hypertext Markup Language 5) is a markup language used for structuring and presenting hypertext documents on the World Wide Web. It was the fifth and final major HTML version that is now a retired World Wide Web Consortium (W3C) recommend ...
document to meet these criteria, the two requirements are that it must have an HTML5
doctype A document type declaration, or DOCTYPE, is an instruction that associates a particular XML or SGML document (for example, a web page) with a document type definition (DTD) (for example, the formal definition of a particular version of HTML 2.0 - ...
, and be written in well-formed XHTML. The same document can then be served as either HTML or XHTML, depending on browser support and MIME type. As expressed by the ''html-polyglot recommendation'', to write a polyglot HTML5 document, the following key points should be observed: # Processing instructions and the XML declaration are both forbidden in polyglot markup # Specifying a document’s character encoding # The DOCTYPE # Namespaces # Element syntax (i.e. End tags are not optional. Use self-closing tags for void elements.) # Element content # Text (i.e. pre and textarea should not start with newline character) # Attributes (i.e. Values must be quoted) # Named entity references (i.e. Only amp, lt, gt, apos, quot) # Comments (i.e. Use <!-- syntax -->) # Scripting and styling polyglot markup The most basic possible polyglot markup document would therefore look like this: The title element must not be empty. In a polyglot markup document non-void elements (such as script, p, div) cannot be self-closing even if they are empty, as this is not valid HTML. For example, to add an empty textarea to a page, one cannot use instead.


Composing formats

The
DICOM Digital Imaging and Communications in Medicine (DICOM) is a technical standard for the digital storage and Medical image sharing, transmission of medical images and related information. It includes a file format definition, which specifies the str ...
medical imaging format was designed to allow polyglotting with
TIFF Tag Image File Format or Tagged Image File Format, commonly known by the abbreviations TIFF or TIF, is an image file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. TIFF is w ...
files, allowing efficient storage of the same image data in a file that can be interpreted by either DICOM or TIFF viewers.


Compatibility

The
Python 2 The programming language Python was conceived in the late 1980s, and its implementation was started in December 1989 by Guido van Rossum at CWI in the Netherlands as a successor to ABC capable of exception handling and interfacing with the ...
and
Python 3 The programming language Python (programming language), Python was conceived in the late 1980s, and its implementation was started in December 1989 by Guido van Rossum at Centrum Wiskunde & Informatica, CWI in the Netherlands as a successor ...
programming languages were not designed to be compatible with each other, but there is sufficient commonality of syntax that a polyglot Python program can be written than runs in both versions.


Security implications

A polyglot of two formats may steganographically compose a malicious payload within an ostensibly benign and widely accepted wrapper format, such as a JPEG file that allows arbitrary data in its comment field. A vulnerable JPEG renderer could then be coerced into executing the payload, handing control to the attacker. The mismatch between what the interpreting program expects, and what the file actually contains, is the root cause of the vulnerability.
SQL Injection In computing, SQL injection is a code injection technique used to attack data-driven applications, in which malicious SQL statements are inserted into an entry field for execution (e.g. to dump the database contents to the attacker). SQL injec ...
is a trivial form of polyglot, where a server naively expects user-controlled input to conform to a certain constraint, but the user supplies syntax which is interpreted as SQL code. Note that in a security context, there is no requirement for a polyglot file to be strictly valid in multiple formats; it is sufficient for the file to trigger unintended behaviour when being interpreted by its primary interpreter. Highly flexible or extensible file formats have greater scope for polyglotting, and therefore more tightly constrained interpretation offers some mitigation against attacks using polyglot techniques. For example, the PDF file format requires that the magic number %PDF appears at byte offset zero, but many PDF interpreters waive this constraint and accept the file as valid PDF as long as the string appears within the first 1024 bytes. This creates a window of opportunity for polyglot PDF files to smuggle non-PDF content in the header of the file. The PDF format has been described as "diverse and vague", and due to significantly varying behaviour between different PDF parsing engines, it is possible to create a PDF-PDF polyglot that renders as two entirely different documents in two different PDF readers. Detecting malware concealed within polyglot files requires more sophisticated analysis than relying on file-type identification utilities such as file. In 2019, an evaluation of commercial anti-malware software determined that several such packages were unable to detect any of the polyglot malware under test. In 2019, the DICOM medical imaging file format was found to be vulnerable to malware injection using a PE-DICOM polyglot technique. The polyglot nature of the attack, combined with regulatory considerations, led to disinfection complications: because "the malware is essentially fused to legitimate imaging files", "incident response teams and A/V software cannot delete the malware file as it contains protected patient health information".


GIFAR attack

A Graphics Interchange Format Java Archives (GIFAR) is a polyglot file that is simultaneously in the
GIF The Graphics Interchange Format (GIF; or , ) is a Raster graphics, bitmap Image file formats, image format that was developed by a team at the online services provider CompuServe led by American computer scientist Steve Wilhite and released ...
and JAR file format. This technique can be used to exploit security vulnerabilities, for example through uploading a GIFAR to a website that allows image uploading (as it is a valid GIF file), and then causing the Java portion of the GIFAR to be executed as though it were part of the website's intended code, being delivered to the browser from the same origin. Java was patched in JRE 6 Update 11, with a CVE published in December 2008. GIFARs are possible because GIF images store their header in the beginning of the file, and JAR files (as with any ZIP archive-based format) store their data at the end.


Related terminology

* Polyglot programming, referring to the practise of building systems using multiple programming languages, but not necessarily in the same file. * Polyglot persistence is similar, but about
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
s.


See also

*
Quine (computing) A quine is a computer program that takes no input and produces a copy of its own source code as its only output. The standard terms for these programs in the computability theory and computer science literature are "self-replicating programs", "s ...


References


External links


CSE HTML Validator for Windows with polyglot markup support

Benefits of polyglot XHTML5

A polyglot in 451 different languages

A polyglot in 16 different languages

A polyglot in 8 different languages
(written in
COBOL COBOL (; an acronym for "common business-oriented language") is a compiled English-like computer programming language designed for business use. It is an imperative, procedural, and, since 2002, object-oriented language. COBOL is primarily ...
, Pascal, Fortran, C,
PostScript PostScript (PS) is a page description language and dynamically typed, stack-based programming language. It is most commonly used in the electronic publishing and desktop publishing realm, but as a Turing complete programming language, it c ...
,
Unix shell A Unix shell is a Command-line_interface#Command-line_interpreter, command-line interpreter or shell (computing), shell that provides a command line user interface for Unix-like operating systems. The shell is both an interactive command languag ...
,
Intel x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. The ...
machine language In computer programming, machine code is computer code consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). For conventional binary computers, machine code is the binaryOn nonb ...
and
Perl Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language". Perl was developed ...
5)
A polyglot in 7 different languages
(written in C, Pascal,
PostScript PostScript (PS) is a page description language and dynamically typed, stack-based programming language. It is most commonly used in the electronic publishing and desktop publishing realm, but as a Turing complete programming language, it c ...
,
TeX Tex, TeX, TEX, may refer to: People and fictional characters * Tex (nickname), a list of people and fictional characters with the nickname * Tex Earnhardt (1930–2020), U.S. businessman * Joe Tex (1933–1982), stage name of American soul singer ...
, Bash,
Perl Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language". Perl was developed ...
and Befunge98)
A polyglot in 6 different languages
(written in
Perl Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language". Perl was developed ...
, C,
Unix shell A Unix shell is a Command-line_interface#Command-line_interpreter, command-line interpreter or shell (computing), shell that provides a command line user interface for Unix-like operating systems. The shell is both an interactive command languag ...
,
Brainfuck Brainfuck is an esoteric programming language created in 1993 by Swiss student Urban Müller. Designed to be extremely minimalistic, the language consists of only eight simple commands, a data pointer, and an instruction pointer. Brainfuck is ...
,
Whitespace White space or whitespace may refer to: Technology * Whitespace characters, characters in computing that represent horizontal or vertical space * White spaces (radio), allocated but locally unused radio frequencies * TV White Space Database, a m ...
and
Befunge Befunge is a two-dimensional stack-based, reflective, esoteric programming language. It differs from conventional languages in that programs are arranged on a two-dimensional grid. "Arrow" instructions direct the control flow to the left, ri ...
)
List of generic polyglots

A PDF-MP3 polyglot, being a PDF document which is also an MP3 audio version of its content

PoC, , GTFO, a security publication published as polyglot PDF documents
{{DEFAULTSORT:Polyglot (Computing) Computer programming Source code Steganography Computer file formats