File Format
   HOME

TheInfoList



OR:

A file format is a
standard Standard may refer to: Symbols * Colours, standards and guidons, kinds of military signs * Standard (emblem), a type of a large symbol or emblem used for identification Norms, conventions or requirements * Standard (metrology), an object th ...
way that information is encoded for storage in a
computer file A computer file is a computer resource for recording data in a computer storage device, primarily identified by its file name. Just as words can be written to paper, so can data be written to a computer file. Files can be shared with and trans ...
. It specifies how
bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s are used to encode information in a digital storage medium. File formats may be either
proprietary {{Short pages monitor One way to incorporate file type metadata, often associated with
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...
and its derivatives, is to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-byte identifiers at the beginnings of files, but since any binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
representation of either or , depending upon the standard to which they adhere. Many file types, especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string (which is not case sensitive), or an appropriate
document type definition A document type definition (DTD) is a set of ''markup declarations'' that define a ''document type'' for an SGML-family markup language ( GML, SGML, XML, HTML). A DTD defines the valid building blocks of an XML document. It defines the document ...
that starts with , or, for
XHTML Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages. It mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated. While HTML, prior ...
, the
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
identifier, which begins with . The files can also begin with HTML comments, random text, or several empty lines, but still be usable HTML. The magic number approach offers better guarantees that the format will be identified correctly, and can often determine more precise information about the file. Since reasonably reliable "magic number" tests can be fairly complex, and each file must effectively be tested against every possibility in the magic database, this approach is relatively inefficient, especially for displaying large lists of files (in contrast, file name and metadata-based methods need to check only one piece of data, and match it against a sorted index). Also, data must be read from the file itself, increasing latency as opposed to metadata stored in the directory. Where file types don't lend themselves to recognition in this way, the system must fall back to metadata. It is, however, the best way for a program to check if the file it has been told to process is of the correct format: while the file's name or metadata may be altered independently of its content, failing a well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type. On the other hand, a valid magic number does not guarantee that the file is not corrupt or is of a correct type. So-called shebang lines in script files are a special case of magic numbers. Here, the magic number is human-readable text that identifies a specific
command interpreter A command-line interpreter or command-line processor uses a command-line interface (CLI) to receive commands from a user in the form of lines of text. This provides a means of setting parameters for the environment, invoking executables and pro ...
and options to be passed to the command interpreter. Another operating system using magic numbers is
AmigaOS AmigaOS is a family of proprietary native operating systems of the Amiga and AmigaOne personal computers. It was developed first by Commodore International and introduced with the launch of the first Amiga, the Amiga 1000, in 1985. Early version ...
, where magic numbers were called "Magic Cookies" and were adopted as a standard system to recognize executables in Hunk executable file format and also to let single programs, tools and utilities deal automatically with their saved data files, or any other kind of file types when saving and loading data. This system was then enhanced with the Amiga standard Datatype recognition system. Another method was the
FourCC A FourCC ("four-character code") is a sequence of four bytes (typically ASCII) used to uniquely identify data formats. It originated from the OSType or ResType metadata system used in classic Mac OS and was adopted for the Amiga/Electronic Arts I ...
method, originating in
OSType A FourCC ("four-character code") is a sequence of four bytes (typically ASCII) used to uniquely identify data formats. It originated from the OSType or ResType metadata system used in classic Mac OS and was adopted for the Amiga/Electronic Arts I ...
on Macintosh, later adapted by
Interchange File Format Interchange File Format (IFF), is a generic container file format originally introduced by Electronic Arts in 1985 (in cooperation with Commodore) in order to facilitate transfer of data between software produced by different companies. IFF fil ...
(IFF) and derivatives.


External metadata

A final way of storing the format of a file is to explicitly store information about the format in the file system, rather than within the file itself. This approach keeps the metadata separate from both the main data and the name, but is also less
portable Portable may refer to: General * Portable building, a manufactured structure that is built off site and moved in upon completion of site and utility work * Portable classroom, a temporary building installed on the grounds of a school to provide a ...
than either filename extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions— for instance, for compatibility with
MS-DOS MS-DOS ( ; acronym for Microsoft Disk Operating System, also known as Microsoft DOS) is an operating system for x86-based personal computers mostly developed by Microsoft. Collectively, MS-DOS, its rebranding as IBM PC DOS, and a few ope ...
's three character limit— most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata. Note that zip files or archive files solve the problem of handling metadata. A utility program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a zip file with extension ). The new file is also compressed and possibly encrypted, but now is transmissible as a single file across operating systems by
FTP The File Transfer Protocol (FTP) is a standard communication protocol used for the transfer of computer files from a server to a client on a computer network. FTP is built on a client–server model architecture using separate control and data ...
transmissions or sent by email as an attachment. At the destination, the single file received has to be unzipped by a compatible utility to be useful. The problems of handling metadata are solved this way using zip files or archive files.


Mac OS type-codes

The
Mac OS Two major famlies of Mac operating systems were developed by Apple Inc. In 1984, Apple debuted the operating system that is now known as the "Classic" Mac OS with its release of the original Macintosh System Software. The system, rebranded "M ...
'
Hierarchical File System Hierarchical File System (HFS) is a proprietary file system developed by Apple Inc. for use in computer systems running Mac OS. Originally designed for use on floppy and hard disks, it can also be found on read-only media such as CD-ROMs. HFS ...
stores codes for '' creator'' and '' type'' as part of the directory entry for each file. These codes are referred to as OSTypes. These codes could be any 4-byte sequence but were often selected so that the ASCII representation formed a sequence of meaningful characters, such as an abbreviation of the application's name or the developer's initials. For instance a
HyperCard HyperCard is a software application and development kit for Apple Macintosh and Apple IIGS computers. It is among the first successful hypermedia systems predating the World Wide Web. HyperCard combines a flat-file database with a graphical, fl ...
"stack" file has a ''creator'' of (from Hypercard's previous name, "WildCard") and a ''type'' of . The
BBEdit BBEdit is a proprietary text editor made by Bare Bones Software, originally developed for Macintosh System Software 6, and currently supporting macOS. History The first version of BBEdit was created as a "bare bones" text editor to serve as a "p ...
text editor has a creator code of referring to its original programmer, Rich Siegel. The type code specifies the format of the file, while the creator code specifies the default program to open it with when double-clicked by the user. For example, the user could have several text files all with the type code of , but each open in a different program, due to having differing creator codes. This feature was intended so that, for example, human-readable plain-text files could be opened in a general-purpose text editor, while programming or HTML code files would open in a specialized editor or IDE. However, this feature was often the source of user confusion, as which program would launch when the files were double-clicked was often unpredictable.
RISC OS RISC OS is a computer operating system originally designed by Acorn Computers Ltd in Cambridge, England. First released in 1987, it was designed to run on the ARM chipset, which Acorn had designed concurrently for use in its new line of Archim ...
uses a similar system, consisting of a 12-bit number which can be looked up in a table of descriptions—e.g. the
hexadecimal In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexa ...
number is "aliased" to , representing a
PostScript PostScript (PS) is a page description language in the electronic publishing and desktop publishing realm. It is a dynamically typed, concatenative programming language. It was created at Adobe Systems by John Warnock, Charles Geschke, Doug Br ...
file.


Mac OS X uniform type identifiers (UTIs)

A Uniform Type Identifier (UTI) is a method used in
macOS macOS (; previously OS X and originally Mac OS X) is a Unix operating system developed and marketed by Apple Inc. since 2001. It is the primary operating system for Apple's Mac computers. Within the market of desktop and lapt ...
for uniquely identifying "typed" classes of entities, such as file formats. It was developed by
Apple An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple fruit tree, trees are agriculture, cultivated worldwide and are the most widely grown species in the genus ''Malus''. The tree originated in Central Asia, wh ...
as a replacement for OSType (type & creator codes). The UTI is a
Core Foundation Core Foundation (also called CF) is a C application programming interface (API) written by Apple for its operating systems, and is a mix of low-level routines and wrapper functions. Most Core Foundation routines follow a certain naming conventi ...
string, which uses a reverse-DNS string. Some common and standard types use a domain called (e.g. for a
Portable Network Graphics Portable Network Graphics (PNG, officially pronounced , colloquially pronounced ) is a raster-graphics file format that supports lossless data compression. PNG was developed as an improved, non-patented replacement for Graphics Interchange F ...
image), while other domains can be used for third-party types (e.g. for
Portable Document Format Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
). UTIs can be defined within a hierarchical structure, known as a conformance hierarchy. Thus, conforms to a supertype of , which itself conforms to a supertype of . A UTI can exist in multiple hierarchies, which provides great flexibility. In addition to file formats, UTIs can also be used for other entities which can exist in macOS, including: * Pasteboard data * Folders (directories) * Translatable types (as handled by the Translation Manager) * Bundles * Frameworks * Streaming data * Aliases and symlinks


OS/2 extended attributes

The HPFS, FAT12, and FAT16 (but not FAT32) filesystems allow the storage of "extended attributes" with files. These comprise an arbitrary set of triplets with a name, a coded type for the value, and a value, where the names are unique and values can be up to 64 KB long. There are standardized meanings for certain types and names (under
OS/2 OS/2 (Operating System/2) is a series of computer operating systems, initially created by Microsoft and IBM under the leadership of IBM software designer Ed Iacobucci. As a result of a feud between the two companies over how to position OS/2 ...
). One such is that the ".TYPE" extended attribute is used to determine the file type. Its value comprises a list of one or more file types associated with the file, each of which is a string, such as "Plain Text" or "HTML document". Thus a file may have several types. The
NTFS New Technology File System (NTFS) is a proprietary journaling file system developed by Microsoft. Starting with Windows NT 3.1, it is the default file system of the Windows NT family. It superseded File Allocation Table (FAT) as the preferred fil ...
filesystem also allows storage of OS/2 extended attributes, as one of the file ''forks'', but this feature is merely present to support the OS/2 subsystem (not present in XP), so the Win32 subsystem treats this information as an opaque block of data and does not use it. Instead, it relies on other file forks to store meta-information in Win32-specific formats. OS/2 extended attributes can still be read and written by Win32 programs, but the data must be entirely parsed by applications.


POSIX extended attributes

On Unix and
Unix-like A Unix-like (sometimes referred to as UN*X or *nix) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Unix-li ...
systems, the
ext2 The ext2 or second extended file system is a file system for the Linux kernel. It was initially designed by French software developer Rémy Card as a replacement for the extended file system (ext). Having been designed according to the same pr ...
,
ext3 ext3, or third extended filesystem, is a journaled file system that is commonly used by the Linux kernel. It used to be the default file system for many popular Linux distributions. Stephen Tweedie first revealed that he was working on extend ...
,
ext4 ext4 (fourth extended filesystem) is a journaling file system for Linux, developed as the successor to ext3. ext4 was initially a series of backward-compatible extensions to ext3, many of them originally developed by Cluster File Systems for ...
,
ReiserFS ReiserFS is a general-purpose, journaling file system initially designed and implemented by a team at Namesys led by Hans Reiser and licensed under GPLv2. Introduced in version 2.4.1 of the Linux kernel, it was the first journaling file sys ...
version 3,
XFS XFS is a high-performance 64-bit journaling file system created by Silicon Graphics, Inc (SGI) in 1993. It was the default file system in SGI's IRIX operating system starting with its version 5.3. XFS was ported to the Linux kernel in 2001; as ...
, JFS, FFS, and
HFS+ HFS Plus or HFS+ (also known as Mac OS Extended or HFS Extended) is a journaling file system developed by Apple Inc. It replaced the Hierarchical File System (HFS) as the primary file system of Apple computers with the 1998 release of Mac OS 8.1 ...
filesystems allow the storage of extended attributes with files. These include an arbitrary list of "name=value" strings, where the names are unique and a value can be accessed through its related name.


PRONOM unique identifiers (PUIDs)

The PRONOM Persistent Unique Identifier (PUID) is an extensible scheme of persistent, unique, and unambiguous identifiers for file formats, which has been developed by The National Archives of the UK as part of its PRONOM technical registry service. PUIDs can be expressed as
Uniform Resource Identifier A Uniform Resource Identifier (URI) is a unique sequence of characters that identifies a logical or physical resource used by web technologies. URIs may be used to identify anything, including real-world objects, such as people and places, conc ...
s using the namespace. Although not yet widely used outside of the UK government and some
digital preservation In library and archival science, digital preservation is a formal endeavor to ensure that digital information of continuing value remains accessible and usable. It involves planning, resource allocation, and application of preservation methods an ...
programs, the PUID scheme does provide greater granularity than most alternative schemes.


MIME types

MIME Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message ...
types are widely used in many
Internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, pub ...
-related applications, and increasingly elsewhere, although their usage for on-disc type information is rare. These consist of a standardised system of identifiers (managed by
IANA The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Interne ...
) consisting of a ''type'' and a ''sub-type'', separated by a
slash Slash may refer to: * Slash (punctuation), the "/" character Arts and entertainment Fictional characters * Slash (Marvel Comics) * Slash (''Teenage Mutant Ninja Turtles'') Music * Harry Slash & The Slashtones, an American rock band * Nash ...
—for instance, or . These were originally intended as a way of identifying what type of file was attached to an
e-mail Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" meant ...
, independent of the source and target operating systems. MIME types identify files on
BeOS BeOS is an operating system for personal computers first developed by Be Inc. in 1990. It was first written to run on BeBox hardware. BeOS was positioned as a multimedia platform that could be used by a substantial population of desktop users a ...
,
AmigaOS 4.0 AmigaOS 4 (abbreviated as OS4 or AOS4) is a line of Amiga operating systems which runs on PowerPC microprocessors. It is mainly based on AmigaOS 3.1 source code developed by Commodore, and partially on version 3.9 developed by Haage & Partn ...
and
MorphOS MorphOS is an AmigaOS-like computer operating system (OS). It is a mixed Proprietary software, proprietary and Open-source software, open source OS produced for the Pegasos PowerPC (PPC) processor based computer, PowerUP accelerator equipped Amig ...
, as well as store unique application signatures for application launching. In AmigaOS and MorphOS, the Mime type system works in parallel with Amiga specific Datatype system. There are problems with the MIME types though; several organizations and people have created their own MIME types without registering them properly with IANA, which makes the use of this standard awkward in some cases.


File format identifiers (FFIDs)

File format identifiers are another, not widely used way to identify file formats according to their origin and their file category. It was created for the Description Explorer suite of software. It is composed of several digits of the form . The first part indicates the organization origin/maintainer (this number represents a value in a company/standards organization database), and the 2 following digits categorize the type of file in
hexadecimal In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexa ...
. The final part is composed of the usual filename extension of the file or the international standard number of the file, padded left with zeros. For example, the PNG file specification has the FFID of where indicates an image file, is the standard number and indicates the
International Organization for Standardization The International Organization for Standardization (ISO ) is an international standard development organization composed of representatives from the national standards organizations of member countries. Membership requirements are given in Ar ...
(ISO).


File content based format identification

Another but less popular way to identify the file format is to examine the file contents for distinguishable patterns among file types. The contents of a file are a sequence of bytes and a byte has 256 unique permutations (0–255). Thus, counting the occurrence of byte patterns that is often referred to as byte frequency distribution gives distinguishable patterns to identify file types. There are many content-based file type identification schemes that use a byte frequency distribution to build the representative models for file type and use any statistical and data mining techniques to identify file types


File structure

There are several types of ways to structure data in a file. The most usual ones are described below.


Unstructured formats (raw memory dumps)

Earlier file formats used raw data formats that consisted of directly dumping the memory images of one or more structures into the file. This has several drawbacks. Unless the memory images also have reserved spaces for future extensions, extending and improving this type of structured file is very difficult. It also creates files that might be specific to one platform or programming language (for example a structure containing a
Pascal Pascal, Pascal's or PASCAL may refer to: People and fictional characters * Pascal (given name), including a list of people with the name * Pascal (surname), including a list of people and fictional characters with the name ** Blaise Pascal, Fren ...
string is not recognized as such in C). On the other hand, developing tools for reading and writing these types of files is very simple. The limitations of the unstructured formats led to the development of other types of file formats that could be easily extended and be backward compatible at the same time.


Chunk-based formats

In this kind of file structure, each piece of data is embedded in a container that somehow identifies the data. The container's scope can be identified by start- and end-markers of some kind, by an explicit length field somewhere, or by fixed requirements of the file format's definition. Throughout the 1970s, many programs used formats of this general kind. For example, word-processors such as
troff troff (), short for "typesetter roff", is the major component of a document processing system developed by Bell Labs for the Unix operating system. troff and the related nroff were both developed from the original roff. While nroff was inten ...
,
Script Script may refer to: Writing systems * Script, a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire * Script (styles of handwriting) ** Script typeface, a typeface with characteristics of handw ...
, and
Scribe A scribe is a person who serves as a professional copyist, especially one who made copies of manuscripts before the invention of automatic printing. The profession of the scribe, previously widespread across cultures, lost most of its promi ...
, and database export files such as CSV.
Electronic Arts Electronic Arts Inc. (EA) is an American video game company headquartered in Redwood City, California. Founded in May 1982 by Apple employee Trip Hawkins, the company was a pioneer of the early home computer game industry and promoted the d ...
and
Commodore Commodore may refer to: Ranks * Commodore (rank), a naval rank ** Commodore (Royal Navy), in the United Kingdom ** Commodore (United States) ** Commodore (Canada) ** Commodore (Finland) ** Commodore (Germany) or ''Kommodore'' * Air commodore ...
-
Amiga Amiga is a family of personal computers introduced by Commodore in 1985. The original model is one of a number of mid-1980s computers with 16- or 32-bit processors, 256 KB or more of RAM, mouse-based GUIs, and significantly improved graphi ...
also used this type of file format in 1985, with their IFF (Interchange File Format) file format. A container is sometimes called a ''"chunk"'', although "chunk" may also imply that each piece is small, and/or that chunks do not contain other chunks; many formats do not impose those requirements. The information that identifies a particular "chunk" may be called many different things, often terms including "field name", "identifier", "label", or "tag". The identifiers are often human-readable, and classify parts of the data: for example, as a "surname", "address", "rectangle", "font name", etc. These are not the same thing as identifiers in the sense of a database key or serial number (although an identifier may well identify its associated data as such a key). With this type of file structure, tools that do not know certain chunk identifiers simply skip those that they do not understand. Depending on the actual meaning of the skipped data, this may or may not be useful (
CSS Cascading Style Sheets (CSS) is a style sheet language used for describing the presentation of a document written in a markup language such as HTML or XML (including XML dialects such as SVG, MathML or XHTML). CSS is a cornerstone techno ...
explicitly defines such behavior). This concept has been used again and again by
RIFF A riff is a repeated chord progression or refrain in music (also known as an ostinato figure in classical music); it is a pattern, or melody, often played by the rhythm section instruments or solo instrument, that forms the basis or accompani ...
(Microsoft-IBM equivalent of IFF), PNG, JPEG storage, DER (
Distinguished Encoding Rules X.690 is an ITU-T standard specifying several ASN.1 encoding formats: * Basic Encoding Rules (BER) * Canonical Encoding Rules (CER) * Distinguished Encoding Rules (DER) The Basic Encoding Rules (BER) were the original rules laid out by the ASN.1 s ...
) encoded streams and files (which were originally described in CCITT X.409:1984 and therefore predate IFF), and Structured Data Exchange Format (SDXF). Indeed, any data format must somehow identify the significance of its component parts, and embedded boundary-markers are an obvious way to do so: * MIME headers do this with a colon-separated label at the start of each logical line. MIME headers cannot contain other MIME headers, though the data content of some headers has sub-parts that can be extracted by other conventions. * CSV and similar files often do this using a header records with field names, and with commas to mark the field boundaries. Like MIME, CSV has no provision for structures with more than one level. *
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
and its kin can be loosely considered a kind of chunk-based format, since data elements are identified by markup that is akin to chunk identifiers. However, it has formal advantages such as schemas and
validation Validation may refer to: * Data validation, in computer science, ensuring that data inserted into an application satisfies defined formats and other input criteria * Forecast verification, validating and verifying prognostic output from a numerica ...
, as well as the ability to represent more complex structures such as
trees In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are u ...
, DAGs, and
chart A chart (sometimes known as a graph) is a graphical representation for data visualization, in which "the data is represented by symbols, such as bars in a bar chart, lines in a line chart, or slices in a pie chart". A chart can represent tabu ...
s. If XML is considered a "chunk" format, then
SGML The Standard Generalized Markup Language (SGML; ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates": * Declarative: Markup should des ...
and its predecessor
IBM GML Generalized Markup Language (GML) is a set of macros that implement intent-based (procedural) markup tags for the IBM text formatter, SCRIPT. SCRIPT/VS is the main component of IBM's Document Composition Facility (DCF). A ''starter set'' of ...
are among the earliest examples of such formats. *
JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other ser ...
is similar to XML without schemas, cross-references, or a definition for the meaning of repeated field-names, and is often convenient for programmers. *
YAML YAML ( and ) (''see '') is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Exte ...
is similar to JSON, but use indentation to separate data chunks and aim to be more human-readable than JSON or XML. *
Protocol Buffers Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data. The method involves an int ...
are in turn similar to JSON, notably replacing boundary-markers in the data with field numbers, which are mapped to/from names by some external mechanism.


Directory-based formats

This is another extensible format, that closely resembles a file system ( OLE Documents are actual filesystems), where the file is composed of 'directory entries' that contain the location of the data within the file itself as well as its signatures (and in certain cases its type). Good examples of these types of file structures are
disk image A disk image, in computing, is a computer file containing the contents and structure of a disk volume or of an entire data storage device, such as a hard disk drive, tape drive, floppy disk, optical disc, or USB flash drive. A disk image is us ...
s,
executable In computing, executable code, an executable file, or an executable program, sometimes simply referred to as an executable or binary, causes a computer "to perform indicated tasks according to encoded instruction (computer science), instructi ...
s, OLE documents
TIFF Tag Image File Format, abbreviated TIFF or TIF, is an image file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. TIFF is widely supported by scanning, faxing, word processin ...
,
libraries A library is a collection of materials, books or media that are accessible for use and not just for display purposes. A library provides physical (hard copies) or digital access (soft copies) materials, and may be a physical location or a vir ...
. Some file formats like ODT and DOCX, being
PKZIP PKZIP is a file archiving computer program, notable for introducing the popular ZIP file format. PKZIP was first introduced for MS-DOS on the IBM-PC compatible platform in 1989. Since then versions have been released for a number of other ...
-based, are both chunked and carry a directory. The structure of a directory-based file format lends itself to modifications more easily than unstructured or chunk-based formats. The nature of this type of format allows users to carefully construct files that causes reader software to do things the authors of the format never intended to happen. An example of this is the
zip bomb In computing, a zip bomb, also known as a decompression bomb or zip of death, is a malicious archive file designed to crash or render useless the program or system reading it. It is often employed to disable antivirus software, in order to creat ...
. Directory-based file formats also use values that point at other areas in the file but if some later data value points back at data that was read earlier, it can result in an infinite loop for any reader software that assumes the input file is valid and blindly follows the loop.


See also

*
Audio file format An audio file format is a file format for storing digital audio data on a computer system. The bit layout of the audio data (excluding metadata) is called the audio coding format and can be uncompressed, or compressed to reduce the file size, o ...
*
Chemical file format A chemical file format is a type of data file which is used specifically to depicting molecular data. One of the most widely used is the chemical table file format, which is similar to ''Structure Data Format'' (SDF) files. They are text files ...
*
Comparison of executable file formats This is a comparison of binary executable file formats which, once loaded by a suitable executable loader, can be directly executed by the CPU rather than being interpreted by software. In addition to the binary application code, the executables ma ...
*
Digital container format A container format (informally, sometimes called a wrapper) or metafile is a file format that allows multiple data streams to be embedded into a single file, usually along with metadata for identifying and further detailing those streams. Notab ...
*
Document file format A document file format is a text or binary file format for storing documents on a storage media, especially for use by computers. There currently exist a multitude of incompatible document file formats. Examples of XML-based open standards are D ...
* DROID file format identification utility *
File (command) The file command is a standard program of Unix and Unix-like operating systems for recognizing the type of data contained in a computer file. History The original version of file originated in Unix Research Version 4 in 1973. System V brought ...
, a file type identification utility *
File conversion Data conversion is the conversion of computer data from one format to another. Throughout a computer environment, data is encoded in a variety of ways. For example, computer hardware is built on the basis of certain standards, which requires th ...
* Future proofing * Graphics file format summary *
Image file formats An Image file format is a file format for a digital image. There are many formats that can be used, such as JPEG, PNG, and GIF. Most formats up until 2022 were for storing 2D images, not 3D ones. The data stored in an image file format may be c ...
*
List of archive formats This is a list of file formats used by archivers and compressors used to create archive files. Archiving only Compression only Archiving and compression Data recovery Comparison Containers and compression Notes While the original ...
*
List of file formats This is a list of file formats used by computers, organized by type. Filename extension it is usually noted in parentheses if they differ from the file format name or abbreviation. Many operating systems do not limit filenames to one extension s ...
* List of file signatures, or "magic numbers" *
List of filename extensions (alphabetical) This alphabetical list of filename extensions contains extensions of notable file formats used by multiple notable applications or services. 0–9 A–E F–L M–R S–Z See also * List of file formats References

{{DEFAULTS ...
*
List of free file formats An open file format is a file format for storing digital data, defined by an openly published specification usually maintained by a standards organization, and which can be used and implemented by anyone. Open file format is licensed with open ...
*
List of motion and gesture file formats The question of gesture and Motion (physics), motion takes more and more importance with the development of gesture controllers, haptic technology, haptic systems, motion capture systems, etc., on the one hand, and with the need of allowing virtual ...
*
Magic number (programming) In computer programming, a magic number is any of the following: * A unique value with unexplained meaning or multiple occurrences which could (preferably) be replaced with a named constant * A constant numerical or text value used to identify a ...
*
Object file An object file is a computer file containing object code, that is, machine code output of an assembler or compiler. The object code is usually relocatable, and not usually directly executable. There are various formats for object files, and the ...
*
Video file format A video file format is a type of file format for storing digital video data on a computer system. Video is almost always stored using lossy compression to reduce the file size. A video file normally consists of a container (e.g. in the Matroska ...
*
Windows file types This is a list of file formats used by computers, organized by type. Filename extension it is usually noted in parentheses if they differ from the file format name or abbreviation. Many operating systems do not limit filenames to one extension ...
*
Filename extension A filename extension, file name extension or file extension is a suffix to the name of a computer file (e.g., .txt, .docx, .md). The extension indicates a characteristic of the file contents or its intended use. A filename extension is typically d ...


References

:* :* :*


External links

* * ("The file formats you use have a direct impact on your ability to open those files at a later date and on the ability of other people to access those data") {{DEFAULTSORT:File format