A file format is a
standard way that information is encoded for storage in a
computer file
A computer file is a System resource, resource for recording Data (computing), data on a Computer data storage, computer storage device, primarily identified by its filename. Just as words can be written on paper, so too can data be written to a ...
. It specifies how
bits are used to encode information in a
digital storage
Data storage is the recording (storing) of information (data) in a storage medium. Handwriting, Phonograph record, phonographic recording, magnetic tape, and optical discs are all examples of storage media. Biological molecules such as RNA ...
medium. File formats may be either
proprietary or
open
Open or OPEN may refer to:
Music
* Open (band), Australian pop/rock band
* The Open (band), English indie rock band
* ''Open'' (Blues Image album), 1969
* ''Open'' (Gerd Dudek, Buschi Niebergall, and Edward Vesala album), 1979
* ''Open'' (Go ...
.
Some file formats are designed for very particular types of data:
PNG files, for example, store
bitmapped images using
lossless data compression
Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits Redundanc ...
. Other file formats, however, are designed for storage of several different types of data: the
Ogg format can act as a
container
A container is any receptacle or enclosure for holding a product used in storage, packaging, and transportation, including shipping.
Things kept inside of a container are protected on several sides by being inside of its structure. The term ...
for different types of
multimedia
Multimedia is a form of communication that uses a combination of different content forms, such as Text (literary theory), writing, Sound, audio, images, animations, or video, into a single presentation. T ...
including any combination of
audio
Audio most commonly refers to sound, as it is transmitted in signal form. It may also refer to:
Sound
*Audio signal, an electrical representation of sound
*Audio frequency, a frequency in the audio spectrum
*Digital audio, representation of sound ...
and
video
Video is an Electronics, electronic medium for the recording, copying, playback, broadcasting, and display of moving picture, moving image, visual Media (communication), media. Video was first developed for mechanical television systems, whi ...
, with or without text (such as
subtitles), and
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
. A
text file
A text file (sometimes spelled textfile; an old alternative name is flat file) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system.
In ope ...
can contain any stream of characters, including possible
control character
In computing and telecommunications, a control character or non-printing character (NPC) is a code point in a character encoding, character set that does not represent a written Character (computing), character or symbol. They are used as in-ba ...
s, and is encoded in one of various
character encoding schemes. Some file formats, such as
HTML
Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
,
scalable vector graphics
Scalable Vector Graphics (SVG) is an XML-based vector graphics format for defining two-dimensional graphics, having support for interactivity and animation. The SVG specification is an open standard developed by the World Wide Web Consortium sin ...
, and the
source code
In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer.
Since a computer, at base, only ...
of
computer software
Software consists of computer programs that instruct the Execution (computing), execution of a computer. Software also includes design documents and specifications.
The history of software is closely tied to the development of digital comput ...
are text files with defined
syntaxes that allow them to be used for specific purposes.
Specifications
File formats often have a published
specification
A specification often refers to a set of documented requirements to be satisfied by a material, design, product, or service. A specification is often a type of technical standard.
There are different types of technical or engineering specificati ...
describing the encoding method and enabling testing of program intended functionality. Not all formats have freely available specification documents, partly because some developers view their specification documents as
trade secret
A trade secret is a form of intellectual property (IP) comprising confidential information that is not generally known or readily ascertainable, derives economic value from its secrecy, and is protected by reasonable efforts to maintain its conf ...
s, and partly because other developers never author a formal specification document, letting precedent set by other already existing programs that use the format define the format via how these existing programs use it.
If the developer of a format does not publish free specifications, another developer looking to utilize that kind of file must either
reverse engineer
Reverse engineering (also known as backwards engineering or back engineering) is a process or method through which one attempts to understand through deductive reasoning how a previously made device, process, system, or piece of software accompl ...
the file to find out how to read it or acquire the specification document from the format's developers for a fee and by signing a
non-disclosure agreement
A non-disclosure agreement (NDA), also known as a confidentiality agreement (CA), confidential disclosure agreement (CDA), proprietary information agreement (PIA), or secrecy agreement (SA), is a legal contract or part of a contract between at le ...
. The latter approach is possible only when a formal specification document exists. Both strategies require significant time, money, or both; therefore, file formats with publicly available specifications tend to be supported by more programs.
Patents
Patent
A patent is a type of intellectual property that gives its owner the legal right to exclude others from making, using, or selling an invention for a limited period of time in exchange for publishing an sufficiency of disclosure, enabling discl ...
law, rather than
copyright
A copyright is a type of intellectual property that gives its owner the exclusive legal right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, ...
, is more often used to protect a file format. Although patents for file formats are not directly permitted under US law, some formats encode data using patented
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
s. For example, prior to 2004, using compression with the
GIF
The Graphics Interchange Format (GIF; or , ) is a Raster graphics, bitmap Image file formats, image format that was developed by a team at the online services provider CompuServe led by American computer scientist Steve Wilhite and released ...
file format required the use of a patented algorithm, and though the patent owner did not initially enforce their patent, they later began collecting
royalty fees. This has resulted in a significant decrease in the use of GIFs, and is partly responsible for the development of the alternative
PNG format. However, the GIF patent expired in the US in mid-2003, and worldwide in mid-2004.
Identifying file type
Different
operating system
An operating system (OS) is system software that manages computer hardware and software resources, and provides common daemon (computing), services for computer programs.
Time-sharing operating systems scheduler (computing), schedule tasks for ...
s have traditionally taken different approaches to determining a particular file's format, with each approach having its own advantages and disadvantages. Most modern operating systems and individual applications need to use all of the following approaches to read "foreign" file formats, if not work with them completely.
Filename extension
One popular method used by many operating systems, including
Windows
Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
,
macOS
macOS, previously OS X and originally Mac OS X, is a Unix, Unix-based operating system developed and marketed by Apple Inc., Apple since 2001. It is the current operating system for Apple's Mac (computer), Mac computers. With ...
,
CP/M
CP/M, originally standing for Control Program/Monitor and later Control Program for Microcomputers, is a mass-market operating system created in 1974 for Intel 8080/Intel 8085, 85-based microcomputers by Gary Kildall of Digital Research, Dig ...
,
DOS,
VMS, and
VM/CMS, is to determine the format of a file based on the end of its name, more specifically the letters following the final period. This portion of the filename is known as the
filename extension
A filename extension, file name extension or file extension is a suffix to the name of a computer file (for example, .txt, .mp3, .exe) that indicates a characteristic of the file contents or its intended use. A filename extension is typically d ...
. For example,
HTML
Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
documents are identified by names that end with (or ), and
GIF
The Graphics Interchange Format (GIF; or , ) is a Raster graphics, bitmap Image file formats, image format that was developed by a team at the online services provider CompuServe led by American computer scientist Steve Wilhite and released ...
images by . In the original
FAT
In nutrition science, nutrition, biology, and chemistry, fat usually means any ester of fatty acids, or a mixture of such chemical compound, compounds, most commonly those that occur in living beings or in food.
The term often refers specif ...
file system, file names were limited to an eight-character identifier and a three-character extension, known as an
8.3 filename. There are a limited number of three-letter extensions, which can cause a given extension to be used by more than one program. Many formats still use three-character extensions even though modern operating systems and application programs no longer have this limitation. Since there is no standard list of extensions, more than one format can use the same extension, which can confuse both the operating system and users.
One artifact of this approach is that the system can easily be tricked into treating a file as a different format simply by renaming it — an
HTML
Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
file can, for instance, be easily treated as
plain text
In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects ( floating-point numbers, images, etc.). It may also include a lim ...
by renaming it from to . Although this strategy was useful to expert users who could easily understand and manipulate this information, it was often confusing to less technical users, who could accidentally make a file unusable (or "lose" it) by renaming it incorrectly.
This led most versions of Windows and Mac OS to hide the extension when listing files. This prevents the user from accidentally changing the file type, and allows expert users to turn this feature off and display the extensions.
Hiding the extension, however, can create the appearance of two or more identical filenames in the same folder. For example, a company logo may be needed both in
format (for publishing) and
.png format (for web sites). With the extensions visible, these would appear as the unique filenames: "" and "". On the other hand, hiding the extensions would make both appear as "", which can lead to confusion.
Hiding extensions can also pose a security risk. For example, a malicious user could create an
executable program with an innocent name such as "". The "" would be hidden and an unsuspecting user would see "", which would appear to be a
JPEG
JPEG ( , short for Joint Photographic Experts Group and sometimes retroactively referred to as JPEG 1) is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degr ...
image, usually unable to harm the machine. However, the operating system would still see the "" extension and run the program, which would then be able to cause harm to the computer. The same is true with files with only one extension: as it is not shown to the user, no information about the file can be deduced without explicitly investigating the file. To further trick users, it is possible to store an icon inside the program, in which case some operating systems' icon assignment for the executable file () would be overridden with an icon commonly used to represent JPEG images, making the program look like an image. Extensions can also be spoofed: some
Microsoft Word
Microsoft Word is a word processor program, word processing program developed by Microsoft. It was first released on October 25, 1983, under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platf ...
macro viruses create a Word file in template format and save it with a extension. Since Word generally ignores extensions and looks at the format of the file, these would open as templates, execute, and spread the virus. This represents a practical problem for Windows systems where extension-hiding is turned on by default.
Internal metadata
A second way to identify a file format is to ''use information'' regarding the format stored inside the file itself, either information meant for this purpose or
binary strings that happen to always be in specific locations in files of some formats. Since the easiest place to locate them is at the beginning, such area is usually called a ''file header'' when it is greater than a few
byte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
s, or a ''magic number'' if it is just a few bytes long.
File header
The metadata contained in a
file header are usually stored at the start of the file, but might be present in other areas too, often including the end, depending on the file format or the type of data contained. Character-based (text) files usually have character-based headers, whereas binary formats usually have binary headers, although this is not a rule. Text-based file headers usually take up more space, but being human-readable, they can easily be examined by using simple software such as a text editor or a hexadecimal editor.
As well as identifying the file format, file headers may contain metadata about the file and its contents. For example, most
image files store information about image format, size, resolution and
color space
A color space is a specific organization of colors. In combination with color profiling supported by various physical devices, it supports reproducible representations of colorwhether such representation entails an analog or a digital represe ...
, and optionally
author
In legal discourse, an author is the creator of an original work that has been published, whether that work exists in written, graphic, visual, or recorded form. The act of creating such a work is referred to as authorship. Therefore, a sculpt ...
ing information such as who made the image, when and where it was made, what camera model and photographic settings were used (
Exif
Exchangeable image file format (officially Exif, according to JEIDA/JEITA/CIPA specifications) is a standard that specifies formats for images, sound, and ancillary tags used by digital cameras (including smartphones), scanners and other system ...
), and so on. Such metadata may be used by software reading or interpreting the file during the loading process and afterwards.
File headers may be used by an
operating system
An operating system (OS) is system software that manages computer hardware and software resources, and provides common daemon (computing), services for computer programs.
Time-sharing operating systems scheduler (computing), schedule tasks for ...
to quickly gather information about a file without loading it all into memory, but doing so uses more of a computer's resources than reading directly from the
directory information. For instance, when a
graphic
Graphics () are visual images or designs on some surface, such as a wall, canvas, screen, paper, or stone, to inform, illustrate, or entertain. In contemporary usage, it includes a pictorial representation of the data, as in design and manufa ...
file manager
A file manager or file browser is a computer program that provides a user interface to manage computer files, files and folder (computing), folders. The most common Computer file#Operations, operations performed on files or groups of files incl ...
has to display the contents of a folder, it must read the headers of many files before it can display the appropriate icons, but these will be located in different places on the storage medium thus taking longer to access. A folder containing many files with complex metadata such as
thumbnail
Thumbnails are reduced-size versions of pictures or videos, used to help in recognizing and organizing them, serving the same role for images as a normal text index does for words. In the age of digital images, visual search engines and image-o ...
information may require considerable time before it can be displayed.
If a header is
binary hard-coded such that the header itself needs complex interpretation in order to be recognized, especially for metadata content protection's sake, there is a risk that the file format can be misinterpreted. It may even have been badly written at the source. This can result in corrupt metadata which, in extremely bad cases, might even render the file unreadable.
A more complex example of file headers are those used for
wrapper (or container) file formats.
Magic number
One way to incorporate file type metadata, often associated with
Unix
Unix (, ; trademarked as UNIX) is a family of multitasking, multi-user computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, a ...
and its derivatives, is to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-byte identifiers at the beginnings of files, but since any binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification.
GIF
The Graphics Interchange Format (GIF; or , ) is a Raster graphics, bitmap Image file formats, image format that was developed by a team at the online services provider CompuServe led by American computer scientist Steve Wilhite and released ...
images, for instance, always begin with the
ASCII
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
representation of either
GIF87a
or
GIF89a
, depending upon the standard to which they adhere. Many file types, especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string
<html>
(which is not case sensitive), or an appropriate
document type definition that starts with
<!DOCTYPE html
, or, for
XHTML
Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated.
While HTML, pr ...
, the
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
identifier, which begins with
<?xml
. The files can also begin with HTML comments, random text, or several empty lines, but still be usable HTML.
The magic number approach offers better guarantees that the format will be identified correctly, and can often determine more precise information about the file. Since reasonably reliable "magic number" tests can be fairly complex, and each file must effectively be tested against every possibility in the magic database, this approach is relatively inefficient, especially for displaying large lists of files (in contrast, file name and metadata-based methods need to check only one piece of data, and match it against a sorted index). Also, data must be read from the file itself, increasing latency as opposed to metadata stored in the directory. Where file types do not lend themselves to recognition in this way, the system must fall back to metadata. It is, however, the best way for a program to check if the file it has been told to process is of the correct format: while the file's name or metadata may be altered independently of its content, failing a well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type. On the other hand, a valid magic number does not guarantee that the file is not corrupt or is of a correct type.
So-called
shebang lines in
script files are a special case of magic numbers. There, the magic number consists of human-readable text within the file that identifies a specific
interpreter and options to be passed to it.
Another operating system using magic numbers is
AmigaOS
AmigaOS is a family of proprietary native operating systems of the Amiga and AmigaOne personal computers. It was developed first by Commodore International and introduced with the launch of the first Amiga, the Amiga 1000, in 1985. Early versions ...
, where magic numbers were called "Magic Cookies" and were adopted as a standard system to recognize executables in
Hunk executable file format and also to let single programs, tools and utilities deal automatically with their saved data files, or any other kind of file types when saving and loading data. This system was then enhanced with the
Amiga standard Datatype recognition system. Another method was the
FourCC method, originating in
OSType on Macintosh, later adapted by
Interchange File Format (IFF) and derivatives.
External metadata
A final way of storing the format of a file is to explicitly store information about the format in the file system, rather than within the file itself.
This approach keeps the metadata separate from both the main data and the name, but is also less
portable
Portable may refer to:
General
* Portable building, a manufactured structure that is built off site and moved in upon completion of site and utility work
* Portable classroom, a temporary building installed on the grounds of a school to provide a ...
than either filename extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions— for instance, for compatibility with
MS-DOS
MS-DOS ( ; acronym for Microsoft Disk Operating System, also known as Microsoft DOS) is an operating system for x86-based personal computers mostly developed by Microsoft. Collectively, MS-DOS, its rebranding as IBM PC DOS, and a few op ...
's three character limit— most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata.
Note that zip files or archive files solve the problem of handling metadata. A utility program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a
zip file with extension ). The new file is also compressed and possibly encrypted, but now is transmissible as a single file across operating systems by
FTP
The File Transfer Protocol (FTP) is a standard communication protocol used for the transfer of computer files from a server to a client on a computer network. FTP is built on a client–server model architecture using separate control and dat ...
transmissions or sent by email as an attachment. At the destination, the single file received has to be unzipped by a compatible utility to be useful. The problems of handling metadata are solved this way using zip files or archive files.
Mac OS type-codes
The
Mac OS
Mac operating systems were developed by Apple Inc. in a succession of two major series.
In 1984, Apple debuted the operating system that is now known as the classic Mac OS with its release of the original Macintosh System Software. The system ...
'
Hierarchical File System
In computing, a hierarchical file system is a file system that uses directories to organize files into a tree structure.
In a hierarchical file system, ''directories'' contain information about both files and other directories, called ''sub ...
stores codes for ''
creator'' and ''
type
Type may refer to:
Science and technology Computing
* Typing, producing text via a keyboard, typewriter, etc.
* Data type, collection of values used for computations.
* File type
* TYPE (DOS command), a command to display contents of a file.
* ...
'' as part of the directory entry for each file. These codes are referred to as OSTypes. These codes could be any 4-byte sequence but were often selected so that the ASCII representation formed a sequence of meaningful characters, such as an abbreviation of the application's name or the developer's initials. For instance a
HyperCard
HyperCard is a application software, software application and software development kit, development kit for Apple Macintosh and Apple IIGS computers. It is among the first successful hypermedia systems predating the World Wide Web.
HyperCard com ...
"stack" file has a ''creator'' of (from Hypercard's previous name, "WildCard") and a ''type'' of . The
BBEdit
BBEdit is a Proprietary software, proprietary text editor made by Bare Bones Software, originally developed for Macintosh System 6, System Software 6, and currently supporting macOS.
History
The first version of BBEdit was created as a "bare bon ...
text editor has a creator code of referring to its original programmer,
Rich Siegel. The type code specifies the format of the file, while the creator code specifies the default program to open it with when double-clicked by the user. For example, the user could have several text files all with the type code of , but each open in a different program, due to having differing creator codes. This feature was intended so that, for example, human-readable plain-text files could be opened in a general-purpose text editor, while programming or HTML code files would open in a specialized editor or
IDE. However, this feature was often the source of user confusion, as which program would launch when the files were double-clicked was often unpredictable.
RISC OS
RISC OS () is an operating system designed to run on ARM architecture, ARM computers. Originally designed in 1987 by Acorn Computers of England, it was made for use in its new line of ARM-based Acorn Archimedes, Archimedes personal computers an ...
uses a similar system, consisting of a 12-bit number which can be looked up in a table of descriptions—e.g. the
hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
number is "aliased" to , representing a
PostScript
PostScript (PS) is a page description language and dynamically typed, stack-based programming language. It is most commonly used in the electronic publishing and desktop publishing realm, but as a Turing complete programming language, it c ...
file.
macOS uniform type identifiers (UTIs)
A Uniform Type Identifier (UTI) is a method used in
macOS
macOS, previously OS X and originally Mac OS X, is a Unix, Unix-based operating system developed and marketed by Apple Inc., Apple since 2001. It is the current operating system for Apple's Mac (computer), Mac computers. With ...
for uniquely identifying "typed" classes of entities, such as file formats. It was developed by
Apple
An apple is a round, edible fruit produced by an apple tree (''Malus'' spp.). Fruit trees of the orchard or domestic apple (''Malus domestica''), the most widely grown in the genus, are agriculture, cultivated worldwide. The tree originated ...
as a replacement for OSType (type & creator codes).
The UTI is a
Core Foundation
Core Foundation (also called CF) is a C application programming interface (API) written by Apple Inc. for its operating systems, and is a mix of low-level routines and wrapper functions. Most Core Foundation routines follow a certain naming c ...
string
String or strings may refer to:
*String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects
Arts, entertainment, and media Films
* ''Strings'' (1991 film), a Canadian anim ...
, which uses a
reverse-DNS string. Some common and standard types use a domain called (e.g. for a
Portable Network Graphics
Portable Network Graphics (PNG, officially pronounced , colloquially pronounced ) is a raster graphics, raster-graphics file graphics file format, format that supports lossless data compression. PNG was developed as an improved, non-patented ...
image), while other domains can be used for third-party types (e.g. for
Portable Document Format
Portable document format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating syste ...
). UTIs can be defined within a hierarchical structure, known as a conformance hierarchy. Thus, conforms to a supertype of , which itself conforms to a supertype of . A UTI can exist in multiple hierarchies, which provides great flexibility.
In addition to file formats, UTIs can also be used for other entities which can exist in macOS, including:
* Pasteboard data
*
Folders (directories)
* Translatable types (as handled by the Translation Manager)
* Bundles
* Frameworks
* Streaming data
* Aliases and symlinks
VSAM Catalog
In IBM
OS/VS
The IBM System/370 (S/370) is a range of IBM mainframe computers announced as the successors to the System/360 family on June 30, 1970. The series mostly maintains backward compatibility with the S/360, allowing an easy migration path for cus ...
through
z/OS
z/OS is a 64-bit operating system for IBM z/Architecture mainframes, introduced by IBM in October 2000. It derives from and is the successor to OS/390, which in turn was preceded by a string of MVS versions.Starting with the earliest:
...
, the VSAM catalog
(prior to
ICF catalogs)
and the VSAM Volume Record in the VSAM Volume Data Set (VVDS) (with ICF catalogs) identifies the type
of VSAM dataset.
VTOC
In IBM
OS/360
OS/360, officially known as IBM System/360 Operating System, is a discontinued batch processing operating system developed by IBM for their then-new System/360 mainframe computer, announced in 1964; it was influenced by the earlier IBSYS/IBJOB a ...
through
z/OS
z/OS is a 64-bit operating system for IBM z/Architecture mainframes, introduced by IBM in October 2000. It derives from and is the successor to OS/390, which in turn was preceded by a string of MVS versions.Starting with the earliest:
...
, a format 1 or 7
Data Set Control Block (DSCB) in the
Volume Table of Contents
In the storage architecture of OS/360 and successors, Conversational Monitor System, CMS, and DOS/360 and successors, the Volume Table of Contents (VTOC) is a data structure that provides a way of locating the data set (IBM mainframe), data sets th ...
(VTOC) identifies the
Dataset Organization (
DSORG) of the dataset described by it.
OS/2 extended attributes
The
HPFS, FAT12, and FAT16 (but not FAT32) filesystems allow the storage of "extended attributes" with files. These comprise an arbitrary set of triplets with a name, a coded type for the value, and a value, where the names are unique and values can be up to 64 KB long. There are standardized meanings for certain types and names (under
OS/2
OS/2 is a Proprietary software, proprietary computer operating system for x86 and PowerPC based personal computers. It was created and initially developed jointly by IBM and Microsoft, under the leadership of IBM software designer Ed Iacobucci, ...
). One such is that the ".TYPE" extended attribute is used to determine the file type. Its value comprises a list of one or more file types associated with the file, each of which is a string, such as "Plain Text" or "HTML document". Thus a file may have several types.
The
NTFS
NT File System (NTFS) (commonly called ''New Technology File System'') is a proprietary journaling file system developed by Microsoft in the 1990s.
It was developed to overcome scalability, security and other limitations with File Allocation Tabl ...
filesystem also allows storage of OS/2 extended attributes, as one of the file ''forks'', but this feature is merely present to support the OS/2 subsystem (not present in XP), so the Win32 subsystem treats this information as an opaque block of data and does not use it. Instead, it relies on other file forks to store meta-information in Win32-specific formats. OS/2 extended attributes can still be read and written by Win32 programs, but the data must be entirely parsed by applications.
POSIX extended attributes
On Unix and
Unix-like
A Unix-like (sometimes referred to as UN*X, *nix or *NIX) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Uni ...
systems, the
ext2
ext2, or second extended file system, is a file system for the Linux kernel (operating system), kernel. It was initially designed by French software developer Rémy Card as a replacement for the extended file system (ext). Having been designed ...
,
ext3
ext3, or third extended filesystem, is a journaling file system, journaled file system that is commonly used with the Linux kernel. It used to be the default file system for many popular Linux distributions but generally has been supplanted by ...
,
ext4
ext4 (fourth extended filesystem) is a journaling file system for Linux, developed as the successor to ext3.
ext4 was initially a series of backward-compatible extensions to ext3, many of them originally developed by Cluster File Systems for ...
,
ReiserFS version 3,
XFS
XFS is a high-performance 64-bit journaling file system created by Silicon Graphics, Inc (SGI) in 1993. It was the default file system in SGI's IRIX operating system starting with its version 5.3. XFS was ported to the Linux kernel in 2001; a ...
,
JFS,
FFS, and
HFS+ filesystems allow the storage of extended attributes with files. These include an arbitrary list of "name=value" strings, where the names are unique and a value can be accessed through its related name.
PRONOM unique identifiers (PUIDs)
The
PRONOM Persistent Unique Identifier (PUID) is an extensible scheme of persistent, unique, and unambiguous identifiers for file formats, which has been developed by
The National Archives of the UK as part of its
PRONOM technical registry service. PUIDs can be expressed as
Uniform Resource Identifier
A Uniform Resource Identifier (URI), formerly Universal Resource Identifier, is a unique sequence of characters that identifies an abstract or physical resource, such as resources on a webpage, mail address, phone number, books, real-world obje ...
s using the namespace. Although not yet widely used outside of the UK government and some
digital preservation
In library science, library and archival science, digital preservation is a formal process to ensure that digital information of continuing value remains accessible and usable in the long term. It involves planning, resource allocation, and appli ...
programs, the PUID scheme does provide greater granularity than most alternative schemes.
MIME types
MIME
A mime artist, or simply mime (from Greek language, Greek , , "imitator, actor"), is a person who uses ''mime'' (also called ''pantomime'' outside of Britain), the acting out of a story through body motions without the use of speech, as a the ...
types are widely used in many
Internet
The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...
-related applications, and increasingly elsewhere, although their usage for on-disc type information is rare. These consist of a standardised system of identifiers (managed by
IANA
The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Internet P ...
) consisting of a ''type'' and a ''sub-type'', separated by a
slash—for instance, or . These were originally intended as a way of identifying what type of file was attached to an
e-mail
Electronic mail (usually shortened to email; alternatively hyphenated e-mail) is a method of transmitting and receiving Digital media, digital messages using electronics, electronic devices over a computer network. It was conceived in the ...
, independent of the source and target operating systems. MIME types identify files on
BeOS,
AmigaOS 4.0 and
MorphOS, as well as store unique application signatures for application launching. In AmigaOS and MorphOS, the Mime type system works in parallel with Amiga specific Datatype system.
There are problems with the MIME types though; several organizations and people have created their own MIME types without registering them properly with IANA, which makes the use of this standard awkward in some cases.
File format identifiers (FFIDs)
File format identifiers are another, not widely used way to identify file formats according to their origin and their file category. It was created for the Description Explorer suite of software. It is composed of several digits of the form . The first part indicates the organization origin/maintainer (this number represents a value in a company/standards organization database), and the 2 following digits categorize the type of file in
hexadecimal
Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...
. The final part is composed of the usual filename extension of the file or the international standard number of the file, padded left with zeros. For example, the PNG file specification has the FFID of where indicates an image file, is the standard number and indicates the
International Organization for Standardization
The International Organization for Standardization (ISO ; ; ) is an independent, non-governmental, international standard development organization composed of representatives from the national standards organizations of member countries.
M ...
(ISO).
File content based format identification
Another less popular way to identify the file format is to examine the file contents for distinguishable patterns among file types. The contents of a file are a sequence of bytes and a byte has 256 unique permutations (0–255). Thus, counting the occurrence of byte patterns that is often referred to as byte frequency distribution gives distinguishable patterns to identify file types. There are many content-based file type identification schemes that use a byte frequency distribution to build the representative models for file type and use any statistical and data mining techniques to identify file types.
File structure
There are several types of ways to structure data in a file. The most usual ones are described below.
Unstructured formats (raw memory dumps)
Earlier file formats used raw data formats that consisted of directly dumping the memory images of one or more structures into the file.
This has several drawbacks. Unless the memory images also have reserved spaces for future extensions, extending and improving this type of structured file is very difficult. It also creates files that might be specific to one platform or programming language (for example a structure containing a
Pascal string is not recognized as such in
C). On the other hand, developing tools for reading and writing these types of files is very simple.
The limitations of the unstructured formats led to the development of other types of file formats that could be easily extended and be backward compatible at the same time.
Chunk-based formats
In this kind of file structure, each piece of data is embedded in a container that somehow identifies the data. The container's scope can be identified by start- and end-markers of some kind, by an explicit length field somewhere, or by fixed requirements of the file format's definition.
Throughout the 1970s, many programs used formats of this general kind. For example, word-processors such as
troff
troff (), short for "typesetter roff", is the major component of a document processing system developed by Bell Labs for the Unix operating system. troff and the related nroff were both developed from the original roff (software), roff.
Whil ...
,
Script, and
Scribe
A scribe is a person who serves as a professional copyist, especially one who made copies of manuscripts before the invention of Printing press, automatic printing.
The work of scribes can involve copying manuscripts and other texts as well as ...
, and database export files such as
CSV.
Electronic Arts
Electronic Arts Inc. (EA) is an American video game company headquartered in Redwood City, California. Founded in May 1982 by former Apple Inc., Apple employee Trip Hawkins, the company was a pioneer of the early home computer game industry ...
and
Commodore-
Amiga
Amiga is a family of personal computers produced by Commodore International, Commodore from 1985 until the company's bankruptcy in 1994, with production by others afterward. The original model is one of a number of mid-1980s computers with 16-b ...
also used this type of file format in 1985, with their IFF (Interchange File Format) file format.
A container is sometimes called a ''"chunk"'', although "chunk" may also imply that each piece is small, and/or that chunks do not contain other chunks; many formats do not impose those requirements.
The information that identifies a particular "chunk" may be called many different things, often terms including "field name", "identifier", "label", or "tag". The identifiers are often human-readable, and classify parts of the data: for example, as a "surname", "address", "rectangle", "font name", etc. These are not the same thing as identifiers in the sense of a database key or serial number (although an identifier may well identify its as such a key).
With this type of file structure, tools that do not know certain chunk identifiers simply skip those that they do not understand. Depending on the
actual meaning of the skipped data, this may or may not be useful (
CSS explicitly defines such behavior).
This concept has been used again and again by
RIFF
A riff is a short, repeated motif or figure in the melody or accompaniment of a musical composition. Riffs are most often found in rock music, punk, heavy metal music, Latin, funk, and jazz, although classical music is also sometimes based ...
(Microsoft-IBM equivalent of IFF), PNG, JPEG storage, DER (
Distinguished Encoding Rules
X.690 is an ITU-T standard specifying several ASN.1 encoding formats:
* Basic Encoding Rules (BER)
* Canonical Encoding Rules (CER)
* Distinguished Encoding Rules (DER)
The Basic Encoding Rules (BER) were the original rules laid out by the AS ...
) encoded streams and files (which were originally described in CCITT X.409:1984 and therefore predate IFF), and
Structured Data Exchange Format (SDXF).
Indeed, any data format must identify the significance of its component parts, and embedded boundary-markers are an obvious way to do so:
*
MIME headers do this with a colon-separated label at the start of each logical line. MIME headers cannot contain other MIME headers, though the data content of some headers has sub-parts that can be extracted by other conventions.
*
CSV and similar files often do this using a header records with field names, and with commas to mark the field boundaries. Like MIME, CSV has no provision for structures with more than one level.
*
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
and its kin can be loosely considered a kind of chunk-based format, since data elements are identified by markup that is akin to chunk identifiers. However, it has formal advantages such as
schemas and
validation, as well as the ability to represent more complex structures such as
trees
In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, e.g., including only woody plants with secondary growth, only p ...
,
DAGs, and
chart
A chart (sometimes known as a graph) is a graphics, graphical representation for data visualization, in which "the data is represented by symbols, such as bars in a bar chart, lines in a line chart, or slices in a pie chart". A chart can repres ...
s. If XML is considered a "chunk" format, then
SGML
The Standard Generalized Markup Language (SGML; International Organization for Standardization, ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on t ...
and its predecessor
IBM GML are among the earliest examples of such formats.
*
JSON
JSON (JavaScript Object Notation, pronounced or ) is an open standard file format and electronic data interchange, data interchange format that uses Human-readable medium and data, human-readable text to store and transmit data objects consi ...
is similar to XML without schemas, cross-references, or a definition for the meaning of repeated field-names, and is often convenient for programmers.
*
YAML is similar to JSON, but use indentation to separate data chunks and aim to be more human-readable than JSON or XML.
*
Protocol Buffers
Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs that communicate with each other over a network or for storing data. The method involves an ...
are in turn similar to JSON, notably replacing boundary-markers in the data with field numbers, which are mapped to/from names by some external mechanism.
Directory-based formats
This is another extensible format, that closely resembles a file system (
OLE Documents are actual filesystems), where the file is composed of 'directory entries' that contain the location of the data within the file itself as well as its signatures (and in certain cases its type). Good examples of these types of file structures are
disk image
A disk image is a snapshot of a storage device's content typically stored in a file on another storage device.
Traditionally, a disk image was relatively large because it was a bit-by-bit copy of every storage location of a device (i.e. every ...
s,
executable
In computer science, executable code, an executable file, or an executable program, sometimes simply referred to as an executable or binary, causes a computer "to perform indicated tasks according to encoded instruction (computer science), in ...
s, OLE documents
TIFF
Tag Image File Format or Tagged Image File Format, commonly known by the abbreviations TIFF or TIF, is an image file format for storing raster graphics images, popular among graphic artists, the publishing industry, and photographers. TIFF is w ...
,
libraries
A library is a collection of Book, books, and possibly other Document, materials and Media (communication), media, that is accessible for use by its members and members of allied institutions. Libraries provide physical (hard copies) or electron ...
.
Some file formats like ODT and DOCX, being
PKZIP
PKZIP is a file archiving computer program
A computer program is a sequence or set of instructions in a programming language for a computer to Execution (computing), execute. It is one component of software, which also includes softwar ...
-based, are both chunked and carry a directory.
The structure of a directory-based file format lends itself to modifications more easily than unstructured or chunk-based formats. The nature of this type of format allows users to carefully construct files that causes reader software to do things the authors of the format never intended to happen. An example of this is the
zip bomb. Directory-based file formats also use values that point at other areas in the file but if some later data value points back at data that was read earlier, it can result in an infinite loop for any reader software that assumes the input file is valid and blindly follows the loop.
See also
*
Audio file format
An audio file format is a file format for storing digital audio data on a computer system. The bit layout of the audio data (excluding metadata) is called the audio coding format and can be uncompressed, or audio compression (data), compressed t ...
*
Chemical file format
A chemical file format is a type of data file which is used specifically for depicting molecular data. One of the most widely used is the chemical table file format, which is similar to ''Structure Data Format'' (SDF) files. They are text files ...
*
Comparison of executable file formats
This is a comparison of binary executable file formats which, once loaded by a suitable executable loader, can be directly executed by the CPU rather than being interpreted by software. In addition to the binary application code, the executables ...
*
Digital container format
*
Document file format
A document file format is a Text file, text or binary file format for storing documents on a computer storage, storage media, especially for use by computers.
There currently exists a multitude of incompatible document file formats.
Examples of ...
*
DROID file format identification utility
*
File (command), a file type identification utility
*
File conversion
*
Future proofing
*
Graphics file format summary
This is a comparison of image file formats (graphics file formats). This comparison primarily features file formats for 2D images.
General
Ownership of the format and related information.
Technical details
See also
* List of codecs
Refe ...
*
Image file formats
An image file format is a file format for a digital image. There are many formats that can be used, such as JPEG, PNG, and GIF. Most formats up until 2022 were for storing 2D images, not 3D ones. The data stored in an image file format may be c ...
*
List of archive formats
This is a list of file formats used by file archiver, archivers and data compression, compressors used to create Archive file, archive files.
Archive formats by purpose
Archive formats are used for backups, mobility, and archiving. Many archive ...
*
List of file formats
This is a list of file formats used by computers, organized by type. Filename extension is usually noted in parentheses if they differ from the file format's name or abbreviation. Many operating systems do not limit filenames to one extension s ...
*
List of file signatures, or "magic numbers"
*
List of filename extensions (alphabetical)
*
List of free file formats
*
List of motion and gesture file formats
*
Magic number (programming)
*
Object file
An object file is a file that contains machine code or bytecode, as well as other data and metadata, generated by a compiler or assembler from source code during the compilation or assembly process. The machine code that is generated is kno ...
*
Video file format
A video file format is a type of file format for storing digital video data on a computer system. Video is almost always stored using lossy compression to reduce the file size.
A video file normally consists of a container (e.g. in the Matroska ...
*
Windows file types
This is a list of file formats used by computers, organized by type. Filename extension is usually noted in parentheses if they differ from the file format's name or abbreviation. Many operating systems do not limit filenames to one extension s ...
*
Filename extension
A filename extension, file name extension or file extension is a suffix to the name of a computer file (for example, .txt, .mp3, .exe) that indicates a characteristic of the file contents or its intended use. A filename extension is typically d ...
References
:*
:*
:*
External links
File Format Descriptions alphabetical listat
Library of Congress
The Library of Congress (LOC) is a research library in Washington, D.C., serving as the library and research service for the United States Congress and the ''de facto'' national library of the United States. It also administers Copyright law o ...
* ("The file formats you use have a direct impact on your ability to open those files at a later date and on the ability of other people to access those data")
{{DEFAULTSORT:File format