Text-based Format
   HOME

TheInfoList



OR:

A text file (sometimes spelled textfile; an old alternative name is flat file) is a kind of
computer file A computer file is a System resource, resource for recording Data (computing), data on a Computer data storage, computer storage device, primarily identified by its filename. Just as words can be written on paper, so too can data be written to a ...
that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as
CP/M CP/M, originally standing for Control Program/Monitor and later Control Program for Microcomputers, is a mass-market operating system created in 1974 for Intel 8080/Intel 8085, 85-based microcomputers by Gary Kildall of Digital Research, Dig ...
, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an
end-of-file In computing, end-of-file (EOF) is a condition in a computer operating system where no more data can be read from a data source. The data source is usually called a file or stream. Details In the C standard library, the character-reading functi ...
(EOF) marker, as padding after the last line in a text file. In modern operating systems such as
DOS DOS (, ) is a family of disk-based operating systems for IBM PC compatible computers. The DOS family primarily consists of IBM PC DOS and a rebranded version, Microsoft's MS-DOS, both of which were introduced in 1981. Later compatible syste ...
,
Microsoft Windows Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
and
Unix-like A Unix-like (sometimes referred to as UN*X, *nix or *NIX) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Uni ...
systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. Some operating systems, such as
Multics Multics ("MULTiplexed Information and Computing Service") is an influential early time-sharing operating system based on the concept of a single-level memory.Dennis M. Ritchie, "The Evolution of the Unix Time-sharing System", Communications of t ...
, Unix-like systems, CP/M, DOS, the
classic Mac OS Mac OS (originally System Software; retronym: Classic Mac OS) is the series of operating systems developed for the Mac (computer), Macintosh family of personal computers by Apple Computer, Inc. from 1984 to 2001, starting with System 1 and end ...
, and Windows, store text files as a sequence of bytes, with an end-of-line
delimiter A delimiter is a sequence of one or more Character (computing), characters for specifying the boundary between separate, independent regions in plain text, Expression (mathematics), mathematical expressions or other Data stream, data streams. An ...
at the end of each line. Other operating systems, such as
OpenVMS OpenVMS, often referred to as just VMS, is a multi-user, multiprocessing and virtual memory-based operating system. It is designed to support time-sharing, batch processing, transaction processing and workstation applications. Customers using Op ...
and OS/360 and its successors, have
record-oriented filesystem In computer science, a record-oriented filesystem is a file system where data is stored as collections of records. This is in contrast to a byte-oriented filesystem, where the data is treated as an unformatted stream of bytes. There are seve ...
s, in which text files are stored as a sequence either of fixed-length records or of variable-length records with a record-length value in the record header. "Text file" refers to a type of container, while
plain text In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects ( floating-point numbers, images, etc.). It may also include a lim ...
refers to a type of content. At a generic level of description, there are two kinds of computer files: text files and
binary file A binary file is a computer file that is not a text file. The term "binary file" is often used as a term meaning "non-text file". Many binary file formats contain parts that can be interpreted as text; for example, some computer document files ...
s.


Data storage

Because of their simplicity, text files are commonly used for storage of information. They avoid some of the problems encountered with other file formats, such as
endianness file:Gullivers_travels.jpg, ''Gulliver's Travels'' by Jonathan Swift, the novel from which the term was coined In computing, endianness is the order in which bytes within a word (data type), word of digital data are transmitted over a data comm ...
, padding bytes, or differences in the number of bytes in a
machine word In computing, a word is any Central processing unit, processor design's natural unit of data. A word is a fixed-sized Data (computing), datum handled as a unit by the instruction set or the hardware of the processor. The number of bits or digits ...
. Further, when
data corruption Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Computer, transmission, and storage systems use a number of meas ...
occurs in a text file, it is often easier to recover and continue processing the remaining contents. A disadvantage of text files is that they usually have a low
entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
, meaning that the information occupies more storage than is strictly necessary. A simple text file may need no additional
metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
(other than knowledge of its
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a c ...
) to assist the reader in interpretation. A text file may contain no data at all, which is a case of
zero-byte file A zero-byte file or zero-length file is a computer file containing no data; that is, it has a length or size of zero bytes. Creation There are many ways that could manually create a zero-byte file, for example, saving empty content in a text ed ...
.


Encoding

The ASCII character set is the most common compatible subset of character sets for English-language text files, and is generally assumed to be the default file format in many situations. It covers American English, but for the British
pound sign The pound sign () is the currency symbol, symbol for the pound unit of account, unit of Pound sterling, sterling – the currency of the United Kingdom and its associated Crown Dependencies and British Overseas Territories and previously of Kin ...
, the
euro sign The euro sign () is the currency sign used for the euro, the official currency of the eurozone. The design was presented to the public by the European Commission on 12 December 1996. It consists of a stylized letter E (or epsilon), crossed by ...
, or characters used outside English, a richer character set must be used. In many systems, this is chosen based on the default locale setting on the computer it is read on. Prior to UTF-8, this was traditionally single-byte encodings (such as
ISO-8859-1 ISO/IEC 8859-1:1998, ''Information technology—8-bit computing, 8-bit single-byte coded graphic character (computing), character sets—Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character enc ...
through ISO-8859-16) for European languages and wide character encodings for Asian languages. Because encodings necessarily have only a limited repertoire of characters, often very small, many are only usable to represent text in a limited subset of human languages.
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
is an attempt to create a common standard for representing all known languages, and most known character sets are subsets of the very large Unicode character set. Although there are multiple character encodings available for Unicode, the most common is
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
, which has the advantage of being backwards-compatible with ASCII; that is, every ASCII text file is also a UTF-8 text file with identical meaning. UTF-8 also has the advantage that it is easily auto-detectable. Thus, a common operating mode of UTF-8 capable software, when opening files of unknown encoding, is to try UTF-8 first and fall back to a locale dependent legacy encoding when it definitely is not UTF-8.


Formats

On most operating systems, the name ''text file'' refers to a file format that allows only
plain text In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects ( floating-point numbers, images, etc.). It may also include a lim ...
content with very little formatting (e.g., no bold or '' italic'' types). Such files can be viewed and edited on
text terminal A computer terminal is an electronic or electromechanical hardware device that can be used for entering data into, and transcribing data from, a computer or a computing system. Most early computers only had a front panel to input or display ...
s or in simple
text editor A text editor is a type of computer program that edits plain text. An example of such program is "notepad" software (e.g. Windows Notepad). Text editors are provided with operating systems and software development packages, and can be used to c ...
s. Text files usually have the
MIME A mime artist, or simply mime (from Greek language, Greek , , "imitator, actor"), is a person who uses ''mime'' (also called ''pantomime'' outside of Britain), the acting out of a story through body motions without the use of speech, as a the ...
type text/plain, usually with additional information indicating an encoding.


Microsoft Windows text files

DOS and
Microsoft Windows Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
use a common text file format, with each line of text separated by a two-character combination:
carriage return A carriage return, sometimes known as a cartridge return and often shortened to CR, or return, is a control character or mechanism used to reset a device's position to the beginning of a line of text. It is closely associated with the line feed ...
(CR) and
line feed A newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or ...
(LF). It is common for the last line of text ''not'' to be terminated with a CR-LF marker, and many text editors (including
Notepad A notebook (also known as a notepad, writing pad, drawing pad, or legal pad) is a book or stack of paper pages that are often Ruled paper, ruled and used for purposes such as note-taking, Diary, journaling or other writing, drawing, or scrapbooki ...
) do not automatically insert one on the last line. On Microsoft Windows operating systems, a file is regarded as a text file if the suffix of the name of the file (the "
filename extension A filename extension, file name extension or file extension is a suffix to the name of a computer file (for example, .txt, .mp3, .exe) that indicates a characteristic of the file contents or its intended use. A filename extension is typically d ...
") is .txt. However, many other suffixes are used for text files with specific purposes. For example, source code for computer programs is usually kept in text files that have file name suffixes indicating the
programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
in which the source is written. Most Microsoft Windows text files use ANSI, OEM, Unicode or UTF-8 encoding. What Microsoft Windows terminology calls "ANSI encodings" are usually single-byte
ISO/IEC 8859 ISO/IEC 8859 is a joint International Organization for Standardization, ISO and International Electrotechnical Commission, IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC ...
encodings (i.e. ANSI in the Microsoft Notepad menus is really "System Code Page", non-Unicode, legacy encoding), except for in locales such as Chinese, Japanese and Korean that require double-byte character sets. ANSI encodings were traditionally used as default system locales within Microsoft Windows, before the transition to Unicode. By contrast, OEM encodings, also known as DOS code pages, were defined by
IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
for use in the original
IBM PC The IBM Personal Computer (model 5150, commonly known as the IBM PC) is the first microcomputer released in the List of IBM Personal Computer models, IBM PC model line and the basis for the IBM PC compatible ''de facto'' standard. Released on ...
text mode display system. They typically include graphical and line-drawing characters common in DOS applications. "Unicode"-encoded Microsoft Windows text files contain text in
UTF-16 UTF-16 (16-bit Unicode Transformation Format) is a character encoding that supports all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two ''code units''. UTF-16 arose from an earli ...
Unicode Transformation Format. Such files normally begin with
byte order mark The byte-order mark (BOM) is a particular usage of the special Unicode character code, , whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text: * the byte order, or endianness, ...
(BOM), which communicates the endianness of the file content. Although UTF-8 does not suffer from endianness problems, many Microsoft Windows programs (i.e. Notepad) prepend the contents of UTF-8-encoded files with BOM, to differentiate UTF-8 encoding from other 8-bit encodings.


Unix text files

On
Unix-like A Unix-like (sometimes referred to as UN*X, *nix or *NIX) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Uni ...
operating systems, text files format is precisely described:
POSIX The Portable Operating System Interface (POSIX; ) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX defines application programming interfaces (APIs), along with comm ...
defines a text file as a file that contains characters organized into zero or more lines, where lines are sequences of zero or more non-newline characters plus a terminating newline character, normally LF. Additionally, POSIX defines a as a text file whose characters are printable or space or backspace according to regional rules. This excludes most control characters, which are not printable.


Apple Macintosh text files

Prior to the advent of
macOS macOS, previously OS X and originally Mac OS X, is a Unix, Unix-based operating system developed and marketed by Apple Inc., Apple since 2001. It is the current operating system for Apple's Mac (computer), Mac computers. With ...
, the
classic Mac OS Mac OS (originally System Software; retronym: Classic Mac OS) is the series of operating systems developed for the Mac (computer), Macintosh family of personal computers by Apple Computer, Inc. from 1984 to 2001, starting with System 1 and end ...
system regarded the content of a file (the data fork) to be a text file when its
resource fork A resource fork is a fork of a file on Apple's classic Mac OS operating system that is used to store structured data. It is one of the two forks of a file, along with the data fork, which stores data that the operating system treats as unstruct ...
indicated that the type of the file was "TEXT". Lines of classic Mac OS text files are terminated with CR characters. Being a Unix-like system, macOS uses Unix format for text files. Uniform Type Identifier (UTI) used for text files in macOS is "public.plain-text"; additional, more specific UTIs are: "public.utf8-plain-text" for utf-8-encoded text, "public.utf16-external-plain-text" and "public.utf16-plain-text" for utf-16-encoded text and "com.apple.traditional-mac-plain-text" for classic Mac OS text files.


Rendering

When opened by a text editor, human-readable content is presented to the user. This often consists of the file's plain text visible to the user. Depending on the application, control codes may be rendered either as literal instructions acted upon by the editor, or as visible
escape character In computing and telecommunications, an escape character is a character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharacters. Generally, the ...
s that can be edited as plain text. Though there may be plain text in a text file, control characters within the file (especially the end-of-file character) can render the plain text unseen by a particular method.


Related concepts

The use of
lightweight markup languages A lightweight markup language (LML), also termed a simple or humane markup language, is a markup language with simple, unobtrusive syntax. It is designed to be easy to write using any generic text editor and easy to read in its raw form. Lightw ...
such as
TeX Tex, TeX, TEX, may refer to: People and fictional characters * Tex (nickname), a list of people and fictional characters with the nickname * Tex Earnhardt (1930–2020), U.S. businessman * Joe Tex (1933–1982), stage name of American soul singer ...
,
markdown Markdown is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as an easy-to-read markup language. Markdown is widely used for blogging and instant messaging, and also used ...
and
wikitext A wiki ( ) is a form of hypertext publication on the internet which is collaboratively edited and managed by its audience directly through a web browser. A typical wiki contains multiple pages that can either be edited by the public or li ...
can be regarded as an extension of plain text files, as marked-up text is still wholly or partially human-readable in spite of containing machine-interpretable annotations. Early uses of
HTML Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
could also be regarded in this way, although the HTML of modern websites is largely unreadable by humans. Other file formats such as
enriched text Enriched text is a formatted text format for email, defined by the IETF in RFC 1896 and associated with the text/enriched MIME type which is defined in RFC 1563. Format It is "intended to facilitate the wider interoperation of simple enriche ...
and CSV can also be regarded as human-interpretable to some degree.


See also

*
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
*
EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding si ...
*
Filename extension A filename extension, file name extension or file extension is a suffix to the name of a computer file (for example, .txt, .mp3, .exe) that indicates a characteristic of the file contents or its intended use. A filename extension is typically d ...
*
List of file formats This is a list of file formats used by computers, organized by type. Filename extension is usually noted in parentheses if they differ from the file format's name or abbreviation. Many operating systems do not limit filenames to one extension s ...
*
Newline A newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or ...
*
Syntax highlighting Syntax highlighting is a feature of text editors that is used for programming language, programming, scripting language, scripting, or markup language, markup languages, such as HTML. The feature displays text, especially source code, in differe ...
*
Text-based protocol In computing, text-based user interfaces (TUI) (alternately terminal user interfaces, to reflect a dependence upon the properties of computer terminals and not just text), is a retronym describing a type of user interface (UI) common as an ear ...
*
Text editor A text editor is a type of computer program that edits plain text. An example of such program is "notepad" software (e.g. Windows Notepad). Text editors are provided with operating systems and software development packages, and can be used to c ...
*
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...


Notes and references


External links

* Power of Plain Text on C2 wiki {{Computer files * Computer data