The BED (Browser Extensible Data) format is a
text file
A text file (sometimes spelled textfile; an old alternative name is flatfile) is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operat ...
format used to store
genomic
Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
regions as
coordinates
In geometry, a coordinate system is a system that uses one or more numbers, or coordinates, to uniquely determine the position of the points or other geometric elements on a manifold such as Euclidean space. The order of the coordinates is si ...
and associated
annotations
An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...
. The data are presented in the form of columns separated by spaces or tabs. This format was developed during the
Human Genome Project
The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...
and then adopted by other sequencing projects. As a result of this increasingly wide use, this format had already become a ''de facto''
standard Standard may refer to:
Symbols
* Colours, standards and guidons, kinds of military signs
* Standard (emblem), a type of a large symbol or emblem used for identification
Norms, conventions or requirements
* Standard (metrology), an object ...
in
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
before a formal specification was written.
One of the advantages of this format is the manipulation of coordinates instead of
nucleotide sequences
A nucleic acid sequence is a succession of bases signified by a series of a set of five different letters that indicate the order of nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. By convention, sequences are usua ...
, which optimizes the power and computation time when comparing all or part of genomes. In addition, its simplicity makes it easy to manipulate and read (or
parsing
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
) coordinates or annotations using
word processing
A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consen ...
and
scripting language
A scripting language or script language is a programming language that is used to manipulate, customize, and automate the facilities of an existing system. Scripting languages are usually interpreted at runtime rather than compiled.
A scripti ...
s such as
Python,
Ruby
A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum (aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sapp ...
or
Perl
Perl is a family of two High-level programming language, high-level, General-purpose programming language, general-purpose, Interpreter (computing), interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it ...
or more specialized tools such as
BEDTools.
History
The end of the 20th century saw the emergence of the first projects to
sequence
In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is called ...
complete
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
s. Among these projects, the
Human Genome Project
The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...
was the most ambitious at the time, aiming to sequence for the first time a genome of several
gigabases. This required the sequencing centres to carry out major methodological development in order to automate the processing of sequences and their analyses. Thus, many formats were created, such as
FASTQ
FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.
...
,
GFF or BED.
However, no official specifications were published at the time, which affected some formats such as FASTQ when
sequencing projects multiplied at the beginning of the 21st century.
Its wide use within
genome browser In bioinformatics, a genome browser is a graphical interface for display of information from a biological database for genomic data. Genome browsers enable researchers to visualize and browse entire genomes with annotated data including gene predic ...
s has made it possible to define this format in a relatively stable way as this description is used by many tools.
Format
Initially the BED format did not have any official specification. Instead, the description provided by the
UCSC Genome Browser
The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate spe ...
has been widely used as a reference.
A formal BED specification
was published in 2021 under the auspices of the
Global Alliance for Genomics and Health
The Global Alliance for Genomics and Health (GA4GH) is an international consortium that is developing standards for responsibly collecting, storing, analyzing, and sharing genomic data in order to enable an "internet of genomics". GA4GH was founded ...
.
Description
A BED file consists of a minimum of three columns to which nine optional columns can be added for a total of twelve columns. The first three columns contain the names of
chromosome
A chromosome is a long DNA molecule with part or all of the genetic material of an organism. In most chromosomes the very long thin DNA fibers are coated with packaging proteins; in eukaryotic cells the most important of these proteins ar ...
s or
scaffolds
Scaffolding, also called scaffold or staging, is a temporary structure used to support a work crew and materials to aid in the construction, maintenance and repair of buildings, bridges and all other man-made structures. Scaffolds are widely used ...
, the start, and the end coordinates of the sequences considered. The next nine columns contain annotations related to these sequences. These columns must be separated by
spaces or
tabs, the latter being recommended for reasons of compatibility between programs.
Each row of a file must have the same number of columns. The order of the columns must be respected: if columns of high numbers are used, the columns of intermediate numbers must be filled in.
Header
A BED file can optionally contain a
header. However, there is no official description of the format of the header. It may contain one or more lines and be signified by different words or symbols,
depending on its functional role or simply descriptive. Thus, a header line can begin with these words or symbol:
* "browser": functional header used by the
UCSC Genome Browser
The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate spe ...
to set options related to it,
* "track": functional header used by
genome browser In bioinformatics, a genome browser is a graphical interface for display of information from a biological database for genomic data. Genome browsers enable researchers to visualize and browse entire genomes with annotated data including gene predic ...
s to specify display options related to it,
* "#": descriptive header to add comments such as the name of each column.
Coordinate system
Unlike the
coordinate system used by other standards such as
GFF, the system used by the BED format is zero-based for the coordinate start and one-based for the coordinate end.
Thus, the
nucleotide
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecul ...
with the coordinate 1 in a genome will have a value of 0 in column 2 and a value of 1 in column 3.
A thousand-base BED interval with the following start and end:
chr7 0 1000
would convert to the following 1-based "human" genome coordinates, as used by a genome browser such as UCSC:
chr7 1 1000
This choice is justified by the method of calculating the lengths of the genomic regions considered, this calculation being based on the simple subtraction of the end coordinates (column 3) by those of the start (column 2):
. When the coordinate system is based on the use of 1 to designate the first position, the calculation becomes slightly more complex:
. This slight difference can have a relatively large impact in terms of computation time when
data sets A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer scienc ...
with several thousand to hundreds of thousands of lines are used.
Alternatively, we can view both coordinates as zero-based, where the end position is non-inclusive. In other words, the zero-based end position denotes the index of the first position after the feature. For the example above, the zero-based end position of 1000 marks the first position after the feature including positions 0 through 999.
Examples
Here is a minimal example:
chr7 127471196 127472363
chr7 127472363 127473530
chr7 127473530 127474697
Here is a typical example with nine columns from the
UCSC Genome Browser
The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate spe ...
. The first three lines are settings for the UCSC Genome Browser and are unrelated to the data specified in BED format:
browser position chr7:127471196-127495720
browser hide all
track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2 itemRgb="On"
chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0
chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0
chr7 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0
chr7 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0
chr7 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255
chr7 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255
chr7 127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255
chr7 127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0
chr7 127480532 127481699 Neg4 0 - 127480532 127481699 0,0,255
File extension
There is currently no standard
file extension
A filename extension, file name extension or file extension is a suffix to the name of a computer file (e.g., .txt, .docx, .md). The extension indicates a characteristic of the file contents or its intended use. A filename extension is typically ...
for BED files, but the ".bed" extension is the most frequently used. The number of columns sometimes is noted in the file extension, for example: ".bed3", ".bed4", ".bed6", ".bed12".
Usage
The use of BED files has spread rapidly with the emergence of
new sequencing techniques and the manipulation of larger and larger
sequence
In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is called ...
files. The comparison of genomic sequences or even entire genomes by comparing the sequences themselves can quickly require significant computational resources and become time-consuming. Handling BED files makes this work more efficient by using coordinates to extract sequences of interest from sequencing sets or to directly compare and manipulate two sets of coordinates.
To perform these tasks, various programs can be used to manipulate BED files, including but not limited to the following:
*
Genome browser In bioinformatics, a genome browser is a graphical interface for display of information from a biological database for genomic data. Genome browsers enable researchers to visualize and browse entire genomes with annotated data including gene predic ...
s: from BED files allows the visualization and extraction of sequences of mammalian genomes currently sequenced (e.g. the function Manage Custom Tracks in
UCSC Genome Browser
The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate spe ...
).
[
* ]Galaxy
A galaxy is a system of stars, stellar remnants, interstellar gas, dust, dark matter, bound together by gravity. The word is derived from the Greek ' (), literally 'milky', a reference to the Milky Way galaxy that contains the Solar Sys ...
: web-based
A web application (or web app) is application software that is accessed using a web browser. Web applications are delivered on the World Wide Web to users with an active network connection.
History
In earlier computing models like client-serve ...
platform.[
* Command-line tools:
** BEDTools: program allowing the manipulation of coordinate sets and the extraction of sequences from a BED file.]
** BEDOPS: a suite of tools for fast boolean operations on BED files.
** BedTk: a faster alternative to BEDTools for a limited and specialized sub-set of operations.
** covtobed: a tool to convert a BAM file into a BED coverage track.
.genome Files
BEDtools also uses .genome files to determine chromosomal boundaries and ensure that padding operations do not extend past chromosome boundaries. Genome files are formatted as shown below, a two-column tab-separated file with one-line header.
chrom size
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
chr5 181538259
chr6 170805979
chr7 159345973
...
References
{{Reflist
Bioinformatics
Computer file formats