Pileup Format
   HOME

TheInfoList



OR:

Pileup format is a text-based
format Format may refer to: Printing and visual media * Text formatting, the typesetting of text elements * Paper formats, or paper size standards * Newspaper format, the size of the paper page Computing * File format, particular way that informatio ...
for summarizing the base calls of aligned reads to a reference sequence. This format facilitates visual display of SNP/indel calling and alignment. It was first used by Tony Cox and Zemin Ning at the
Wellcome Trust Sanger Institute The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust. It is located on the Wellcome G ...
, and became widely known through its implementation within the
SAMtools SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM (Sequence Alignment/Map), BAM (Binary Alignment/Map) and CRAM formats, written by Heng Li. These files are generated as output ...
software suite.


Format


Example


The columns

Each line consists of 5 (or optionally 6) tab-separated columns: #Sequence identifier #Position in sequence (starting from 1) #Reference nucleotide at that position #Number of aligned reads covering that position (depth of coverage) #Bases at that position from aligned reads #Phred Quality of those bases, represented in ASCII with -33 offset (OPTIONAL)


Column 5: The bases string

*. (dot) means a base that matched the reference on the forward strand *, (comma) means a base that matched the reference on the reverse strand * (less-/greater-than sign) denotes a reference skip. This occurs, for example, if a base in the reference genome is intronic and a read maps to two flanking exons. If quality scores are given in a
sixth column ''Sixth Column'', also known under the title ''The Day After Tomorrow'', is a science fiction novel by American writer Robert A. Heinlein, based on a then-unpublished story by editor John W. Campbell, and set in a United States that has been conq ...
, they refer to the quality of the read and not the specific base. *AGTCN (upper case) denotes a base that did not match the reference on the forward strand *agtcn (lower case) denotes a base that did not match the reference on the reverse strand *A sequence matching the
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
denotes an insertion of one or more bases starting from the next position. For example, +2AG means insertion of AG in the forward strand *A sequence matching the regular expression denotes a deletion of one or more bases starting from the next position. For example, -2ct means deletion of CT in the reverse strand *^ (caret) marks the start of a read segment and the ASCII of the character following `^' minus 33 gives the mapping quality *$ (dollar) marks the end of a read segment * * (asterisk) is a placeholder for a deleted base in a multiple basepair deletion that was mentioned in a previous line by the notation


Column 6: The base quality string

This is an optional column. If present, the
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
value of the character minus 33 gives the mapping Phred quality of each of the bases in the previous column 5. This is similar to quality encoding in the
FASTQ format FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. I ...
.


File extension

There is no standard
file extension A filename extension, file name extension or file extension is a suffix to the name of a computer file (e.g., .txt, .docx, .md). The extension indicates a characteristic of the file contents or its intended use. A filename extension is typically d ...
for a Pileup file, but .msf (multiple sequence file), .pup and .pileup are used.


See also

*
Variant Call Format The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 G ...
*
FASTQ format FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. I ...
* List of file formats for molecular biology


References


External links


SAMtools pileup descriptionbioruby-pileup_iterator (A Ruby pileup parser)
{{Bioinformatics Bioinformatics Biological sequence format