SAMtools
   HOME

TheInfoList



OR:

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM (Sequence Alignment/Map), BAM (Binary Alignment/Map) and
CRAM Cram may refer to: * Cram (surname), a surname, and list of notable persons having the surname * Cram.com, a website for creating and sharing flashcards * Cram (Australian game show), a television show * ''Cram'' (game show), a TV game show that ...
formats, written by
Heng Li Heng Li is a Chinese bioinformatics scientist. He is an associate professor at the department of Biomedical Informatics of Harvard Medical School and the department of Data Science of Dana-Farber Cancer Institute. He was previously a research sci ...
. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and
format conversion Data conversion is the conversion of computer data from one format to another. Throughout a computer environment, data is encoded in a variety of ways. For example, computer hardware is built on the basis of certain standards, which requires th ...
. SAM files can be very large (10s of
Gigabyte The gigabyte () is a multiple of the unit byte for digital information. The prefix ''giga'' means 109 in the International System of Units (SI). Therefore, one gigabyte is one billion bytes. The unit symbol for the gigabyte is GB. This defini ...
s is common), so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details. As third-party projects were trying to use code from SAMtools despite it not being designed to be embedded in that way, the decision was taken in August 2014 to split the SAMtools package into a stand-alone software library with a well-defined
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software Interface (computing), interface, offering a service to other pieces of software. A document or standa ...
(HTSlib), a project for variant calling and manipulation of variant data (BCFtools), and the stand-alone SAMtools package for working with
sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
data.


Usage and commands

Like many
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...
commands, SAMtool commands follow a
stream A stream is a continuous body of water, body of surface water Current (stream), flowing within the stream bed, bed and bank (geography), banks of a channel (geography), channel. Depending on its location or certain characteristics, a stream ...
model, where data runs through each command as if carried on a
conveyor belt A conveyor belt is the carrying medium of a belt conveyor system (often shortened to belt conveyor). A belt conveyor system is one of many types of conveyor systems. A belt conveyor system consists of two or more pulleys (sometimes referred to ...
. This allows combining multiple commands into a data processing pipeline. Although the final output can be very complex, only a limited number of simple commands are needed to produce it. If not specified, the
standard streams In computer programming, standard streams are interconnected input and output communication channels between a computer program and its environment when it begins execution. The three input/output (I/O) connections are called standard input (stdin ...
(stdin, stdout, and stderr) are assumed. Data sent to stdout are printed to the screen by default but are easily redirected to another file using the normal Unix redirectors (> and >>), or to another command via a pipe (, ).


SAMtools commands

SAMtools provides the following commands, each invoked as "". ; view : The command filters SAM or BAM formatted data. Using options and arguments it understands what data to select (possibly all of it) and passes only that data through. Input is usually a sam or bam file specified as an argument, but could be sam or bam data piped from any other command. Possible uses include extracting a subset of data into a new file, converting between BAM and SAM formats, and just looking at the raw file contents. The order of extracted reads is preserved. ; sort : The command sorts a BAM file based on its position in the reference, as determined by its alignment. The element + coordinate in the reference that the first matched base in the read aligns to is used as the key to order it by. ODO: verify The sorted output is dumped to a new file by default, although it can be directed to stdout (using the -o option). As sorting is memory intensive and BAM files can be large, this command supports a sectioning mode (with the -m options) to use at most a given amount of memory and generate multiple output file. These files can then be merged to produce a complete sorted BAM file ODO - investigate the details of this more carefully/nowiki>. ; index : The command creates a new index file that allows fast look-up of data in a (sorted) SAM or BAM. Like an index on a database, the generated or file allows programs that can read it to more efficiently work with the data in the associated files. ; tview : The command starts an interactive ascii-based viewer that can be used to visualize how reads are aligned to specified small regions of the reference genome. Compared to a graphics based viewer like IGV,IGV
/ref> it has few features. Within the view, it is possible to jumping to different positions along reference elements (using 'g') and display help information ('?'). ; mpileup : The command produces a
pileup format Pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. This format facilitates visual display of SNP/indel calling and alignment. It was first used by Tony Cox and Zemin Ning at the Wellcome T ...
(or BCF) file giving, for each genomic coordinate, the overlapping read bases and indels at that position in the input BAM files(s). This can be used for SNP calling for example. ; flagstat :


Examples

; view : samtools view ''sample.bam'' > ''sample.sam'' Convert a bam file into a sam file. : samtools view -bS ''sample.sam'' > ''sample.bam'' Convert a sam file into a bam file. The option compresses or leaves compressed input data. : samtools view ''sample_sorted.bam'' "chr1:10-13" Extract all the reads aligned to the range specified, which are those that are aligned to the reference element named ''chr1'' and cover its 10th, 11th, 12th or 13th base. The results is saved to a BAM file including the header. An index of the input file is required for extracting reads according to their mapping position in the reference genome, as created by ''samtools index''. : samtools view -h -b ''sample_sorted.bam'' "chr1:10-13" > ''tiny_sorted.bam'' Extract the same reads as above, but instead of displaying them, writes them to a new bam file, ''tiny_sorted.bam''. The option makes the output compressed and the option causes the SAM headers to be output also. These headers include a description of the reference that the reads in ''sample_sorted.bam'' were aligned to and will be needed if the ''tiny_sorted.bam'' file is to be used with some of the more advanced SAMtools commands. The order of extracted reads is preserved. ; tview : samtools tview ''sample_sorted.bam'' Start an interactive viewer to visualize a small region of the reference, the reads aligned, and mismatches. Within the view, can jump to a new location by typing g: and a location, like . If the reference element name and following colon is replaced with , the current reference element is used, i.e. if is typed after the previous "goto" command, the viewer jumps to the region 200 base pairs down on ''chr1''. Typing brings up help information for scroll movement, colors, views, ... : samtools tview -p chrM:1 ''sample_chrM.ba
UCSC_hg38.fa
'
Set start position and compare. : samtools tview -d T -p chrY:10,000,000 ''sample_chrY.ba
UCSC_hg38.fa
' >> ''save.txt''
: samtools tview -d H -p chrY:10,000,000 ''sample_chrY.ba
UCSC_hg38.fa
>> ''save.html''
Save screen in .txt or .html. ; sort : samtools sort -o sorted_out ''unsorted_in.bam'' Read the specified ''unsorted_in.bam'' as input, sort it by aligned read position, and write it out to ''sorted_out''. Type of output can be either sam, bam, or cram, and will be determined automatically by sorted_out's file-extension. : samtools sort -m 5000000 ''unsorted_in.bam'' ''sorted_out'' Read the specified ''unsorted_in.bam'' as input, sort it in blocks up to 5 million k (5 Gb) and write output to a series of bam files named ''sorted_out.0000.bam'', ''sorted_out.0001.bam'', etc., where all bam 0 reads come before any bam 1 read, etc. ; index : samtools index ''sorted.bam'' Creates an index file, ''sorted.bam.bai'' for the ''sorted.bam'' file.


See also

*
DNA sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
*
Pileup format Pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. This format facilitates visual display of SNP/indel calling and alignment. It was first used by Tony Cox and Zemin Ning at the Wellcome T ...


References


External links


Home page for the SAMtools project



Wiki page at SeqAnswers for the SAMtools software (stub as of 2012-02-26.)
broken link
Mathematical notes on SAMtools algorithms from its primary author

Short, somewhat specialized tutorial on SAMtools from EMBL
broken link {{Bioinformatics Bioinformatics algorithms Bioinformatics software DNA sequencing Public-domain software