SAM (file Format)
   HOME
*





SAM (file Format)
Sequence Alignment Map (SAM) is a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker ''et al''. It was developed when the 1000 Genomes Project wanted to move away from the MAQ mapper format and decided to design a new format. The overall TAB-delimited flavour of the format came from an earlier format inspired by BLAT’s PSL. The name of SAM came from Gabor Marth from University of Utah, who originally had a format under the same name but with a different syntax more similar to a BLAST output. It is widely used for storing data, such as nucleotide sequences, generated by next generation sequencing technologies, and the standard has been broadened to include unmapped sequences. The format supports short and long reads (up to 128 Mbp) produced by different sequencing platforms and is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Wellcome S ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Heng Li
Heng Li is a Chinese bioinformatics scientist. He is an associate professor at the department of Biomedical Informatics of Harvard Medical School and the department of Data Science of Dana-Farber Cancer Institute. He was previously a research scientist working at the Broad Institute in Cambridge, Massachusetts with David Reich and David Altshuler. Li's work has made several important contributions in the field of next generation sequencing. Education Li majored in physics at Nanjing University during 1997–2001. He received his PhD from the Institute of Theoretical Physics at the Chinese Academy of Sciences in 2006. His thesis, titled "Constructing the TreeFam database", was supervised by Wei-Mou Zheng. Research Li was involved in a number of projects while working at the Beijing Genomics Institute from 2002 to 2006. These included studying rice finishing, silkworm sequencing, and genetic variation in chickens. From 2006 to 2009, Li worked on a postdoctoral research fellowship ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Genome Analysis Toolkit (GATK)
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as regulatory sequences (see non-coding DNA), and often a substantial fraction of 'junk' DNA with no evident function. Almost all eukaryotes have mitochondria and a small mitochondrial genome. Algae and plants also contain chloroplasts with a chloroplast genome. The study of the genome is called genomics. The genomes of many organisms have been sequenced and various regions have been annotated. The International Human Genome Project reported the sequence of the genome for ''Homo sapiens'' in 200The Human Genome Project although the initial "finished" sequence was missing 8% of the genome consisting mostly of repetitive sequences. With advancements in technology that could handle sequencing ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


GVF Format
GVF may refer to: * Greta Van Fleet, an American rock band * Global Village Foundation, an American charity * Golin language Golin (also Gollum, Gumine) is a Papuan language of Papua New Guinea. Phonology Vowels Diphthongs that occur are . The consonants can also be syllabic. Consonant are treated as single consonants by Bunn & Bunn (1970),* but as combinatio ..., native to Papua New Guinea * Grapevine virus F, a plant virus species in the genus Vitivirus * Gradient vector flow, a computer vision method {{disambiguation ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


FASTA Format
In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a near universal standard in the field of bioinformatics. The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages like the R programming language, Python, Ruby, Haskell, and Perl. Original format & overview The original FASTA/Pearson format is described in the documentation for the FASTA suite of programs. It can be downloaded with any free distribution of FASTA (see fasta20.doc, fastaVN.doc or fastaVN.me—where VN is the Version Number). In the original format, a sequence was represented as a series of lines, each of whic ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Unique Molecular Identifier
Unique molecular identifiers (UMIs), or molecular barcodes (MBC) are short sequences or molecular "tags" added to DNA fragments in some next generation sequencing library preparation protocols to identify the input DNA molecule. These tags are added before PCR amplification, and can be used to reduce errors and quantitative bias introduced by the amplification. Applications include variant calling in ctDNA, gene expression in single-cell RNA-seq (scRNA-seq) and haplotyping via linked reads. See also * Batch effect * Multiplex (assay) In the biological sciences, a multiplex assay is a type of immunoassay that uses magnetic beads to simultaneously measure multiple analytes in a single experiment. A multiplex assay is a derivative of an ELISA using beads for binding the capture a ... References DNA sequencing {{genetics-stub ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Phred Quality Score
A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. It was originally developed for the computer program Phred (software), Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces. The FASTQ format encodes phred scores as ASCII characters alongside the read sequences. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences. Definition Phred quality scores Q are logarithmically related to the base-calling error probabilities P and defined as Q = -10 \ \log_ P. This relation can be also be written as P = 10^. For example, if Phred assigns a quali ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

FASTQ Format
FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. It was originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA formatted sequence and its quality data, but has recently become the ''de facto'' standard for storing the output of high-throughput sequencing instruments such as the Illumina Genome Analyzer. Format A FASTQ file has four line-separated fields per sequence: * Field 1 begins with a '@' character and is followed by a sequence identifier and an ''optional'' description (like a FASTA title line). * Field 2 is the raw sequence letters. * Field 3 begins with a '+' character and is ''optionally'' followed by the same sequence identifier (and any description) again. * Field 4 encodes the quality values for the sequence in Field 2, and must contain the same num ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

CIGAR String
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data. Interpretation If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between amino acids occupying a par ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


SAMtools
SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM (Sequence Alignment/Map), BAM (Binary Alignment/Map) and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion. SAM files can be very large (10s of Gigabytes is common), so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Binary Alignment Map
Binary Alignment Map (BAM) is the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the Sequence Alignment Map-files. BAM is the compressed binary representation of SAM (Sequence Alignment Map), a compact and index-able representation of nucleotide sequence alignments. The goal of indexing is to retrieve alignments that overlap a specific location quickly without having to go through all of them. Before indexing, BAM must be sorted by reference ID and then leftmost coordinate. BAM is in compressed BGZF format. The structure of BAM files include a header section and an alignment section: * Header—The sample name, sample length, and alignment method are all included in this section. The alignments section contains alignments that are linked to specific information in the header section. * Alignments—The read name, read sequence, read quality, alignment information, and custom tags are all included in this file. The chr ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Wellcome Sanger Institute
The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust. It is located on the Wellcome Genome Campus by the village of Hinxton, outside Cambridge. It shares this location with the European Bioinformatics Institute. It was established in 1992 and named after double Nobel Laureate Frederick Sanger. It was conceived as a large scale DNA sequencing centre to participate in the Human Genome Project, and went on to make the largest single contribution to the gold standard sequence of the human genome. From its inception the institute established and has maintained a policy of data sharing, and does much of its research in collaboration. Since 2000, the institute expanded its mission to understand "the role of genetics in health and disease". The institute now employs around 900 people and engages in five main areas of research: Canc ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Broad Institute
The Eli and Edythe L. Broad Institute of MIT and Harvard (IPA: , pronunciation respelling: ), often referred to as the Broad Institute, is a biomedical and genomic research center located in Cambridge, Massachusetts, Cambridge, Massachusetts, United States. The institute is independently governed and supported as a 501(c)(3) nonprofit research organization under the name Broad Institute Inc., and it partners with the Massachusetts Institute of Technology, Harvard University, and the five Harvard teaching hospitals. History The Broad Institute evolved from a decade of research collaborations among MIT and Harvard scientists. One cornerstone was the Center for Genome Research of Whitehead Institute at MIT. Founded in 1982, the Whitehead became a major center for genomics and the Human Genome Project. As early as 1995, scientists at the Whitehead started pilot projects in genomic medicine, forming an unofficial collaborative network among young scientists interested in genomic a ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]