N50, L50, And Related Statistics
   HOME

TheInfoList



OR:

In
computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
, N50 and L50 are statistics of a set of
contig A contig (from ''contiguous'') is a set of overlapping DNA segments that together represent a consensus region of DNA.Gregory, S. ''Contig Assembly''. Encyclopedia of Life Sciences, 2005. In bottom-up sequencing projects, a contig refers to ov ...
or
scaffold Scaffolding, also called scaffold or staging, is a temporary structure used to support a work crew and materials to aid in the construction, maintenance and repair of buildings, bridges and all other man-made structures. Scaffolds are widely used ...
lengths. The ''N50'' is similar to a
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
or
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...
of lengths, but has greater weight given to the longer contigs. It is used widely in
genome assembly In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one ...
, especially in reference to contig lengths within a draft assembly. There are also the related U50, UL50, UG50, UG50%, N90, NG50, and D50 statistics. To provide a better assessment of assembly output for viral and microbial datasets, a new metric called U50 should be used. The ''U50'' identifies unique, target-specific contigs by using a reference genome as baseline, aiming at circumventing some limitations that are inherent to the ''N50'' metric. The use of the ''U50'' metric allows for a more accurate measure of assembly performance by analyzing only the unique, non-overlapping contigs. Most viral and microbial sequencing have high background noise (i.e., host and other non-targets), which contributes to having a skewed, misrepresented ''N50'' value - this is corrected by ''U50''.


Definition


N50

N50 statistic defines assembly quality in terms of
contiguity Contiguity or contiguous may refer to: *Contiguous data storage, in computer science *Contiguity (probability theory) *Contiguity (psychology) *Contiguous distribution of species, in biogeography *Geographic contiguity of territorial land *Contigu ...
. Given a set of contigs, the ''N50'' is defined as the sequence length of the shortest contig at 50% of the total assembly length. It can be thought of as the point of half of the mass of the distribution; the number of bases from all contigs longer than the ''N50'' will be close to the number of bases from all contigs shorter than the ''N50''. For example, consider 9 contigs with the lengths 2,3,4,5,6,7,8,9,and 10; their sum is 54, half of the sum is 27, and the size of the genome also happens to be 54. 50% of this assembly would be 10 + 9 + 8 = 27 (half the length of the sequence). Thus the N50=8, which is the size of the contig which, along with the larger contigs, contain half of sequence of a particular genome. Note: When comparing N50 values from different assemblies, the assembly sizes must be the same size in order for N50 to be meaningful. N50 can be described as a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value.


L50

Given a set of contigs, each with its own length, the ''L50'' is defined as count of smallest number of contigs whose length sum makes up half of genome size. From the example above the L50=3.


N90

The N90 statistic is less than or equal to the ''N50'' statistic; it is the length for which the collection of all contigs of that length or longer contains at least 90% of the sum of the lengths of all contigs.


NG50

Note that ''N50'' is calculated in the context of the assembly size rather than the genome size. Therefore, comparisons of N50 values derived from assemblies of significantly different lengths are usually not informative, even if for the same genome. To address this, the authors of the Assemblathon competition came up with a new measure called ''NG50''. The NG50 statistic is the same as ''N50'' except that it is 50% of the known or estimated genome size that must be of the NG50 length or longer. This allows for meaningful comparisons between different assemblies. In the typical case that the assembly size is not more than the genome size, the NG50 statistic will not be more than the N50 statistic.


D50

The D50 statistic (also termed D50 test) is similar to the ''N50'' statistic in definition though it is generally not used to describe genome assemblies. The ''D50'' statistic is the lowest value ''d'' for which the sum of the lengths of the largest ''d'' lengths is at least 50% of the sum of all of the lengths.


U50

''U50'' is the length of the smallest contig such that 50% of the sum of all unique, target-specific contigs is contained in contigs of size U50 or larger.


UL50

''UL50'' is the number of contigs whose length sum produces U50.


UG50

''UG50'' is the length of the smallest contig such that 50% of the reference genome is contained in unique, target-specific contigs of size UG50 or larger.


UG50%

''UG50%'' is the estimated percent coverage length of the UG50 in direct relation to the length of the reference genome. The calculation is (100 × (UG50/Length of reference genome). The ''UG50%'', as a percentage-based metric, can be used to compare assembly results from different samples or studies.


Examples

Consider two fictional, highly simplified genome assemblies, A and B, that are derived from two different species. Assembly A contains six contigs of lengths 80  kbp, 70 kbp, 50 kbp, 40 kbp, 30 kbp, and 20 kbp. The sum size of assembly A is 290 kbp, the N50 contig length is 70 kbp because 80 + 70 is greater than 50% of 290, and the L50 contig count is 2 contigs. The contig lengths of assembly B are the same as those of assembly A, except for the presence of two additional contigs with lengths of 10 kbp and 5 kbp. The size of assembly B is 305 kbp, the N50 contig length drops to 50 kbp because 80 + 70 + 50 is greater than 50% of 305, and the L50 contig count is 3 contigs. This example illustrates that one can sometimes increase the N50 length simply by removing some of the shortest contigs or scaffolds from an assembly. If the estimated or known size of the genome from the fictional species A is 500 kbp then the ''NG50'' contig length is 30 kbp because 80 + 70 + 50 + 40 + 30 is greater than 50% of 500. In contrast, if the estimated or known size of the genome from species B is 350 kbp then it has an NG50 contig length of 50 kbp because 80 + 70 + 50 is greater than 50% of 350.


Alternate computation

''N50'' can be found mathematically for a list ''L'' of positive integers as follows: # Create another list ''L' '', which is identical to ''L'', except that every element ''n'' in ''L'' has been replaced with ''n'' copies of itself. # The median of ''L' '' is the ''N50'' of ''L''. (The 10%
quantile In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile th ...
of ''L' '' is the ''N90'' statistic.) For example: If ''L'' = (2, 2, 2, 3, 3, 4, 8, 8), then ''L' '' consists of six 2's, six 3's, four 4's, and sixteen 8's. That is, ''L' '' has twice as many 2s as ''L''; it has three times as many 3s as ''L''; it has four times as many 4s; etc. The median of the 32-element set ''L' '' is the average of the 16th smallest element, 4, and 17th smallest element, 8, so the ''N50'' is 6. We can see that the sum of all values in the list ''L'' that are smaller than or equal to the ''N50'' of 6 is 16 = 2+2+2+3+3+4 and the sum of all values in the list ''L'' that are larger than or equal to 6 is also 16 = 8+8. For comparison with the ''N50'' of 6, note that the mean of the list ''L'' is 4 while the median is 3. To recapitulate in a more visual way, we have: Values of the list       ''L'' =  (2,    2,    2,    3,       3,       4,          8,                      8) Values of the new list   ''L' '' = (2  2  2  2  2  2  3  3  3  3  3  3  4  4  4  4  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8) Ranks of ''L' '' values =           1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32


References


Arachne wiki
at
Broad Institute The Eli and Edythe L. Broad Institute of MIT and Harvard (IPA: , pronunciation respelling: ), often referred to as the Broad Institute, is a biomedical and genomic research center located in Cambridge, Massachusetts, United States. The institu ...
* * {{cite journal , doi = 10.1101/gr.126599.111 , title = Assemblathon 1: A competitive assessment of de novo short read assembly methods , journal = Genome Research , volume = 21 , issue = 12 , pages = 2224–2241 , year = 2011 , last1 = Earl , first1 = D , last2 = Bradnam , first2 = K , last3 = St. John , first3 = J , last4 = Darling , first4 = A , last5 = Lin , first5 = D , last6 = Fass , first6 = J , last7 = Yu , first7 = HOK , last8 = Buffalo , first8 = V , last9 = Zerbino , first9 = DR , last10 = Diekhans , first10 = M , last11 = Nguyen , first11 = N , last12 = Ariyaratne , first12 = PN , last13 = Sung , first13 = W-K , last14 = Ning , first14 = Z , last15 = Haimel , first15 = M , last16 = Simpson , first16 = JT , last17 = Fonseca , first17 = NA , last18 = Birol , first18 = I , last19 = Docking , first19 = TR , last20 = Ho , first20 = IY , last21 = Rokhsar , first21 = DS , last22 = Chikhi , first22 = R , last23 = Lavenier , first23 = D , last24 = Chapuis , first24 = G , last25 = Naquin , first25 = D , last26 = Maillet , first26 = N , last27 = Schatz , first27 = MC , last28 = Kelley , first28 = DR , last29 = Phillippy , first29 = AM , last30 = Koren , first30 = S , pmid=21926179 , pmc=3227110
L50-vs-N50 blog post
(07-Oct-2015)


See also

* Herfindahl–Hirschman Index


External links


contig_info
A tool to estimate standard descriptive statistics from contig sequences, e.g. N(G)50, N(G)75, N(G)90, L(G)50, L(G)75, L(G)90
auN
... Bioinformatics Genomics