K-mers
   HOME

TheInfoList



OR:

In
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
, ''k''-mers are
substring In formal language theory and computer science, a substring is a contiguous sequence of characters within a string. For instance, "''the best of''" is a substring of "''It was the best of times''". In contrast, "''Itwastimes''" is a subsequence ...
s of length k contained within a biological sequence. Primarily used within the context of
computational genomics Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data (i.e., experimental data obtained ...
and
sequence analysis In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignm ...
, in which ''k''-mers are composed of
nucleotides Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules w ...
(''i.e''. A, T, G, and C), ''k''-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term ''k''-mer refers to all of a sequence's subsequences of length k, such that the sequence AGAT would have four
monomers In chemistry, a monomer ( ; ''mono-'', "one" + '' -mer'', "part") is a molecule that can react together with other monomer molecules to form a larger polymer chain or three-dimensional network in a process called polymerization. Classification M ...
(A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). More generally, a sequence of length L will have L - k + 1 ''k''-mers and n^ total possible ''k''-mers, where n is number of possible monomers (e.g. four in the case of DNA).


Introduction

''k''-mers are simply length k subsequences. For example, all the possible ''k''-mers of a DNA sequence are shown below: A method of visualizing ''k''-mers, the ''k''-mer spectrum, shows the multiplicity of each ''k''-mer in a sequence versus the number of ''k''-mers with that multiplicity. The number of modes in a ''k''-mer spectrum for a species's genome varies, with most species having a unimodal distribution. However, all
mammals Mammals () are a group of vertebrate animals constituting the class Mammalia (), characterized by the presence of mammary glands which in females produce milk for feeding (nursing) their young, a neocortex (a region of the brain), fur or ...
have a multimodal distribution. The number of modes within a ''k''-mer spectrum can vary between regions of genomes as well: humans have unimodal ''k''-mer spectra in 5' UTRs and
exons An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...
but multimodal spectra in 3' UTRs and
introns An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of the cistron .e., gene. ...
.


Forces Affecting DNA ''k''-mer Frequency

The frequency of ''k''-mer usage is affected by numerous forces, working at multiple levels, which are often in conflict. It is important to note that ''k''-mers for higher values of ''k'' are affected by the forces affecting lower values of ''k'' as well. For example, if the 1-mer A does not occur in a sequence, none of the 2-mers containing A (AA, AT, AG, and AC) will occur either, thereby linking the effects of the different forces.


''k'' = 1

When ''k'' = 1, there are four DNA ''k''-mers, ''i.e.'', A, T, G, and C. At the molecular level, there are three
hydrogen bonds In chemistry, a hydrogen bond (or H-bond) is a primarily electrostatic force of attraction between a hydrogen (H) atom which is covalently bound to a more electronegative "donor" atom or group (Dn), and another electronegative atom bearing a ...
between G and C, whereas there are only two between A and T. GC bonds, as a result of the extra hydrogen bond (and stronger stacking interactions), are more thermally stable than AT bonds. Mammals and birds have a higher ratio of Gs and Cs to As and Ts (
GC-content In molecular biology and genetics, GC-content (or guanine-cytosine content) is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C). This measure indicates the proportion of G and C bases out o ...
), which led to the hypothesis that thermal stability was a driving factor of GC-content variation. However, while promising, this hypothesis did not hold up under scrutiny: analysis among a variety of prokaryotes showed no evidence of GC-content correlating with temperature as the thermal adaptation hypothesis would predict. Indeed, if natural selection were to be the driving force behind GC-content variation, that would require that single nucleotide changes, which are often silent, to alter the fitness of an organism. Rather, current evidence suggests that GC‐biased gene conversion (gBGC) is a driving factor behind variation in GC content. gBGC is a process that occurs during recombination which replaces As and Ts with Gs and Cs. This process, though distinct from natural selection, can nevertheless exert selective pressure on DNA biased towards GC replacements being fixed in the genome. gBGC can therefore be seen as an "impostor" of natural selection. As would be expected, GC content is greater at sites experiencing greater recombination. Furthermore, organisms with higher rates of recombination exhibit higher GC content, in keeping with the gBGC hypothesis's predicted effects. Interestingly, gBGC does not appear to be limited to
eukaryotes Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bacte ...
. Asexual organisms such as bacteria and archaea also experience recombination by means of gene conversion, a process of homologous sequence replacement resulting in multiple identical sequences throughout the genome. That recombination is able to drive up GC content in all domains of life suggests that gBGC is universally conserved. Whether gBGC is a (mostly) neutral byproduct of the molecular machinery of life or is itself under selection remains to be determined. The exact mechanism and evolutionary advantage or disadvantage of gBGC is currently unknown.


''k'' = 2

Despite the comparatively large body of literature discussing GC-content biases, relatively little has been written about dinucleotide biases. What is known is that these dinucleotide biases are relatively constant throughout the genome, unlike GC-content, which, as seen above, can vary considerably. This is an important insight that must not be overlooked. If dinucleotide bias were subject to pressures resulting from
translation Translation is the communication of the Meaning (linguistic), meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The ...
, then there would be differing patterns of dinucleotide bias in coding and
noncoding Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regula ...
regions driven by some dinucelotides' reduced translational efficiency. Because there is not, it can therefore be inferred that the forces modulating dinucleotide bias are independent of translation. Further evidence against translational pressures affecting dinucleotide bias is the fact that the dinucleotide biases of viruses, which rely heavily on translational efficiency, are shaped by their viral family more than by their hosts, whose translational machinery the viruses hijack. Counter to gBGC's increasing GC-content is
CG suppression CG suppression is a term for the phenomenon that CG dinucleotides are very uncommon in most portions of vertebrate genomes. In adult somatic tissues, cytosine residues may be methylated, and this occurs almost exclusively within a symmetric CpG ...
, which reduces the frequency of CG 2-mers due to
deamination Deamination is the removal of an amino group from a molecule. Enzymes that catalyse this reaction are called deaminases. In the human body, deamination takes place primarily in the liver, however it can also occur in the kidney. In situations of e ...
of
methylated In the chemical sciences, methylation denotes the addition of a methyl group on a substrate, or the substitution of an atom (or group) by a methyl group. Methylation is a form of alkylation, with a methyl group replacing a hydrogen atom. These t ...
CG dinucleotides, resulting in substitutions of CGs with TGs, thereby reducing the GC-content. This interaction highlights the interrelationship between the forces affecting ''k''-mers for varying values of ''k.'' One interesting fact about dinucleotide bias is that it can serve as a "distance" measurement between phylogenetically similar genomes. The genomes of pairs of organisms that are closely related share more similar dinucleotide biases than between pairs of more distantly related organisms.


''k'' = 3

There are twenty natural
amino acids Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
that are used to build the proteins that DNA encodes. However, there are only four nucleotides. Therefore, there cannot be a one-to-one correspondence between nucleotides and amino acids. Similarly, there are 16 2-mers, which is also not enough to unambiguously represent every amino acid. However, there are 64 distinct 3-mers in DNA, which is enough to uniquely represent each amino acid. These non-overlapping 3-mers are called
codons The genetic code is the set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets, or codons) into proteins. Translation is accomplished by the ribosome, which links ...
. While each codon only maps to one amino acid, each amino acid can be represented by multiple codons. Thus, the same amino acid sequence can have multiple DNA representations. Interestingly, each codon for an amino acid is not used in equal proportions. This is called codon-usage bias (CUB). When ''k'' = 3, a distinction must be made between true 3-mer frequency and CUB. For example, the sequence ATGGCA has four 3-mer words within it (ATG, TGG, GGC, and GCA) while only containing two codons (ATG and GCA). However, CUB is a major driving factor of 3-mer usage bias (accounting for up to ⅓ of it, since ⅓ of the ''k''-mers in a coding region are codons) and will be the main focus of this section. The exact cause of variation between the frequencies of various codons is not fully understood. It is known that codon preference is correlated with tRNA abundances, with codons matching more abundant tRNAs being correspondingly more frequent and that more highly expressed proteins exhibit greater CUB. This suggests that selection for translational efficiency or accuracy is the driving force behind CUB variation.


''k'' = 4

Similar to the effect seen in dinucleotide bias, the tetranucleotide biases of phylogenetically similar organisms are more similar than between less closely related organisms. The exact cause of variation in tetranucleotide bias is not well understood, but it has been hypothesized to be the result of the maintenance of genetic stability at the molecular level.


Applications

The frequency of a set of ''k''-mers in a species's genome, in a genomic region, or in a class of sequences can be used as a "signature" of the underlying sequence. Comparing these frequencies is computationally easier than
sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
and is an important method in
alignment-free sequence analysis In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches. The emergence and need for the analysis of different types of data generated through biolo ...
. It can also be used as a first stage analysis before an alignment.


Sequence assembly

In sequence assembly, ''k''-mers are used during the construction of
De Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...
s. In order to create a De Bruijn Graph, the ''k''-mers stored in each edge with length L must overlap another string in another edge by L-1 in order to create a
vertex Vertex, vertices or vertexes may refer to: Science and technology Mathematics and computer science *Vertex (geometry), a point where two or more curves, lines, or edges meet *Vertex (computer graphics), a data structure that describes the position ...
. Reads generated from
next-generation sequencing Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation s ...
will typically have different read lengths being generated. For example, reads by Illumina’s sequencing technology capture reads of 100-mers. However, the problem with the sequencing is that only small fractions out of all the possible 100-mers that are present in the genome are actually generated. This is due to read errors, but more importantly, just simple coverage holes that occur during sequencing. The problem is that these small fractions of the possible ''k''-mers violate the key assumption of De Bruijn graphs that all the ''k''-mer reads must overlap its adjoining ''k''-mer in the genome by k-1 (which cannot occur when all the possible ''k''-mers are not present). The solution to this problem is to break these ''k''-mer sized reads into smaller ''k''-mers, such that the resulting smaller ''k''-mers will represent all the possible ''k''-mers of that smaller size that are present in the genome. Furthermore, splitting the ''k''-mers into smaller sizes also helps alleviate the problem of different initial read lengths. In this example, the five reads do not account for all the possible 7-mers of the genome, and as such, a De Bruijn graph cannot be created. But, when they are split into 4-mers, the resultant subsequences are enough to reconstruct the genome using a De Bruijn graph. Beyond being used directly for sequence assembly, ''k''-mers can also be used to detect genome mis-assembly by identifying ''k''-mers that are overrepresented which suggest the presence of repeated DNA sequences that have been combined. In addition, ''k''-mers are also used to detect bacterial contamination during eukaryotic genome assembly, an approach borrowed from the field of metagenomics.


Choice of ''k''-mer size

The choice of the ''k''-mer size has many different effects on the sequence assembly. These effects vary greatly between lower sized and larger sized ''k''-mers. Therefore, an understanding of the different ''k''-mer sizes must be achieved in order to choose a suitable size that balances the effects. The effects of the sizes are outlined below.


=Lower ''k''-mer sizes

= *A lower ''k''-mer size will decrease the amount of edges stored in the graph, and as such, will help decrease the amount of space required to store DNA sequence. *Having smaller sizes will increase the chance for all the ''k''-mers to overlap, and as such, have the required subsequences in order to construct the De Bruijn graph. *However, by having smaller sized ''k''-mers, you also risk having many vertices in the graph leading into a single k-mer. Therefore, this will make the reconstruction of the genome more difficult as there is a higher level of path ambiguities due to the larger amount of vertices that will need to be traversed. *Information is lost as the ''k''-mers become smaller. **''E.g. '' The possibility of AGTCGTAGATGCTG is lower than ACGT, and as such, holds a greater amount of information (refer to
entropy (information theory) In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable X, which takes values in the alphabet \ ...
for more information). *Smaller ''k''-mers also have the problem of not being able to resolve areas in the DNA where small
microsatellite A microsatellite is a tract of repetitive DNA in which certain DNA motifs (ranging in length from one to six or more base pairs) are repeated, typically 5–50 times. Microsatellites occur at thousands of locations within an organism's genome. ...
s or repeats occur. This is because smaller ''k''-mers will tend to sit entirely within the repeat region and is therefore hard to determine the amount of repetition that has actually taken place. **''E.g. '' For the subsequence ATGTGTGTGTGTGTACG, the amount of repetitions of TG will be lost if a ''k''-mer size less than 16 is chosen. This is because most of the ''k''-mers will sit in the repeated region and may just be discarded as repeats of the same ''k''-mer instead of referring the amount of repeats.


=Higher ''k''-mer sizes

= *Having larger sized ''k''-mers will increase the number of edges in the graph, which in turn, will increase the amount of memory needed to store the DNA sequence. *By increasing the size of the ''k''-mers, the number of vertices will also decrease. This will help with the construction of the genome as there will be fewer paths to traverse in the graph. *Larger ''k''-mers also run a higher risk of not having outward vertices from every k-mer. This is due to larger ''k''-mers increasing the risk that it will not overlap with another ''k''-mer by k-1. Therefore, this can lead to disjoints in the reads, and as such, can lead to a higher amount of smaller
contig A contig (from ''contiguous'') is a set of overlapping DNA segments that together represent a consensus region of DNA.Gregory, S. ''Contig Assembly''. Encyclopedia of Life Sciences, 2005. In bottom-up sequencing projects, a contig refers to ov ...
s. *Larger ''k''-mer sizes help alleviate the problem of small repeat regions. This is due to the fact that the ''k''-mer will contain a balance of the repeat region and the adjoining DNA sequences (given it are a large enough size) that can help to resolve the amount of repetition in that particular area.


Genetics and Genomics

With respect to disease, dinucleotide bias has been applied to the detection of genetic islands associated with pathogenicity. Prior work has also shown that tetranucleotide biases are able to effectively detect
horizontal gene transfer Horizontal gene transfer (HGT) or lateral gene transfer (LGT) is the movement of genetic material between Unicellular organism, unicellular and/or multicellular organisms other than by the ("vertical") transmission of DNA from parent to offsprin ...
in both prokaryotes and eukaryotes. Another application of ''k''-mers is in genomics-based taxonomy. For example, GC-content has been used to distinguish between species of ''
Erwinia ''Erwinia'' is a genus of Enterobacterales bacteria containing mostly plant pathogenic species which was named for the famous plant pathologist, Erwin Frink Smith. It contains Gram-negative bacteria related to ''Escherichia coli'', ''Shigella'' ...
'' with moderate success. Similar to the direct use of GC-content for taxonomic purposes is the use of Tm, the melting temperature of DNA. Because GC bonds are more thermally stable, sequences with higher GC content exhibit a higher Tm. In 1987, the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics proposed the use of ΔTm as factor in determining species boundaries as part of the
phylogenetic species concept In biology, a species is the basic unit of classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of the appropriate sex ...
, though this proposal does not appear to have gained traction within the scientific community. Other applications within genetics and genomics include: * RNA isoform quantification from
RNA-seq RNA-Seq (named as an abbreviation of RNA sequencing) is a sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing c ...
data * Classification of human mitochondrial
haplogroup A haplotype is a group of alleles in an organism that are inherited together from a single parent, and a haplogroup (haploid from the el, ἁπλοῦς, ''haploûs'', "onefold, simple" and en, group) is a group of similar haplotypes that share ...
* Detection of recombination sites in genomes * Estimation of
genome size Genome size is the total amount of DNA contained within one copy of a single complete genome. It is typically measured in terms of mass in picograms (trillionths (10−12) of a gram, abbreviated pg) or less frequently in daltons, or as the total ...
using ''k''-mer frequency vs ''k''-mer depth * Characterization of
CpG islands The CpG sites or CG sites are regions of DNA where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. CpG sites occur with high frequency in genomic regions called CpG isl ...
by flanking regions *''De novo'' detection of repeated sequence such as
transposable element A transposable element (TE, transposon, or jumping gene) is a nucleic acid sequence in DNA that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size. Transp ...
*
DNA barcoding DNA barcoding is a method of species identification using a short section of DNA from a specific gene or genes. The premise of DNA barcoding is that by comparison with a reference library of such DNA sections (also called "sequences"), an indiv ...
 of species. *Characterization of protein-binding
sequence motifs In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as ''A ...
*Identification of
mutation In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, mi ...
or polymorphism using next generation
sequencing In genetics and biochemistry, sequencing means to determine the primary structure (sometimes incorrectly called the primary sequence) of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succ ...
data


Metagenomics

''k''-mer frequency and spectrum variation is heavily used in metagenomics for both analysis and binning. In binning, the challenge is to separate sequencing reads into "bins" of reads for each organism (or
operational taxonomic unit An Operational Taxonomic Unit (OTU) is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 by Robert R. Sokal and Peter H. A. Sneath in the context of numerical taxonomy, w ...
), which will then be assembled. TETRA is a notable tool that takes metagenomic samples and bins them into organisms based on their tetranucleotide (''k'' = 4) frequencies.  Other tools that similarly rely on ''k''-mer frequency for metagenomic binning are CompostBin (''k'' = 6), PCAHIER, PhyloPythia (5 ≤ ''k'' ≤ 6), CLARK (''k'' ≥ 20), and TACOA (2 ≤ ''k'' ≤ 6). Recent developments have also applied
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...
to metagenomic binning using ''k''-mers. Other applications within metagenomics include: * Recovery of reading frames from raw reads * Estimation of species abundance in metagenomic samples * Determination of which species are present in samples * Identification of
biomarkers In biomedical contexts, a biomarker, or biological marker, is a measurable indicator of some biological state or condition. Biomarkers are often measured and evaluated using blood, urine, or soft tissues to examine normal biological processes, p ...
for diseases from samples


Biotechnology 

Modifying ''k''-mer frequencies in DNA sequences has been used extensively in biotechnological applications to control translational efficiency. Specifically, it has been used to both up- and down-regulate protein production rates. With respect to increasing protein production, reducing unfavorable dinucleotide frequency has been used yield higher rates of protein synthesis. In addition, codon usage bias has been modified to create synonymous sequences with greater protein expression rates. Similarly, codon pair optimization, a combination of dinucelotide and codon optimization, has also been successfully used to increase expression. The most studied application of ''k''-mers for decreasing translational efficiency is codon-pair manipulation for attenuating viruses in order to create vaccines. Researchers were able to recode
dengue virus ''Dengue virus'' (DENV) is the cause of dengue fever. It is a mosquito-borne, single positive-stranded RNA virus of the family ''Flaviviridae''; genus ''Flavivirus''. Four serotypes of the virus have been found, a reported fifth has yet to be co ...
, the virus that causes
dengue fever Dengue fever is a mosquito-borne tropical disease caused by the dengue virus. Symptoms typically begin three to fourteen days after infection. These may include a high fever, headache, vomiting, muscle and joint pains, and a characterist ...
, such that its codon-pair bias was more different to mammalian codon-usage preference than the wild type. Though containing an identical amino-acid sequence, the recoded virus demonstrated significantly weakened
pathogenicity In biology, a pathogen ( el, πάθος, "suffering", "passion" and , "producer of") in the oldest and broadest sense, is any organism or agent that can produce disease. A pathogen may also be referred to as an infectious agent, or simply a germ ...
while eliciting a strong immune response. This approach has also been used effectively to create an influenza vaccine as well a vaccine for Marek's disease herpesvirus (MDV). Notably, the codon-pair bias manipulation employed to attenuate MDV did not effectively reduce the oncogenicity of the virus, highlighting a potential weakness in the biotechnology applications of this approach. To date, no codon-pair deoptimized vaccine has been approved for use. Two later articles help explain the actual mechanism underlying codon-pair deoptimization: codon-pair bias is the result of dinucleotide bias. By studying viruses and their hosts, both sets of authors were able to conclude that the molecular mechanism that results in the attenuation of viruses is an increase in dinucleotides poorly suited for translation. GC-content, due to its effect on DNA melting point, is used to predict annealing temperature in PCR, another important biotechnology tool.


Implementation


Pseudocode

Determining the possible ''k''-mers of a read can be done by simply cycling over the string length by one and taking out each substring of length k. The pseudocode to achieve this is as follows: procedure k-mers(string seq, integer k) is L ← arr ← new array of L − k + 1 empty strings ''// iterate over the number of k-mers in seq,'' ''// storing the nth k-mer in the output array'' for n ← 0 to L − k + 1 exclusive do arr ← subsequence of seq from letter n inclusive to letter n + k exclusive return arr


In Bioinformatics Pipelines

Because the number of ''k''-mers grows exponentially for values of ''k'', counting ''k''-mers for large values of ''k'' (usually >10) is a computationally difficult task. While simple implementations such as the above pseudocode work for small values of ''k'', they need to be adapted for high-throughput applications or when ''k'' is large. To solve this problem, various tools have been developed:
Jellyfish
uses a multithreaded, lock-free
hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
for ''k''-mer counting and has
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
,
Ruby A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sa ...
, and
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offici ...
bindings
KMC
is a tool for ''k''-mer counting that uses a multidisk architecture for optimized speed
Gerbil
uses a hash table approach but with added support for GPU acceleration
K-mer Analysis Toolkit (KAT)
uses a modified version of Jellyfish to analyze ''k''-mer counts


See also

*
Oligonucleotide Oligonucleotides are short DNA or RNA molecules, oligomers, that have a wide range of applications in genetic testing, research, and forensics. Commonly made in the laboratory by solid-phase chemical synthesis, these small bits of nucleic acids c ...
*
Genomic signature The genomic signature refers to the characteristic frequency of oligonucleotides in a genome or sequence. It has been observed that the genomic signature of phylogenetically related genomes is similar. See also * Gene signature * mutational signat ...


References

* Some of the content in this article was copied fro
K-mer
at the PLOS wiki, which is available under
Creative Commons Attribution 2.5 Generic (CC BY 2.5) license
{{Reflist


External links


bioXriv:k-mer

arXiv: k-mer
Nucleic acids Applied mathematics Biophysics Computational biology Bioinformatics Amino acids