bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...

, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches. The emergence and need for the analysis of different types of data generated through biological research has given rise to the field of

. Molecular sequence and structure data of DNA,

RNA Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...

, and

proteins Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respondi ...

gene expression Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect. ...

profiles or

microarray A microarray is a multiplex lab-on-a-chip. Its purpose is to simultaneously detect the expression of thousands of genes from a sample (e.g. from a tissue). It is a two-dimensional array on a solid substrate—usually a glass slide or silic ...

data,

metabolic pathway In biochemistry, a metabolic pathway is a linked series of chemical reactions occurring within a cell. The reactants, products, and intermediates of an enzymatic reaction are known as metabolites, which are modified by a sequence of chemical ...

data are some of the major types of data being analysed in bioinformatics. Among them sequence data is increasing at the exponential rate due to advent of next-generation sequencing technologies. Since the origin of bioinformatics,

sequence analysis In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence align ...

has remained the major area of research with wide range of applications in database searching,

genome annotation DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. An annotation (irrespective of the context) is a note added by way of explanati ...

comparative genomics Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural ...

molecular phylogeny Molecular phylogenetics () is the branch of phylogeny that analyzes genetic, hereditary molecular differences, predominantly in DNA sequences, to gain information on an organism's evolutionary relationships. From these analyses, it is possible to ...

and

gene prediction In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functi ...

. The pioneering approaches for sequence analysis were based on

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Ali ...

either global or local, pairwise or

multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolution ...

. Alignment-based approaches generally give excellent results when the sequences under study are closely related and can be reliably aligned, but when the sequences are divergent, a reliable alignment cannot be obtained and hence the applications of sequence alignment are limited. Another limitation of alignment-based approaches is their computational complexity and are time-consuming and thus, are limited when dealing with large-scale sequence data. The advent of

next-generation sequencing Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation s ...

technologies has resulted in generation of voluminous sequencing data. The size of this sequence data poses challenges on alignment-based algorithms in their assembly, annotation and comparative studies.

Alignment-free methods

Alignment-free methods can broadly be classified into five categories: a) methods based on ''k''-mer/word frequency, b) methods based on the length of common substrings, c) methods based on the number of (spaced) word matches, d) methods based on ''micro-alignments'', e) methods based on information theory and f) methods based on graphical representation. Alignment-free approaches have been used in sequence similarity searches, clustering and classification of sequences, and more recently in phylogenetics (Figure 1). Such molecular phylogeny analyses employing alignment-free approaches are said to be part of ''next-generation phylogenomics''. A number of review articles provide in-depth review of alignment-free methods in sequence analysis. The ''AFproject'' is an international collaboration to benchmark and compare software tools for alignment-free sequence comparison.

Methods based on ''k''-mer/word frequency

The popular methods based on ''k''-mer/word frequencies include feature frequency profile (FFP), Composition vector (CV), Return time distribution (RTD), frequency chaos game representation (FCGR). and Spaced Words.

Feature frequency profile (FFP)

The methodology involved in FFP based method starts by calculating the count of each possible ''k''-mer (possible number of ''k''-mers for nucleotide sequence: 4^k, while that for protein sequence: 20^k) in sequences. Each ''k''-mer count in each sequence is then normalized by dividing it by total of all ''k''-mers' count in that sequence. This leads to conversion of each sequence into its feature frequency profile. The pair wise distance between two sequences is then calculated Jensen–Shannon (JS) divergence between their respective FFPs. The

distance matrix In mathematics, computer science and especially graph theory, a distance matrix is a square matrix In mathematics, a square matrix is a matrix with the same number of rows and columns. An ''n''-by-''n'' matrix is known as a square matrix of orde ...

thus obtained can be used to construct phylogenetic tree using clustering algorithms like

neighbor-joining In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm require ...

UPGMA UPGMA (unweighted pair group method with arithmetic mean) is a simple agglomerative (bottom-up) hierarchical clustering method. The method is generally attributed to Sokal and Michener. The UPGMA method is similar to its ''weighted'' variant, th ...

etc.

Composition vector (CV)

In this method frequency of appearance of each possible ''k''-mer in a given sequence is calculated. The next characteristic step of this method is the subtraction of random background of these frequencies using

Markov model In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Mark ...

to reduce the influence of random neutral

mutations In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, mitosi ...

to highlight the role of selective evolution. The normalized frequencies are put a fixed order to form the composition vector (CV) of a given sequence.

Cosine distance In data analysis, cosine similarity is a measure of similarity between two sequences of numbers. For defining it, the sequences are viewed as vectors in an inner product space, and the cosine similarity is defined as the cosine of the angle bet ...

function is then used to compute pairwise distance between CVs of sequences. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like

etc. This method can be extended through resort to efficient pattern matching algorithms to include in the computation of the composition vectors: (i) all ''k''-mers for any value of ''k'', (ii) all substrings of any length up to an arbitrarily set maximum ''k'' value, (iii) all maximal substrings, where a substring is maximal if extending it by any character would cause a decrease in its occurrence count.

Return time distribution (RTD)

The RTD based method does not calculate the count of ''k''-mers in sequences, instead it computes the time required for the reappearance of ''k''-mers. The time refers to the number of residues in successive appearance of particular ''k''-mer. Thus the occurrence of each ''k''-mer in a sequence is calculated in the form of RTD, which is then summarised using two statistical parameters

mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value ( magnitude and sign) of a given data set. For a data set, the '' ari ...

(μ) and standard deviation (σ). Thus each sequence is represented in the form of numeric vector of size 2⋅4^''k'' containing ''μ'' and ''σ'' of 4^''k'' RTDs. The pair wise distance between sequences is calculated using

Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore o ...

measure. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like

etc. A recent approach Pattern Extraction through Entropy Retrieval (PEER) provides direct detection of the k-mer length and summarised the occurrence interval using entropy.

Frequency chaos game representation (FCGR)

The FCGR methods have evolved from chaos game representation (CGR) technique, which provides scale independent representation for genomic sequences. The CGRs can be divided by grid lines where each grid square denotes the occurrence of oligonucleotides of a specific length in the sequence. Such representation of CGRs is termed as Frequency Chaos Game Representation (FCGR). This leads to representation of each sequence into FCGR. The pair wise distance between FCGRs of sequences can be calculated using the Pearson distance, the Hamming distance or the Euclidean distance.

Spaced-word frequencies

While most alignment-free algorithms compare the word-composition of sequences, Spaced Words uses a pattern of care and don't care positions. The occurrence of a spaced word in a sequence is then defined by the characters at the match positions only, while the characters at the don't care positions are ignored. Instead of comparing the frequencies of contiguous words in the input sequences, this approach compares the frequencies of the spaced words according to the pre-defined pattern. Note that the pre-defined pattern can be selected by analysis of the

Variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of number ...

of the number of matches, the probability of the first occurrence on several models, or the

Pearson correlation coefficient In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficien ...

between the expected word frequency and the true alignment distance.

Methods based on length of common substrings

The methods in this category employ the

similarity Similarity may refer to: In mathematics and computing * Similarity (geometry), the property of sharing the same shape * Matrix similarity, a relation between matrices * Similarity measure, a function that quantifies the similarity of two objects * ...

and differences of substrings in a pair of sequences. These algorithms were mostly used for string processing in

computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...

Average common substring (ACS)

In this approach, for a chosen pair of sequences (A and B of lengths ''n'' and ''m'' respectively), longest substring starting at some position is identified in one sequence (A) which exactly matches in the other sequence (B) at any position. In this way, lengths of longest substrings starting at different positions in sequence A and having exact matches at some positions in sequence B are calculated. All these lengths are averaged to derive a measure

L(A, B)

. Intuitively, larger the

L(A, B)

, the more similar the two sequences are. To account for the differences in the length of sequences,

L(A, B)

is normalized .e.

L(A, B)/\log(m)

This gives the similarity measure between the sequences. In order to derive a distance measure, the inverse of

similarity measure In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such mea ...

is taken and a

correction term Correction may refer to: * A euphemism for punishment * Correction (newspaper), the posting of a notice of a mistake in a past issue of a newspaper * Correction (stock market), in financial markets, a short-term price decline * ''Correction'' (no ...

is subtracted from it to assure that

d(A, A)

will be zero. Thus :

d(A, B) = \left \frac \right - \left frac \right

This measure

d(A, B)

is not symmetric, so one has to compute

d_s(A, B) = d_s(B, A) = (d(A, B) + d(B, A))/2

, which gives final ACS measure between the two strings (A and B). The subsequence/substring search can be efficiently performed by using

suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow parti ...

''k''-mismatch average common substring approach (kmacs)

This approach is a generalization of the ACS approach. To define the distance between two DNA or protein sequences, kmacs estimates for each position ''i'' of the first sequence the longest substring starting at ''i'' and matching a substring of the second sequence with up to ''k'' mismatches. It defines the average of these values as a measure of similarity between the sequences and turns this into a symmetric distance measure. Kmacs does not compute exact ''k''-mismatch substrings, since this would be computational too costly, but approximates such substrings.

Mutation distances (Kr)

This approach is closely related to the ACS, which calculates the number of substitutions per site between two DNA sequences using the shortest absent substring (termed as ).

Length distribution of k-mismatch common substrings

This approach uses the program kmacs to calculate

longest common substring In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection ...

s with up to ''k'' mismatches for a pair of DNA sequences. The phylogenetic distance between the sequences can then be estimated from a local maximum in the length distribution of the k-mismatch common substrings.

Methods based on the number of (spaced) word matches

$D_2^S$ and $D_2^*$

These approachese are variants of the

D_2

statistics that counts the number of

k

-mer matches between two sequences. They improve the simple

D_2

statistics by taking the background distribution of the compared sequences into account.

MASH

This is an extremely fast method that uses the MinHash bottom sketch strategy for estimating the

Jaccard index The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v) and now is fre ...

of the multi-sets of

k

-mers of two input sequences. That is, it estimates the ratio of

k

-mer matches to the total number of

k

-mers of the sequences. This can be used, in turn, to estimate the evolutionary distances between the compared sequences, measured as the number of substitutions per sequence position since the sequences evolved from their last common ancestor.

Slope-Tree

This approach calculates a distance value between two protein sequences based on the decay of the number of

k

-mer matches if

k

increases.

Slope-SpaM

This method calculates the number

N_k

k

-mer or spaced-word matches (''SpaM'') for different values for the word length or number of match positions

k

in the underlying pattern, respectively. The slope of an affine-linear function

F

that depends on

N_k

is calculated to estimate the Jukes-Cantor distance between the input sequences .

Skmer

''Skmer'' calculates distances between species from unassembled sequencing reads. Similar to ''MASH'', it uses the

on the sets of

k

-mers from the input sequences. In contrast to ''MASH'', the program is still accurate for low sequencing coverage, so it can be used for

genome skimming Genome skimming is a sequencing approach that uses low-pass, shallow sequencing of a genome (up to 5%), to generate fragments of DNA, known as genome skims. These genome skims contain information about the high-copy fraction of the genome. The hig ...

Methods based on micro-alignments

Strictly spoken, these methods are not ''alignment-free''. They are using simple gap-free ''micro-alignments'' where sequences are required to match at certain pre-defined positions. The positions aligned at the remaining positions of the ''micro-alignments'' where mismatches are allowed, are then used for phylogeny inference.

Co-phylog

This method searches for so-called ''structures'' that are defined as pairs of ''k''-mer matches between two DNA sequences that are one position apart in both sequences. The two ''k''-mer matches are called the ''context'', the position between them is called the ''object''. Co-phylog then defines the distance between two sequences the fraction of such ''structures'' for which the two nucleotides in the ''object'' are different. The approach can be applied to unassembled sequencing reads.

andi

andi estimates phylogenetic distances between genomic sequences based on ungapped local alignments that are flanked by maximal exact word matches. Such word matches can be efficiently found using suffix arrays. The gapfree alignments between the exact word matches are then used to estimate phylogenetic distances between genome sequences. The resulting distance estimates are accurate for up to around 0.6 substitutions per position.

Filtered Spaced-Word Matches (FSWM)

FSWM uses a pre-defined binary pattern ''P'' representing so-called ''match positions'' and ''don't-care positions''. For a pair of input DNA sequences, it then searches for ''spaced-word matches'' w.r.t. ''P'', i.e. for local gap-free alignments with matching nucleotides at the ''match positions'' of ''P'' and possible mismatches at the ''don't-care positions''. Spurious low-scoring spaced-word matches are discarded, evolutionary distances between the input sequences are estimated based on the nucleotides aligned to each other at the ''don't-care positions'' of the remaining, homologous spaced-word matches. FSWM has been adapted to estimate distances based on unassembled NGS reads, this version of the program is called ''Read-SpaM''.

Prot-SpaM

Prot-SpaM (Proteome-based Spaced-word Matches) is an implementation of the FSWM algorithm for partial or whole proteome sequences.

Multi-SpaM

Multi-SpaM (MultipleSpaced-word Matches) is an approach to genome-based phylogeny reconstruction that extends the FSWM idea to multiple sequence comparison. Given a binary pattern ''P'' of ''match positions'' and ''don't-care positions'', the program searches for ''P''-blocks, i.e. local gap-free four-way alignments with matching nucleotides at the ''match positions'' of ''P'' and possible mismatches at the ''don't-care positions''. Such four-way alignments are randomly sampled from a set of input genome sequences. For each ''P''-block, an unrooted tree topology is calculated using ''RAxML''. The program ''Quartet MaxCut'' is then used to calculate a supertree from these trees.

Methods based on information theory

Information Theory Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...

has provided successful methods for alignment-free sequence analysis and comparison. The existing applications of information theory include global and local characterization of DNA, RNA and proteins, estimating genome entropy to motif and region classification. It also holds promise in

gene mapping Gene mapping describes the methods used to identify the locus of a gene and the distances between genes. Gene mapping can also describe the distances between different sites within a gene. The essence of all genome mapping is to place a c ...

analysis and

metagenomics Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or micr ...

Base–base correlation (BBC)

Base–base correlation (BBC) converts the genome sequence into a unique 16-dimensional numeric vector using the following equation, :

T_(K) = \sum_^K P_(\ell) \cdot \log_2 \left ( \frac \right)

The

P_i

and

P_j

denotes the probabilities of bases ''i'' and ''j'' in the genome. The

P_(\ell)

indicates the probability of bases ''i'' and ''j'' at distance ''ℓ'' in the genome. The parameter ''K'' indicates the maximum distance between the bases ''i'' and ''j''. The variation in the values of 16 parameters reflect variation in the genome content and length.

Information correlation and partial information correlation (IC-PIC)

IC-PIC (

information correlation Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, a ...

and partial information correlation) based method employs the base correlation property of DNA sequence. IC and PIC were calculated using following formulas, :

IC_\ell = -2 \sum_i P_i \log_2 P_i + \sum_ P_ (\ell) \log_2 P_ (\ell)

PIC_ (\ell) = (P_ (\ell) - P_i P_j (\ell))^2

The final vector is obtained as follows: :

V =  \text \ell \isin \left \,

which defines the range of distance between bases. The pairwise distance between sequences is calculated using

measure. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like

, etc..

Compression

Examples are effective approximations to

Kolmogorov complexity In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that prod ...

, for example

Lempel-Ziv complexity LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations including ...

. In general compression-based methods use the

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such as ...

between the sequences. This is expressed in conditional

, that is, the length of the shortest self-delimiting program required to generate a string given the prior knowledge of the other string. This measure has a relation to measuring ''k''-words in a sequence, as they can be easily used to generate the sequence. It is sometimes a computationally intensive method. The theoretic basis for the

approach was laid by Bennett, Gacs, Li, Vitanyi, and Zurek (1998) by proposing the

information distance Information distance is the distance between two finite objects (represented as computer files) expressed as the number of bits in the shortest program which transforms one object into the other one or vice versa on a universal computer. This is a ...

. The

being incomputable it was approximated by compression algorithms. The better they compress the better they are. Li, Badger, Chen, Kwong,, Kearney, and Zhang (2001) used a non-optimal but normalized form of this approach, and the optimal normalized form by Li, Chen, Li, Ma, and Vitanyi (2003) appeared in and more extensively and proven by Cilibrasi and Vitanyi (2005) in. Otu and Sayood (2003) used the

method to construct five different distance measures for phylogenetic tree construction.

Context modeling compression

In the context modeling complexity the next-symbol predictions, of one or more statistical models, are combined or competing to yield a prediction that is based on events recorded in the past. The algorithmic information content derived from each symbol prediction can be used to compute algorithmic information profiles with a time proportional to the length of the sequence. The process has been applied to DNA sequence analysis.

Methods based on graphical representation

Iterated maps

The use of iterated maps for sequence analysis was first introduced by HJ Jefferey in 1990 when he proposed to apply the

Chaos Game In mathematics, the term chaos game originally referred to a method of creating a fractal, using a polygon and an initial point selected at random inside it. The fractal is created by iteratively creating a sequence of points, starting with the ...

to map genomic sequences into a unit square. That report coined the procedure as Chaos Game Representation (CGR). However, only 3 years later this approach was first dismissed as a projection of a Markov transition table by N Goldman. This objection was overruled by the end of that decade when the opposite was found to be the case – that CGR bijectively maps Markov transition is into a fractal, order-free (degree-free) representation. The realization that iterated maps provide a bijective map between the symbolic space and numeric space led to the identification of a variety of alignment-free approaches to sequence comparison and characterization. These developments were reviewed in late 2013 by JS Almeida in. A number of web apps such as https://github.com/usm/usm.github.com/wiki, are available to demonstrate how to encode and compare arbitrary symbolic sequences in a manner that takes full advantage of modern

MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filteri ...

distribution developed for cloud computing.

Comparison of alignment based and alignment-free methods

Applications of alignment-free methods

* Genomic rearrangements * Molecular phylogenetics * Metagenomics * Next generation sequence data analysis * Epigenomics * Barcoding of species * Population genetics * Horizontal gene transfer * Sero/genotyping of viruses * Allergenicity prediction * SNP discovery * Recombination detection * Viral Classification * Archaea Taxonomic Identification

List of web servers/software for alignment-free methods

References

{{Reflist Bioinformatics Computational biology