bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of

approximate string matching In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching i ...

that allows for substitutions. They are a straightforward modification to the earliest

heuristic A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...

-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search.,

alignment Alignment may refer to: Archaeology * Alignment (archaeology), a co-linear arrangement of features or structures with external landmarks * Stone alignment, a linear arrangement of upright, parallel megalithic standing stones Biology * Structu ...

assembly Assembly may refer to: Organisations and meetings * Deliberative assembly, a gathering of members who use parliamentary procedure for making decisions * General assembly, an official meeting of the members of an organization or of their representa ...

, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and

dash The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the endash , generally longer than the hyphen b ...

es or

asterisk The asterisk ( ), from Late Latin , from Ancient Greek , ''asteriskos'', "little star", is a typographical symbol. It is so called because it resembles a conventional image of a heraldic star. Computer scientists and mathematicians often voc ...

s for irrelevant positions.

Principle

Due to a number of functional and

evolution Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...

ary constraints, nucleic acid sequences between individuals tend to be highly conserved, with the typical difference between two human genomes estimated on the order of 0.6% (or around 20 million base pairs). Identification of highly similar regions in the

genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...

may indicate functional importance, as mutations in these areas that would result in cessation of function or loss of regulatory ability would be evolutionary unfavorable. More observed differences between two sequences may arise as a result of stochastic sequencing errors. Similarly, when performing assembly of a previously characterized genome, an attempt is made to align the newly sequenced DNA fragments to the existing genome sequence. In both cases, it is useful to be able to directly compare nucleic acid sequences. Since the sequences are not expected to be exactly identical, however, it is beneficial to focus on smaller subsequences that are more likely to be locally identical. Spaced seeds allow for even more permissive local matching by allowing certain base pairs (defined by the pattern of the specific spaced seed) to mismatch without penalty, thus allowing algorithms that use the general "hit-extend" strategy of alignment to explore additional potential matches that would be otherwise ignored.

Example

As a simple example, consider the following DNA sequences:

CTAAGTCACG
CTAACACACG
1111001111

Upon visual inspection, it's easy to see that there is a mismatch between the two sequences at the fifth and six base positions (in bold, above). However, the sequences still share 80% sequence similarity. The mismatches may be due to a real (biological) change or a sequencing error. In a non-spaced model, this putative match would be ignored if a seed size greater than 4 is specified. But a spaced seed of

1111001111

could be used to effectively zero-weighting the mismatch sites, treating the sequences as same for the purposes of hit identification. In reality, of course, we don't know the relative positioning of the "true" mismatches, so there can be different spaced seed patterns depending on where the mismatches are anticipated.

History

The concept of spaced seeds has been explored in literature under different names. One of the early uses was in sequence homology where the FLASH algorithm from 1993 referred to it as "non-contiguous sub-sequences of tokens" that were generated from all combinations of positions within a sequence window. In 1995, a similar concept was used in approximate string matching where "gapped

tuple In mathematics, a tuple is a finite ordered list (sequence) of elements. An -tuple is a sequence (or ordered list) of elements, where is a non-negative integer. There is only one 0-tuple, referred to as ''the empty tuple''. An -tuple is defi ...

s" of positions in a sequence were explored to identify common substrings between a large text and a query. The term "shape" was used in a 2001 paper to describe gapped q-grams where it refers to a set of relevant positions in a substring and soon after in 2002,

PatternHunter PatternHunter is a commercially available homology search instrument software that uses sequence alignment techniques. It was initially developed in the year 2002 by three scientists: Bin Ma, John Tramp and Ming Li. These scientists were driven by t ...

introduced "spaced model" which was proposed as an improvement upon the consecutive seeds used in BLAST, which was ultimately adopted by newer versions of it. Finally, in 2003, PatternHunter II settled on the term "spaced seed" to refer to the approach used in PatternHunter

Spaced versus Non-Spaced Models

Popular alignment algorithms such as BLAST and MegaBLAST use a non-spaced model, where the entire length of the seed is made of exact matches. Thus, any mismatching base pair along the length of the seed will result in the program ignoring the potential hit. In a spaced model (such as PatternHunter), the matches are not necessarily consecutive. More formally, the difference in a spaced seed model as compared to a non-spaced model is the relative positioning (otherwise known as weight,

k

) of the matched bases. In a non-spaced model, the length of the seed model,

M

and the weight,

k

of the seeds are the same, as they must be consecutive while in a spaced model, the weight is not necessarily equal to the length of the seed model, since match positions may be non-consecutive. Therefore, a spaced seed model may be longer than a non-spaced seed model but have the same weight. For example, a non-spaced seed

11111

has the same weight as a spaced model

1101101

, but their lengths differ. The predicted number of hits can be calculated from PatternHunter using the following

lemma Lemma may refer to: Language and linguistics * Lemma (morphology), the canonical, dictionary or citation form of a word * Lemma (psycholinguistics), a mental abstraction of a word about to be uttered Science and mathematics * Lemma (botany), a ...

(L-M+1)*p^k

Where

L

is the length of the sequence the model is compared to,

M

is the length of the seed model,

p

is the probability of a match and

k

is the weight of the seed used.

Applications

Homology Search

The type of seed model used for sequence alignment can affect the processing time and memory usage when doing large-scale homology searches – two considerations that have been central in the development of modern homology search algorithms. It may also affect the sensitivity. Using spaced seed models has been demonstrated to allow for faster homology searches as seen with PatternHunter wherein homology searches were twenty times faster and used less memory than BLASTn (using a non-spaced model).

Sequence Alignment

Most aligners first find candidate locations (or seeds) in the target sequence and then inspect those more closely to verify the alignment. Ideally, this first step would find all relevant locations in the target so sensitivity is prioritized but due to computational intensity, many popular algorithms (such as the earlier implementations of BLAST and FASTA) use heuristics to "short-cut" exploring all locations, ultimately missing many but running relatively quickly. One possible way to increase sensitivity, as done in the SHRiMP2 algorithm, is to use spaced seeds to allow for small differences between the query and the candidate locations so that somewhat more locations are identified as candidates. SHRiMP2 specifically uses multiple spaced seeds for this and requires multiple matches, increasing sensitivity as it allows different possible combinations of differences while maintaining speed comparable to original methods.

Sequence Assembly

A variation of spaced seeds with a single contiguous gap has been used in ''de novo'' sequence assembly. In this instance, the design has an equal number of ones at either end of the sequence with a run of zeroes in between. The reasoning behind this design is that in assemblers that utilize

De Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...

s, increasing k-mer size inflates memory usage, as k-mers are more likely to be unique. However, the most important parts of a k-mer are its ends, as they are what are used to extend sequences in a graph. Thus, to circumvent the problem with memory usage, the less-important middle part (covered by the gap) is ignored. This approach has the additional advantage, as in other uses of spaced seeds, of taking into account any sequencing errors that may have occurred in the gap area. It has been noted that increasing the length of the gap also increases the uniqueness of k-mers in both ''

E. coli ''Escherichia coli'' (),Wells, J. C. (2000) Longman Pronunciation Dictionary. Harlow ngland Pearson Education Ltd. also known as ''E. coli'' (), is a Gram-negative, facultative anaerobic, rod-shaped, coliform bacterium of the genus ''Escher ...

'' and ''

H. sapiens Humans (''Homo sapiens'') are the most abundant and widespread species of primate, characterized by bipedalism and exceptional cognitive skills due to a large and complex brain. This has enabled the development of advanced tools, culture, ...

'' genomes.

Metagenomics

A metagenomics study will commonly start with the high-throughput sequencing of a mixture of distinct species (e.g. from the human gut), yielding a set of sequences but with unknown origins. As such, one common goal is to identify which genome each sequence is

phylogenetically In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups o ...

most similar to. One approach could be to take k-mers from each sequence and see which, in a set of genomes, it has most sequence similarity with. Spaced seeds have been successfully utilized for this purpose by finding how many k-mers are found in each genome (hit number) and the total number of positions these k-mers cover (coverage).

Multiple Spaced Seeds

An improvement to (single) spaced seeds was first demonstrated by PatternHunter in 2002 where a set of spaced seeds were used, in which a "hit" was called whenever one of the set matched. PatternHunter II, in 2003, demonstrated that this approach could offer higher sensitivity than BLAST while maintaining similar speed. Identifying an optimal set of spaced seeds is an

NP-hard In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...

problem, however, even finding a "good" set of spaced seeds remains difficult, although several attempts have been made to computationally identify them. Since the speed of the algorithm must decrease with an increasing number of spaced seeds, it makes sense to only consider multiple seeds when all offer some useful contribution. There is ongoing research on how to quickly calculate good multiple spaced seeds, as previous homology search software calculated and

hard-coded Hard coding (also hard-coding or hardcoding) is the software development practice of embedding data directly into the source code of a program or other executable object, as opposed to obtaining the data from external sources or generating it at ...

their seeds – it would be advantageous to be able to calculate purpose-driven multiple spaced seeds instead.

References

{{reflist Genomics String matching algorithms