Nucleic acid structure prediction is a computational method to determine ''secondary'' and ''tertiary''
nucleic acid structure
Nucleic acid structure refers to the structure of nucleic acids such as DNA and RNA. Chemically speaking, DNA and RNA are very similar. Nucleic acid structure is often divided into four different levels: primary, secondary, tertiary, and quatern ...
from its sequence. Secondary structure can be predicted from one or several nucleic acid sequences. Tertiary structure can be predicted from the sequence, or by comparative modeling (when the structure of a homologous sequence is known).
The problem of predicting nucleic acid secondary structure is dependent mainly on
base pair
A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
ing and
base stacking interactions; many molecules have several possible three-dimensional structures, so predicting these structures remains out of reach unless obvious sequence and functional similarity to a known class of nucleic acid molecules, such as
transfer RNA
Transfer RNA (abbreviated tRNA and formerly referred to as sRNA, for soluble RNA) is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes), that serves as the physical link between the mRNA and the amino ac ...
(tRNA) or
microRNA
MicroRNA (miRNA) are small, single-stranded, non-coding RNA molecules containing 21 to 23 nucleotides. Found in plants, animals and some viruses, miRNAs are involved in RNA silencing and post-transcriptional regulation of gene expression. miRN ...
(miRNA), is observed. Many secondary structure prediction methods rely on variations of
dynamic programming
Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.
I ...
and therefore are unable to efficiently identify
pseudoknot
__NOTOC__
A pseudoknot is a nucleic acid secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem. The pseudoknot was first recognized in the turnip yellow ...
s.
While the methods are similar, there are slight differences in the approaches to RNA and DNA structure prediction. ''
In vivo
Studies that are ''in vivo'' (Latin for "within the living"; often not italicized in English) are those in which the effects of various biological entities are tested on whole, living organisms or cells, usually animals, including humans, and ...
'', DNA structures are more likely to be duplexes with full
complementarity between two strands, while RNA structures are more likely to fold into complex secondary and tertiary structures such as in the
ribosome
Ribosomes ( ) are macromolecular machines, found within all cells, that perform biological protein synthesis (mRNA translation). Ribosomes link amino acids together in the order specified by the codons of messenger RNA (mRNA) molecules to ...
,
spliceosome
A spliceosome is a large ribonucleoprotein (RNP) complex found primarily within the nucleus of eukaryotic cells. The spliceosome is assembled from small nuclear RNAs (snRNA) and numerous proteins. Small nuclear RNA (snRNA) molecules bind to specifi ...
, or
transfer RNA
Transfer RNA (abbreviated tRNA and formerly referred to as sRNA, for soluble RNA) is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes), that serves as the physical link between the mRNA and the amino ac ...
. This is partly because the extra oxygen in RNA increases the propensity for
hydrogen bond
In chemistry, a hydrogen bond (or H-bond) is a primarily electrostatic force of attraction between a hydrogen (H) atom which is covalently bound to a more electronegative "donor" atom or group (Dn), and another electronegative atom bearing a ...
ing in the nucleic acid backbone. The
energy parameters are also different for the two nucleic acids. The structure prediction methods can follow a completely theoretical approach, or a hybrid one incorporating experimental data.
Single sequence structure prediction
A common problem for researchers working with RNA is to determine the three-dimensional structure of the molecule given only a nucleic acid sequence. However, in the case of RNA much of the final structure is determined by the
secondary structure
Protein secondary structure is the three dimensional conformational isomerism, form of ''local segments'' of proteins. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...
or intra-molecular
base pair
A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
ing interactions of the molecule. This is shown by the high conservation of base pairings across diverse species.
The most stable structure
Secondary structure of small RNA molecules is largely determined by strong, local interactions such as
hydrogen bond
In chemistry, a hydrogen bond (or H-bond) is a primarily electrostatic force of attraction between a hydrogen (H) atom which is covalently bound to a more electronegative "donor" atom or group (Dn), and another electronegative atom bearing a ...
s and
base stacking. Summing the free energy for such interactions should provide an approximation for the stability of a given structure. To predict the folding free energy of a given secondary structure, an empirical
nearest-neighbor model is used. In the nearest neighbor model the free energy change for each motif depends on the sequence of the motif and of its closest base-pairs.
The model and parameters of minimal energy for Watson–Crick pairs, GU pairs and loop regions were derived from empirical calorimetric experiments, the most up-to-date parameters were published in 2004,
although most software packages use the prior set assembled in 1999.
The simplest way to find the lowest free energy structure would be to generate all possible structures and calculate the free energy for it, but the number of possible structures for a sequence increases exponentially with the length of RNA:
number of secondary structures = (1,8)N, N- number of nucleotides
.
For longer molecules, the number of possible secondary structures is huge: a sequence of 100 nucleotides has more than 10
25 possible secondary structures.
Dynamic programming algorithms
Most popular methods for predicting RNA and DNA's secondary structure involve
dynamic programming
Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.
I ...
.
One of the early attempts at predicting RNA secondary structure was made by
Ruth Nussinov
Ruth Nussinov (Hebrew: פרופסור רות נוסינוב) is an Israeli-American biologist who works as a Professor in the Department of Human Genetics, School of Medicine at Tel Aviv University and is the Senior Principal Scientist and Princi ...
and co-workers who developed a dynamic programming-based algorithm that maximized the length and number of a series of "blocks" (polynucleotide chains).
Each "block" required at least two nucleotides, which reduced the algorithm's storage requirements over single base-matching approaches.
Nussinov et al. later published an adapted approach with improved performance that increased the RNA size limit to ~1,000 bases by folding increasingly sized subsections while storing the results of prior folds, now known as the
Nussinov algorithm.
In 1981, Michael Zuker and Patrick Stiegler proposed a refined approach with performance comparable to Nussinov et al.'s solution but with the additional ability to find also find "suboptimal" secondary structures.
Dynamic programming algorithms provide a means to implicitly check all variants of possible RNA secondary structures without explicitly generating the structures. First, the lowest conformational free energy is determined for each possible sequence fragment starting with the shortest fragments and then for longer fragments. For longer fragments, recursion on the optimal free energy changes determined for shorter sequences speeds the determination of the lowest folding free energy. Once the lowest free energy of the complete sequence is calculated, the exact structure of RNA molecule is determined.
Dynamic programming algorithms are commonly used to detect
base pair
A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
ing patterns that are "well-nested", that is, form
hydrogen bond
In chemistry, a hydrogen bond (or H-bond) is a primarily electrostatic force of attraction between a hydrogen (H) atom which is covalently bound to a more electronegative "donor" atom or group (Dn), and another electronegative atom bearing a ...
s only to bases that do not overlap one another in sequence position. Secondary structures that fall into this category include
double helices,
stem-loop
Stem-loop intramolecular base pairing is a pattern that can occur in single-stranded RNA. The structure is also known as a hairpin or hairpin loop. It occurs when two regions of the same strand, usually complementary in nucleotide sequence when ...
s, and variants of the "cloverleaf" pattern found in
transfer RNA
Transfer RNA (abbreviated tRNA and formerly referred to as sRNA, for soluble RNA) is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes), that serves as the physical link between the mRNA and the amino ac ...
molecules. These methods rely on pre-calculated parameters which estimate the
free energy associated with certain types of base-pairing interactions, including
Watson-Crick and
Hoogsteen base pair
A Hoogsteen base pair is a variation of base-pairing in nucleic acids such as the A•T pair. In this manner, two nucleobases, one on each strand, can be held together by hydrogen bonds in the major groove. A Hoogsteen base pair applies the N7 ...
s. Depending on the complexity of the method, single base pairs may be considered, and short two- or three-base segments, to incorporate the effects of base stacking. This method cannot identify
pseudoknot
__NOTOC__
A pseudoknot is a nucleic acid secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem. The pseudoknot was first recognized in the turnip yellow ...
s, which are not well nested, without substantial algorithmic modifications that are computationally very costly.
Suboptimal structures
The accuracy of RNA secondary structure prediction from one sequence by free energy minimization is limited by several factors:
# The free energy value's list in nearest neighbor model is incomplete
# Not all known RNA folds in such a way as to conform with the thermodynamic minimum.
# Some RNA sequences have more than one biologically active conformation (i.e.,
riboswitch
In molecular biology, a riboswitch is a regulatory segment of a messenger RNA molecule that binds a small molecule, resulting in a change in production of the proteins encoded by the mRNA. Thus, an mRNA that contains a riboswitch is directly invo ...
es)
For this reason, the ability to predict structures which have similar low free energy can provide significant information. Such structures are termed
suboptimal structures. MFOLD is one program that generates suboptimal structures.
Predicting pseudoknots
One of the issues when predicting RNA secondary structure is that the standard free energy minimization and statistical sampling methods can not find
pseudoknot
__NOTOC__
A pseudoknot is a nucleic acid secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem. The pseudoknot was first recognized in the turnip yellow ...
s.
The major problem is that the usual dynamic programing algorithms, when predicting secondary structure, consider only the interactions between the closest nucleotides, while pseudoknotted structures are formed due to interactions between distant nucleotides. Rivas and Eddy published a dynamic programming algorithm for predicting pseudoknots.
However, this dynamic programming algorithm is very slow. The standard dynamic programming algorithm for free energy minimization scales O(N
3) in time (N is the number of nucleotides in the sequence), while the Rivas and Eddy algorithm scales O(N
6) in time. This has prompted several researchers to implement versions of the algorithm that restrict classes of pseudoknots, resulting in performance gains. For example, pknotsRG tool includes only the class of simple recursive pseudoknots and scales O(N4) in time.
Other approaches for RNA secondary structure prediction
Another approach for RNA secondary structure determination is to sample structures from the
Boltzmann
Ludwig Eduard Boltzmann (; 20 February 1844 – 5 September 1906) was an Austrian physicist and philosopher. His greatest achievements were the development of statistical mechanics, and the statistical explanation of the second law of thermodyn ...
ensemble,
as exemplified by the program SFOLD. The program generates a statistical sample of all possible RNA secondary structures. The algorithm samples secondary structures according to the
Boltzmann distribution
In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability t ...
. The sampling method offers an appealing solution to the problem of uncertainties in folding.
Comparative secondary structure prediction
Sequence covariation methods rely on the existence of a data set composed of multiple
homologous RNA sequences with related but dissimilar sequences. These methods analyze the covariation of individual base sites in
evolution
Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...
; maintenance at two widely separated sites of a pair of base-pairing nucleotides indicates the presence of a structurally required hydrogen bond between those positions. The general problem of pseudoknot prediction has been shown to be
NP-complete
In computational complexity theory, a problem is NP-complete when:
# it is a problem for which the correctness of each solution can be verified quickly (namely, in polynomial time) and a brute-force search algorithm can find a solution by tryi ...
.
In general, the problem of alignment and consensus structure prediction are closely related. Three different approaches to the prediction of consensus structures can be distinguished:
# Folding of alignment
# Simultaneous sequence alignment and folding
# Alignment of predicted structures
Align then fold
A practical
heuristic
A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...
approach is to use
multiple sequence alignment
Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutio ...
tools to produce an alignment of several RNA sequences, to find consensus sequence and then fold it. The quality of the alignment determines the accuracy of the consensus structure model. Consensus sequences are folded using various approaches similarly as in individual structure prediction problem. The thermodynamic folding approach is exemplified by RNAalifold program.
The different approaches are exemplified by Pfold and ILM programs. Pfold program implements a
SCFGs.
ILM (iterated loop matching) unlike the other algorithms for folding of alignments, can return pseudoknoted structures. It uses combination of
thermodynamics
Thermodynamics is a branch of physics that deals with heat, work, and temperature, and their relation to energy, entropy, and the physical properties of matter and radiation. The behavior of these quantities is governed by the four laws of the ...
and
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
content scores.
Align and fold
Evolution
Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...
frequently preserves functional RNA structure better than RNA sequence.
Hence, a common biological problem is to infer a common structure for two or more highly diverged but homologous RNA sequences. In practice, sequence alignments become unsuitable and do not help to improve the accuracy of structure prediction, when sequence similarity of two sequences is less than 50%.
Structure-based alignment programs improves the performance of these alignments and most of them are variants of the Sankoff algorithm.
Basically, Sankoff algorithm is a merger of sequence alignment and Nussinov
(maximal-pairing) folding dynamic programming method.
Sankoff algorithm itself is a theoretical exercise because it requires extreme computational resources (O
(n3m) in time, and O
(n2m) in space, where n is the sequence length and m is the number of sequences). Some notable attempts at implementing restricted versions of Sankoff's algorithm are Foldalign,
Dynalign,
PMmulti/PMcomp,
Stemloc,
and Murlet.
In these implementations the maximal length of alignment or variants of possible consensus structures are restricted. For example, Foldalign focuses on local alignments and restricts the possible length of the sequences alignment.
Fold then align
A less widely used approach is to fold the sequences using single sequence structure prediction methods and align the resulting structures using tree-based metrics.
The fundamental weakness with this approach is that single sequence predictions are often inaccurate, thus all further analyses are affected.
Tertiary structure prediction
Once secondary structure of RNA is known, the next challenge is to predict
tertiary structure
Protein tertiary structure is the three dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains may int ...
. The biggest problem is to determine the structure of regions between double stranded helical regions. Also RNA molecules often contain posttranscriptionally modified nucleosides, which because of new possible non-canonical interactions, cause a lot of troubles for tertiary structure prediction.
The three-dimensional structure prediction methods can use comparative modeling which starts from a related known structure known as the template. The alternative strategy is de novo modeling of RNA secondary structure which uses physics-based principles such as molecular dynamics or random sampling of the conformational landscape followed by screening with a statistical potential for scoring. These methods either use an all-atom representation of the nucleic acid structure or a coarse-grained representation. The low-resolution structures generated by many of these modeling methods are then subjected to high-resolution refinement.
See also
*
RNA
Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...
*
RNA structure
Nucleic acid structure refers to the structure of nucleic acids such as DNA and RNA. Chemically speaking, DNA and RNA are very similar. Nucleic acid structure is often divided into four different levels: primary, secondary, tertiary, and quatern ...
*
Non-coding RNA
A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally important types of non-c ...
*
List of RNA structure prediction software
This list of RNA structure prediction software is a compilation of software tools and web portals used for RNA structure prediction.
Single sequence secondary structure prediction.
Single sequence tertiary structure prediction
Comparative me ...
*
Comparison of nucleic acid simulation software
This is a list of notable computer programs that are used for nucleic acid
Nucleic acids are biopolymers, macromolecules, essential to all known forms of life. They are composed of nucleotides, which are the monomers made of three components: ...
*
Comparison of software for molecular mechanics modeling
This is a list of computer programs that are predominantly used for molecular mechanics calculations.
See also
*Car–Parrinello molecular dynamics
*Comparison of force-field implementations
*Comparison of nucleic acid simulation software
* ...
References
Further reading
*
*
*
*
*
*
*
*
*
*
*
* Tuzet, H. & Perriquet, O., 2004. CARNAC: folding families of related RNAs. Nucleic Acids Research, 32(Web Server issue), W142-145.
*
*
*
ModeRNA: A program for comparative RNA modeling
{{Biomolecular structure
RNA