Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated,

open access Open access (OA) is a set of principles and a range of practices through which research outputs are distributed online, free of access charges or other barriers. With open access strictly defined (according to the 2001 definition), or libre op ...

database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the

Pfam Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 35.0, was released in November 2021 and contains 19,632 families. Uses ...

database for annotating protein families. Unlike

proteins Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...

, ncRNAs often have similar

secondary structure Protein secondary structure is the three dimensional conformational isomerism, form of ''local segments'' of proteins. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...

without sharing much similarity in the primary sequence. Rfam divides ncRNAs into families based on evolution from a common ancestor. Producing multiple sequence alignments (MSA) of these families can provide insight into their structure and function, similar to the case of protein families. These MSAs become more useful with the addition of secondary structure information. Rfam researchers also contribute to Wikipedia's RNA WikiProject.

Uses

The Rfam database can be used for a variety of functions. For each ncRNA family, the interface allows users to: view and download multiple sequence alignments; read annotation; and examine species distribution of family members. There are also links provided to literature references and other RNA databases. Rfam also provides links to Wikipedia so that entries can be created or edited by users. The interface at the Rfam website allows users to search ncRNAs by keyword, family name, or genome as well as to search by ncRNA sequence or

EMBL The European Molecular Biology Laboratory (EMBL) is an intergovernmental organization dedicated to molecular biology research and is supported by 27 member states, two prospect states, and one associate member state. EMBL was created in 1974 and ...

accession number. The database information is also available for download, installation and use using the INFERNAL software package. The INFERNAL package can also be used with Rfam to annotate sequences (including complete genomes) for homologues to known ncRNAs.

Methods

In the database, the information of the

and the primary sequence, represented by the MSA, is combined in statistical models called profile

stochastic context-free grammar Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structur ...

s (SCFGs), also known as covariance models. These are analogous to hidden Markov models used for protein family annotation in the

database. Each family in the database is represented by two multiple sequence alignments in

Stockholm format Stockholm format is a multiple sequence alignment format used by Pfam, Rfam anDfam to disseminate protein, RNA and DNA sequence alignments. The alignment editorRaleehtml" ;"title=",;()[">,;()[aBb.-_--supports pseudoknot and further structure marku ...

and a SCFG. The first MSA is the "seed" alignment. It is a hand-curated alignment that contains representative members of the ncRNA family and is annotated with structural information. This seed alignment is used to create the SCFG, which is used with the Rfam software INFERNAL to identify additional family members and add them to the alignment. A family-specific threshold value is chosen to avoid false positives. Until release 12, Rfam used an initial BLAST filtering step because profile SCFGs were too computationally expensive. However, the latest versions of INFERNAL are fast enough so that the BLAST step is no longer necessary. The second MSA is the “full” alignment, and is created as a result of a search using the covariance model against the sequence database. All detected homologs are aligned to the model, giving the automatically produced full alignment.

History

Version 1.0 of Rfam was launched in 2003 and contained 25 ncRNA families and annotated about 50 000 ncRNA genes. In 2005, version 6.1 was released and contained 379 families annotating over 280 000 genes. In August 2012, version 11.0 contained 2208 RNA families, while the current version (14.6, released in July 2021) annotates 4070 families.

Major releases and publications

* 2003 - Rfam: an RNA family database. * 2005 - Rfam: annotating non-coding RNAs in complete genomes. * 2008 - The RNA WikiProject: community annotation of RNA families. * 2008 - Rfam: updates to the RNA families database. * 2011 - Rfam: Wikipedia, clans and the “decimal” release. * 2012 - Rfam 11.0: 10 years of RNA families. * 2014 - Rfam 12.0: updates to the RNA families database. * 2017 - Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. * 2020 - Rfam 14: expanded coverage of metagenomic, viral and microRNA families.

Problems

#The genomes of higher eukaryotes contain many ncRNA-derived

pseudogene Pseudogenes are nonfunctional segments of DNA that resemble functional genes. Most arise as superfluous copies of functional genes, either directly by DNA duplication or indirectly by Reverse transcriptase, reverse transcription of an mRNA trans ...

s and repeats. Distinguishing these non-functional copies from functional ncRNA is a formidable challenge. #Introns are not modeled by covariance models.

References

External links

Rfam website at the European Bioinformatics Institute

INFERNAL software package

miRBase
{{Wellcome Trust Genetic engineering in the United Kingdom Genetics databases Molecular biology Public-domain software with source code RNA Science and technology in Cambridgeshire South Cambridgeshire District Wellcome Trust