The Conserved Domain Database (CDD) is a database of well-annotated

multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutio ...

models and derived database search models, for ancient domains and full-length proteins.

Philosophy

Domains can be thought of as distinct functional and/or structural units of a protein. These two classifications coincide rather often, as a matter of fact, and what is found as an independently folding unit of a polypeptide chain also carries specific function. Domains are often identified as recurring (sequence or structure) units, which may exist in various contexts. In

molecular evolution Molecular evolution is the process of change in the sequence composition of cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and population genet ...

such domains may have been utilized as building blocks, and may have been recombined in different arrangements to modulate protein function. CDD defines conserved domains as recurring units in molecular evolution, the extents of which can be determined by sequence and structure analysis. The goal of the NCBI conserved domain curation project is to provide database users with insights into how patterns of residue conservation and divergence in a family relate to functional properties, and to provide useful links to more detailed information that may help to understand those sequence/structure/function relationships. To do this, CDD Curators include the following types of information in order to supplement and enrich the traditional multiple sequence alignments that form the foundation of domain models: 3-dimensional structures and conserved core motifs, conserved features/sites, phylogenetic organization, links to electronic literature resources.

Content

CDD content includes NCBI manually curated domain models and domain models imported from a number of external source databases (

Pfam Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 35.0, was released in November 2021 and contains 19,632 families. Uses ...

, SMART, COG, PRK,

TIGRFAMs TIGRFAMs is a database of protein families designed to support manual and automated genome annotation. Each entry includes a multiple sequence alignment and hidden Markov model (HMM) built from the alignment. Sequences that score above the defined c ...

). What is unique about NCBI-curated domains is that they use 3D-structure information to explicitly define domain boundaries, align blocks, amend alignment details, and provide insights into sequence/structure/function relationships. Manually curated models are organized hierarchically if they describe domain families that are clearly related by common descent. To provide a non-redundant view of the data, CDD clusters similar domain models from various sources into superfamilies.

Searching the database

The collection is also part of NCBI's

Entrez The Entrez (pronounced ''ɒnˈtreɪ'') Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information ...

query and retrieval system, crosslinked to numerous other resources. CDD provides annotation of domain footprints and conserved functional sites on protein sequences. Precalculated domain annotation can be retrieved for protein sequences tracked in NCBI's Entrez system, and CDD's collection of models can be queried with novel protein sequences via * , or at* , that allows the computation and download of annotation for large sets of protein queries.

References

External links

* {{cite web , url = https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml , title = Conserved Domains Database (CDD) and Resource Group , publisher = United States National Center for Biotechnology Information Biological databases Protein structure Protein domains