CATH
   HOME

TheInfoList



OR:

The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of
protein domain In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of ...
s. It was created in the mid-1990s by Professor
Christine Orengo Christine Anne Orengo is a Professor of Bioinformatics at University College London (UCL)Christine Orengo's known for her work on protein structure, particularly the CATH database. Orengo serves as president of the International Society for Co ...
and colleagues including
Janet Thornton Dame Janet Maureen Thornton, (born 23 May 1949) is a senior scientist and director emeritus at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL). She is one of the world's leading researche ...
and David Jones, and continues to be developed by the Orengo group at
University College London , mottoeng = Let all come who by merit deserve the most reward , established = , type = Public research university , endowment = £143 million (2020) , budget = ...
. CATH shares many broad features with the
SCOP A ( or ) was a poet as represented in Old English poetry. The scop is the Old English counterpart of the Old Norse ', with the important difference that "skald" was applied to historical persons, and scop is used, for the most part, to designa ...
resource, however there are also many areas in which the detailed classification differs greatly.


Hierarchical organization

Experimentally-determined protein three-dimensional structures are obtained from the
Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cr ...
and split into their consecutive
polypeptide chains Peptides (, ) are short chains of amino acids linked by peptide bonds. Long chains of amino acids are called proteins. Chains of fewer than twenty amino acids are called oligopeptides, and include dipeptides, tripeptides, and tetrapeptides. A p ...
, where applicable. Protein domains are identified within these chains using a mixture of automatic methods and manual curation. The domains are then classified within the CATH structural hierarchy: at the Class (C) level, domains are assigned according to their
secondary structure Protein secondary structure is the three dimensional form of ''local segments'' of proteins. The two most common secondary structural elements are alpha helices and beta sheets, though beta turns and omega loops occur as well. Secondary struct ...
content, i.e. all
alpha Alpha (uppercase , lowercase ; grc, ἄλφα, ''álpha'', or ell, άλφα, álfa) is the first letter of the Greek alphabet. In the system of Greek numerals, it has a value of one. Alpha is derived from the Phoenician letter aleph , whi ...
, all
beta Beta (, ; uppercase , lowercase , or cursive ; grc, βῆτα, bē̂ta or ell, βήτα, víta) is the second letter of the Greek alphabet. In the system of Greek numerals, it has a value of 2. In Modern Greek, it represents the voiced labiod ...
, a mixture of alpha and beta, or little secondary structure; at the Architecture (A) level, information on the secondary structure arrangement in three-dimensional space is used for assignment; at the Topology/fold (T) level, information on how the secondary structure elements are connected and arranged is used; assignments are made to the Homologous superfamily (H) level if there is good evidence that the domains are related by evolution i.e. they are homologous. Additional sequence data for domains with no experimentally determined structures are provided by CATH's sister resource, Gene3D, which are used to populate the homologous superfamilies. Protein sequences from UniProtKB and Ensembl are scanned against CATH HMMs to predict domain sequence boundaries and make homologous superfamily assignments.


Releases

The CATH team aim to provide official releases of the CATH classification every 12 months. This release process is important because it allows for the provision of internal validation, extra annotations and analysis. However, it can mean that there is a time delay between new structures appearing in the PDB and the latest official CATH release, In order to address this issue: CATH-B provides a limited amount of information to the very latest domain annotations (e.g. domain boundaries and superfamily classifications). The latest release of CATH-Gene3D (v4.3) was released in December 2020 and consists of: * 500,238 structural protein domain entries * 151 mln non-structural protein domain entries * 5,481 homologous superfamily entries * 212,872 functional family entries


Open source software

CATH is an
open source software Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Open ...
project, with developers developing and maintaining a number of open source tools. CATH maintains a todo list on
GitHub GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, cont ...
to allow external users to create and keep track of issues relating to the CATH protein structure classification.


References

{{Use dmy dates, date=April 2017 Protein structure Protein folds Biological databases Protein classification Protein superfamilies