The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of

protein domain In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of s ...

s. It was created in the mid-1990s by Professor

Christine Orengo Christine Anne Orengo is a Professor of Bioinformatics at University College London (UCL)Christine Orengo's known for her work on protein structure, particularly the CATH database. Orengo serves as president of the International Society for Co ...

and colleagues including

Janet Thornton Dame Janet Maureen Thornton, (born 23 May 1949) is a senior scientist and director emeritus at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL). She is one of the world's leading researcher ...

and David Jones, and continues to be developed by the Orengo group at

University College London , mottoeng = Let all come who by merit deserve the most reward , established = , type = Public research university , endowment = £143 million (2020) , budget = ...

. CATH shares many broad features with the

SCOP A ( or ) was a poet as represented in Old English poetry. The scop is the Old English counterpart of the Old Norse ', with the important difference that "skald" was applied to historical persons, and scop is used, for the most part, to designa ...

resource, however there are also many areas in which the detailed classification differs greatly.

Hierarchical organization

Experimentally-determined protein three-dimensional structures are obtained from the

Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cry ...

and split into their consecutive

polypeptide chains Peptides (, ) are short chains of amino acids linked by peptide bonds. Long chains of amino acids are called proteins. Chains of fewer than twenty amino acids are called oligopeptides, and include dipeptides, tripeptides, and tetrapeptides. A p ...

, where applicable. Protein domains are identified within these chains using a mixture of automatic methods and manual curation. The domains are then classified within the CATH structural hierarchy: at the Class (C) level, domains are assigned according to their

secondary structure Protein secondary structure is the three dimensional conformational isomerism, form of ''local segments'' of proteins. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...

content, i.e. all

alpha Alpha (uppercase , lowercase ; grc, ἄλφα, ''álpha'', or ell, άλφα, álfa) is the first letter of the Greek alphabet. In the system of Greek numerals, it has a value of one. Alpha is derived from the Phoenician letter aleph , whic ...

, all

beta Beta (, ; uppercase , lowercase , or cursive ; grc, βῆτα, bē̂ta or ell, βήτα, víta) is the second letter of the Greek alphabet. In the system of Greek numerals, it has a value of 2. In Modern Greek, it represents the voiced labiod ...

, a mixture of alpha and beta, or little secondary structure; at the Architecture (A) level, information on the secondary structure arrangement in three-dimensional space is used for assignment; at the Topology/fold (T) level, information on how the secondary structure elements are connected and arranged is used; assignments are made to the Homologous superfamily (H) level if there is good evidence that the domains are related by evolution i.e. they are homologous. Additional sequence data for domains with no experimentally determined structures are provided by CATH's sister resource, Gene3D, which are used to populate the homologous superfamilies. Protein sequences from UniProtKB and Ensembl are scanned against CATH HMMs to predict domain sequence boundaries and make homologous superfamily assignments.

Releases

The CATH team aim to provide official releases of the CATH classification every 12 months. This release process is important because it allows for the provision of internal validation, extra annotations and analysis. However, it can mean that there is a time delay between new structures appearing in the PDB and the latest official CATH release, In order to address this issue: CATH-B provides a limited amount of information to the very latest domain annotations (e.g. domain boundaries and superfamily classifications). The latest release of CATH-Gene3D (v4.3) was released in December 2020 and consists of: * 500,238 structural protein domain entries * 151 mln non-structural protein domain entries * 5,481 homologous superfamily entries * 212,872 functional family entries

Open source software

CATH is an

open source software Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Open ...

project, with developers developing and maintaining a number of open source tools. CATH maintains a todo list on

GitHub GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous ...

to allow external users to create and keep track of issues relating to the CATH protein structure classification.

References

{{Use dmy dates, date=April 2017 Protein structure Protein folds Biological databases Protein classification Protein superfamilies