Chemical Registration System
   HOME

TheInfoList



OR:

A chemical database is a database specifically designed to store
chemical information Cheminformatics (also known as chemoinformatics) refers to use of physical chemistry theory with computer and information science techniques—so called "''in silico''" techniques—in application to a range of descriptive and prescriptive problem ...
. This information is about chemical and crystal structures, spectra,
reactions Reaction may refer to a process or to a response to an action, event, or exposure: Physics and chemistry *Chemical reaction *Nuclear reaction *Reaction (physics), as defined by Newton's third law *Chain reaction (disambiguation). Biology and me ...
and syntheses, and thermophysical data.


Types of chemical databases


Bioactivity database

Bioactivity databases correlate structures or other chemical information to bioactivity results taken from bioassays in literature, patents, and screening programs.


Chemical structures

Chemical structures are traditionally represented using lines indicating chemical bonds between
atoms Every atom is composed of a nucleus and one or more electrons bound to the nucleus. The nucleus is made of one or more protons and a number of neutrons. Only the most common variety of hydrogen has no neutrons. Every solid, liquid, gas, an ...
and drawn on paper (2D
structural formula The structural formula of a chemical compound is a graphic representation of the molecular structure (determined by structural chemistry methods), showing how the atoms are possibly arranged in the real three-dimensional space. The chemical bondi ...
e). While these are ideal visual representations for the chemist, they are unsuitable for computational use and especially for
search Searching or search may refer to: Computing technology * Search algorithm, including keyword search ** :Search algorithms * Search and optimization for problem solving in artificial intelligence * Search engine technology, software for findi ...
and
storage Storage may refer to: Goods Containers * Dry cask storage, for storing high-level radioactive waste * Food storage * Intermodal container, cargo shipping * Storage tank Facilities * Garage (residential), a storage space normally used to store car ...
. Small molecules (also called
ligands In coordination chemistry, a ligand is an ion or molecule (functional group) that binds to a central metal atom to form a coordination complex. The bonding with the metal generally involves formal donation of one or more of the ligand's electro ...
in drug design applications), are usually represented using lists of atoms and their connections. Large molecules such as proteins are however more compactly represented using the sequences of their amino acid building blocks. Large chemical databases for structures are expected to handle the storage and searching of information on millions of molecules taking terabytes of physical memory.


Literature database

Chemical literature databases correlate structures or other chemical information to relevant references such as academic papers or patents. This type of database includes STN, Scifinder, and Reaxys. Links to literature are also included in many databases that focus on chemical characterization.


Crystallographic database

Crystallographic databases store X-ray crystal structure data. Common examples include Protein Data Bank and Cambridge Structural Database.


NMR spectra database

NMR spectra databases correlate chemical structure with NMR data. These databases often include other characterization data such as FTIR and
mass spectrometry Mass spectrometry (MS) is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a ''mass spectrum'', a plot of intensity as a function of the mass-to-charge ratio. Mass spectrometry is use ...
.


Reactions database

Most chemical databases store information on stable molecules but in databases for reactions also intermediates and temporarily created unstable molecules are stored. Reaction databases contain information about products, educts, and
reaction mechanism In chemistry, a reaction mechanism is the step by step sequence of elementary reactions by which overall chemical change occurs. A chemical mechanism is a theoretical conjecture that tries to describe in detail what takes place at each stage of ...
s.


Thermophysical database

Thermophysical data are information about * phase equilibria including vapor–liquid equilibrium, solubility of gases in liquids, liquids in solids (SLE), heats of mixing, vaporization, and fusion. * caloric data like heat capacity, heat of formation and combustion, * transport properties like viscosity and thermal conductivity


Chemical structure representation

There are two principal techniques for representing chemical structures in digital databases * As connection tables / adjacency matrices / lists with additional information on bond (edges) and atom attributes (nodes), such as: *: MDL Molfile, PDB, CML * As a linear string notation based on depth first or breadth first traversal, such as: *: SMILES/SMARTS, SLN, WLN, InChI These approaches have been refined to allow representation of stereochemical differences and charges as well as special kinds of bonding such as those seen in organo-metallic compounds. The principal advantage of a computer representation is the possibility for increased storage and fast, flexible search.


Search


Substructure

Chemists can search databases using parts of structures, parts of their IUPAC names as well as based on constraints on properties. Chemical databases are particularly different from other general purpose databases in their support for sub-structure search. This kind of search is achieved by looking for
subgraph isomorphism In theoretical computer science, the subgraph isomorphism problem is a computational task in which two graphs ''G'' and ''H'' are given as input, and one must determine whether ''G'' contains a subgraph that is isomorphic to ''H''. Subgraph isomorp ...
(sometimes also called a monomorphism) and is a widely studied application of Graph theory. The algorithms for searching are computationally intensive, often of O (''n''3) or O (''n''4) time complexity (where ''n'' is the number of atoms involved). The intensive component of search is called atom-by-atom-searching (ABAS), in which a mapping of the search substructure atoms and bonds with the target molecule is sought. ABAS searching usually makes use of the Ullman algorithm or variations of it (''i.e.'' SMSD ). Speedups are achieved by
time amortization Time is the continued sequence of existence and events that occurs in an apparently irreversible succession from the past, through the present, into the future. It is a component quantity of various measurements used to sequence events, to co ...
, that is, some of the time on search tasks are saved by using precomputed information. This pre-computation typically involves creation of bitstrings representing presence or absence of molecular fragments. By looking at the fragments present in a search structure it is possible to eliminate the need for ABAS comparison with target molecules that do not possess the fragments that are present in the search structure. This elimination is called screening (not to be confused with the screening procedures used in drug-discovery). The bit-strings used for these applications are also called structural-keys. The performance of such keys depends on the choice of the fragments used for constructing the keys and the probability of their presence in the database molecules. Another kind of key makes use of hash-codes based on fragments derived computationally. These are called 'fingerprints' although the term is sometimes used synonymously with structural-keys. The amount of memory needed to store these structural-keys and fingerprints can be reduced by 'folding', which is achieved by combining parts of the key using bitwise-operations and thereby reducing the overall length.


Conformation

Search by matching 3D conformation of molecules or by specifying spatial constraints is another feature that is particularly of use in drug design. Searches of this kind can be computationally very expensive. Many approximate methods have been proposed, for instance BCUTS, special function representations, moments of inertia, ray-tracing histograms, maximum distance histograms, shape multipoles to name a few.


Giga Search

Databases of synthesizable and virtual chemicals are getting larger each year, therefore the ability to efficiently mine them is critical for drug discovery projects
MolSoft's
MolCart Giga Search (http://www.molsoft.com/giga-search.html) is the first ever method designed for substructure search of billions of chemicals.


Descriptors

All properties of molecules beyond their structure can be split up into either physico-chemical or pharmacological attributes also called descriptors. On top of that, there exist various artificial and more or less standardized naming systems for molecules that supply more or less ambiguous names and
synonym A synonym is a word, morpheme, or phrase that means exactly or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words ''begin'', ''start'', ''commence'', and ''initiate'' are all ...
s. The IUPAC name is usually a good choice for representing a molecule's structure in a both human-readable and unique
string String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...
although it becomes unwieldy for larger molecules. Trivial names on the other hand abound with homonyms and synonyms and are therefore a bad choice as a defining database key. While physico-chemical descriptors like molecular weight, (
partial Partial may refer to: Mathematics * Partial derivative, derivative with respect to one of several variables of a function, with the other variables held constant ** ∂, a symbol that can denote a partial derivative, sometimes pronounced "partial ...
) charge, solubility, etc. can mostly be computed directly based on the molecule's structure, pharmacological descriptors can be derived only indirectly using involved multivariate statistics or experimental (
screening Screening may refer to: * Screening cultures, a type a medical test that is done to find an infection * Screening (economics), a strategy of combating adverse selection (includes sorting resumes to select employees) * Screening (environmental), a ...
, bioassay) results. All of those descriptors can for reasons of computational effort be stored along with the molecule's representation and usually are.


Similarity

There is no single definition of molecular similarity, however the concept may be defined according to the application and is often described as an
inverse Inverse or invert may refer to: Science and mathematics * Inverse (logic), a type of conditional sentence which is an immediate inference made from another conditional sentence * Additive inverse (negation), the inverse of a number that, when ad ...
of a measure of distance in descriptor space. Two molecules might be considered more similar for instance if their difference in molecular weights is lower than when compared with others. A variety of other measures could be combined to produce a multi-variate distance measure. Distance measures are often classified into Euclidean measures and non-Euclidean measures depending on whether the triangle inequality holds. Maximum Common Subgraph (
MCS Music * Motion City Soundtrack, a pop punk / rock band from Minneapolis, Minnesota Science and technology * Matrix cable system, submarine communications cable connecting Indonesia and Singapore * Megawatt Charging System, electric vehicle cha ...
) based substructure search (similarity or distance measure) is also very common. MCS is also used for screening drug like compounds by hitting molecules, which share common subgraph (substructure). Chemicals in the databases may be clustered into groups of 'similar' molecules based on similarities. Both hierarchical and non-hierarchical clustering approaches can be applied to chemical entities with multiple attributes. These attributes or molecular properties may either be determined empirically or computationally derived descriptors. One of the most popular clustering approaches is the Jarvis-Patrick algorithm . In pharmacologically oriented chemical repositories, similarity is usually defined in terms of the biological effects of compounds (
ADME ADME is an abbreviation in pharmacokinetics and pharmacology for " absorption, distribution, metabolism, and excretion", and describes the disposition of a pharmaceutical compound within an organism. The four criteria all influence the drug le ...
/tox) that can in turn be semiautomatically inferred from similar combinations of physico-chemical descriptors using QSAR methods.


Registration systems

Databases systems for maintaining unique records on chemical compounds are termed as Registration systems. These are often used for chemical indexing, patent systems and industrial databases. Registration systems usually enforce uniqueness of the chemical represented in the database through the use of unique representations. By applying rules of precedence for the generation of stringified notations, one can obtain unique/' canonical' string representations such as 'canonical SMILES'. Some registration systems such as the CAS system make use of algorithms to generate unique hash codes to achieve the same objective. A key difference between a registration system and a simple chemical database is the ability to accurately represent that which is known, unknown, and partially known. For example, a chemical database might store a molecule with
stereochemistry Stereochemistry, a subdiscipline of chemistry, involves the study of the relative spatial arrangement of atoms that form the structure of molecules and their manipulation. The study of stereochemistry focuses on the relationships between stereois ...
unspecified, whereas a chemical registry system requires the registrar to specify whether the stereo configuration is unknown, a specific (known) mixture, or racemic. Each of these would be considered a different record in a chemical registry system. Registration systems also preprocess molecules to avoid considering trivial differences such as differences in
halogen The halogens () are a group in the periodic table consisting of five or six chemically related elements: fluorine (F), chlorine (Cl), bromine (Br), iodine (I), astatine (At), and tennessine (Ts). In the modern IUPAC nomenclature, this group is ...
ions in chemicals. An example is the Chemical Abstracts Service (CAS) registration system. See also CAS registry number.


List of Chemical Cartridges

* Accord * Direct * J Chem * CambridgeSoft * Bingo * Pinpoint


List of Chemical Registration Systems

* ChemReg * Register * RegMol * Compound-Registration * Ensemble


Web-based


Tools

The computational representations are usually made transparent to chemists by graphical display of the data. Data entry is also simplified through the use of chemical structure editors. These editors internally convert the graphical data into computational representations. There are also numerous algorithms for the interconversion of various formats of representation. An open-source utility for conversion is OpenBabel. These search and conversion algorithms are implemented either within the database system itself or as is now the trend is implemented as external components that fit into standard relational database systems. Both Oracle and
PostgreSQL PostgreSQL (, ), also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the In ...
based systems make use of cartridge technology that allows user defined datatypes. These allow the user to make SQL queries with chemical search conditions (For example, a query to search for records having a phenyl ring in their structure represented as a SMILES string in a SMILESCOL column could be SELECT * FROM CHEMTABLE WHERE SMILESCOL.CONTAINS('c1ccccc1') Algorithms for the conversion of IUPAC names to structure representations and vice versa are also used for extracting structural information from text. However, there are difficulties due to the existence of multiple dialects of IUPAC. Work is on to establish a unique IUPAC standard (See InChI).


See also

* Biological database *
Beilstein database The Beilstein database is the largest database in the field of organic chemistry, in which compounds are uniquely identified by their Beilstein Registry Number. The database covers the scientific literature from 1771 to the present and contains ex ...
and
Dortmund Data Bank The Dortmund Data Bank (short DDB) is a factual data bank for thermodynamic and thermophysical data. Its main usage is the data supply for process simulation where experimental data are the basis for the design, analysis, synthesis, and optimizati ...
* BindingDB * ChEBI * ChEMBL * Chemisches Zentralblatt Structural Database * ChemSpider *
Collaborative Drug Discovery Collaborative Drug Discovery (CDD) is a software company founded in 2004 as a spin-out of Eli Lilly by Barry Bunin, PhD. CDD utilizes a web-based database solution for managing drug discovery data, primarily through the CDD Vault product which ...
* Comparative Toxicogenomics Database *
Computational Chemistry List The Computational Chemistry List (CCL) was established on January 11, 1991, as an independent electronic forum for chemistry researchers and educators from around the world. According to the forum's web site, it is estimated that more than 3000 mem ...
* DrugBank * List of chemical databases * List of software for molecular mechanics modeling * LOLI Database * NMR spectra database * PubChem *
SPRESI database The SPRESI data collection is one of the largest databases for organic chemistry worldwide. The database covers the scientific literature from 1974 to 2014, focusing on organic synthesis. It contains information on 5.8 million chemical structures ...
* Colocalization Benchmark Source


References

{{DEFAULTSORT:Chemical Database Computational chemistry Cheminformatics