The simplified molecular-input line-entry system (SMILES) is a
specification in form of a line notation for describing the structure
of chemical species using short
1 History 2 Terminology 3 Graph-based definition 4 Description
4.1 Atoms 4.2 Bonds 4.3 Rings 4.4 Aromaticity 4.5 Branching 4.6 Stereochemistry 4.7 Isotopes 4.8 Examples 4.9 Other examples of SMILES
5 Extensions 6 Conversion 7 See also 8 References 9 Further reading 10 External links
10.1 SMILES related software utilities
The original SMILES specification was initiated by David Weininger at
the USEPA Mid-Continent Ecology Division Laboratory in Duluth in the
1980s. Acknowledged for their parts in the early
development were "Gilman Veith and Rose Russo (USEPA) and Albert Leo
Corwin Hansch (Pomona College) for supporting the work, and Arthur
Weininger (Pomona; Daylight CIS) and Jeremy Scofield (Cedar River
Software, Renton, WA) for assistance in programming the system."
Environmental Protection Agency
of the bonds chosen to break cycles, of the starting atom used for the depth-first traversal, and of the order in which branches are listed when encountered.
Description Atoms Atoms are represented by the standard abbreviation of the chemical elements, in square brackets, such as [Au] for gold. Brackets may be omitted in the common case of atoms which:
are in the "organic subset" of B, C, N, O, P, S, F, Cl, Br, or I, and have no formal charge, and have the number of hydrogens attached implied by the SMILES valence model (typically their normal valence, but for N and P it is 3 or 5, and for S it is 2, 4 or 6), and are the normal isotopes, and are not chiral centers.
All other elements must be enclosed in brackets, and have charges and
hydrogens shown explicitly. For instance, the SMILES for water may be
written as either O or [OH2]. Hydrogen may also be written as a
separate atom; water may also be written as [H]O[H].
When brackets are used, the symbol H is added if the atom in brackets
is bonded to one or more hydrogen, followed by the number of hydrogen
atoms if greater than 1, then by the sign '+' for a positive charge or
by '-' for a negative charge. For example, [NH4+] for ammonium. If
there is more than one charge, it is normally written as digit;
however, it is also possible to repeat the sign as many times as the
ion has charges: one may write either [Ti+4] or [Ti++++] for Titanium
IV (Ti4+). Thus, the hydroxide anion is represented by [OH-], the
hydronium cation is [OH3+] and the cobalt III cation (Co3+) is either
[Co+3] or [Co+++].
A bond is represented using one of the symbols '.' '-' '=' '#' '$' ':'
'/' or ''.
Bonds between aliphatic atoms are assumed to be single unless
specified otherwise and are implied by adjacency in the SMILES string.
Although single bonds may be written as "-", this is usually omitted.
For example, the SMILES for ethanol may be written as C-C-O, CC-O or
C-CO, but is usually written CCO.
Double, triple, and quadruple bonds are represented by the symbols
'=', '#', and '$' respectively as illustrated by the SMILES O=C=O
(carbon dioxide), C#N (hydrogen cyanide) and [Ga-]$[As+] (gallium
An additional type of bond is a "non-bond", indicated with ".", to
indicate that two parts are not bonded together. For example, aqueous
sodium chloride may be written as [Na+].[Cl-] to show the
An aromatic "one and a half" bond may be indicated with ':'; see
In Kekulé form with alternating single and double bonds, e.g. C1=CC=CC=C1, Using the aromatic bond symbol ":", e.g. C:1:C:C:C:C:C1, or Most commonly, by writing the constituent B, C, N, O, P and S atoms in lower-case forms 'b', 'c', 'n', 'o', 'p' and 's', respectively.
In the latter case, bonds between two aromatic atoms are assumed (if not explicitly shown) to be aromatic bonds. Thus, benzene, pyridine and furan can be represented respectively by the SMILES c1ccccc1, n1ccccc1 and o1cccc1. Aromatic nitrogen bonded to hydrogen, as found in pyrrole must be represented as [nH] and imidazole is written in SMILES notation as n1c[nH]cc1. When aromatic atoms are singly bonded to each other, such as in biphenyl, a single bond must be shown explicitly: c1ccccc1-c2ccccc2. This is one of the few cases where the single bond symbol "-" is required. (In fact, most SMILES software can correctly infer that the bond between the two rings cannot be aromatic and so will accept the form "c1ccccc1c2ccccc2".) The Daylight and OpenEye algorithms for generating canonical SMILES differ in their treatment of aromaticity.
Visualization of 3-cyanoanisole as COc(c1)cccc1C#N.
Branching Branches are described with parentheses, as in CCC(=O)O for propionic acid and FC(F)F for fluoroform. The first atom within the parentheses, and the first atom after the parenthesized group, are both bonded to the same branch point atom. Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N (see depiction) and COc(cc1)ccc1C#N (see depiction) which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable. Branches may be written in any order. For example, bromochlorodifluoromethane may be written as FC(Br)(Cl)F, BrC(F)(F)Cl, C(F)(Cl)(F)Br, or the like. Generally, a SMILES form is easiest to read if the simpler branch comes first, with the final, unparenthesized portion being the most complex. The only caveats to such rearrangements are:
If ring numbers are reused, they are paired according to their order of appearance in the SMILES string. Some adjustments may be required to preserve the correct pairing. If stereochemistry is specified, adjustments must be made; see Stereochemistry § Notes below.
The one form of branch which does not require parentheses are ring-closing bonds. Choosing ring-closing bonds appropriately can reduce the number of parentheses required. For example, toluene is normally written as Cc1ccccc1 or c1ccccc1C, avoiding the parentheses required if written as c1ccc(C)ccc1 or c1ccc(ccc1)C. Stereochemistry
SMILES permits, but does not require, specification of stereoisomers. Configuration around double bonds is specified using the characters "/" and "" to show directional single bonds adjacent to a double bond. For example, F/C=C/F (see depiction) is one representation of trans-1,2-difluoroethylene, in which the fluorine atoms are on opposite sides of the double bond (as shown in the figure), whereas F/C=CF (see depiction) is one possible representation of cis-1,2-difluoroethylene, in which the Fs are on the same side of the double bond. Bond direction symbols always come in groups of at least two, of which the first is arbitrary. That is, FC=CF is the same as F/C=C/F. When alternating single-double bonds are present, the groups are larger than two, with the middle directional symbols being adjacent to two double bonds. For example, the common form of (2,4)-hexadiene is written C/C=C/C=C/C.
Beta-carotene, with the eleven double bonds highlighted.
As a more complex example, beta-carotene has a very long backbone of alternating single and double bonds, which may be written CC1CCC/C(C)=C1/C=C/C(C)=C/C=C/C(C)=C/C=C/C=C(C)/C=C/C=C(C)/C=C/C2=C(C)/CCCC2(C)C. Configuration at tetrahedral carbon is specified by @ or @@. Consider the four bonds in the order in which they appear, left to right, in the SMILES form. Looking toward the central carbon from the perspective of the first bond, the other three are either clockwise or counter-clockwise. These cases are indicated with @@ and @, respectively. (Because the @ symbol itself is a counter-clockwise spiral.)
For example, consider the amino acid alanine. One of its SMILES forms
is NC(C)C(=O)O, more fully written as N[CH](C)C(=O)O. L-alanine, the
more common enantiomer, is written as N[C@@H](C)C(=O)O (see
depiction). Looking from the N-C bond, the hydrogen (H), methyl (C),
and carboxylate (C(=O)O) groups appear clockwise. D-
Molecule Structure SMILES Formula
Dinitrogen N≡N N#N
Copper(II) sulfate Cu2+ SO42− [Cu+2].[O-]S(=O)(=O)[O-]
A pheromone of the Californian scale insect
2S,5R-Chalcogran: a pheromone of the bark beetle Pityogenes chalcographus
To illustrate a molecule with more than 9 rings, consider Cephalostatin-1, a steroidic trisdecacyclic pyrazine with the empirical formula C54H74N2O10 isolated from the Indian Ocean hemichordate Cephalodiscus gilchristi:
Starting with the left-most methyl group in the figure: CC(C)(O1)C[C@@H](O)[C@@]1(O2)[C@@H](C)[C@@H]3CC=C4[C@]3(C2)C(=O)C[C@H]5[C@H]4CC[C@@H](C6)[C@]5(C)Cc(n7)c6nc(C[C@@]89(C))c7C[C@@H]8CC[C@@H]%10[C@@H]9C[C@@H](O)[C@@]%11(C)C%10=C[C@H](O%12)[C@]%11(O)[C@H](C)[C@]%12(O%13)[C@H](O)C[C@@]%13(C)CO Note that '%' appears in front of the index of ring closure labels above 9; see § Rings above. Other examples of SMILES The SMILES notation is described extensively in the SMILES theory manual provided by Daylight Chemical Information Systems and a number of illustrative examples are presented. Daylight's depict utility provides users with the means to check their own examples of SMILES and is a valuable educational tool. Extensions SMARTS is a line notation for specification of substructural patterns in molecules. While it uses many of the same symbols as SMILES, it also allows specification of wildcard atoms and bonds, which can be used to define substructural queries for chemical database searching. One common misconception is that SMARTS-based substructural searching involves matching of SMILES and SMARTS strings. In fact, both SMILES and SMARTS strings are first converted to internal graph representations which are searched for subgraph isomorphism. SMIRKS is a line notation for specifying reaction transforms. Conversion SMILES can be converted back to 2-dimensional representations using Structure Diagram Generation algorithms (Helson, 1999). This conversion is not always unambiguous. Conversion to 3-dimensional representation is achieved by energy minimization approaches. There are many downloadable and web-based conversion utilities. See also
SMILES arbitrary target specification SMARTS language for
specification of substructural queries.
SYBYL Line Notation (another line notation)
Molecular Query Language – query language allowing also numerical
properties, e.g. physicochemical values or distances
Chemistry Development Kit
^ Weininger 1988 ^ a b Weininger, Weininger & Weininger 1989 ^ Weininger 1990 ^ Swanson, Richard Pommier (2004). "The Entrance of Informatics into Combinatorial Chemistry". In Rayward, W. [Warden] Boyd; Bowden, Mary Ellen. The History and Heritage of Scientific and Technological Information Systems: Proceedings of the 2002 Conference of the American Society of Information Science and Technology and the Chemical Heritage Foundation. Medford, NJ: Information Today. p. 205. ISBN 1-57387-229-6. https://wayback.archive-it.org/2118/20100925010036/http://126.96.36.199/pubs/asist2002/17-swanson.pdf ^ Weininger, Dave. "Acknowledgements on Daylight Tutorial smiles-etc page". Retrieved 24 June 2013. ^ Anderson, Veith & Weininger 1987 ^ "SMILES Tutorial: What is SMILES?". U.S. Environmental Protection Agency. Retrieved 2012-09-23. ^ Hutchison D, Kanade T, Kittler J, Klienberg JM, Mattern F, Mitchell JC, Naor M, Nierstrasz O, Rangan CP, Steffen B, Sudan M, Terzopoulos D, Tygar D, Vardi MY, Weikum G, Raschid L, Neglur G, Grossman RL, Liu B (2005). "Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples". In Ludäscher B. Data Integration in the Life Sciences. Lecture Notes in Computer Science. 3615. Berlin: Springer. pp. 145–157. doi:10.1007/11530084_13. ISBN 978-3-540-27967-9. Retrieved 2013-02-12. ^ Byers, JA; Birgersson, G; Löfqvist, J; Appelgren, M; Bergström, G (Mar 1990). "Isolation of pheromone synergists of bark beetle,Pityogenes chalcographus, from complex insect-plant odors by fractionation and subtractive-combination bioassay" (PDF). Journal of Chemical Ecology. 16 (3): 861–76. doi:10.1007/BF01016496. PMID 24263601. ^ National Center for Biotechnology Information (NCBI). PubChem Compound. (accessed May 12, 2012) PubChem Compound CID=183413 (Cephalostatin-1)
Anderson E, Veith GD, Weininger D (1987). SMILES: A line notation and computerized interpreter for chemical structures. Duluth, MN: U.S. EPA, Environmental Research Laboratory-Duluth. Report No. EPA/600/M-87/021. Helson HE (1999). "Structure Diagram Generation". In Lipkowitz KB, Boyd DB. Rev. Comput. Chem. 13. New York: Wiley-VCH. pp. 313–398. doi:10.1002/9780470125908.ch6. Weininger D (February 1988). "SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules". Journal of Chemical Information and Modeling. 28 (1): 31–6. doi:10.1021/ci00057a005. Weininger D, Weininger A, Weininger JL (May 1989). "SMILES. 2. Algorithm for generation of unique SMILES notation". Journal of Chemical Information and Modeling. 29 (2): 97–101. doi:10.1021/ci00062a008. Weininger D (August 1990). "SMILES. 3. DEPICT. Graphical depiction of chemical structures". Journal of Chemical Information and Modeling. 30 (3): 237–43. doi:10.1021/ci00067a005.
This article's use of external links may not follow's policies or guidelines. Please improve this article by removing excessive or inappropriate external links, and converting useful links where appropriate into footnote references. (September 2012) (Learn how and when to remove this template message)
"SMILES – A Simplified Chemical Language" The OpenSMILES home page "SMARTS – SMILES Extension" Daylight SMILES tutorial Parsing SMILES
SMILES related software utilities
NCI/CADD Chemical Identifier Resolver – resolves or generates SMILES
from chemical names, CAS Registry Numbers, InChI/InChIKey and many
other chemical structure file formats
NCI/CADD Online SMILES Translator and Structure