Robert M. Hanson
Department of Chemistry
St. Olaf College
5/19/2010
An adaptation of SMILES and SMARTS for 3D molecular atom search and selection, including multi-dimensional biomolecular sequence information including nucleic acid base pairing and cysteine cross-linking. The org.jmol.smiles package provides extensive functionality for selecting atoms within a three-dimensional model based on SMILES and SMARTS strings. This package may be used independently of Jmol -- see JmolSmilesApplet.java and JmeToJmol.htm.
Besides a presentation of general considerations, a detailed specification for syntax, and the term "aromatic" is defined.
format
$ load 1d66.pdb;calculate hbonds 108 hydrogen bonds $ print {*}.find("SMILES") //* Jmol BIOSMILES 12.0.RC17_dev 2010-06-05 17:08 1 *// //* chain D dna *// ~C:1C:2G:3G:4A:5G:6G:7A:8C:9A:%10G:%11T:%12C:%13C:%14T:%15C:%16C:%17G:%18G:%19. //* chain E dna *// ~C:%19C:%18G:%17G:%16A:%15G:%14G:%13A:%12C:%11T:%10G:9T:8C:7C:6T:5C:4C:3G:2G:1. //* chain A protein *// ~EQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEVESRLERL. //* chain B protein *// ~EQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEVESRLERL. [Cd][Cd]. [O]
//* chain 9 rna *// ~UUAG:%(792)G:%(793)C:%(794)G:%(795)G:%(796)C:%(797)CAC AG:%(798)C:%(799)G:%(800)G:%(801)U:%(802)G:%(803)G:%(804)G:%(805)GUUGCCUC:%(806) C:%(807)C:%(808)G:%(809)U:%(810)ACCC:%(811)AUCCCG:%(811)AACA:%(810)C:%(809) G:%(808)G:%(807)AAG:%(806)AU:%(812)AA:%(812)GC:%(805)C:%(804)C:%(803)A:%(802) C:%(801)C:%(800)AG:%(799)C:%(798)GUUC:%(813)C:%(814)G:%(815)G:%(816)G:%(817) GAGUAC:%(818)U:%(819)G:%(820)G:%(821)A:%(822)G:%(823)UG:%(824)C:%(825)GCG AG:%(825)C:%(824)C:%(823)U:%(822)C:%(821)U:%(820)G:%(819)G:%(818)GAAAC:%(817) C:%(816)C:%(815)G:%(814)G:%(813)UUCG:%(797)C:%(796)C:%(795)G:%(794)C:%(793) C:%(792)ACC.
All single-component aspects of Daylight SMILES are implemented, including aromaticity and atom- and bond-based stereochemistry ("chirality").
Var x = '$R1="[CH3,NH2]";$R2="[OH]"; {a}[$R1]' // select aromatic atoms attached to CH3 or NH2 select within(SMARTS,@x)Note that these variables are any string whatsoever, not just atom sets. The syntax is simply:
Var x = '$R1="[CH3,NH2]";$R2="[$($R1),OH]"; {a}[$R1]' // select aromatic atoms attached to CH3, NH2, or OH select within(SMARTS,@x)
Var x = '$R1="[CH3,NH2]";$R2="[OH]"; {a}[$([$R1]),$([$R2])]' // aromatic attached to CH3, NH2, or OH select within(SMARTS,@x)Note that $(...) need not be within [...], and wherever it is, it always means "just the first atom".
[Element] | capitalized - standard notation Na, Si, etc. -- specific non-aromatic atom |
[element] | uncapitalized - specific aromatic atom (as for standard notation, no limitations) |
* | any atom |
A | any non-aromatic atom |
a | any aromatic atom |
# | atomic number |
(integer) | mass number -- Note, however, that [H1] is [*H1], "any atom with one attached hydrogen", not unlabeled hydrogen, [1H]. |
D | degree - total number of connections |
H | exact hydrogen count |
h | "implicit" hydrogen count (atoms are not in structure) |
R | in the specified number of rings |
r | in ring of a given size |
v | valence (total bond order) |
X | calculated connectivity, including implicit hydrogens |
x | number of ring bonds |
@ | stereochemistry |
d | non-hydrogen degree -- number of non-hydrogen connections |
= | Jmol atom index, for example: [=23] |
[number]? | mass number or undefined (so, for example, [C12?] means any carbon that isn't explicitly C13 or C14 |
[RES.ATOMNAME] | residue and atom name, for example [ALA.CA]. "0" for atomName indicates the "lead" atom. |
~...~... | BIOSEQUENCE using single-letter or [RES] codes. |
%(n) | ring branching where n may be larger than 99. |
[*.ATOMNAME], [RESIDUE.*], [*.*] | Wildcards for residues and atom names |
[RES.ATOMNAME]+[RES.ATOMNAME] | atoms in adjacent residues, for example [ALA.CA]+[GLY.N] |
[RES.ATOMNAME]:[RES.ATOMNAME] | atoms in cross-linked residues, for example [CYS.CA]:[CYS.CA] |
~...~... | BIOSEQUENCE notation using single-letter or [RES] codes, including logic: select search("~A:[C,T]") |
# note: prior to parsing, all white space is removed [smiles] == [node][connections] [connections] == [connection] | NULL } [connection] == { [branch] | [bond] [node] } [connections] [branch] == "(" { [smiles] | [bond] [smiles] } ")" [node] == { [atomExpression] | [ringPointer] } [atomExpression] = { [unbracketedAtomType] | "[" [bracketedExpression] "]" } [unbracketedAtomType] == [atomType] & ! { "Ac" | "Ba" | "Ca" | "Na" | "Pa" | "Sc" | "ac" | "ba" | "ca" | "na" | "pa" | "sc" } # note: Brackets are required for these elements: [Na], [Ca], etc. # These elements Xy are instead interpreted as "X" "y", a single-letter # element followed by an aromatic atom. [atomType] == { [validElementSymbol] | [aromaticType] } [validElementSymbol] == (see Elements.java; including Xx and only through element 109) [aromaticType] == { [validElementSymbol].toLowerCase() } [bracketedExpression] == { "[" [atomPrimitives] "]" } [atomPrimitives] == { [atom] | [atom] [atomModifiers] } [atom] == { [isotope] [atomType] | [atomType] } [isotope] == [digits] # note -- isotope mass must come before the element symbol. [digits] == { [digit] | [digit] [digits] } [digit] == { "0" | "1" | "2" | "3" | "4" | "5" | "6" | 7" | "8" | "9" } [atomModifiers] == { [atomModifier] | [atomModifier] [atomModifiers] } [atomModifier] == { [charge] | [stereochemistry] | [H_Prop] } [charge] == { "-" [digits] | "+" [digits] | [plusSet] | [minusSet] } [plusSet] == { "+" | "+" [plusSet] } [minusSet] == { "-" | "-" [minusSet] } [stereochemistry] == { "@" # anticlockwise | "@@" # clockwise | "@" [stereochemistryDescriptor] | "@@" [stereochemistryDescriptor] } [stereochemistryDescriptor] == [stereoClass] [stereoOrder] [stereoClass] == { "AL" | "TH" | "SP" | "TP" | "OH" } [stereoOrder] == [digits] [ringPointer] == { "%" [digit][digit] | [digit] | "%(" [digits] ")"} # note: all ringPointers must have a second matching ringPointer # and must be preceded by an atomExpression for the # first occurance and either an atomExpression or a bond # for the second occurance # note: Jmol BIOSMARTS extends the possible number of rings to > 100 by # allowing %(n) [bond] == { "-" | "=" | "#" | "." | "/" | "\\" | ":" | NULL # note: Jmol will not match two totally independent molecular pieces. For example, # Jmol will not math [Na+].[Cl-]. However, "." can be used to clarify a # structure that has "ring" bond notation: # CC1CCC.C1CC is a valid structure. # note: BIOSEQUENCE uses ":" to indicate "cross-linked", which is the default for branches
# note: prior to parsing, all white space is removed [smartDef] == [variableDefs] [smarts] | [smarts] [variableDefs] == [variableDef] | [variableDef] [variableDefs] [variableDef] == "$" [label] "=" "\"" [smarts] "\"" [comments] ";" [label] == [any characters other than "=" and "$", and not starting with "("] [comments] == [any characters other than ";"] # note: Variable definitions must be parsed first. # After that, all variable references [$XXXX] are replaced [smarts] == { [biosequence] | [node] [connections] } [biosequence] == "~" [node] [connections] # note: In a biosequence, "atom types" are standard 1-letter-code group names # or bracketed residues [xxx]. The "~" must be the first character # in a component and must be repeated for each component (separated by ".") [connections] == [connection] | NULL } [connection] == { [branch] | [bondExpression] [node] } [connections] [branch] == "(" { [smarts] | [bondExpression] [smarts] } ")" # note: Default bonding for a branch is single for SMILES or base-paired (:) for BIOSEQUENCE [node] == { [atomExpression] | [ringPointer] } [atomExpression] = { [unbracketedAtomType] | "[" [bracketedExpression] "]" | [nestedExpression] } [nestedExpression] == "$(" + [atomExpression] + ")" [unbracketedAtomType] == [atomType] & ! { "Ac" | "Ba" | "Ca" | "Na" | "Pa" | "Sc" | "ac" | "ba" | "ca" | "na" | "pa" | "sc" } # note: Brackets are required for these elements: [Na], [Ca], etc. # These elements Xy are instead interpreted as "X" "y", a single-letter # element followed by an aromatic atom. # note: in a biosequence, all atom types are 1-letter code group names [atomType] == { [validElementSymbol] | "A" | [aromaticType] | "*" } [validElementSymbol] == (see Elements.java; including Xx and only through element 109) [aromaticType] == { "a" | [validElementSymbol].toLowerCase() } [bracketedExpression] == { [atomOrSet] | [atomOrSet] ";" [atomAndSet] } [atomOrSet] == { [atomAndSet] | [atomAndSet] "," [atomAndSet] } [atomAndSet] == { [atomPrimitives] | [atomPrimitives] "&" [atomAndSet] | "!" [atomPrimitive] | "!" [atomPrimitive] "&" [atomAndSet] } [atomPrimitives] == { [atomPrimitive] | [atomPrimitive] [atomPrimitives] } # note -- if & is not used, certain combinations of primitiveDescritors # are not allowed. Specifically, combinations that together # form the symbol for an element will be read as the element (Ar, Rh, etc.) # when NOT followed by a digit and no element has already been defined # So, for example, [Ar] is argon, [Ar3] is [A&r3], [ORh] is [O&R&h], # but [Ard2] is [Ar&d2] -- "argon with two non-hydrogen connections" # Also, "!" may not be use with implied "&". # Thus, [!a], [!a&!h2], and [h2&!a] are all valid, but [!ah2] is invalid. [primitive] == { [isotope] | [atomType] | [charge] | [stereochemistry] | [a_Prop] | [A_Prop] | [D_Prop] | [H_Prop] | [h_Prop] | [R_Prop] | [r_Prop] | [v_Prop] | [X_Prop] | [x_Prop] | [nestedExpression] } [isotope] == [digits] | [digits] "?" # note -- isotope mass may come before or after element symbol, # EXCEPT "H1" which must be parsed as "an atom with a single H" [digits] == { [digit] | [digit] [digits] } [digit] == { "0" | "1" | "2" | "3" | "4" | "5" | "6" | 7" | "8" | "9" } [charge] == { "-" [digits] | "+" [digits] | [plusSet] | [minusSet] } [plusSet] == { "+" | "+" [plusSet] } [minusSet] == { "-" | "-" [minusSet] } [stereochemistry] == { "@" # anticlockwise | "@@" # clockwise | "@" [stereochemistryDescriptor] | "@@" [stereochemistryDescriptor] } [stereochemistryDescriptor] == [stereoClass] [stereoOrder] [stereoClass] == { "AL" | "TH" | "SP" | "TP" | "OH" } [stereoOrder] == [digits] # note -- "?" here (unspecified) is not relevant in 3D-SEARCH [A_Prop] == "#" [digits] # elemental atomic number [a_Prop] == "=" [digits] # atom index (starts with 0) [D_Prop] == { "D" [digits] | "D" } # degree -- total number of connections # excludes implicit H atoms; default 1 [d_Prop] == { "d" [digits] | "d" } # degree -- non-hydrogen connections # default 1 [H_Prop] == { "H" [digits] | "H" } # exact hydrogen count # excludes implicit H atoms [h_Prop] == { "h" [digits] | "h" } # implicit hydrogens -- "h" indicates "at least one" # (see note below) [R_Prop] == { "R" [digits] | "R" } # ring membership; e.g. "R2" indicates "in two rings" # "R" indicates "in a ring" # !R" or "R0" indicates "not in any ring" [r_Prop] == { "r" [digits] | "r" } # in ring of size [digits]; "r" indicates "in a ring" [v_Prop] == { "v" [digits] | "v" } # valence -- total bond order (counting double as 2, e.g.) [X_Prop] == { "X" [digits] | "X" } # connectivity -- total number of connections # includes implicit H atoms [x_Prop] == { "x" [digits] | "x" } # ring connectivity -- total ring connections [bioAtom] == { [residueName] "." [atomName] # note: BIOSEQUENCE (only) also allows just [residueName], an abbreviation for [residueName] ".0" [residueName] == { "*" | [oneDigitGroupName] | [groupName] | null } # note: null same as "*" [oneDigitGroupName] == (see JmolConstants.predefinedGroup1Names) [groupName] == (any PDB residue name) [atomName] == { "*" | "0" | (any atom name) | null } # note: null same as "*"; "0" refers to "lead atom -- CA for proteins, P for nucleic, or O for carbohydrate [ringPointer] == { [digit] | "%" [digit][digit] | "%(" [digits] ")" } # note: All ringPointers must have a second matching ringPointer # and must be preceded by an atomExpression for the # first occurance and either an atomExpression or a bondExpression # for the second occurance. [bondExpression] == { [bondOrSet] | [bondOrSet] ";" [bondAndSet] } [bondOrSet] == { [bondAndSet] | [bondAndSet] "," [bondAndSet] } [bondAndSet] == { [bondPrimitives] | [bondPrimitives] "&" [bondAndSet] | "!" [bondPrimitive] | "!" [bondPrimitive] "&" [bondAndSet] } [bondPrimitives] == { [bondPrimitive] | [bondPrimitive] [bondPrimitives] } [bond] == { "-" | "=" | "#" | "." | "/" | "\\" | ":" | "~" | "@" | "+" | NULL # note: All bondExpressions are not valid. Stereochemistry should not # be mixed with the others, as it represents a single bond always. # In addition, "." ("no bond") cannot be mixed with any bond type. # Nothing would be retrieved by "-&=", as a bond cannot be both single # and double. However, "-@" is potentially very useful -- "ring single-bonds" # or "=&!@" -- "doubly-bonded atoms where the double bond is not in a ring" # note: Jmol will not match two totally independent molecular pieces. For example, # Jmol will not math [Na+].[Cl-] # note: "+" indicates "adjacent biomolecular groups in a chain" # note: a BIOSEQUENCE ends with "." or the end of the string. A new BIOSEQUENCE # can continue with "~" immediately following this "." # note: For a SMARTS search, "." indicates the start of a new subset, not necessarily a # new component.
We define "aromatic" here strictly in terms of geometry - a flat ring with trigonal planar geometry for all atoms in the ring. No consideration of bond order is used, because for the sorts of models that can be loaded into Jmol, many do not assume a bonding scheme (PDB, GAUSSIAN, etc.).
Given a ring of N atoms...
1 / \ 2 6 -- 6a | | 5a -- 5 4 \ / 3with arbitrary order and up to N substituents...
-- Bob Hanson last updated 6/6/2010