Jmol 3D-SEARCH BIOSMILES/BIOSMARTS

Robert M. Hanson
Department of Chemistry
St. Olaf College
5/19/2010

An adaptation of SMILES and SMARTS for 3D molecular atom search and selection, including multi-dimensional biomolecular sequence information including nucleic acid base pairing and cysteine cross-linking. The org.jmol.smiles package provides extensive functionality for selecting atoms within a three-dimensional model based on SMILES and SMARTS strings. This package may be used independently of Jmol -- see JmolSmilesApplet.java and JmeToJmol.htm.

Besides a presentation of general considerations, a detailed specification for syntax, and the term "aromatic" is defined.

General Considerations

format

atom selection BIOSMILES/BIOSMARTS aromaticity

Comparision to Daylight SMILES

All single-component aspects of Daylight SMILES are implemented, including aromaticity and atom- and bond-based stereochemistry ("chirality").

Comparision to Daylight SMARTS

primitives implicit hydrogen count

Detailed Jmol 3D-SEARCH SMILES Specification

 
      # note: prior to parsing, all white space is removed
       
   [smiles] == [node][connections] 
   [connections] == [connection] | NULL }
   [connection] == { [branch] | [bond] [node] } [connections]
   [branch] == "(" { [smiles] | [bond] [smiles] } ")" 
   [node] == { [atomExpression] | [ringPointer] }

   [atomExpression] = { [unbracketedAtomType] 
                             | "[" [bracketedExpression] "]" }
   
   [unbracketedAtomType] == [atomType] 
                                 & ! { "Ac" | "Ba" | "Ca" | "Na" | "Pa" | "Sc"
                                     | "ac" | "ba" | "ca" | "na" | "pa" | "sc" }
      # note: Brackets are required for these elements: [Na], [Ca], etc.
      #       These elements Xy are instead interpreted as "X" "y", a single-letter
      #       element followed by an aromatic atom. 
      
   [atomType] == { [validElementSymbol] | [aromaticType] }
   [validElementSymbol] == (see Elements.java; 
                            including Xx and only through element 109)
   [aromaticType] == { [validElementSymbol].toLowerCase() }
       
   [bracketedExpression] == { "[" [atomPrimitives] "]" } 
   
   [atomPrimitives] == { [atom] | [atom] [atomModifiers] }
   [atom] == { [isotope] [atomType] | [atomType] } 
   [isotope] == [digits]
       # note -- isotope mass must come before the element symbol. 
   [digits] == { [digit] | [digit] [digits] }
   [digit] == { "0" | "1" | "2" | "3" | "4" | "5" | "6" | 7" | "8" | "9" }
   [atomModifiers] == { [atomModifier] | [atomModifier] [atomModifiers] }
   [atomModifier] == { [charge] | [stereochemistry] | [H_Prop] }
   [charge] == { "-" [digits] | "+" [digits] | [plusSet] | [minusSet] }
   [plusSet] == { "+" | "+" [plusSet] }
   [minusSet] == { "-" | "-" [minusSet] }
   [stereochemistry] == { "@"           # anticlockwise
                              | "@@"    # clockwise
                              | "@" [stereochemistryDescriptor] 
                              | "@@" [stereochemistryDescriptor] }
   [stereochemistryDescriptor] == [stereoClass] [stereoOrder]
   [stereoClass] == { "AL" | "TH" | "SP" | "TP" | "OH" }
   [stereoOrder] == [digits]
   
   [ringPointer] == { "%" [digit][digit] | [digit] | "%(" [digits] ")"}
      # note: all ringPointers must have a second matching ringPointer 
      #       and must be preceded by an atomExpression for the 
      #       first occurance and either an atomExpression or a bond
      #       for the second occurance
      # note: Jmol BIOSMARTS extends the possible number of rings to > 100 by 
      #       allowing %(n)

   [bond] == { "-" | "=" | "#" | "." | "/" | "\\" | ":" | NULL
      # note: Jmol will not match two totally independent molecular pieces. For example,
      #       Jmol will not math [Na+].[Cl-]. However, "." can be used to clarify a
      #       structure that has "ring" bond notation:
      #       CC1CCC.C1CC   is a valid structure.
      # note: BIOSEQUENCE uses ":" to indicate "cross-linked", which is the default for branches

Detailed Jmol 3D-BIOSMARTS Specification

 
      # note: prior to parsing, all white space is removed

   [smartDef] == [variableDefs] [smarts] | [smarts]
   [variableDefs] == [variableDef] | [variableDef] [variableDefs]
   [variableDef] ==  "$" [label] "=" "\"" [smarts] "\"" [comments] ";"
   [label] == [any characters other than "=" and "$", and not starting with "("]
   [comments] == [any characters other than ";"]
   
      # note: Variable definitions must be parsed first. 
      #       After that, all variable references [$XXXX] are replaced
      
   [smarts] == { [biosequence] | [node] [connections] } 
   [biosequence] == "~" [node] [connections] 
      # note: In a biosequence, "atom types" are standard 1-letter-code group names
      #       or bracketed residues [xxx]. The "~" must be the first character
      #       in a component and must be repeated for each component (separated by ".")
   [connections] == [connection] | NULL }
   [connection] == { [branch] | [bondExpression] [node] } [connections]
   [branch] == "(" { [smarts] | [bondExpression] [smarts] } ")" 
      # note: Default bonding for a branch is single for SMILES or base-paired (:) for BIOSEQUENCE
   [node] == { [atomExpression] | [ringPointer] }

   [atomExpression] = { [unbracketedAtomType] 
                             | "[" [bracketedExpression] "]" 
                             | [nestedExpression] }
   
   [nestedExpression] == "$(" + [atomExpression] + ")"

   [unbracketedAtomType] == [atomType] 
                                 & ! { "Ac" | "Ba" | "Ca" | "Na" | "Pa" | "Sc"
                                     | "ac" | "ba" | "ca" | "na" | "pa" | "sc" }
      # note: Brackets are required for these elements: [Na], [Ca], etc.
      #       These elements Xy are instead interpreted as "X" "y", a single-letter
      #       element followed by an aromatic atom. 
      # note: in a biosequence, all atom types are 1-letter code group names
      
   [atomType] == { [validElementSymbol] | "A" | [aromaticType] | "*" }
   [validElementSymbol] == (see Elements.java; 
                            including Xx and only through element 109)
   [aromaticType] == { "a" | [validElementSymbol].toLowerCase() }
       
   [bracketedExpression] == { [atomOrSet] | [atomOrSet] ";" [atomAndSet] } 
   
   [atomOrSet] == { [atomAndSet] | [atomAndSet] "," [atomAndSet] }
   [atomAndSet] == { [atomPrimitives] | [atomPrimitives] "&" [atomAndSet]
                              | "!" [atomPrimitive] 
                              | "!" [atomPrimitive] "&" [atomAndSet] }
   [atomPrimitives] == { [atomPrimitive] | [atomPrimitive] [atomPrimitives] }
       # note -- if & is not used, certain combinations of primitiveDescritors
       #         are not allowed. Specifically, combinations that together
       #         form the symbol for an element will be read as the element (Ar, Rh, etc.)
       #         when NOT followed by a digit and no element has already been defined 
       #         So, for example, [Ar] is argon, [Ar3] is [A&r3], [ORh] is [O&R&h],  
       #         but [Ard2] is [Ar&d2] -- "argon with two non-hydrogen connections"
       #         Also, "!" may not be use with implied "&". 
       #         Thus, [!a], [!a&!h2], and [h2&!a] are all valid, but [!ah2] is invalid.             
   [primitive] == { [isotope] | [atomType] | [charge] | [stereochemistry]
                              | [a_Prop] | [A_Prop] | [D_Prop] | [H_Prop] | [h_Prop] 
                              | [R_Prop] | [r_Prop] | [v_Prop] | [X_Prop]
                              | [x_Prop] | [nestedExpression] }
   [isotope] == [digits] | [digits] "?"
       # note -- isotope mass may come before or after element symbol, 
       #         EXCEPT "H1" which must be parsed as "an atom with a single H" 
   [digits] == { [digit] | [digit] [digits] }
   [digit] == { "0" | "1" | "2" | "3" | "4" | "5" | "6" | 7" | "8" | "9" }
   [charge] == { "-" [digits] | "+" [digits] | [plusSet] | [minusSet] }
   [plusSet] == { "+" | "+" [plusSet] }
   [minusSet] == { "-" | "-" [minusSet] }
   [stereochemistry] == { "@"           # anticlockwise
                              | "@@"    # clockwise
                              | "@" [stereochemistryDescriptor] 
                              | "@@" [stereochemistryDescriptor] }
   [stereochemistryDescriptor] == [stereoClass] [stereoOrder]
   [stereoClass] == { "AL" | "TH" | "SP" | "TP" | "OH" }
   [stereoOrder] == [digits]
       # note -- "?" here (unspecified) is not relevant in 3D-SEARCH 
   
   [A_Prop] == "#" [digits]           # elemental atomic number
   [a_Prop] == "=" [digits]           # atom index (starts with 0)
   [D_Prop] == { "D" [digits] | "D" } # degree -- total number of connections 
                                      #   excludes implicit H atoms; default 1
   [d_Prop] == { "d" [digits] | "d" } # degree -- non-hydrogen connections
                                      #   default 1 
   [H_Prop] == { "H" [digits] | "H" } # exact hydrogen count 
                                      #   excludes implicit H atoms
   [h_Prop] == { "h" [digits] | "h" } # implicit hydrogens -- "h" indicates "at least one"
                                      #   (see note below)
   [R_Prop] == { "R" [digits] | "R" } # ring membership; e.g. "R2" indicates "in two rings"
                                      #   "R" indicates "in a ring" 
                                      #   !R" or "R0" indicates "not in any ring"
   [r_Prop] == { "r" [digits] | "r" } # in ring of size [digits]; "r" indicates "in a ring"
   [v_Prop] == { "v" [digits] | "v" } # valence -- total bond order (counting double as 2, e.g.)
   [X_Prop] == { "X" [digits] | "X" } # connectivity -- total number of connections
                                      #   includes implicit H atoms
   [x_Prop] == { "x" [digits] | "x" } # ring connectivity -- total ring connections
   
   [bioAtom] == { [residueName] "." [atomName]
      # note: BIOSEQUENCE (only) also allows just [residueName], an abbreviation for [residueName] ".0"
   [residueName] == { "*" | [oneDigitGroupName] | [groupName] | null }
      # note: null same as "*"
   [oneDigitGroupName] == (see JmolConstants.predefinedGroup1Names)
   [groupName] == (any PDB residue name)
   [atomName] == { "*" | "0" | (any atom name) | null }
      # note: null same as "*"; "0" refers to "lead atom -- CA for proteins, P for nucleic, or O for carbohydrate
   [ringPointer] == { [digit] | "%" [digit][digit] | "%(" [digits] ")" }
      # note: All ringPointers must have a second matching ringPointer 
      #       and must be preceded by an atomExpression for the 
      #       first occurance and either an atomExpression or a bondExpression
      #       for the second occurance.

   [bondExpression] == { [bondOrSet] | [bondOrSet] ";" [bondAndSet] } 
   
   [bondOrSet] == { [bondAndSet] | [bondAndSet] "," [bondAndSet] }
   [bondAndSet] == { [bondPrimitives] | [bondPrimitives] "&" [bondAndSet]
                              | "!" [bondPrimitive] 
                              | "!" [bondPrimitive] "&" [bondAndSet] }
   [bondPrimitives] == { [bondPrimitive] | [bondPrimitive] [bondPrimitives] }       
   [bond] == { "-" | "=" | "#" | "." | "/" | "\\" | ":" | "~" | "@" | "+" | NULL
      # note: All bondExpressions are not valid. Stereochemistry should not 
      #       be mixed with the others, as it represents a single bond always.
      #       In addition, "." ("no bond") cannot be mixed with any bond type.
      #       Nothing would be retrieved by "-&=", as a bond cannot be both single
      #       and double. However, "-@" is potentially very useful -- "ring single-bonds"
      #       or "=&!@" -- "doubly-bonded atoms where the double bond is not in a ring"
      # note: Jmol will not match two totally independent molecular pieces. For example,
      #       Jmol will not math [Na+].[Cl-]
      # note: "+" indicates "adjacent biomolecular groups in a chain"
      # note: a BIOSEQUENCE ends with "." or the end of the string. A new BIOSEQUENCE
      #       can continue with "~" immediately following this "." 
      # note: For a SMARTS search, "." indicates the start of a new subset, not necessarily a
      #       new component.
      

Jmol 3D-SEARCH Definition of "aromatic"

We define "aromatic" here strictly in terms of geometry - a flat ring with trigonal planar geometry for all atoms in the ring. No consideration of bond order is used, because for the sorts of models that can be loaded into Jmol, many do not assume a bonding scheme (PDB, GAUSSIAN, etc.).

Given a ring of N atoms...

                  1
                /   \
               2     6 -- 6a
               |     |
         5a -- 5     4
                \   /
                  3  
with arbitrary order and up to N substituents...
  1. Check to see if all ring atoms have no more than 3 connections. Note: An alternative definition might include "and no substituent is explicitly double-bonded to its ring atom, as in quinone. Here we opt to allow the atoms of quinone to be called "aromatic."
  2. Select a cutoff value close to zero. We use 0.01 here.
  3. Generate a set of normals as follows:
    1. For each ring atom, construct the normal associated with the plane formed by that ring atom and its two nearest ring-atom neighbors.
    2. For each ring atom with a substituent, construct a normal associated with the plane formed by its connecting substituent atom and the two nearest ring-atom neighbors.
    3. If this is the first normal, assign vMean to it.
    4. If this is not the first normal, check vNorm.dot.vMean. If this value is less than zero, scale vNorm by -1.
    5. Add vNorm to vMean.
  4. Calculate the standard deviation of the dot products of the individual vNorms with the normalized vMean.
  5. The ring is deemed flat if this standard deviation is less than the selected cutoff value.

-- Bob Hanson last updated 6/6/2010