compbio.data.sequence
Class SequenceUtil

java.lang.Object
  extended by compbio.data.sequence.SequenceUtil

public final class SequenceUtil
extends java.lang.Object

Utility class for operations on sequences

Since:
1.0
Version:
2.0 June 2011
Author:
Peter Troshin

Field Summary
static java.util.regex.Pattern AA
          Valid Amino acids
static java.util.regex.Pattern AMBIGUOUS_AA
          Same as AA pattern but with two additional letters - XU
static java.util.regex.Pattern AMBIGUOUS_NUCLEOTIDE
          Ambiguous nucleotide
static java.util.regex.Pattern DIGIT
          A digit
static java.util.regex.Pattern NON_AA
          inversion of AA pattern
static java.util.regex.Pattern NON_NUCLEOTIDE
          Non nucleotide
static java.util.regex.Pattern NONWORD
          Non word
static java.util.regex.Pattern NUCLEOTIDE
          Nucleotides a, t, g, c, u
static java.util.regex.Pattern WHITE_SPACE
          A whitespace character: [\t\n\x0B\f\r]
 
Method Summary
static java.lang.String cleanProteinSequence(java.lang.String sequence)
          Remove all non AA chars from the sequence
static java.lang.String cleanSequence(java.lang.String sequence)
          Removes all whitespace chars in the sequence string
static void closeSilently(java.util.logging.Logger log, java.io.Closeable stream)
          Closes the Closable and logs the exception if any
static java.lang.String deepCleanSequence(java.lang.String sequence)
          Removes all special characters and digits as well as whitespace chars from the sequence
static boolean isAmbiguosProtein(java.lang.String sequence)
          Check whether the sequence confirms to amboguous protein sequence
static boolean isNonAmbNucleotideSequence(java.lang.String sequence)
          Ambiguous DNA chars : AGTCRYMKSWHBVDN // differs from protein in only one (!) - B char
static boolean isNucleotideSequence(FastaSequence s)
           
static boolean isProteinSequence(java.lang.String sequence)
           
static java.util.List<FastaSequence> openInputStream(java.lang.String inFilePath)
          Reads and parses Fasta or Clustal formatted file into a list of FastaSequence objects
static java.util.HashSet<Score> readAAConResults(java.io.InputStream results)
          Read AACon result with no alignment files.
static java.util.HashMap<java.lang.String,java.util.Set<Score>> readDisembl(java.io.InputStream input)
          > Foobar_dundeefriends # COILS 34-41, 50-58, 83-91, 118-127, 160-169, 191-220, 243-252, 287-343 # REM465 355-368 # HOTLOOPS 190-204 # RESIDUE COILS REM465 HOTLOOPS M 0.86010 0.88512 0.37094 T 0.79983 0.85864 0.44331 >Next Sequence name
static java.util.List<FastaSequence> readFasta(java.io.InputStream inStream)
          Reads fasta sequences from inStream into the list of FastaSequence objects
static java.util.HashMap<java.lang.String,java.util.Set<Score>> readGlobPlot(java.io.InputStream input)
          > Foobar_dundeefriends # COILS 34-41, 50-58, 83-91, 118-127, 160-169, 191-220, 243-252, 287-343 # REM465 355-368 # HOTLOOPS 190-204 # RESIDUE COILS REM465 HOTLOOPS M 0.86010 0.88512 0.37094 T 0.79983 0.85864 0.44331 >Next Sequence name
static java.util.Map<java.lang.String,Score> readIUPred(java.io.File result)
          Read IUPred output
static java.util.Map<java.lang.String,Score> readJRonn(java.io.File result)
           
static java.util.Map<java.lang.String,Score> readJRonn(java.io.InputStream inStream)
          Reader for JRonn horizontal file format
static void writeFasta(java.io.OutputStream os, java.util.List<FastaSequence> sequences)
          Writes FastaSequence in the file, each sequence will take one line only
static void writeFasta(java.io.OutputStream outstream, java.util.List<FastaSequence> sequences, int width)
          Writes list of FastaSequeces into the outstream formatting the sequence so that it contains width chars on each line
static void writeFastaKeepTheStream(java.io.OutputStream outstream, java.util.List<FastaSequence> sequences, int width)
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

WHITE_SPACE

public static final java.util.regex.Pattern WHITE_SPACE
A whitespace character: [\t\n\x0B\f\r]


DIGIT

public static final java.util.regex.Pattern DIGIT
A digit


NONWORD

public static final java.util.regex.Pattern NONWORD
Non word


AA

public static final java.util.regex.Pattern AA
Valid Amino acids


NON_AA

public static final java.util.regex.Pattern NON_AA
inversion of AA pattern


AMBIGUOUS_AA

public static final java.util.regex.Pattern AMBIGUOUS_AA
Same as AA pattern but with two additional letters - XU


NUCLEOTIDE

public static final java.util.regex.Pattern NUCLEOTIDE
Nucleotides a, t, g, c, u


AMBIGUOUS_NUCLEOTIDE

public static final java.util.regex.Pattern AMBIGUOUS_NUCLEOTIDE
Ambiguous nucleotide


NON_NUCLEOTIDE

public static final java.util.regex.Pattern NON_NUCLEOTIDE
Non nucleotide

Method Detail

isNucleotideSequence

public static boolean isNucleotideSequence(FastaSequence s)
Returns:
true is the sequence contains only letters a,c, t, g, u

isNonAmbNucleotideSequence

public static boolean isNonAmbNucleotideSequence(java.lang.String sequence)
Ambiguous DNA chars : AGTCRYMKSWHBVDN // differs from protein in only one (!) - B char


cleanSequence

public static java.lang.String cleanSequence(java.lang.String sequence)
Removes all whitespace chars in the sequence string

Parameters:
sequence -
Returns:
cleaned up sequence

deepCleanSequence

public static java.lang.String deepCleanSequence(java.lang.String sequence)
Removes all special characters and digits as well as whitespace chars from the sequence

Parameters:
sequence -
Returns:
cleaned up sequence

cleanProteinSequence

public static java.lang.String cleanProteinSequence(java.lang.String sequence)
Remove all non AA chars from the sequence

Parameters:
sequence - the sequence to clean
Returns:
cleaned sequence

isProteinSequence

public static boolean isProteinSequence(java.lang.String sequence)
Parameters:
sequence -
Returns:
true is the sequence is a protein sequence, false overwise

isAmbiguosProtein

public static boolean isAmbiguosProtein(java.lang.String sequence)
Check whether the sequence confirms to amboguous protein sequence

Parameters:
sequence -
Returns:
return true only if the sequence if ambiguous protein sequence Return false otherwise. e.g. if the sequence is non-ambiguous protein or DNA

writeFasta

public static void writeFasta(java.io.OutputStream outstream,
                              java.util.List<FastaSequence> sequences,
                              int width)
                       throws java.io.IOException
Writes list of FastaSequeces into the outstream formatting the sequence so that it contains width chars on each line

Parameters:
outstream -
sequences -
width - - the maximum number of characters to write in one line
Throws:
java.io.IOException

writeFastaKeepTheStream

public static void writeFastaKeepTheStream(java.io.OutputStream outstream,
                                           java.util.List<FastaSequence> sequences,
                                           int width)
                                    throws java.io.IOException
Throws:
java.io.IOException

readFasta

public static java.util.List<FastaSequence> readFasta(java.io.InputStream inStream)
                                               throws java.io.IOException
Reads fasta sequences from inStream into the list of FastaSequence objects

Parameters:
inStream - from
Returns:
list of FastaSequence objects
Throws:
java.io.IOException

writeFasta

public static void writeFasta(java.io.OutputStream os,
                              java.util.List<FastaSequence> sequences)
                       throws java.io.IOException
Writes FastaSequence in the file, each sequence will take one line only

Parameters:
os -
sequences -
Throws:
java.io.IOException

readIUPred

public static java.util.Map<java.lang.String,Score> readIUPred(java.io.File result)
                                                        throws java.io.IOException,
                                                               UnknownFileFormatException
Read IUPred output

Parameters:
result -
Returns:
Map key->sequence name, value->Score
Throws:
java.io.IOException
UnknownFileFormatException

readJRonn

public static java.util.Map<java.lang.String,Score> readJRonn(java.io.File result)
                                                       throws java.io.IOException,
                                                              UnknownFileFormatException
Throws:
java.io.IOException
UnknownFileFormatException

readJRonn

public static java.util.Map<java.lang.String,Score> readJRonn(java.io.InputStream inStream)
                                                       throws java.io.IOException,
                                                              UnknownFileFormatException
Reader for JRonn horizontal file format
 >Foobar M G D T T A G 0.48 0.42
 0.42 0.48 0.52 0.53 0.54
 
 
 Where all values are tab delimited

Parameters:
inStream - the InputStream connected to the JRonn output file
Returns:
Map key=sequence name value=Score
Throws:
java.io.IOException - is thrown if the inStream has problems accessing the data
UnknownFileFormatException - is thrown if the inStream represents an unknown source of data, i.e. not a JRonn output

closeSilently

public static final void closeSilently(java.util.logging.Logger log,
                                       java.io.Closeable stream)
Closes the Closable and logs the exception if any

Parameters:
log -
stream -

readDisembl

public static java.util.HashMap<java.lang.String,java.util.Set<Score>> readDisembl(java.io.InputStream input)
                                                                            throws java.io.IOException,
                                                                                   UnknownFileFormatException
> Foobar_dundeefriends # COILS 34-41, 50-58, 83-91, 118-127, 160-169, 191-220, 243-252, 287-343 # REM465 355-368 # HOTLOOPS 190-204 # RESIDUE COILS REM465 HOTLOOPS M 0.86010 0.88512 0.37094 T 0.79983 0.85864 0.44331 >Next Sequence name

Parameters:
input - the InputStream
Returns:
Map key=sequence name, value=set of score
Throws:
java.io.IOException
UnknownFileFormatException

readGlobPlot

public static java.util.HashMap<java.lang.String,java.util.Set<Score>> readGlobPlot(java.io.InputStream input)
                                                                             throws java.io.IOException,
                                                                                    UnknownFileFormatException
> Foobar_dundeefriends # COILS 34-41, 50-58, 83-91, 118-127, 160-169, 191-220, 243-252, 287-343 # REM465 355-368 # HOTLOOPS 190-204 # RESIDUE COILS REM465 HOTLOOPS M 0.86010 0.88512 0.37094 T 0.79983 0.85864 0.44331 >Next Sequence name

Parameters:
input -
Returns:
Map key=sequence name, value=set of score
Throws:
java.io.IOException
UnknownFileFormatException

readAAConResults

public static java.util.HashSet<Score> readAAConResults(java.io.InputStream results)
Read AACon result with no alignment files. This method leaves incoming InputStream open!

Parameters:
results - output file of AAConservation
Returns:
Map with keys ConservationMethod -> float[]

openInputStream

public static java.util.List<FastaSequence> openInputStream(java.lang.String inFilePath)
                                                     throws java.io.IOException,
                                                            UnknownFileFormatException
Reads and parses Fasta or Clustal formatted file into a list of FastaSequence objects

Parameters:
inFilePath - the path to the input file
Returns:
the List of FastaSequence objects
Throws:
java.io.IOException - if the file denoted by inFilePath cannot be read
UnknownFileFormatException - if the inFilePath points to the file which format cannot be recognised