compbio.data.sequence
Class SequenceUtil

java.lang.Object
  extended by compbio.data.sequence.SequenceUtil

public final class SequenceUtil
extends java.lang.Object

Utility class for operations on sequences

Author:
pvtroshin Date September 2009

Field Summary
static java.util.regex.Pattern AA
          Valid Amino acids
static java.util.regex.Pattern AMBIGUOUS_AA
          Same as AA pattern but with two additional letters - XU
static java.util.regex.Pattern AMBIGUOUS_NUCLEOTIDE
          Ambiguous nucleotide
static java.util.regex.Pattern DIGIT
          A digit
static java.util.regex.Pattern NON_AA
          inversion of AA pattern
static java.util.regex.Pattern NON_NUCLEOTIDE
          Non nucleotide
static java.util.regex.Pattern NONWORD
          Non word
static java.util.regex.Pattern NUCLEOTIDE
          Nucleotides a, t, g, c, u
static java.util.regex.Pattern WHITE_SPACE
          A whitespace character: [\t\n\x0B\f\r]
 
Method Summary
static java.lang.String cleanSequence(java.lang.String sequence)
          Removes all whitespace chars in the sequence string
static java.lang.String deepCleanSequence(java.lang.String sequence)
          Removes all special characters and digits as well as whitespace chars from the sequence
static boolean isAmbiguosProtein(java.lang.String sequence)
          Check whether the sequence confirms to amboguous protein sequence
static boolean isNonAmbNucleotideSequence(java.lang.String sequence)
          Ambiguous DNA chars : AGTCRYMKSWHBVDN // differs from protein in only one (!) - B char
static boolean isNucleotideSequence(FastaSequence s)
           
static boolean isProteinSequence(java.lang.String sequence)
           
static java.util.List<FastaSequence> readFasta(java.io.InputStream inStream)
          Reads fasta sequences from inStream into the list of FastaSequence objects
static void writeFasta(java.io.OutputStream os, java.util.List<FastaSequence> sequences)
          Writes FastaSequence in the file, each sequence will take one line only
static void writeFasta(java.io.OutputStream outstream, java.util.List<FastaSequence> sequences, int width)
          Writes list of FastaSequeces into the outstream formatting the sequence so that it contains width chars on each line
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

WHITE_SPACE

public static final java.util.regex.Pattern WHITE_SPACE
A whitespace character: [\t\n\x0B\f\r]


DIGIT

public static final java.util.regex.Pattern DIGIT
A digit


NONWORD

public static final java.util.regex.Pattern NONWORD
Non word


AA

public static final java.util.regex.Pattern AA
Valid Amino acids


NON_AA

public static final java.util.regex.Pattern NON_AA
inversion of AA pattern


AMBIGUOUS_AA

public static final java.util.regex.Pattern AMBIGUOUS_AA
Same as AA pattern but with two additional letters - XU


NUCLEOTIDE

public static final java.util.regex.Pattern NUCLEOTIDE
Nucleotides a, t, g, c, u


AMBIGUOUS_NUCLEOTIDE

public static final java.util.regex.Pattern AMBIGUOUS_NUCLEOTIDE
Ambiguous nucleotide


NON_NUCLEOTIDE

public static final java.util.regex.Pattern NON_NUCLEOTIDE
Non nucleotide

Method Detail

isNucleotideSequence

public static boolean isNucleotideSequence(FastaSequence s)
Returns:
true is the sequence contains only letters a,c, t, g, u

isNonAmbNucleotideSequence

public static boolean isNonAmbNucleotideSequence(java.lang.String sequence)
Ambiguous DNA chars : AGTCRYMKSWHBVDN // differs from protein in only one (!) - B char


cleanSequence

public static java.lang.String cleanSequence(java.lang.String sequence)
Removes all whitespace chars in the sequence string

Parameters:
sequence -
Returns:
cleaned up sequence

deepCleanSequence

public static java.lang.String deepCleanSequence(java.lang.String sequence)
Removes all special characters and digits as well as whitespace chars from the sequence

Parameters:
sequence -
Returns:
cleaned up sequence

isProteinSequence

public static boolean isProteinSequence(java.lang.String sequence)
Parameters:
sequence -
Returns:
true is the sequence is a protein sequence, false overwise

isAmbiguosProtein

public static boolean isAmbiguosProtein(java.lang.String sequence)
Check whether the sequence confirms to amboguous protein sequence

Parameters:
sequence -
Returns:
return true only if the sequence if ambiguous protein sequence Return false otherwise. e.g. if the sequence is non-ambiguous protein or DNA

writeFasta

public static void writeFasta(java.io.OutputStream outstream,
                              java.util.List<FastaSequence> sequences,
                              int width)
                       throws java.io.IOException
Writes list of FastaSequeces into the outstream formatting the sequence so that it contains width chars on each line

Parameters:
outstream -
sequences -
width - - the maximum number of characters to write in one line
Throws:
java.io.IOException

readFasta

public static java.util.List<FastaSequence> readFasta(java.io.InputStream inStream)
                                               throws java.io.IOException
Reads fasta sequences from inStream into the list of FastaSequence objects

Parameters:
inStream - from
Returns:
list of FastaSequence objects
Throws:
java.io.IOException

writeFasta

public static void writeFasta(java.io.OutputStream os,
                              java.util.List<FastaSequence> sequences)
                       throws java.io.IOException
Writes FastaSequence in the file, each sequence will take one line only

Parameters:
os -
sequences -
Throws:
java.io.IOException