6 # Copyright (C) 2000-2002 Washington University School of Medicine
7 # and Howard Hughes Medical Institute
11 # Author: Christian M. Zmasek
12 # zmasek@genetics.wustl.edu
13 # http://www.genetics.wustl.edu/eddy/people/zmasek/
15 # Last modified 09/06/03
20 # Available at: http://www.genetics.wustl.edu/eddy/forester/
21 # RIO webserver: http://www.rio.wustl.edu/
24 # Zmasek C.M. and Eddy S.R. (2002)
25 # RIO: Analyzing proteomes by automated phylogenomics using
26 # resampled inference of orthologs.
27 # BMC Bioinformatics 3:14
28 # http://www.biomedcentral.com/1471-2105/3/14/
30 # It is highly recommended that you read this paper before
31 # installing and/or using RIO. (Included in the RIO
32 # distribution as PDF: "RIO.pdf".)
35 # Before rio.pl can be used, some variables in rio_module.pm need to be set,
36 # as described in RIO_INSTALL.
38 # Usage: rio.pl <Mode: 1, 2, 3, or 4> <tagged arguments, single letter arguments>
44 # % RIO1.1/perl/rio.pl 1 A=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out1 p I C E
46 # % RIO1.1/perl/rio.pl 2 A=aconitase N=LEU2_LACLA/5-449 O=out2 p I C E
48 # % RIO1.1/perl/rio.pl 3 A=/path/to/my/pfam/Full/aconitase H=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out3 p I C E
50 # % RIO1.1/perl/rio.pl 4 A=/path/to/my/pfam/Full/aconitase N=LEU2_LACLA/5-449 O=out4 p I C E
52 # % RIO1.1/perl/rio.pl 3 A=/path/to/my/pfam/Full/aconitase b=/path/to/my/pfam/Seed/aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out5 p I C E
58 # 1: RIO analysis based on precalculated pairwise distances
59 # alignment does not contain query sequence
61 # 2: RIO analysis based on precalculated pairwise distances
62 # alignment does contain query sequence
64 # 3: RIO analysis based on Pfam alignments,
65 # alignment does not contain query sequence
67 # 4: RIO analysis based on Pfam alignments,
68 # alignment does contain query sequence
75 # No "G=", "H=", "F=", "T=", "a=", "b=", "s", "f" in modes 1 and 2.
78 # A=<String> Pfam alignment name (mandatory). This specifies the alignment
79 # against which the RIO analysis is to be performed.
80 # In modes 1 and 2: Pfam model (alignment) name
81 # (e.g. "A=aconitase").
82 # In modes 3 and 4: Pfam alignment path/name
83 # (e.g. "A=/path/to/your/pfam/Full/aconitase").
85 # Q=<String> Path/name of file containing the query sequence
86 # (in FASTA format or raw sequence) (mandatory in modes 1 and 3).
88 # N=<String> Query name (mandatory). This must include the SWISS-PROT code
89 # for the species of the query after a "_" (e.g. "N=QUERY_HAEIN").
90 # If the query sequence is already in the alignment (modes 2 and 4)
91 # the complete name needs to be specified -- including "/xxx-xxx".
93 # O=<String> Output file path/name (mandatory).
95 # T=<char> Model for pairwaise distance calculation:
96 # J=JTT, B=BLOSUM 62, M=mtREV24, V=VT, W=WAG, P=PAM.
97 # BLOSUM 62 is default.
98 # (Not in modes 1 and 2; these modes use $MATRIX_FOR_PWD instead.)
100 # In modes 1 and 3, a HMM is needed to align the query sequence to
101 # the alignment and either one of the following options must be
103 # H=<String> HMM name: This uses hmmfetch to retrieve a HMM from
105 # F=<String> HMM file: This directly reads the HMM from a file.
107 # S=<String> Species tree file path/name (in NHX format) (optional).
108 # If not specified, $SPECIES_TREE_FILE_DEFAULT is used.
110 # G=<String> Species names file (optional). Only sequences associated with
111 # species found in this file are used.
112 # In the species names file, individual species names must be
113 # separated by newlines and lines starting with "#" are ignored.
114 # While only sequences associated with species found in the species
115 # tree ("S=") are used for the actual RIO analysis, this allows to
116 # remove sequences prior to tree calculation (which is the most
117 # time consuming step).
119 # P=<int> Sort priority (default is 12):
121 # 1 : Ortholog, Super ortholog
122 # 2 : Super ortholog, Ortholog
123 # 3 : Ortholog, Distance
124 # 4 : Distance, Ortholog
125 # 5 : Ortholog, Super ortholog, Distance
126 # 6 : Ortholog, Distance, Super ortholog
127 # 7 : Super ortholog, Ortholog, Distance
128 # 8 : Super ortholog, Distance, Ortholog
129 # 9 : Distance, Ortholog, Super ortholog
130 # 10 : Distance, Super ortholog, Ortholog
131 # 11 : Ortholog, Subtree neighbor, Distance
132 # 12 : Ortholog, Subtree neighbor, Super ortholog, Distance (default)
133 # 13 : Ortholog, Super ortholog, Subtree neighbor, Distance
134 # 14 : Subtree neighbor, Ortholog, Super ortholog, Distance
135 # 15 : Subtree neighbor, Distance, Ortholog, Super ortholog
136 # 16 : Ortholog, Distance, Subtree neighbor, Super ortholog
137 # 17 : Ortholog, Subtree neighbor, Distance, Super ortholog
139 # a=<int> Bootstraps for tree construction (not in modes 1 and 2).
142 # L=<int> Threshold for orthologies for output. Default is 0.
143 # v=<int> Threshold for ultra-paralogies for output. Default is 50.
145 # U=<int> Threshold for orthologies for distance calculation. Default is 60.
147 # X=<int> In case of more than one putative orthologs:
148 # number of sd the distance query - LCA has to differ
149 # from the mean to generate a warning. Default is 2.
151 # Y=<int> In case of no putative orthologs:
152 # number of sd the distance query - root has to differ
153 # from mean to generate a warning. Default is 2.
155 # Z=<double> In case of one putative ortholog:
156 # threshold for factor between the two distances to their
157 # LCA (larger/smaller) to generate a warning. Default is 2.
159 # B=<int> Threshold for subtree-neighborings. Default is 0.
161 # b=<String> Build HMM from seed alignment with "hmmbuild -s" (optional).
162 # This is to prevent from finding multiple domains per sequence
163 # (i.e. prevents "cutting" the query sequence). Give path/name to
166 # j=<String> Name for temporary directory (optional).
168 # y=<int> Seed for random number generator. Default is 41.
170 # I Create and save a rooted, with duplication vs speciation,
171 # and orthology information annotated gene tree.
172 # If precalculated distances are used (modes 1 and 2): this gene
173 # tree is a NJ tree calculated based on the non-bootstrap resampled
174 # (original) pairwise distances.
175 # If precalculated distances are not used (modes 3 and 4): this gene
176 # is a consenus tree with ML branch length values and is also
177 # annotated with bootstrap values for each node.
179 # Options for output:
180 # p Output ultra-paralogs.
181 # D Description from SWISS-PROT and TrEMBL.
182 # C Complete description from SWISS-PROT and TrEMBL.
183 # E 118 character output instead of 78 character output.
185 # K Keep intermediate files (they will go into the same directory
186 # as the output file, their names are the same as of the output
187 # file, with various suffixes added).
189 # s Ignore non SWISS-PROT sequences (i.e. sequences from TrEMBL)
190 # in the Pfam alignment.
192 # f Try to ignore TrEMBL "fragments" (sequences with "fragment" in
193 # their description).
195 # + Parallel, use machines listed in file $NODE_LIST.
197 # x RIO used as web server -- HTML output.
204 # 09/06/03: Removal of minor bug. Only create consenus tree with ML branch length
205 # values if "I" option used (in modes 3 or 4) -- the problem/bug was that
206 # this tree was always created whether "I" was used or not.
213 use lib $FindBin::Bin;
220 my $VERSION = "5.010";
222 my $E_VALUE_THRESHOLD = 0.01; # For HMMSEARCH.
223 my $SORT_DEFAULT = 12;
224 my $THRESHOLD_ORTHOLOGS_DEFAULT = 0;
225 my $THRESHOLD_SN_DEFAULT = 0;
226 my $THRESHOLD_ORTHOLOGS_DEFAULT_DC = 60;
227 my $T_ULTRA_PARALOGS_DEFAULT = 50;
228 my $WARN_NO_ORTHOS_DEFAULT = 2;
229 my $WARN_MORE_THAN_ONE_ORTHO_DEFAULT = 2;
230 my $WARN_ONE_ORTHO_DEFAULT = 2;
231 my $MIN_NUMBER_OF_SEQS_IN_ALN = 4;
232 my $BOOSTRAPS_FOR_MAKETREE_DEFAULT = 100;
233 my $SEED_FOR_MAKETREE_DEFAULT = 41;
234 my $MATRIX_DEFAULT = 2; # 2=BLOSUM62
236 my $DO_RIO_TEMP_OUTFILE = "DoRIO_OUTFILE";
237 my $TEMP_HMM_FILE = "HMMFILE";
239 my $DEFAULT_OPTIONS_FOR_MAKETREE = "XR";
247 my $species_tree_file = "";
249 my $outfile_annot_nhx_tree = "";
251 my $multiple_trees_file = "";
252 my $distance_matrix_file = "";
253 my $maketree_out_tree_file = "";
254 my $seed_aln_for_hmmbuild = "";
260 my $species_names_file = " "; # Must be " ".
261 my $options_for_makeTree = "";
264 # multiple choice options:
266 my $sort = $SORT_DEFAULT;
267 my $matrix_n = $MATRIX_DEFAULT; # 0=JTT 1=PAM 2=BLOSUM62 3=mtREV24 5=VT 6=WAG
274 my $complete_description = 0;
277 my $non_sp = 1; # 0 to remove non SP seqs.
280 my $output_ultraparalogs = 0;
286 my $warn_no_orthos = $WARN_NO_ORTHOS_DEFAULT;
287 my $warn_more_than_one_ortho = $WARN_MORE_THAN_ONE_ORTHO_DEFAULT;
288 my $warn_one_ortho = $WARN_ONE_ORTHO_DEFAULT;
289 my $boostraps_for_makeTree = $BOOSTRAPS_FOR_MAKETREE_DEFAULT;
290 my $seed_for_makeTree = $SEED_FOR_MAKETREE_DEFAULT;
291 my $t_orthologs = $THRESHOLD_ORTHOLOGS_DEFAULT;
292 my $t_sn = $THRESHOLD_SN_DEFAULT;
293 my $t_orthologs_dc = $THRESHOLD_ORTHOLOGS_DEFAULT_DC;
294 my $t_ultra_paralogs = $T_ULTRA_PARALOGS_DEFAULT;
297 # internal variables:
298 my $print_header_for_orthologies = 0;
299 my $print_header_for_s_paralogs = 0;
300 my $length_of_alignment = 0;
301 my $length_of_orig_alignment = 0;
306 my $number_of_seqs_in_aln = 0;
308 my $saw_distance_values = 0;
309 my $saw_ultra_paralogs = 0;
311 my $ext_nodes_in_trees_analyzed = 0;
313 my $time_tree_calc = 0;
314 my $time_tree_calcT = 0;
317 my $time_dqopuzzle = 0;
318 my $time_dqopuzzleT = 0;
319 my $time_addingdists = 0;
320 my $time_addingdistsT = 0;
323 my $larger_blocks = 0;
324 my $printed_ultra_paralogs = 0;
326 my $dorio_outfile = "";
327 my $options_for_DoRIO = "";
331 my $subtree_neighbors = 0;
333 my $s_para_name = "";
335 my $sort_priority = "";
336 my $return_line = "";
338 my $command_line = "";
339 my $command_line_for_hmmbuild = "";
340 my $current_dir = "";
341 my @complete_names = ();
343 my %Species_names_hash = ();
344 my %AC_DE = (); # AC => DE from "ACDEOS" TrEMBL file.
345 my %SP_AC_DE = (); # ID => DE from "ACIDOS" SWISS-PROT file.
346 my %names_in_pwd_file = ();
349 my $start_date = `date`;
354 # This analyzes the options:
355 # --------------------------
363 $command_line = "$0 ";
364 for ( $j = 0; $j < @ARGV; ++$j ) {
365 $command_line .= "$ARGV[ $j ] ";
368 &analyzeCommandLine( @ARGV );
370 if ( $species_tree_file eq "" ) {
371 $species_tree_file = $SPECIES_TREE_FILE_DEFAULT;
376 $options_for_makeTree = $DEFAULT_OPTIONS_FOR_MAKETREE;
377 $options_for_makeTree .= "S".$seed_for_makeTree;
380 if ( $mode == 1 || $mode == 2 ) {
383 $hmm_file = $RIO_HMM_DIRECTORY.$alignment.$SUFFIX_HMM;
384 $bsp_file = $RIO_BSP_DIRECTORY.$alignment.$SUFFIX_BOOT_STRP_POS;
385 &userErrorCheckForTextFileExistence( $hmm_file );
386 &userErrorCheckForTextFileExistence( $bsp_file );
389 $pwd_file = $RIO_PWD_DIRECTORY.$alignment.$SUFFIX_PWD;
390 $nbd_file = $RIO_NBD_DIRECTORY.$alignment.$SUFFIX_PWD_NOT_BOOTS;
391 $alignment = $RIO_ALN_DIRECTORY.$alignment.$ALIGN_FILE_SUFFIX;
392 &userErrorCheckForTextFileExistence( $pwd_file );
393 &userErrorCheckForTextFileExistence( $nbd_file );
394 &userErrorCheckForTextFileExistence( $alignment );
398 $options_for_makeTree .= "F";
400 elsif ( $mode == 3 || $mode == 4 ) {
401 if ( $safe_nhx == 1 ) {
402 $options_for_makeTree .= "U";
405 $options_for_makeTree .= "#";
407 $options_for_makeTree .= "D"; # To calc. and keep pairwise distances.
408 $options_for_makeTree .= "B".$boostraps_for_makeTree;
412 if ( $output_HTML == 1 ) {
414 $complete_description = 1;
419 if ( $mode == 1 || $mode == 3 || $mode == 4 ) {
422 $matrix_n = $MATRIX_FOR_PWD;
425 if ( $matrix_n == 0 ) {
426 $options_for_makeTree .= "J";
427 $matrix = "JTT (Jones et al. 1992)";
429 elsif ( $matrix_n == 1 ) { # PAM is makeTree's default.
430 $matrix = "PAM (Dayhoff et al. 1978)";
432 elsif ( $matrix_n == 2 ) {
433 $options_for_makeTree .= "L";
434 $matrix = "BLOSUM 62 (Henikoff-Henikoff 92)";
436 elsif ( $matrix_n == 3 ) {
437 $options_for_makeTree .= "M";
438 $matrix = "mtREV24 (Adachi-Hasegawa 1996)";
440 elsif ( $matrix_n == 5 ) {
441 $options_for_makeTree .= "T";
442 $matrix = "VT (Mueller-Vingron 2000)";
444 elsif ( $matrix_n == 6 ) {
445 $options_for_makeTree .= "W";
446 $matrix = "WAG (Whelan-Goldman 2000)";
449 &dieWithUnexpectedError( "Failed sanity check" );
454 # This creates the temp directory:
455 # --------------------------------
461 if ( $temp_dir eq "" ) {
462 $temp_dir = $TEMP_DIR_DEFAULT.$time.$ii;
465 $temp_dir = $temp_dir.$ii;
468 while ( -e $temp_dir ) {
470 $temp_dir = $TEMP_DIR_DEFAULT.$time.$ii;
473 mkdir( $temp_dir, 0700 ) || &dieWithUnexpectedError( "Could not create \"$temp_dir\"" );
475 unless ( ( -e $temp_dir ) && ( -d $temp_dir ) ) {
476 &dieWithUnexpectedError( "\"$temp_dir\" does not exist, or is not a directory" );
481 # The analysis starts here:
482 # -------------------------
484 $dorio_outfile = $temp_dir."/".$DO_RIO_TEMP_OUTFILE;
486 $output_dir = dirname( $outfile );
488 unless ( ( -e $output_dir ) && ( -d $output_dir ) ) {
489 &userError( "Outfile directory (\"$output_dir\") does not exist,\n or is not a directory." );
492 if ( $mode == 1 || $mode == 3 ) {
493 $query_name = substr( $query_name, 0, $LENGTH_OF_NAME - 10 );
500 if ( $mode == 1 || $mode == 3 ) {
502 # Prepares the query file:
503 # ------------------------
504 $query_name = &seqFile2CleanedUpFastaFile( $seqX_file,
505 "$temp_dir/QUERY_SEQ",
507 if ( $query_name eq "" ) {
508 &userError( "Query file \"$seqX_file\") does not appear to contain a valid name\n and/or \"-N\" option has not been used." );
514 if ( $hmm_file eq "" ) {
515 $hmm_file = $temp_dir."/".$TEMP_HMM_FILE;
516 if ( $hmm_name ne "" ) {
517 &executeHmmfetch( $PFAM_HMM_DB, $hmm_name, $hmm_file );
519 elsif ( $seed_aln_for_hmmbuild ne "" ) {
520 $command_line_for_hmmbuild = &executeHmmbuild( $seed_aln_for_hmmbuild, $hmm_file );
530 # This might remove non SWISS PROT seqs, TreMBL fragments,
531 # and seqs from species not in $species_names_file from the alignment:
532 # --------------------------------------------------------------------
533 if ( $mode == 3 || $mode == 4 ) {
534 #if ( $do_not_removeSeqsFromPfamAlign != 1 ) {
537 &removeSeqsFromPfamAlign( $alignment,
545 &removeSeqsFromPfamAlign( $alignment,
557 # If necessary, this aligns the query to the pfam alignment
558 # using hmmsearch, p7extract.pl, multifetch.pl, and hmmalign
559 # from the HMMER package:
560 # ----------------------------------------------------------
561 if ( $mode == 1 || $mode == 3 ) {
564 $f = &alignWithHmmalign( $alignment,
565 $temp_dir."/QUERY_SEQ",
567 $temp_dir."/HMMALIGNOUT",
574 $f = &alignWithHmmalign( $temp_dir."/ALIGN2",
575 $temp_dir."/QUERY_SEQ",
577 $temp_dir."/HMMALIGNOUT",
582 if ( $alignment =~ /.+\/(.+)/ ) {
585 if ( $alignment =~ /(.+)\..+/ ) {
589 if ( $output_HTML == 1 ) {
590 &exitWithWarning( "query sequence does not contain sufficient similarity to the \"$alignment\" domain", 1 );
593 &exitWithWarning( "Query sequence does not contain sufficient similarity to the \"$alignment\" domain" );
598 # In case query contains more than one of the same domain:
600 @complete_names = &getCompleteName( $temp_dir."/HMMALIGNOUT", $query_name );
602 if ( @complete_names < 1 ) {
603 &dieWithUnexpectedError( "Could not find \"$query_name in $temp_dir"."/HMMALIGNOUT\"" );
606 elsif ( $mode == 2 || $mode == 4 ) {
607 # Here, this is just for checking:
609 @complete_names = &getCompleteName( $alignment, $query_name );
611 elsif ( $mode == 4 ) {
612 @complete_names = &getCompleteName( $temp_dir."/ALIGN2", $query_name );
614 if ( @complete_names < 1 ) {
615 &dieWithUnexpectedError( "Could not find \"$query_name in $temp_dir"."/HMMALIGNOUT\"" );
617 @complete_names = ();
618 $complete_names[ 0 ] = $query_name;
621 if ( $parallel == 1 ) {
624 $processors = scalar( @nodelist );
625 if ( $processors < 2 ) {
628 if ( $processors > $BOOTSTRAPS ) {
629 $processors = $BOOTSTRAPS;
632 $block_size = int $BOOTSTRAPS / $processors;
633 $larger_blocks = $BOOTSTRAPS - ( $block_size * $processors ); # number of blocks which have a size of
640 # This opens the output file:
641 # ---------------------------
642 if ( $output_HTML != 1 ) {
643 open( OUT, ">$outfile" ) || &dieWithUnexpectedError( "Cannot create file \"$outfile\"" );
646 # This starts printing to the output file:
647 # ----------------------------------------
652 # This loop goes through the different domains of the query
653 # which aligned to the alignment (in modes 2 and 4 this can
654 # obviously be only one):
655 # -----------------------------------------------------------
656 for ( $jj = 0; $jj < @complete_names; ++$jj ) {
659 # Moves the query to the last line(s) of the alignment.
660 # Removes other querie domains $complete_names[i]-- for which i != $jj
661 # --------------------------------------------------------------------
663 &moveToLast( $complete_names[ $jj ],
664 $temp_dir."/HMMALIGNOUT",
665 $temp_dir."/MOVETOLASTOUT",
670 if ( $mode == 1 || $mode == 3 ) {
672 @temp_array = &pfam2phylipMatchOnly( $temp_dir."/MOVETOLASTOUT",
673 $temp_dir."/ALIGN2_PHYLIP_MO",
677 @temp_array = &pfam2phylipMatchOnly( $temp_dir."/HMMALIGNOUT",
681 $length_of_alignment = $temp_array[ 0 ];
682 $length_of_orig_alignment = $temp_array[ 1 ];
683 $number_of_seqs_in_aln = $temp_array[ 2 ];
685 elsif ( $mode == 2 || $mode == 4 ) {
687 $query_name = $complete_names[ 0 ];
690 if ( !&startsWithSWISS_PROTname( $query_name ) ) {
691 # Query is not a SWISS-PROT sequence.
692 $query_name = &getCompleteNameForTrEMBLquerySeq( $temp_dir."/ALIGN2",
696 $number_of_seqs_in_aln = &countSeqsInPfamAlign( $temp_dir."/ALIGN2" );
699 if ( !&startsWithSWISS_PROTname( $query_name ) ) {
700 # Query is not a SWISS-PROT sequence.
701 $query_name = &getCompleteNameForTrEMBLquerySeq( $alignment,
704 $number_of_seqs_in_aln = &countSeqsInPfamAlign( $alignment );
711 if ( $number_of_seqs_in_aln < $MIN_NUMBER_OF_SEQS_IN_ALN ) {
713 if ( $output_HTML == 1 ) {
714 &exitWithWarning( "Removal of sequences resulted in an alignment with less than $MIN_NUMBER_OF_SEQS_IN_ALN sequences ($number_of_seqs_in_aln)", 1 );
717 &exitWithWarning( "Removal of sequences resulted in an alignment with less than $MIN_NUMBER_OF_SEQS_IN_ALN sequences ($number_of_seqs_in_aln)" );
724 unlink( $temp_dir."/ALIGN2_BOOTSTRAPPED" );
726 if ( $parallel == 1 ) {
727 &executeBootstrap_cz( $BOOTSTRAPS,
729 $temp_dir."/ALIGN2_PHYLIP_MO",
730 $temp_dir."/ALIGN2_BOOTSTRAPPED",
736 &executeBootstrap_cz( $BOOTSTRAPS,
738 $temp_dir."/ALIGN2_PHYLIP_MO",
739 $temp_dir."/ALIGN2_BOOTSTRAPPED" );
744 $current_dir = `pwd`;
745 $current_dir =~ s/\s//;
747 chdir ( $temp_dir ) || &dieWithUnexpectedError( "Could not chdir to \"$temp_dir\"" );
750 if ( $parallel == 1 ) {
753 my $all_finished = 0;
755 system( $RIO_SLAVE_DRIVER,
758 $temp_dir."/ALIGN2_BOOTSTRAPPED",
760 $complete_names[ $jj ],
765 && &dieWithUnexpectedError( "Could not execute \"$RIO_SLAVE_DRIVER\"" );
767 while ( $all_finished != 1 ) {
768 for ( $number = 0; $number < $processors; $number++ ) {
769 unless ( -e "FINISHED_$number" ) {
779 "MAKETREEOUT".$MULTIPLE_TREES_FILE_SUFFIX."0",
780 "MAKETREEOUT".$MULTIPLE_TREES_FILE_SUFFIX )
781 && &dieWithUnexpectedError( "$!" );
783 for ( $number = 1; $number < $processors; $number++ ) {
784 system( "cat MAKETREEOUT$MULTIPLE_TREES_FILE_SUFFIX$number >> MAKETREEOUT$MULTIPLE_TREES_FILE_SUFFIX" )
785 && &dieWithUnexpectedError( "$!" );
786 if ( unlink( "MAKETREEOUT$MULTIPLE_TREES_FILE_SUFFIX$number" ) != 1 ) {
787 &dieWithUnexpectedError( "Could not delete \"MAKETREEOUT$MULTIPLE_TREES_FILE_SUFFIX$number" );
791 # Sanity check: Counts ";" in "MAKETREEOUT$MULTIPLE_TREES_FILE_SUFFIX".
792 if ( `grep -c ';' MAKETREEOUT$MULTIPLE_TREES_FILE_SUFFIX` != $BOOTSTRAPS ) {
793 &dieWithUnexpectedError( "\"MAKETREEOUT$MULTIPLE_TREES_FILE_SUFFIX\" does not contain $BOOTSTRAPS \";\"" );
796 for ( $number = 0; $number < $processors; $number++ ) {
797 if ( unlink( "FINISHED_$number" ) != 1 ) {
798 &dieWithUnexpectedError( "Could not delete \"FINISHED_$number\"" );
802 &executeConsense( "MAKETREEOUT".$MULTIPLE_TREES_FILE_SUFFIX );
803 unlink( "outfile", "intree" );
805 system( "mv", "outtree", "MAKETREEOUT.nhx" )
806 && &dieWithUnexpectedError( "$!" );
811 $time_dqopuzzle = time; #time
812 &executePuzzleDQObootstrapped( "ALIGN2_BOOTSTRAPPED", $matrix_n );
813 $time_dqopuzzle = time - $time_dqopuzzle; #time
814 $time_dqopuzzleT += $time_dqopuzzle; #time
816 system( "mv", "ALIGN2_BOOTSTRAPPED.dist", "DISTs_TO_QUERY" )
817 && &dieWithUnexpectedError( "$!" );
821 &executePuzzleDQO( "ALIGN2_PHYLIP_MO", $matrix_n );
823 unlink( "ALIGN2_PHYLIP_MO" );
825 system( "mv", "ALIGN2_PHYLIP_MO.dist", "DIST_TO_QUERY" )
826 && &dieWithUnexpectedError( "$!" );
828 if ( $parallel != 1 ) {
829 $time_addingdists = time;
830 &addDistsToQueryToPWDfile( $pwd_file,
831 $temp_dir."/DISTs_TO_QUERY",
832 $temp_dir."/PWD_INC_QUERY",
833 $complete_names[ $jj ] );
836 $time_addingdists = time - $time_addingdists;
837 $time_addingdistsT += $time_addingdists;
839 &addDistsToQueryToPWDfile( $nbd_file,
840 $temp_dir."/DIST_TO_QUERY",
841 $temp_dir."/NBD_INC_QUERY",
842 $complete_names[ $jj ] );
847 $current_dir = `pwd`;
848 $current_dir =~ s/\s//;
850 || &dieWithUnexpectedError( "Could not chdir to \"$temp_dir\"" );
855 if ( $parallel != 1 ) {
856 unlink( $temp_dir."/MAKETREEOUT".$TREE_FILE_SUFFIX );
859 $time_tree_calc = time;
861 # This calculates the trees
862 # -------------------------
864 if ( $mode == 1 || $mode == 2 ) {
868 &executeNeighbor( $temp_dir."/NBD_INC_QUERY",
875 system( "mv", "outtree", "NBD_NJ_TREE" )
876 && &dieWithUnexpectedError( "$!" );
877 if ( $parallel != 1 ) {
878 &executeMakeTree( $options_for_makeTree,
879 $temp_dir."/PWD_INC_QUERY",
880 $temp_dir."/MAKETREEOUT".$TREE_FILE_SUFFIX,
881 $temp_dir."/maketree_tempdir" );
886 &executeNeighbor( $nbd_file,
893 system( "mv", "outtree", "NBD_NJ_TREE" )
894 && &dieWithUnexpectedError( "$!" );
896 &executeMakeTree( $options_for_makeTree,
898 $temp_dir."/MAKETREEOUT".$TREE_FILE_SUFFIX,
899 $temp_dir."/maketree_tempdir" );
903 chdir( $current_dir )
904 || &dieWithUnexpectedError( "Could not chdir to \"$current_dir\"" );
908 elsif ( $mode == 3 || $mode == 4 ) {
909 &executeMakeTree( $options_for_makeTree,
911 $temp_dir."/MAKETREEOUT".$TREE_FILE_SUFFIX,
912 $temp_dir."/maketree_tempdir" );
914 unlink( $temp_dir."/MAKETREEOUT".$ALIGN_FILE_SUFFIX );
918 $time_tree_calc = time - $time_tree_calc;
919 $time_tree_calcT += $time_tree_calc;
923 system( "cp", $temp_dir."/MAKETREEOUT".$TREE_FILE_SUFFIX, $outfile.$TREE_FILE_SUFFIX );
924 system( "cp", $temp_dir."/MAKETREEOUT".$LOG_FILE_SUFFIX, $outfile.$LOG_FILE_SUFFIX );
925 system( "cp", $temp_dir."/MAKETREEOUT".$MULTIPLE_TREES_FILE_SUFFIX, $outfile.$MULTIPLE_TREES_FILE_SUFFIX );
926 if ( $mode == 1 || $mode == 2 ) {
927 system( "cp", $temp_dir."/NBD_NJ_TREE", $outfile."-NJ".$TREE_FILE_SUFFIX );
932 unlink( $temp_dir."/ALIGN2" );
934 $multiple_trees_file = $temp_dir."/MAKETREEOUT".$MULTIPLE_TREES_FILE_SUFFIX;
935 $maketree_out_tree_file = $temp_dir."/MAKETREEOUT".$TREE_FILE_SUFFIX;
936 $distance_matrix_file = $temp_dir."/MAKETREEOUT".$SUFFIX_PWD_NOT_BOOTS;
939 if ( $mode == 1 || $mode == 3 ) {
940 $query_name = $complete_names[ $jj ];
943 $options_for_DoRIO = "";
945 # This will result in saving of the annotated consenus tree:
946 # ----------------------------------------------------------
947 if ( $safe_nhx == 1 ) {
948 my $number = $jj + 1;
949 if ( @complete_names > 1 ) {
950 $outfile_annot_nhx_tree = $outfile.$ADDITION_FOR_RIO_ANNOT_TREE."-".$number.$TREE_FILE_SUFFIX;
953 $outfile_annot_nhx_tree = $outfile.$ADDITION_FOR_RIO_ANNOT_TREE.$TREE_FILE_SUFFIX;
960 if ( $mode == 3 || $mode == 4 ) {
961 $options_for_DoRIO .= " D=".$distance_matrix_file;
963 elsif ( $mode == 1 ) {
964 $options_for_DoRIO .= " d=".$temp_dir."/DIST_TO_QUERY";
966 elsif ( $mode == 2 ) {
967 $options_for_DoRIO .= " D=".$nbd_file;
970 $options_for_DoRIO .= " M=".$multiple_trees_file;
971 $options_for_DoRIO .= " 'N=".$query_name."'";
972 $options_for_DoRIO .= " S=".$species_tree_file;
973 $options_for_DoRIO .= " O=".$dorio_outfile;
974 $options_for_DoRIO .= " P=".$sort;
975 $options_for_DoRIO .= " L=".$t_orthologs;
976 $options_for_DoRIO .= " B=".$t_sn;
977 $options_for_DoRIO .= " U=".$t_orthologs_dc;
978 $options_for_DoRIO .= " X=".$warn_more_than_one_ortho;
979 $options_for_DoRIO .= " Y=".$warn_no_orthos;
980 $options_for_DoRIO .= " Z=".$warn_one_ortho;
982 if ( $mode == 1 || $mode == 2 ) {
983 $options_for_DoRIO .= " T=".$temp_dir."/NBD_NJ_TREE";
984 $options_for_DoRIO .= " t=".$maketree_out_tree_file;
986 elsif ( $mode == 3 || $mode == 4 ) {
987 if ( $safe_nhx == 1 ) { # Added 09/04/03.
988 $options_for_DoRIO .= " T=".$maketree_out_tree_file;
992 if ( $safe_nhx == 1 ) {
993 $options_for_DoRIO .= " I";
995 if ( $output_ultraparalogs == 1 ) {
996 $options_for_DoRIO .= " p";
997 $options_for_DoRIO .= " v=".$t_ultra_paralogs;
1002 &executeDoRIO( $options_for_DoRIO );
1004 $time_rio = time - $time_rio;
1005 $time_rioT += $time_rio;
1007 unless ( ( -s $dorio_outfile ) && ( -f $dorio_outfile ) && ( -T $dorio_outfile ) ) {
1010 &dieWithUnexpectedError( "failure during execution of RIO (no output generated)" );
1013 if ( $safe_nhx == 1 ) {
1015 $temp_dir."/".$DO_RIO_TEMP_OUTFILE.$ADDITION_FOR_RIO_ANNOT_TREE.$TREE_FILE_SUFFIX,
1016 $outfile_annot_nhx_tree )
1017 && &dieWithUnexpectedError( "$!" );
1021 open( IN, "$dorio_outfile" )
1022 || &dieWithUnexpectedError( "Cannot open file \"$dorio_outfile\"" );
1024 $saw_distance_values = 0;
1025 $saw_ultra_paralogs = 0;
1026 $printed_ultra_paralogs = 0;
1027 $print_header_for_orthologies = 1;
1028 $print_header_for_s_paralogs = 1;
1033 # This generates the report
1034 # -------------------------
1036 W: while ( $return_line = <IN> ) {
1038 if ( $return_line =~ /distance values:/i ) {
1039 $saw_distance_values = 1;
1040 &printTitleForDistanceValues();
1042 elsif ( $return_line =~ /ultra paralogs/i ) {
1043 $saw_ultra_paralogs = 1;
1045 elsif ( $return_line =~ /^mean bootstrap/i ) {
1046 &printMeanBootstraps();
1048 elsif ( $return_line =~ /sort priority\s*:\s*(.+)/i ) {
1049 $sort_priority = $1;
1051 elsif ( $return_line =~ /ext nodes\s*:\s*(.+)/i ) {
1052 $ext_nodes_in_trees_analyzed = $1 - 1; # One seq is query.
1054 elsif ( $return_line =~ /bootstraps\s*:\s*(\S+)/i ) {
1055 if ( $jj == @complete_names - 1 ) {
1057 if ( $output_HTML == 1 ) {
1064 elsif ( $saw_distance_values != 1
1065 && $saw_ultra_paralogs != 1
1066 && $return_line =~ /(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s*(\S*)/ ) {
1069 $subtree_neighbors = $3;
1073 if ( $print_header_for_orthologies == 1 ) {
1074 &printHeaderForOrthologies();
1075 $print_header_for_orthologies = 0;
1077 &printOrthologies();
1079 elsif ( $saw_distance_values != 1
1080 && $saw_ultra_paralogs != 1
1081 && $return_line =~ /^\s*-\s*$/ ) {
1086 if ( $print_header_for_orthologies == 1 ) {
1087 &printHeaderForOrthologies();
1088 $print_header_for_orthologies = 0;
1090 &printOrthologies();
1092 elsif ( $output_ultraparalogs == 1
1093 && $saw_ultra_paralogs == 1
1094 && $return_line =~ /(\S+)\s+(\S+)\s+(\S+)/ ) {
1098 if ( $print_header_for_s_paralogs == 1 ) {
1099 &printHeaderForSparalogs();
1100 $print_header_for_s_paralogs = 0;
1102 &printUltraParlogs();
1103 $printed_ultra_paralogs = 1;
1105 elsif ( $output_ultraparalogs == 1
1106 && $saw_ultra_paralogs == 1
1107 && $return_line =~ /^\s*-\s*$/ ) {
1108 &printNoUltraParalogs();
1110 elsif ( $return_line =~ /Bootstraps/ ) {
1111 $saw_distance_values = 0;
1113 elsif ( $saw_distance_values == 1 && $saw_ultra_paralogs != 1 ) {
1114 &printDistanceValues();
1120 } # End of for loop going through possible
1121 # multiple matches to the same alignment/model.
1123 if ( $output_HTML != 1 ) {
1129 if ( $output_HTML != 1 ) {
1130 print( "\n\nrio.pl successfully terminated.\nOutput written to: $outfile\n\n" );
1143 # ===========================================================
1145 # -----------------------------------------------------------
1150 # -----------------------------------------------------------
1151 # Parallization related
1152 # -----------------------------------------------------------
1156 # Last modified: 02/02/02
1157 sub readInNodesList {
1159 &testForTextFilePresence( $NODE_LIST );
1161 open( NIN, "$NODE_LIST" ) || &dieWithUnexpectedError( "Cannot open file \"$NODE_LIST\"" );
1164 if ( $_ =~ /(\S+)/ ) {
1165 push( @nodelist, $1 );
1174 # Last modified: 02/02/02
1176 my @temp_node_list = ();
1177 my $p = Net::Ping->new( "tcp", 2 ); # or "udp"
1180 foreach $n ( @nodelist ) {
1181 if ( defined( $p->ping( $n ) ) ) {
1182 push( @temp_node_list, $n );
1186 @nodelist = @temp_node_list;
1194 # -----------------------------------------------------------
1196 # -----------------------------------------------------------
1199 # Last modified: 03/07/01
1202 if ( $output_HTML != 1 ) {
1203 print OUT "RIO - Resampled Inference of Orthologs\n";
1204 print OUT "Version: $VERSION\n";
1205 print OUT "------------------------------------------------------------------------------\n";
1207 print OUT "Pfam alignment file : $alignment\n";
1209 print OUT "Pfam alignment description : ".&getDescriptionFromPfam( $alignment )."\n";
1211 if ( $mode == 1 || $mode == 2 ) {
1212 print OUT "Bootstrapped pairwise distances file : $pwd_file\n";
1213 print OUT "Not bootstrapped pairwise distances file: $nbd_file\n";
1214 print OUT "Bootstrap positions file : $bsp_file\n";
1216 if ( $mode == 1 || $mode == 3 ) {
1217 if ( $seed_aln_for_hmmbuild ne "" ) {
1218 print OUT "HMM : built based on $seed_aln_for_hmmbuild\n";
1220 elsif ( $hmm_name ne "" ) {
1221 print OUT "HMM : $hmm_name\n";
1224 print OUT "HMM : $hmm_file\n";
1226 print OUT "Query file : $seqX_file\n";
1228 print OUT "==============================================================================\n\n";
1236 # Last modified: 03/07/01
1237 sub printHeaderForOrthologies {
1239 if ( $output_HTML != 1 ) {
1241 print OUT "\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n";
1244 print OUT "Query : $query_name\n\n";
1246 if ( @complete_names > 1 ) {
1247 my $size = @complete_names;
1248 my $number = $jj + 1;
1249 print OUT "More than one region of the query were aligned to the profile HMM.\n";
1250 print OUT "This is for domain #$number out of $size.\n\n";
1253 print OUT "Number (in %) of observed orthologies (o), \"subtree-neighborings\" (n),\n";
1254 print OUT "and super-orthologies (s) to query in bootstrapped trees, evolutionary\n";
1255 print OUT "distance to query (as average number of aa replacements per residue):\n\n";
1256 if ( $long_output != 1 ) {
1257 print OUT "Sequence Description o[%] n[%] s[%] distance\n";
1258 print OUT "-------- ----------- ---- ---- ---- --------\n";
1261 print OUT "Sequence Description o[%] n[%] s[%] distance\n";
1262 print OUT "-------- ----------- ---- ---- ---- --------\n";
1268 print "<P>   </P>\n";
1269 print "<HR NOSHADE COLOR=\"#CCCCCC\">\n";
1270 print "<P>   </P>\n";
1273 if ( @complete_names > 1 ) {
1274 my $size = @complete_names;
1275 my $number = $jj + 1;
1276 print "<P>More than one region of the query were aligned to the profile HMM. \n";
1277 print "This is for domain #$number out of $size.</P>\n";
1279 print "<P>Query : $query_name</P>\n";
1280 print "<H4 class = \"title\">Orthologies, subtree-neighborings, super-orthologies</H4>\n";
1282 print "<P>Number (in %) of observed orthologies (o), \"subtree-neighborings\" (n), \n";
1283 print "and super-orthologies (s) to query in bootstrapped trees, evolutionary \n";
1284 print "distance to query (as average number of aa replacements per residue):</P>\n";
1285 if ( $ortho_name ne "-" ) {
1286 print "<TABLE BORDER=\"0\" CELLPADDING=\"1\">\n";
1288 print "<TR VALIGN=\"TOP\"> <TD NOWRAP> <B>Sequence</B> </TD><TD NOWRAP> <B>Description</B> </TD><TD NOWRAP ALIGN=\"RIGHT\"> <B>o[%]</B> </TD><TD NOWRAP ALIGN=\"RIGHT\"> <B>n[%]</B> </TD><TD NOWRAP ALIGN=\"RIGHT\"> <B>s[%]</B> </TD><TD NOWRAP>   <B>distance</B> </TD> </TR>\n";
1292 } ## printHeaderForOrthologies
1296 # Last modified: 10/15/01
1297 sub printHeaderForSparalogs {
1299 if ( $output_HTML != 1 ) {
1300 print OUT "\nUltra-paralogs\n";
1301 print OUT "--------------\n";
1302 print OUT "Number (in %) of observed ultra-paralogies (up) to query\n";
1303 print OUT "in bootstrapped trees, evolutionary distance to query (as average number\n";
1304 print OUT "of aa replacements per residue):\n\n";
1305 if ( $long_output != 1 ) {
1306 print OUT "Sequence Description up[%] distance\n";
1307 print OUT "-------- ----------- ----- --------\n";
1310 print OUT "Sequence Description up[%] distance\n";
1311 print OUT "-------- ----------- ----- --------\n";
1315 print "<H4 class = \"title\">Ultra-paralogs</H4>\n";
1316 print "<P>Number (in %) of observed ultra-paralogies (up) to query \n";
1317 print "in bootstrapped trees, evolutionary distance to query (as average number \n";
1318 print "of aa replacements per residue):</P>\n";
1319 print "<TABLE BORDER=\"0\" CELLPADDING=\"1\">\n";
1320 print "<TR VALIGN=\"TOP\"> <TD NOWRAP> <B>Sequence</B> </TD><TD NOWRAP> <B>Description</B> </TD><TD NOWRAP ALIGN=\"RIGHT\"> <B>up[%]</B> </TD><TD NOWRAP>   <B>distance</B> </TD> </TR>\n";
1324 } ## printHeaderForSparalogs
1328 # Last modified: 03/07/01
1329 sub printOrthologies {
1333 $orthos = &roundToInt( $orthos );
1334 $s_orthos = &roundToInt( $s_orthos );
1337 $subtree_neighbors = &roundToInt( $subtree_neighbors );
1340 if ( ( $description == 1 || $complete_description == 1 )
1341 && $ortho_name ne "-" ) {
1343 if ( $non_sp != 1 ) {
1344 if ( &startsWithSWISS_PROTname( $ortho_name ) ) {
1345 $descp = &getDescriptionFromSWISSPROT_ACDEOSfile( $SWISSPROT_ACDEOS_FILE, $ortho_name );
1352 if ( &startsWithSWISS_PROTname( $ortho_name ) ) {
1353 $descp = &getDescriptionFromSWISSPROT_ACDEOSfile( $SWISSPROT_ACDEOS_FILE, $ortho_name );
1356 $descp = &getDescriptionFromTrEMBL_ACDEOSfile( $TREMBL_ACDEOS_FILE, $ortho_name );
1360 if ( $output_HTML != 1 ) {
1361 if ( $long_output == 1 ) {
1362 @cut = &cutDescription( $descp, 73 );
1365 @cut = &cutDescription( $descp, 33 );
1370 if ( $descp eq "" ) {
1374 if ( $output_HTML != 1 ) {
1376 if ( $ortho_name eq "-" ) {
1377 print OUT "\nNO ORTHOLOGS in alignment with the current thresholds for output\n";
1379 elsif ( $dist ne "-" ) {
1380 if ( $long_output == 1 ) {
1381 print OUT sprintf "%-24.24s%-74.74s%5s%5s%5s%10.6f", $ortho_name,$descp,$orthos,$subtree_neighbors,$s_orthos,$dist;
1384 print OUT sprintf "%-24.24s%-34.34s%5s%5s%5s%10.6f", $ortho_name,$descp,$orthos,$subtree_neighbors,$s_orthos,$dist;
1388 if ( $long_output == 1 ) {
1389 print OUT sprintf "%-24.24s%-74.74s%5s%5s%5s%10.10s", $ortho_name,$descp,$orthos,$subtree_neighbors,$s_orthos,$dist;
1392 print OUT sprintf "%-24.24s%-34.34s%5s%5s%5s%10.10s", $ortho_name,$descp,$orthos,$subtree_neighbors,$s_orthos,$dist;
1395 if ( $complete_description == 1 ) {
1396 for ( $i = 1; $i < @cut; ++$i ) {
1398 if ( $long_output == 1 ) {
1399 print OUT sprintf " %-74.74s", $cut[ $i ];
1402 print OUT sprintf " %-34.34s", $cut[ $i ];
1409 if ( $ortho_name eq "-" ) {
1410 print "<H4 class = \"warnings\">NO ORTHOLOGS in alignment with the current thresholds for output</H4>\n";
1413 $ortho_name = &replaceNameWithLinkToExpasy( $ortho_name );
1414 print "<TR VALIGN=\"TOP\"> <TD NOWRAP> $ortho_name </TD><TD> $descp </TD><TD NOWRAP ALIGN=\"RIGHT\"> $orthos </TD><TD NOWRAP ALIGN=\"RIGHT\"> $subtree_neighbors </TD><TD NOWRAP ALIGN=\"RIGHT\"> $s_orthos </TD><TD NOWRAP>   $dist </TD> </TR>\n";
1418 } ## printOrthologies
1422 sub replaceNameWithLinkToExpasy {
1425 if ( $name =~ /(.+)_(.+)\/(.+)/ ) {
1429 if ( length( $desc ) <= 4 ) {
1430 $name = "<A HREF=\"".$EXPASY_SPROT_SEARCH_DE.$desc."_".$spec."\" TARGET=\"_blank\">".$desc."_".$spec."</A>\/".$numbers;
1433 $name = "<A HREF=\"".$EXPASY_SPROT_SEARCH_AC.$desc."\" TARGET=\"_blank\">".$desc."</A>_".$spec."\/".$numbers;
1439 } ## replaceNameWithLinkToExpasy
1444 # Last modified: 10/15/01
1445 sub printUltraParlogs {
1449 $s_paras = &roundToInt( $s_paras );
1451 if ( ( $description == 1 || $complete_description == 1 )
1452 && $s_para_name ne "-" ) {
1454 if ( $non_sp != 1 ) {
1455 if ( &startsWithSWISS_PROTname( $s_para_name ) ) {
1456 $descp = &getDescriptionFromSWISSPROT_ACDEOSfile( $SWISSPROT_ACDEOS_FILE, $s_para_name );
1463 if ( &startsWithSWISS_PROTname( $s_para_name ) ) {
1464 $descp = &getDescriptionFromSWISSPROT_ACDEOSfile( $SWISSPROT_ACDEOS_FILE, $s_para_name );
1467 $descp = &getDescriptionFromTrEMBL_ACDEOSfile( $TREMBL_ACDEOS_FILE, $s_para_name );
1471 if ( $output_HTML != 1 ) {
1472 if ( $long_output == 1 ) {
1473 @cut = &cutDescription( $descp, 73 );
1476 @cut = &cutDescription( $descp, 33 );
1481 if ( $descp eq "" ) {
1485 if ( $output_HTML != 1 ) {
1487 if ( $dist ne "-" ) {
1488 if ( $long_output == 1 ) {
1489 print OUT sprintf "%-24.24s%-74.74s%5s%10.6f", $s_para_name,$descp,$s_paras,$dist;
1492 print OUT sprintf "%-24.24s%-34.34s%5s%10.6f", $s_para_name,$descp,$s_paras,$dist;
1496 if ( $long_output == 1 ) {
1497 print OUT sprintf "%-24.24s%-74.74s%5s%10.10s", $s_para_name,$descp,$s_paras,$dist;
1500 print OUT sprintf "%-24.24s%-34.34s%5s%10.10s", $s_para_name,$descp,$s_paras,$dist;
1503 if ( $complete_description == 1 ) {
1504 for ( $i = 1; $i < @cut; ++$i ) {
1506 if ( $long_output == 1 ) {
1507 print OUT sprintf " %-74.74s", $cut[ $i ];
1510 print OUT sprintf " %-34.34s", $cut[ $i ];
1518 $s_para_name = &replaceNameWithLinkToExpasy( $s_para_name );
1519 print "<TR VALIGN=\"TOP\"> <TD NOWRAP> $s_para_name </TD><TD> $descp </TD><TD NOWRAP ALIGN=\"RIGHT\"> $s_paras </TD><TD NOWRAP>   $dist </TD> </TR>\n";
1522 } ## printUltraParlogs
1526 sub printNoUltraParalogs {
1527 if ( $output_HTML != 1 ) {
1528 print OUT "\nUltra-paralogs\n";
1529 print OUT "--------------\n";
1530 print OUT "\nNO ULTRA-PARALOGS in alignment with the current threshold of $t_ultra_paralogs%\n";
1533 print "<H4 class = \"title\">Ultra-paralogs</H4>\n";
1534 print "<H4 class = \"warnings\">NO ULTRA-PARALOGS in alignment with the current threshold of $t_ultra_paralogs%</H4>\n";
1536 } ## printNoUltraParalogs
1540 # Called by method "printOrthologies".
1541 # Last modified: 02/27/01
1542 sub cutDescription {
1548 while ( ( length( $line ) ) > $size ) {
1549 $cut[ $i++ ] = substr( $line, 0, $size );
1550 $line = substr( $line, $size );
1552 $cut[ $i++ ] = $line;
1559 # Last modified: 02/27/01
1560 sub printTitleForDistanceValues {
1561 if ( $output_HTML != 1 ) {
1562 if ( $mode == 1 || $mode == 2 ) {
1563 print OUT "\n\nDistance values (based on NJ tree of original alignment)\n";
1564 print OUT "--------------------------------------------------------\n";
1566 elsif ( $mode == 3 || $mode == 4 ) {
1567 print OUT "\n\nDistance values (based on ML branch length values on consensus tree)\n";
1568 print OUT "--------------------------------------------------------------------\n";
1572 print "<H4 class = \"title\">Distance values (based on NJ tree of original alignment)</H4>\n";
1575 } ## printTitleForDistanceValues
1580 # Last modified: 02/27/01
1581 sub printDistanceValues {
1582 if ( $output_HTML != 1 ) {
1583 print OUT "$return_line";
1586 chomp( $return_line );
1587 if ( $return_line =~ /WARNING/ ) {
1588 $return_line =~ s/\+\/-/ ± /;
1589 $return_line =~ s/\*/ × /;
1590 print "<H4 class = \"warnings\">$return_line</H4>\n";
1592 elsif ( $return_line =~ /lca\s+is/i ) {
1593 print "<P class = \"nomargins\">$return_line</P>\n";
1595 elsif ( $return_line =~ /orthologous/i ) {
1596 print "<P class = \"nomargins\">$return_line</P>\n";
1598 elsif ( $return_line =~ /distance\s+of\s+query/i ) {
1599 print "<TABLE BORDER=\"0\" CELLPADDING=\"1\">\n";
1601 if ( $return_line =~ /(.+)=(.+)/ ) {
1602 print "<TR VALIGN=\"TOP\"><TD>$1</TD><TD> = $2</TD></TR>\n";
1604 if ( $return_line =~ /sum\s+/i || $return_line =~ /distance\s+of\s+ortholog\s+to\s+LCA/i ) {
1608 } ## printDistanceValues
1613 # Last modified: 02/27/01
1614 sub printMeanBootstraps {
1615 if ( $output_HTML != 1 ) {
1616 print OUT "\n\n$return_line";
1619 chomp( $return_line );
1620 $return_line =~ s/\+\/-/ ± /;
1622 print "<P>$return_line</P>\n";
1624 } ## printMeanBootstraps
1629 # Last modified: 02/12/02
1632 if ( $output_HTML != 1 ) {
1633 print OUT "\n\n\n==============================================================================\n";
1634 if ( $number_of_seqs_in_aln >= $MIN_NUMBER_OF_SEQS_IN_ALN ) {
1635 print OUT "RIO options\n";
1636 print OUT "-----------\n";
1637 print OUT "Mode : ";
1639 print OUT "precalc. pwd files with alignment not containing query (1)\n";
1641 elsif ( $mode == 2 ) {
1642 print OUT "precalc. pwd files with alignment containing query (2)\n";
1644 elsif ( $mode == 3 ) {
1645 print OUT "alignment not containing query (3)\n";
1647 elsif ( $mode == 4 ) {
1648 print OUT "alignment containing query (4)\n";
1650 print OUT "Bootstraps : $bootstraps\n";
1651 print OUT "Species tree : $species_tree_file\n";
1652 if ( $safe_nhx == 1 ) {
1653 if ( $mode == 3 || $mode == 4 ) {
1654 if ( @complete_names > 1 ) {
1655 $outfile_annot_nhx_tree =~ s/-\d+\.nhx/-X.nhx/;
1656 print OUT "Saved annotated consensus trees (ML branch lengths) : $outfile_annot_nhx_tree\n";
1659 print OUT "Saved annotated consensus tree (ML branch lengths) : $outfile_annot_nhx_tree\n";
1662 elsif ( $mode == 1 || $mode == 2 ) {
1663 if ( @complete_names > 1 ) {
1664 $outfile_annot_nhx_tree =~ s/-\d+\.nhx/-X.nhx/;
1665 print OUT "Saved annotated NJ trees (based on original alignment) : $outfile_annot_nhx_tree\n";
1668 print OUT "Saved annotated NJ tree (based on original alignment) : $outfile_annot_nhx_tree\n";
1672 print OUT "Threshold for output for orthologies (L=) : $t_orthologs\n";
1673 print OUT "Threshold for output for subtree-neighborings (B=) : $t_sn\n";
1674 print OUT "Threshold for distance calc for orthologies (U=) : $t_orthologs_dc\n";
1676 print OUT "When to generate warnings:\n";
1677 print OUT "More than one ortholog: diff. in standard deviations (X=): $warn_more_than_one_ortho\n";
1678 print OUT "No orthologs : diff. in standard deviations (Y=): $warn_no_orthos\n";
1679 print OUT "One ortholog : factor (Z=): $warn_one_ortho\n";
1680 if ( $output_ultraparalogs == 1 ) {
1681 print OUT "Output ultra-paralogs (p)\n";
1682 print OUT "Threshold for ultra-paralogies (v=) : $t_ultra_paralogs\n";
1684 print OUT "Sort priority: $sort_priority\n";
1687 print OUT "\nOptions for the calculation of the phylgenetic trees\n";
1688 print OUT "----------------------------------------------------\n";
1690 print OUT "Model for pairwise distance calculations : $matrix\n";
1692 elsif ( $mode == 3 || $mode == 4 ) {
1693 print OUT "Model for pairwise dist and ML branch length calc. : $matrix\n";
1695 if ( $mode == 1 || $mode == 3 || $mode == 4 ) {
1696 print OUT "Columns in alignment used for tree calc : $length_of_alignment\n";
1697 print OUT "Columns in original alignment : $length_of_orig_alignment\n";
1699 print OUT "Sequences in alignment used for trees (incl query) : $number_of_seqs_in_aln\n";
1701 if ( $mode == 3 || $mode == 4 ) {
1702 print OUT "Removed non-SWISS-PROT sequences : ";
1703 if ( $non_sp == 1 ) {
1709 if ( $non_sp == 1 ) {
1710 print OUT "Removed \"TrEMBL fragments\" : ";
1711 if ( $no_frags == 1 ) {
1719 if ( $mode == 1 || $mode == 2 ) {
1720 print OUT "Prgrm to calc. branch lengths for distance values : PHYLIP NEIGHBOR (NJ)\n";
1722 elsif ( $mode == 3 || $mode == 4 ) {
1723 print OUT "Prgrm to calc branch lengths for distance values : TREE-PUZZLE\n";
1725 if ( $seed_aln_for_hmmbuild ne "" ) {
1726 print OUT "HMM was built with hmmbuild using options : $command_line_for_hmmbuild\n";
1728 if ( ( $mode == 3 || $mode == 4 ) && $species_names_file =~ /\S/ ) {
1729 print OUT "File listing species used for tree calculation (G=): $species_names_file\n";
1731 print OUT "Seed for random number generator : $seed_for_makeTree\n";
1732 print OUT "Options for makeTree : $options_for_makeTree\n";
1734 $time_total = time - $time_total;
1736 print OUT "\nTime and date\n";
1737 print OUT "-------------\n";
1739 print OUT "Time requirement dqo puzzle : $time_dqopuzzleT s\n";
1742 print OUT "Time requirement for tree calculation: $time_tree_calcT s\n";
1743 print OUT "Time requirement for SDI and RIO : $time_rioT s\n";
1744 print OUT "Total time requirement : $time_total s\n";
1745 print OUT "Date started : $start_date";
1746 print OUT ( "Date finished : ".`date` );
1748 print OUT "\nCommand line\n";
1749 print OUT "------------\n";
1750 print OUT "$command_line\n";
1751 if ( $parallel == 1 ) {
1752 print OUT "\nProcessors used: @nodelist\n";
1756 if ( $printed_ultra_paralogs == 1 ) {
1759 if ( $species_tree_file =~ /.+\/(.+)/ ) {
1760 $species_tree_file = $1;
1762 print "<H4 class = \"title\">Options</H4>\n";
1763 print "<TABLE BORDER=\"0\" CELLPADDING=\"1\">\n";
1764 print "<TR VALIGN=\"TOP\"><TD> Bootstraps: </TD><TD> $bootstraps </TD></TR>\n";
1765 print "<TR VALIGN=\"TOP\"><TD> Species tree: </TD><TD> $species_tree_file </TD></TR>\n";
1766 print "<TR VALIGN=\"TOP\"><TD> Threshold for output for orthologies: </TD><TD> $t_orthologs </TD></TR>\n";
1767 print "<TR VALIGN=\"TOP\"><TD> Threshold for output for subtree-neighborings: </TD><TD> $t_sn </TD></TR>\n";
1768 print "<TR VALIGN=\"TOP\"><TD> Threshold for distance calc for orthologies: </TD><TD> $t_orthologs_dc </TD></TR>\n";
1769 print "<TR VALIGN=\"TOP\"><TD> When to generate warnings </TD><TD> </TD></TR>\n";
1770 print "<TR VALIGN=\"TOP\"><TD> More than one ortholog [diff in standard deviations]: </TD><TD> $warn_more_than_one_ortho </TD></TR>\n";
1771 print "<TR VALIGN=\"TOP\"><TD> No orthologs [diff in standard deviations]: </TD><TD> $warn_no_orthos </TD></TR>\n";
1772 print "<TR VALIGN=\"TOP\"><TD> One ortholog [factor]: </TD><TD> $warn_one_ortho </TD></TR>\n";
1773 if ( $output_ultraparalogs == 1 ) {
1774 print "<TR VALIGN=\"TOP\"><TD> Output ultra-paralogs </TD><TD> </TD></TR>\n";
1775 print "<TR VALIGN=\"TOP\"><TD> Threshold for ultra-paralogies: </TD><TD> $t_ultra_paralogs </TD></TR>\n";
1777 print "<TR VALIGN=\"TOP\"><TD> Sort priority: </TD><TD> $sort_priority </TD></TR>\n";
1778 print "<TR VALIGN=\"TOP\"><TD> Model for pairwise distance calculations:</TD><TD> $matrix </TD></TR>\n";
1779 print "<TR VALIGN=\"TOP\"><TD> Columns in alignment used for tree calc: </TD><TD> $length_of_alignment </TD></TR>\n";
1780 print "<TR VALIGN=\"TOP\"><TD> Columns in original alignment: </TD><TD> $length_of_orig_alignment </TD></TR>\n";
1781 print "<TR VALIGN=\"TOP\"><TD> Sequences in alignment used for trees (incl query): </TD><TD> $number_of_seqs_in_aln </TD></TR>\n";
1782 print "<TR VALIGN=\"TOP\"><TD> Seed for random number generator: </TD><TD> $seed_for_makeTree </TD></TR>\n";
1785 $time_total = time - $time_total;
1787 print "<P>   </P>\n";
1788 print "<TABLE BORDER=\"0\" CELLPADDING=\"1\">\n";
1789 print "<TR VALIGN=\"TOP\"><TD> Time requirement: </TD><TD> $time_total s</TD></TR>\n";
1790 print "<TR VALIGN=\"TOP\"><TD> Date started: </TD><TD> $start_date </TD></TR>\n";
1791 print ( "<TR VALIGN=\"TOP\"><TD> Date finished: </TD><TD> ".`date`." </TD></TR>\n" );
1792 if ( $parallel == 1 ) {
1793 print "<TR VALIGN=\"TOP\"><TD> Number of processors used: </TD><TD> ".scalar( @nodelist )." </TD></TR>\n";
1809 # -----------------------------------------------------------
1810 # Execution of other programs
1811 # -----------------------------------------------------------
1820 # Returns the options used.
1821 # Last modified: 05/11/01
1822 sub executeHmmbuild {
1825 my $outfile = $_[ 1 ];
1828 &testForTextFilePresence( $seed );
1830 $options = getHmmbuildOptionsFromPfam( $seed );
1837 $options =~ s/-o\s+\S+//;
1838 $options =~ s/(\s|^)[^-]\S+/ /g;
1840 if ( $options =~ /--prior/ ) {
1841 my $basename = basename( $seed );
1842 $basename .= ".PRIOR";
1843 $options =~ s/--prior/--prior $PRIOR_FILE_DIR$basename/;
1846 # Remove for versions of HMMER lower than 2.2.
1847 if ( $options =~ /--informat\s+\S+/ ) {
1848 $options =~ s/--informat\s+\S+/--informat SELEX/;
1851 $options = "--informat SELEX ".$options;
1854 system( "$HMMBUILD $options $outfile $seed" )
1855 && &dieWithUnexpectedError( "Could not execute \"$HMMBUILD $options $outfile $seed\"" );
1858 } ## executeHmmbuild.
1865 # Last modified: 02/26/01
1866 sub getHmmbuildOptionsFromPfam {
1868 my $infile = $_[ 0 ];
1869 my $return_line = "";
1872 &testForTextFilePresence( $infile );
1874 open( GHO, $infile ) || &dieWithUnexpectedError( "Cannot open file \"$infile\"" );
1875 while ( $return_line = <GHO> ) {
1876 if ( $return_line =~ /^\s*#.*hmmbuild\s+(.+)\s*$/ ) {
1885 } ## getHmmbuildOptionsFromPfam
1890 # Purpose. Aligns a FASTA file to a Pfam alignment using an HMM profile.
1892 # 1. Pfam flat file name
1893 # 2. Name of FASTA file to append
1894 # 3. HMM profile file name
1895 # 4. outputfile name
1896 # 5. 1 use --mapali, --withali otherwise (in hmmalign)
1897 # Returns 1 if successful, -1 if no alignment was made because
1898 # E value of HMMSEARCH output was larger than $E_VALUE_THRESHOLD.
1899 # Last modified: 07/11/01
1900 sub alignWithHmmalign {
1901 my $alignment = $_[ 0 ];
1902 my $query = $_[ 1 ];
1904 my $outfile = $_[ 3 ];
1905 my $use_mapali = $_[ 4 ];
1907 my $ali = "--withali";
1909 if ( $use_mapali == 1 ) {
1913 &testForTextFilePresence( $alignment );
1914 &testForTextFilePresence( $query );
1915 &testForTextFilePresence( $hmm );
1917 system( "$HMMSEARCH $hmm $query > $temp_dir/HMMSEARCHOUT" )
1918 && &dieWithUnexpectedError( "Could not execute \"$HMMSEARCH $hmm $query > $temp_dir/HMMSEARCHOUT\"" );
1922 $E = &getEvalue( "$temp_dir/HMMSEARCHOUT" );
1924 &dieWithUnexpectedError( "No E-value found in \"$temp_dir/HMMSEARCHOUT\"" );
1926 elsif ( $E > $E_VALUE_THRESHOLD ) {
1927 unlink( "$temp_dir/HMMSEARCHOUT" );
1931 system( "$P7EXTRACT -d $temp_dir/HMMSEARCHOUT > $temp_dir/GDF" )
1932 && &dieWithUnexpectedError( "Could not execute \"$P7EXTRACT -d $temp_dir/HMMSEARCHOUT > $temp_dir/GDF\"" );
1935 system( "$MULTIFETCH -d -g $query $temp_dir/GDF > $temp_dir/MULTIFETCHOUT" )
1936 && &dieWithUnexpectedError( "Could not execute \"$MULTIFETCH -d -g $query $temp_dir/GDF > $temp_dir/MULTIFETCHOUT\"" );
1938 # Checks if score was too low to have made a reasonable alignment.
1939 unless ( -s "$temp_dir/MULTIFETCHOUT" ) {
1940 unlink( "$temp_dir/HMMSEARCHOUT", "$temp_dir/GDF", "$temp_dir/MULTIFETCHOUT" );
1944 system( "$HMMALIGN -o $outfile $ali $alignment $hmm $temp_dir/MULTIFETCHOUT >/dev/null 2>&1" )
1945 && &dieWithUnexpectedError( "Could not execute \"$HMMALIGN -o $outfile $ali $alignment $hmm $temp_dir/MULTIFETCHOUT\"" );
1947 if ( unlink( "$temp_dir/HMMSEARCHOUT", "$temp_dir/GDF","$temp_dir/MULTIFETCHOUT" ) != 3 ) {
1948 &dieWithUnexpectedError( "Could not delete (a) file(s)" );
1952 } ## alignWithHmmalign
1957 # Gets the E value for complete sequences (score includes all domains)
1958 # from a HMMSEARCH output file.
1959 # One argument: the HMMSEARCH output file name
1960 # Returns the E value, 2000 if no E value found
1961 # Last modified: 07/11/01
1964 my $infile = $_[ 0 ];
1965 my $return_line = "";
1969 &testForTextFilePresence( $infile );
1971 open( E, "$infile" ) || &dieWithUnexpectedError( "Cannot open file \"$infile\"" );
1972 while ( $return_line = <E> ) {
1974 # "Sequence Description Score E-value N"
1975 if ( $return_line =~ /Sequence.+Description.+Score.+E.+value.+N/ ) {
1978 # "QUERY_HUMAN 657.4 1.3e-198 1"
1979 elsif ( $flag == 1 && $return_line =~ /\s+(\S+)\s+\d+\s*$/ ) {
1993 # Four/Five arguments:
1994 # 1. Number of bootstraps
1995 # 2. bsp (bootstrap positions) file
1996 # 3. Infile (alignment)
1997 # 4. Outfile (bootstrapped according to bsp file)
1998 # 5. Number of processors
1999 # Last modified: 01/30/02
2000 sub executeBootstrap_cz {
2001 my $boots = $_[ 0 ];
2002 my $bsp_file = $_[ 1 ];
2003 my $infile = $_[ 2 ];
2004 my $outfile = $_[ 3 ];
2005 my $processors = $_[ 4 ];
2007 if ( defined( $processors ) && ( $processors > 1 ) ) {
2008 system( "$BOOTSTRAP_CZ $boots $infile $bsp_file $outfile $processors" )
2009 && &dieWithUnexpectedError( "Could not execute \"$BOOTSTRAP_CZ $boots $infile $bsp_file $outfile $processors\"" );
2013 system( "$BOOTSTRAP_CZ $boots $infile $bsp_file $outfile" )
2014 && &dieWithUnexpectedError( "Could not execute \"$BOOTSTRAP_CZ $boots $infile $bsp_file $outfile\"" );
2017 } ## executeBootstrap_cz
2024 # options for DoRIO.main.
2025 # Last modified: 02/26/01
2028 my $options = $_[ 0 ];
2030 system( "$DORIO $options >/dev/null 2>&1" )
2031 && &dieWithUnexpectedError( "Could not execute \"$DORIO $options\"" );
2047 # -----------------------------------------------------------
2048 # These deal with the alignment
2049 # -----------------------------------------------------------
2054 # Counts sequences from a Pfam flat file or
2055 # in a PHYLIP interleaved aligment.
2056 # One arguments: Pfam flat file name.
2057 # Returns the number of sequences.
2058 # Last modified: 07/10/01
2059 sub countSeqsInPfamAlign {
2060 my $infile = $_[ 0 ];
2061 my $return_line = "";
2062 my $saw_sequence_line = 0;
2063 my $number_of_seqs = 0;
2065 &testForTextFilePresence( $infile );
2067 open( C, "$infile" ) || &dieWithUnexpectedError( "Cannot open file \"$infile\"" );
2068 while ( $return_line = <C> ) {
2070 if ( $saw_sequence_line == 1
2071 && !&containsPfamNamedSequence( $return_line )
2072 && !&isPfamCommentLine( $return_line ) ) {
2075 if ( &isPfamSequenceLine( $return_line )
2076 && $return_line !~ /^\s*\d+\s+\d+/ ) {
2077 if ( $saw_sequence_line == 0 ) {
2078 $saw_sequence_line = 1;
2084 return $number_of_seqs;
2086 } ## countSeqsInPfamAlign
2091 # This gets the complete name(s) of a sequence from a Pfam alignment.
2092 # I.e. it adds "/xxx-xxx".
2094 # 1. Infile (alignment)
2096 # Returns a String-array of all the complete names found.
2097 # Last modified: 03/04/01
2098 sub getCompleteName {
2100 my $infile = $_[ 0 ];
2101 my $query_name = $_[ 1 ];
2102 my $return_line = "";
2103 my @complete_names = ();
2104 my $complete_name = "";
2107 &testForTextFilePresence( $infile );
2109 $query_name =~ s/\/.*//;
2111 open( INGCN, $infile ) || &dieWithUnexpectedError( "Cannot open file \"$infile\"" );
2112 while ( $return_line = <INGCN> ) {
2113 if ( $return_line =~ /^\s*$query_name(\S+)\s+.+/ ) {
2114 $complete_name = $query_name.$1;
2115 if ( $i > 0 && $complete_names[ 0 ] eq $complete_name ) {
2116 # Now, we saw of all of them.
2119 $complete_names[ $i++ ] = $complete_name;
2124 return @complete_names;
2125 } ## getCompleteName
2130 # Removes sequences from a Pfam flat file.
2131 # It can remove all sequences not from species listed in a species names file.
2132 # It can remove all sequences which do not have a SWISS-PROT name (XXXX_XXXXX)
2133 # It can remove all sequences which are "TrEMBL" fragments.
2135 # 1. Pfam flat file name
2137 # 3. Name of the query - not to be removed
2138 # (use " " to not use this functionality)
2139 # 4. species names file (will be ignored if " ")
2140 # 5. 1 to NOT remove non-SWISS_PROT seqs.
2141 # 6. 1 to remove TrEMBL seqs with "(FRAGMENT)" in their DE line.
2142 # (Only used if non SWISS_PROT seqswill not be removed)
2143 # Returns the number of sequences in the resulting alignment.
2144 # If a query name is given, it returns -1 if query is not found in alignment,
2145 # -10 if the name is not unique.
2146 # Last modified: 05/11/01
2147 sub removeSeqsFromPfamAlign {
2148 my $infile = $_[ 0 ];
2149 my $outfile = $_[ 1 ];
2150 my $query = $_[ 2 ];
2151 my $species_names_file = $_[ 3 ];
2152 my $keep_non_sp = $_[ 4 ];
2153 my $remove_frags = $_[ 5 ];
2154 my $return_line = "";
2157 my $saw_sequence_line = 0;
2158 my $number_of_seqs = 0;
2160 my $query_given = 0;
2161 my $species_names_file_given = 0;
2162 my $length_of_name = 0;
2163 my %AC_OS = (); # AC -> species name (TrEMBL)
2164 my %AC_DE = (); # AC -> description (TrEMBL)
2169 &testForTextFilePresence( $infile );
2171 if ( $query =~ /\S/ ) {
2174 if ( $species_names_file =~ /\S/ ) {
2175 $species_names_file_given = 1;
2176 &readSpeciesNamesFile( $species_names_file );
2179 if ( $keep_non_sp == 1
2180 || ( $query_given == 1 && !&startsWithSWISS_PROTname( $query ) ) ) {
2182 &testForTextFilePresence( $TREMBL_ACDEOS_FILE );
2184 # Fill up hash $AC_OS and $AC_DE.
2185 open( HH, "$TREMBL_ACDEOS_FILE" ) || &dieWithUnexpectedError( "Cannot open file \"$TREMBL_ACDEOS_FILE\"" );
2186 while ( $return_line = <HH> ) {
2187 if ( $return_line =~ /(\S+);([^;]*);(\S+)/ ) {
2189 if ( $remove_frags == 1 ) {
2197 open( OUT_RNSP, ">$outfile" ) || &dieWithUnexpectedError( "Cannot create file \"$outfile\"" );
2198 open( IN_RNSP, "$infile" ) || &dieWithUnexpectedError( "Cannot open file \"$infile\"" );
2199 while ( $return_line = <IN_RNSP> ) {
2201 if ( $saw_sequence_line == 1
2202 && !&containsPfamNamedSequence( $return_line )
2203 && !&isPfamCommentLine( $return_line ) ) {
2204 # This is just for counting purposes.
2205 $saw_sequence_line = 2;
2207 if ( &isPfamSequenceLine( $return_line ) ) {
2208 if ( $saw_sequence_line == 0 ) {
2209 $saw_sequence_line = 1;
2211 $return_line =~ /(\S+)\s+(\S+)/;
2214 if ( $query_given == 1 && $name eq $query ) {
2217 if ( ( $query_given == 1 && $name ne $query )
2218 || $query_given != 1 ) {
2219 if ( !&startsWithSWISS_PROTname( $name ) ) {
2220 if ( $keep_non_sp != 1 ) {
2226 unless( exists( $AC_OS{ $AC } ) ) {
2227 #ACs not present in "ACDEOS" file.
2230 $OS = $AC_OS{ $AC };
2231 if ( !$OS || $OS eq "" ) {
2232 &dieWithUnexpectedError( "species for \"$AC\" not found" );
2234 if ( $species_names_file_given == 1 ) {
2235 unless( exists( $Species_names_hash{ $OS } ) ) {
2239 if ( $remove_frags == 1 ) {
2240 $DE = $AC_DE{ $AC };
2241 if ( $DE && $DE =~ /\(FRAGMENT\)/ ) {
2245 $name =~ s/\//_$OS\//;
2249 if ( $species_names_file_given == 1 ) {
2250 if ( $name =~ /_([A-Z0-9]{1,5})/ ) {
2251 unless( exists( $Species_names_hash{ $1 } ) ) {
2255 # remove everything whose species cannot be determined.
2262 elsif ( $query_given == 1 && $name eq $query
2263 && !&startsWithSWISS_PROTname( $query ) ) {
2264 # Adding species to non SWISS-PROT query
2267 unless( exists( $AC_OS{ $AC } ) ) {
2268 #ACs not present in "ACDEOS" file.
2269 &userError( "Could not establish species of query.\n Check file \"$TREMBL_ACDEOS_FILE\"." );
2271 $OS = $AC_OS{ $AC };
2272 if ( !$OS || $OS eq "" ) {
2273 &dieWithUnexpectedError( "species for \"$AC\" not found" );
2275 $name =~ s/\//_$OS\//;
2278 $length_of_name = length( $name );
2280 if ( $length_of_name > ( $LENGTH_OF_NAME - 1 ) ) {
2281 &userError( "Name \"$name\" is too long." );
2284 for ( my $j = 0; $j <= ( $LENGTH_OF_NAME - $length_of_name - 1 ); ++$j ) {
2288 $return_line = $name.$seq."\n";
2291 print OUT_RNSP $return_line;
2292 if ( $saw_sequence_line == 1 ) {
2298 if ( $query_given == 1 ) {
2299 if ( $saw_query < 1 ) {
2302 elsif ( $saw_query > 1 ) {
2306 return $number_of_seqs;
2308 } ## removeSeqsFromPfamAlign
2315 # "Returns" a Hash of Strings (=keys) containing all the names found in PWD file
2316 # Last modified: 05/29/01
2317 sub getNamesFromPWDFile {
2318 my $infile = $_[ 0 ];
2319 my $return_line = "";
2321 my $saw_dist_line = 0;
2323 &testForTextFilePresence( $infile );
2325 open( GN_IN, "$infile" ) || &dieWithUnexpectedError( "Cannot open file \"$infile\"" );
2327 while ( $return_line = <GN_IN> ) {
2328 if ( $saw_dist_line == 1 && $return_line =~ /^\s*(\d+)\s*$/ ) {
2330 &dieWithUnexpectedError( "Failed sanity check" );
2334 elsif ( $return_line =~ /^\s*(\S+)\s+\S+/ ) {
2335 $names_in_pwd_file{ $1 } = 0;
2342 } ## getNamesFromPWDFile
2347 # Moves sequences which start with query name (argument 1)
2348 # to the last positions in pfam alignment sepecified by argument 2.
2349 # Removes seqs present in argument 4, unless for query name.
2352 # 2. Infile (alignment)
2353 # 3. Outfile (=infile with query seq moved to the bottom)
2354 # 4. Array of seq names to remove, unless for query name
2355 # Last modified: 06/25/01
2357 my $query = $_[ 0 ];
2358 my $infile = $_[ 1 ];
2359 my $outfile = $_[ 2 ];
2360 my @to_remove = @{ $_[ 3 ] }; # @{} tells Perl that this is a list.
2361 my $return_line = "";
2362 my $query_line = "";
2365 &testForTextFilePresence( $infile );
2367 open( MTL_IN, "$infile" ) || &dieWithUnexpectedError( "Cannot open file \"$infile\"" );
2368 open( MTL_OUT, ">$outfile" ) || &dieWithUnexpectedError( "Cannot create file \"$outfile\"" );
2370 W: while ( $return_line = <MTL_IN> ) {
2371 if ( &isPfamCommentLine( $return_line )
2372 && ( !isRFline( $return_line ) || $mode != 1 ) ) {
2375 if ( @to_remove > 1 ) {
2376 foreach $n ( @to_remove ) {
2377 if ( $n ne $query && $return_line =~ /^\s*$n\s+/ ) {
2382 if ( $return_line =~ /^\s*$query\s+/ ) {
2383 $query_line = $return_line;
2385 elsif ( $query_line ne ""
2386 && ( $return_line !~ /\S+/ || isRFline( $return_line ) ) ) {
2387 print MTL_OUT $query_line;
2388 print MTL_OUT $return_line;
2392 print MTL_OUT $return_line;
2395 if ( $query_line ne "" ) {
2396 print MTL_OUT $query_line;
2415 # -----------------------------------------------------------
2417 # -----------------------------------------------------------
2422 # This gets the complete name of a TrEMBL sequence from a Pfam alignment.
2423 # I.e. it adds the species between "_" and "/XXX-XXX".
2425 # 1. Infile (alignment)
2427 # Returns the complete name found.
2428 # Last modified: 04/25/01
2429 sub getCompleteNameForTrEMBLquerySeq {
2431 my $infile = $_[ 0 ];
2432 my $query_name = $_[ 1 ];
2433 my $return_line = "";
2434 my $complete_name = "";
2435 my $before_slash = "";
2436 my $after_slash = "";
2438 &testForTextFilePresence( $infile );
2440 $query_name =~ /(.+)\/.+/;
2443 $query_name =~ /.+\/(.+)/;
2446 open( INGCN, $infile ) || &dieWithUnexpectedError( "Cannot open file \"$infile\"" );
2447 while ( $return_line = <INGCN> ) {
2448 if ( $return_line =~ /^\s*($before_slash.+\/$after_slash)/ ) {
2449 $complete_name = $1;
2454 if ( $complete_name eq "" ) {
2455 &userError( "Could not find \"$query_name\" in \"$alignment\"." );
2457 return $complete_name;
2458 } ## getCompleteNameForTrEMBLquerySeq
2465 # Last modified: 02/26/01
2466 sub getDescriptionFromPfam {
2468 my $infile = $_[ 0 ];
2469 my $return_line = "";
2472 &testForTextFilePresence( $infile );
2474 open( INGDPF, $infile ) || &dieWithUnexpectedError( "Cannot open file \"$infile\"" );
2475 while ( $return_line = <INGDPF> ) {
2476 if ( $return_line =~ /^\s*#=DE\s+(.+)/ ) {
2485 } ## getDescriptionFromPfam
2489 # Reads in (SWISS-PROT) species names from a file.
2490 # Names must be separated by newlines.
2491 # Lines beginning with "#" are ignored.
2492 # A possible "=" and everything after is ignored.
2493 # One argument: species-names-file name
2494 # Last modified: 04/24/01
2495 sub readSpeciesNamesFile {
2496 my $infile = $_[ 0 ];
2497 my $return_line = "";
2500 &testForTextFilePresence( $infile );
2502 open( IN_RSNF, "$infile" ) || &dieWithUnexpectedError( "Cannot open file \"$infile\"" );
2503 while ( $return_line = <IN_RSNF> ) {
2504 if ( $return_line !~ /^\s*#/ && $return_line =~ /(\S+)/ ) {
2506 $species =~ s/=.+//;
2507 $Species_names_hash{ $species } = "";
2513 } ## readSpeciesNamesFile
2518 # This reads a raw sequence file or a FASTA sequence file
2519 # and saves it as a "cleaned up" FASTA sequence file.
2520 # If no > line is in the file, it creates one with new sequence name.
2521 # If a > line is in the file, it modifes it:
2522 # white space -> _, ";" ":" "," or "|" -> "~", deletes everything after ( or [;
2523 # length is limited to 40 characters.
2524 # Error if $new_seq_name is "" and no > line in the file.
2525 # Two/three arguments:
2528 # 3. new sequence name for > line(will be ignored if "")
2529 # If new sequence name is "":
2530 # returns the contents of the ">" line after modification.
2531 # If new sequence name is specified:
2532 # return new sequence name.
2533 # Last modified: 03/04/01
2534 sub seqFile2CleanedUpFastaFile {
2535 my $infile = $_[ 0 ];
2536 my $outfile = $_[ 1 ];
2537 my $new_seq_name = $_[ 2 ];
2538 my $return_line = "";
2540 my $saw_desc_line = 0;
2542 &testForTextFilePresence( $infile );
2544 open( IN_CUFF, "$infile" ) || &dieWithUnexpectedError( "Cannot open file \"$infile\"" );
2545 open( OUT_CUFF, ">$outfile" ) || &dieWithUnexpectedError( "Cannot create file \"$outfile\"" );
2547 while ( $return_line = <IN_CUFF> ) {
2548 if ( $return_line =~ /\w/ && $return_line !~ /^\s*#/ ) {
2549 if ( $return_line =~ /^\s*>/ ) {
2550 if ( $new_seq_name eq "" && $return_line !~ /_/ ) {
2551 &userError( "Description line of query file appears not to\n contain any species information. Use \"N=\" option." );
2553 elsif ( $new_seq_name eq "" ) {
2554 $return_line =~ s/^\s*>\s*(.*?)\s*/>$1/; # Removes spaces before and after >.
2555 $return_line = substr( $return_line, 0, $LENGTH_OF_NAME - 1 );
2556 $return_line =~ s/[\(\[].*//; # Removes "(" or "[" and everything after.
2557 $return_line =~ s/\s+$//; # Removes spaces at end.
2558 $return_line =~ s/\s+/_/g; # Replaces all white spaces with "_".
2559 $return_line =~ s/[;:,\|]/~/g; # Replaces all ";", ":", ",", or "|" with "~".
2560 $return_line =~ />\s*(\S+)/;
2562 $return_line .= "\n";
2565 $return_line = ">".$new_seq_name."\n";
2566 $mod_desc = $new_seq_name;
2571 if ( $saw_desc_line != 1 ) {
2572 if ( $new_seq_name ne "" ) {
2573 print OUT_CUFF ( ">".$new_seq_name."\n" );
2574 $mod_desc = $new_seq_name;
2577 &userError( "Query file is not a FASTA file\n and option \"N=\" has not been used." );
2581 $return_line =~ s/[^a-zA-Z\r\n\f]//g; # Removes non-letters from sequence.
2584 if ( $return_line =~ /\w/ ) {
2585 print OUT_CUFF $return_line;
2593 } ## seqFile2CleanedUpFastaFile
2598 # Purpose. Gets description for TrEMBL seqs,
2599 # from a file which contains the AC, DE, and OS
2600 # and which has to be generated from a TrEMBL flat file db
2601 # using "extractTrembl.pl".
2602 # The same file is used in method "addSpeciesToNonSPseqs".
2604 # 1. "ACDEOS" file (AC, DE, OS from TrEMBL db)
2605 # 2. AC ("_species/..." is removed)
2606 # Format: AC;DE;OS\n
2607 # Last modified: 02/14/02
2608 sub getDescriptionFromTrEMBL_ACDEOSfile {
2609 my $ACDEOS = $_[ 0 ];
2613 # Fill up (huge) hash, if not already done.
2615 &testForTextFilePresence( $ACDEOS );
2616 open( ACDEOS, "$ACDEOS" ) || &dieWithUnexpectedError( "Cannot open file \"$ACDEOS\"" );
2617 while ( $return_line = <ACDEOS> ) {
2618 if ( $return_line =~ /(\S+);([^;]+);/ ) {
2627 unless( exists( $AC_DE{ $AC } ) ) {
2628 #AC not present in "ACDEOS" file.
2632 $DE = $AC_DE{ $AC };
2634 if ( !$DE || $DE eq "" ) {
2640 } ## getDescriptionFromTrEMBL_ACDEOSfile
2644 # Purpose. Gets description for SP seqs,
2645 # from a file which contains the AC, DE, and OS
2646 # and which has to be generated from a sprot.dat flat file db
2647 # using "extractSWISS-PROT.pl".
2649 # 1. "ACDEOS" file (AC, DE, OS from SWISS-PROT db)
2650 # 2. SWISS-PROT AC (XXXX_XXXX)
2651 # Format: AC;DE;OS\n
2652 # Last modified: 02/12/02
2653 sub getDescriptionFromSWISSPROT_ACDEOSfile {
2654 my $SPACDEOS = $_[ 0 ];
2658 # Fill up (huge) hash, if not already done.
2659 unless ( %SP_AC_DE ) {
2660 &testForTextFilePresence( $SPACDEOS );
2661 open( ACDEOS, "$SPACDEOS" ) || &dieWithUnexpectedError( "Cannot open file \"$SPACDEOS\"" );
2662 while ( $return_line = <ACDEOS> ) {
2663 if ( $return_line =~ /(\S+);([^;]+);/ ) {
2664 $SP_AC_DE{ $1 } = $2;
2672 unless( exists( $SP_AC_DE{ $AC } ) ) {
2673 #AC not present in "ACDEOS" file.
2677 $DE = $SP_AC_DE{ $AC };
2679 if ( !$DE || $DE eq "" ) {
2685 } ## getDescriptionFromSWISSPROT_ACDEOSfile
2695 # -----------------------------------------------------------
2697 # -----------------------------------------------------------
2702 # Numeric value to be rounded to int.
2703 # Last modified: 10/17/01
2706 unless ( $x eq "-" ) {
2707 $x = int ( $x + 0.5 );
2715 # Last modified: 03/10/01
2716 sub cleanUpTempDir {
2717 unlink( $temp_dir."/MAKETREEOUT".$TREE_FILE_SUFFIX, $temp_dir."/MAKETREEOUT".$LOG_FILE_SUFFIX,
2718 $temp_dir."/MAKETREEOUT".$ALIGN_FILE_SUFFIX, $temp_dir."/MAKETREEOUT".$MULTIPLE_TREES_FILE_SUFFIX,
2719 $temp_dir."/MAKETREEOUT".$SUFFIX_PWD_NOT_BOOTS, $temp_dir."/".$DO_RIO_TEMP_OUTFILE,
2720 $temp_dir."/ALIGN1", $temp_dir."/ALIGN2", $temp_dir."/QUERY_SEQ", $temp_dir."/NBD_NJ_TREE",
2721 $temp_dir."/ALIGN2_BOOTSTRAPPED", $temp_dir."/ALIGN2_PROCESSED", $temp_dir."/DIST_TO_QUERY",
2722 $temp_dir."/DISTs_TO_QUERY", $temp_dir."/HMMALIGNOUT", $temp_dir."/NBD_INC_QUERY", $temp_dir."/PWD_INC_QUERY",
2723 $temp_dir."/HMMFILE", $temp_dir."/MOVETOLASTOUT" );
2738 # -----------------------------------------------------------
2739 # Command line and arguments, Errors
2740 # -----------------------------------------------------------
2746 # Last modified: 03/08/01
2747 sub analyzeCommandLine {
2755 $mode = shift( @_ );
2757 if ( $mode != 1 && $mode != 2 && $mode != 3 && $mode != 4 ) {
2758 &errorInCommandLine( "Mode can only be: 1, 2, 3, or 4." );
2762 foreach $args ( @_ ) {
2766 $char = substr( $args, 0, 1 );
2769 if ( length( $args ) > 1 ) {
2770 $arg = substr( $args, 2 );
2773 if ( $char =~ /A/ ) {
2774 if ( $alignment ne "" ) {
2775 &errorInCommandLine( "Entered same argument twice." );
2777 if ( $mode == 3 || $mode == 4 ) {
2778 &userErrorCheckForTextFileExistence( $arg );
2782 elsif ( $char =~ /B/ ) {
2783 if ( $t_sn != $THRESHOLD_SN_DEFAULT ) {
2784 &errorInCommandLine( "Entered same argument twice." );
2788 elsif ( $char =~ /C/ ) {
2789 if ( $description == 1 || $complete_description == 1 ) {
2790 &errorInCommandLine( "Entered same argument twice or conflicting arguments: \"D\" and \"C\"." );
2792 $complete_description = 1;
2794 elsif ( $char =~ /D/ ) {
2795 if ( $description == 1 || $complete_description == 1 ) {
2796 &errorInCommandLine( "Entered same argument twice or conflicting arguments: \"D\" and \"C\"." );
2800 elsif ( $char =~ /E/ ) {
2801 if ( $long_output != 0 ) {
2802 &errorInCommandLine( "Entered same argument twice." );
2806 elsif ( $char =~ /F/ ) {
2807 if ( $hmm_file ne "" || $hmm_name ne "" || $seed_aln_for_hmmbuild ne "") {
2808 &errorInCommandLine( "Entered same argument twice or conflicting arguments: \"F=\", \"H=\" and \"b=\"." );
2810 if ( $mode == 1 || $mode == 2 ) {
2811 &errorInCommandLine( "Can not use \"F=\" in modes 1 or 2." );
2813 &userErrorCheckForTextFileExistence( $arg );
2816 elsif ( $char =~ /G/ ) {
2817 if ( $species_names_file ne " " ) {
2818 &errorInCommandLine( "Entered same argument twice." );
2820 &userErrorCheckForTextFileExistence( $arg );
2821 $species_names_file = $arg;
2823 elsif ( $char =~ /H/ ) {
2824 if ( $hmm_name ne "" || $hmm_file ne "" || $seed_aln_for_hmmbuild ne "" ) {
2825 &errorInCommandLine( "Entered same argument twice or conflicting arguments: \"F=\", \"H=\" and \"b=\"." );
2827 if ( $mode == 1 || $mode == 2 ) {
2828 &errorInCommandLine( "Can not use \"H=\" in modes 1 or 2." );
2832 elsif ( $char =~ /I/ ) {
2833 if ( $safe_nhx != 0 ) {
2834 &errorInCommandLine( "Entered same argument twice." );
2838 elsif ( $char =~ /K/ ) {
2840 &errorInCommandLine( "Entered same argument twice." );
2844 elsif ( $char =~ /L/ ) {
2845 if ( $t_orthologs != $THRESHOLD_ORTHOLOGS_DEFAULT ) {
2846 &errorInCommandLine( "Entered same argument twice." );
2848 $t_orthologs = $arg;
2850 elsif ( $char =~ /N/ ) {
2851 if ( $query_name ne "" ) {
2852 &errorInCommandLine( "Entered same argument twice." );
2856 elsif ( $char =~ /O/ ) {
2857 if ( $outfile ne "" ) {
2858 &errorInCommandLine( "Entered same argument twice." );
2862 elsif ( $char =~ /P/ ) {
2863 if ( $sort != $SORT_DEFAULT ) {
2864 &errorInCommandLine( "Entered same argument twice." );
2868 elsif ( $char =~ /Q/ ) {
2869 if ( $seqX_file ne "" ) {
2870 &errorInCommandLine( "Entered same argument twice." );
2872 &userErrorCheckForTextFileExistence( $arg );
2875 elsif ( $char =~ /S/ ) {
2876 if ( $species_tree_file ne "" ) {
2877 &errorInCommandLine( "Entered same argument twice." );
2879 &userErrorCheckForTextFileExistence( $arg );
2880 $species_tree_file = $arg;
2882 elsif ( $char =~ /T/ ) {
2883 if ( $mode == 1 || $mode == 2 ) {
2884 &errorInCommandLine( "Matrix cannot be changed in modes 1 and 2 (is dictated by \"\$MATRIX_FOR_PWD\" for mode 1)." );
2886 if ( $arg eq "J" ) {
2889 elsif ( $arg eq "P" ) {
2892 elsif ( $arg eq "B" ) {
2895 elsif ( $arg eq "M" ) {
2898 elsif ( $arg eq "V" ) {
2901 elsif ( $arg eq "W" ) {
2905 &errorInCommandLine( "Use T=J for JTT, P for PAM, B for BLOSUM62, M for mtREV24, V for VT, W for WAG." );
2908 elsif ( $char =~ /U/ ) {
2909 if ( $t_orthologs_dc != $THRESHOLD_ORTHOLOGS_DEFAULT_DC ) {
2910 &errorInCommandLine( "Entered same argument twice." );
2912 $t_orthologs_dc = $arg;
2914 elsif ( $char =~ /X/ ) {
2915 if ( $warn_more_than_one_ortho
2916 != $WARN_MORE_THAN_ONE_ORTHO_DEFAULT ) {
2917 &errorInCommandLine( "Entered same argument twice." );
2919 $warn_more_than_one_ortho = $arg;
2921 elsif ( $char =~ /Y/ ) {
2922 if ( $warn_no_orthos != $WARN_NO_ORTHOS_DEFAULT ) {
2923 &errorInCommandLine( "Entered same argument twice." );
2925 $warn_no_orthos = $arg;
2927 elsif ( $char =~ /Z/ ) {
2928 if ( $warn_one_ortho != $WARN_ONE_ORTHO_DEFAULT ) {
2929 &errorInCommandLine( "Entered same argument twice." );
2931 $warn_one_ortho = $arg;
2933 elsif ( $char =~ /a/ ) {
2934 if ( $boostraps_for_makeTree != $BOOSTRAPS_FOR_MAKETREE_DEFAULT ) {
2935 &errorInCommandLine( "Entered same argument twice." );
2937 if ( $mode == 1 || $mode == 2 ) {
2938 &errorInCommandLine( "Modes 1 and 2: Cannot change bootstrap value. Do not use \"a=\"." );
2940 $boostraps_for_makeTree = $arg;
2941 if ( $boostraps_for_makeTree < 10 ) {
2942 &errorInCommandLine( "Bootsraps cannot be smaller than 10." );
2945 elsif ( $char =~ /b/ ) {
2946 if ( $hmm_name ne "" || $hmm_file ne "" || $seed_aln_for_hmmbuild ne "" ) {
2947 &errorInCommandLine( "Entered same argument twice or conflicting arguments: \"F=\", \"H=\" and \"b=\"." );
2949 if ( $mode == 1 || $mode == 2 ) {
2950 &errorInCommandLine( "Can not use \"b=\" in modes 1 or 2." );
2952 &userErrorCheckForTextFileExistence( $arg );
2953 $seed_aln_for_hmmbuild = $arg;
2955 elsif ( $char =~ /f/ ) {
2956 if ( $no_frags ne 0 ) {
2957 &errorInCommandLine( "Entered same argument twice." );
2961 elsif ( $char =~ /j/ ) {
2962 if ( $temp_dir ne "" ) {
2963 &errorInCommandLine( "Entered same argument twice." );
2967 elsif ( $char =~ /p/ ) {
2968 if ( $output_ultraparalogs != 0 ) {
2969 &errorInCommandLine( "Entered same argument twice." );
2971 $output_ultraparalogs = 1;
2973 elsif ( $char =~ /s/ ) {
2974 if ( $non_sp != 1 ) {
2975 &errorInCommandLine( "Entered same argument twice." );
2979 elsif ( $char =~ /v/ ) {
2980 $t_ultra_paralogs = $arg;
2982 elsif ( $char =~ /x/ ) {
2983 if ( $output_HTML == 1 ) {
2984 &errorInCommandLine( "Entered same argument twice." );
2988 elsif ( $char =~ /y/ ) {
2989 if ( $seed_for_makeTree != $SEED_FOR_MAKETREE_DEFAULT ) {
2990 &errorInCommandLine( "Entered same argument twice." );
2992 $seed_for_makeTree = $arg;
2994 elsif ( $char =~ /\+/ ) {
2995 if ( $parallel != 0 ) {
2996 &errorInCommandLine( "Entered same argument twice." );
3001 &errorInCommandLine( "Unknown option: \"$args\"." );
3004 } ## analyzeCommandLine
3009 # Last modified: 03/08/01
3010 sub CheckArguments {
3012 if ( $outfile eq "" ) {
3013 &errorInCommandLine( "Outfile not specified. Use \"O=\"." );
3015 if ( $alignment eq "" ) {
3016 &errorInCommandLine( "Need to specify a Pfam alignment file. Use \"A=\"." );
3018 if ( -e $outfile ) {
3019 &userError( "\"$outfile\" already exists." );
3022 if ( $sort < 0 || $sort > 17 ) {
3023 &errorInCommandLine( "Sort priority (\"P=\") must be between 0 and 15." );
3026 if ( $parallel == 1 && $mode != 1 ) {
3027 &errorInCommandLine( "Parallelization only implemented for mode 1." );
3030 if ( $mode == 1 || $mode == 2 ) {
3032 if ( $species_names_file =~ /\S/ ) {
3033 &errorInCommandLine( "Modes 1 and 2: Cannot use species names file. Do not use \"G=\"." );
3035 if ( $non_sp == 0 ) {
3036 &errorInCommandLine( "Can not use \"s\" in modes 1 or 2." );
3038 if ( $no_frags == 1 ) {
3039 &errorInCommandLine( "Can not use \"f\" in modes 1 or 2." );
3043 if ( $mode == 1 || $mode == 3 ) {
3044 if ( $seqX_file eq "" ) {
3045 &errorInCommandLine( "Modes 1 and 3: Need to specify a query file. Use \"Q=\"." );
3050 if ( $hmm_name eq "" && $hmm_file eq "" && $seed_aln_for_hmmbuild eq "" ) {
3051 &errorInCommandLine( "Mode 3: Need to specify either a HMM name (\"H=\"), a HMM file (\"F=\") or build a HMM (\"b=\")." );
3056 if ( $hmm_name ne "" || $hmm_file ne "" || $seed_aln_for_hmmbuild ne "" ) {
3057 &errorInCommandLine( "Mode 1: Must not specify a HMM name (\"H=\"), a HMM file (\"F=\") or build a HMM (\"b=\")." );
3061 if ( $mode == 2 || $mode == 4 ) {
3062 if ( $seqX_file ne "" ) {
3063 &errorInCommandLine( "Modes 2 and 4: Must not specify a query file. Do not use \"Q=\".\n" );
3065 if ( $query_name eq "" ) {
3066 &errorInCommandLine( "Modes 2 and 4: Must specify a query name. Use \"N=\"." );
3068 if ( $hmm_name ne "" || $hmm_file ne "" || $seed_aln_for_hmmbuild ne "" ) {
3069 &errorInCommandLine( "Modes 2 and 4: Cannot specify a HMM name (\"H=\"), a HMM file (\"F=\") or build a HMM (\"b=\")." );
3074 if ( $non_sp != 1 && $no_frags == 1 ) {
3075 &errorInCommandLine( "\"Fragments\" are assumed to be only found in non SWISS-PROT seqs.\n Do not use \"f\" together with \"s\"." );
3078 if ( $output_HTML == 1 ) {
3080 &errorInCommandLine( "Output in HTML (for web server) only for mode 1." );
3084 if ( $output_ultraparalogs == 0 && $t_ultra_paralogs != $T_ULTRA_PARALOGS_DEFAULT ) {
3085 &errorInCommandLine( "Use \"p\" to output ultra paralogs (cannot use \"v=\" without \"p\")." );
3088 if ( $non_sp == 1 && ( $mode == 3 || $mode == 4 ) ) {
3089 unless ( ( -s $TREMBL_ACDEOS_FILE ) && ( -f $TREMBL_ACDEOS_FILE ) && ( -T $TREMBL_ACDEOS_FILE ) ) {
3090 my $message = "AC, DE, and OS-file not found.\n";
3091 $message .= " If non SWISS-PROT sequences are not to be removed from the\n";
3092 $message .= " Pfam alignment (\"s\" option), variable \"\$TREMBL_ACDEOS_FILE\" needs\n";
3093 $message .= " to point to a file containing AC, DE, and OS from TrEMBL. Such a\n";
3094 $message .= " file can be generated with \"extractTrembl.pl\".\n";
3095 $message .= " Currently, \"TREMBL_ACDEOS_FILE\" points to:\n";
3096 $message .= " $TREMBL_ACDEOS_FILE";
3097 &userError( $message );
3101 unless ( ( -s $species_tree_file ) && ( -f $species_tree_file ) && ( -T $species_tree_file ) ) {
3102 my $message = "Species tree file not found.\n";
3103 $message .= " A valid species tree must be specified.\n";
3104 $message .= " Either, use \"S=\" option, or set variable\n";
3105 $message .= " \"\$SPECIES_TREE_FILE_DEFAULT\".\n";
3106 $message .= " Currently, this program looks for a species tree at:\n";
3107 $message .= " $species_tree_file";
3108 &userError( $message );
3111 if ( $hmm_name ne "" ) {
3112 unless ( ( -s $PFAM_HMM_DB ) && ( -f $PFAM_HMM_DB ) ) {
3113 my $message = "HMMER model db file not found.\n";
3114 $message .= " If \"H=\" option is used, a valid HMMER model db needs\n";
3115 $message .= " to be specified with variable \"\$PFAM_HMM_DB\".\n";
3116 $message .= " Currently, \"\$PFAM_HMM_DB\" points to:\n";
3117 $message .= " $PFAM_HMM_DB";
3118 &userError( $message );
3125 # Last modfied: 06/25/01
3126 sub userErrorCheckForTextFileExistence {
3128 unless ( ( -s $file ) && ( -f $file ) && ( -T $file ) ) {
3129 &userError( "\"$file\" does not exist or is not a plain text file." );
3131 } ## checkForFileExistence
3135 # One argument: the error message.
3136 # Last modified: 04/26/01
3137 sub errorInCommandLine {
3139 my $error = $_[ 0 ];
3142 print " rio.pl version: $VERSION\n";
3145 print " Error in command line:\n";
3146 if ( $error ne "" ) {
3150 print " Type \"rio.pl\" (no arguments) for more information.\n";
3153 } ## errorInCommandLine
3158 # One argument: the error message.
3159 # Last modified: 04/26/01
3162 my $error = $_[ 0 ];
3165 print " rio.pl version: $VERSION\n";
3169 if ( $error ne "" ) {
3173 print " Type \"rio.pl\" (no arguments) for more information.\n";
3184 # Last modified: 04/26/01
3188 print " rio.pl version: $VERSION\n";
3189 print " ------\n\n";
3192 Copyright (C) 2000-2002 Washington University School of Medicine
3193 and Howard Hughes Medical Institute
3197 Author: Christian M. Zmasek
3198 zmasek\@genetics.wustl.edu
3199 http://www.genetics.wustl.edu/eddy/people/zmasek/
3201 Last modified 05/26/02
3203 Available at : http://www.genetics.wustl.edu/eddy/forester/
3204 RIO webserver: http://www.rio.wustl.edu/
3207 Zmasek C.M. and Eddy S.R. (2002)
3208 RIO: Analyzing proteomes by automated phylogenomics using
3209 resampled inference of orthologs.
3210 BMC Bioinformatics 3:14
3211 http://www.biomedcentral.com/1471-2105/3/14/
3213 It is highly recommended that you read this paper before
3214 installing and/or using RIO. (Included in the RIO
3215 distribution as PDF: "RIO.pdf".)
3218 Before rio.pl can be used, some variables in rio_module.pm need to be set,
3219 as described in RIO_INSTALL.
3223 Usage: rio.pl <Mode: 1, 2, 3, or 4> <tagged arguments, single letter arguments>
3230 % RIO1.1/perl/rio.pl 1 A=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out1 p I C E
3232 % RIO1.1/perl/rio.pl 2 A=aconitase N=LEU2_LACLA/5-449 O=out2 p I C E
3234 % RIO1.1/perl/rio.pl 3 A=/path/to/my/pfam/Full/aconitase H=aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out3 p I C E
3236 % RIO1.1/perl/rio.pl 4 A=/path/to/my/pfam/Full/aconitase N=LEU2_LACLA/5-449 O=out4 p I C E
3238 % RIO1.1/perl/rio.pl 3 A=/path/to/my/pfam/Full/aconitase b=/path/to/my/pfam/Seed/aconitase Q=RIO1.1/LEU2_HAEIN N=QUERY_HAEIN O=out5 p I C E
3245 1: RIO analysis based on precalculated pairwise distances
3246 alignment does not contain query sequence
3248 2: RIO analysis based on precalculated pairwise distances
3249 alignment does contain query sequence
3251 3: RIO analysis based on Pfam alignments,
3252 alignment does not contain query sequence
3254 4: RIO analysis based on Pfam alignments,
3255 alignment does contain query sequence
3262 No "G=", "H=", "F=", "T=", "a=", "b=", "s", "f" in modes 1 and 2.
3265 A=<String> Pfam alignment name (mandatory). This specifies the alignment
3266 against which the RIO analysis is to be performed.
3267 In modes 1 and 2: Pfam model (alignment) name
3268 (e.g. "A=aconitase").
3269 In modes 3 and 4: Pfam alignment path/name
3270 (e.g. "A=/path/to/your/pfam/Full/aconitase").
3272 Q=<String> Path/name of file containing the query sequence
3273 (in FASTA format or raw sequence) (mandatory in modes 1 and 3).
3275 N=<String> Query name (mandatory). This must include the SWISS-PROT code
3276 for the species of the query after a "_" (e.g. "N=QUERY_HAEIN").
3277 If the query sequence is already in the alignment (modes 2 and 4)
3278 the complete name needs to be specified -- including "/xxx-xxx".
3280 O=<String> Output file path/name (mandatory).
3282 T=<char> Model for pairwaise distance calculation:
3283 J=JTT, B=BLOSUM 62, M=mtREV24, V=VT, W=WAG, P=PAM.
3284 BLOSUM 62 is default.
3285 (Not in modes 1 and 2; these modes use \$MATRIX_FOR_PWD instead.)
3287 In modes 1 and 3, a HMM is needed to align the query sequence to
3288 the alignment and either one of the following options must be
3290 H=<String> HMM name: This uses hmmfetch to retrieve a HMM from
3292 F=<String> HMM file: This directly reads the HMM from a file.
3294 S=<String> Species tree file path/name (in NHX format) (optional).
3295 If not specified, \$SPECIES_TREE_FILE_DEFAULT is used.
3297 G=<String> Species names file (optional). Only sequences associated with
3298 species found in this file are used.
3299 In the species names file, individual species names must be
3300 separated by newlines and lines starting with "#" are ignored.
3301 While only sequences associated with species found in the species
3302 tree ("S=") are used for the actual RIO analysis, this allows to
3303 remove sequences prior to tree calculation (which is the most
3304 time consuming step).
3306 P=<int> Sort priority (default is 12):
3308 1 : Ortholog, Super ortholog
3309 2 : Super ortholog, Ortholog
3310 3 : Ortholog, Distance
3311 4 : Distance, Ortholog
3312 5 : Ortholog, Super ortholog, Distance
3313 6 : Ortholog, Distance, Super ortholog
3314 7 : Super ortholog, Ortholog, Distance
3315 8 : Super ortholog, Distance, Ortholog
3316 9 : Distance, Ortholog, Super ortholog
3317 10 : Distance, Super ortholog, Ortholog
3318 11 : Ortholog, Subtree neighbor, Distance
3319 12 : Ortholog, Subtree neighbor, Super ortholog, Distance (default)
3320 13 : Ortholog, Super ortholog, Subtree neighbor, Distance
3321 14 : Subtree neighbor, Ortholog, Super ortholog, Distance
3322 15 : Subtree neighbor, Distance, Ortholog, Super ortholog
3323 16 : Ortholog, Distance, Subtree neighbor, Super ortholog
3324 17 : Ortholog, Subtree neighbor, Distance, Super ortholog
3326 a=<int> Bootstraps for tree construction (not in modes 1 and 2).
3329 L=<int> Threshold for orthologies for output. Default is 0.
3330 v=<int> Threshold for ultra-paralogies for output. Default is 50.
3332 U=<int> Threshold for orthologies for distance calculation. Default is 60.
3334 X=<int> In case of more than one putative orthologs:
3335 number of sd the distance query - LCA has to differ
3336 from the mean to generate a warning. Default is 2.
3338 Y=<int> In case of no putative orthologs:
3339 number of sd the distance query - root has to differ
3340 from mean to generate a warning. Default is 2.
3342 Z=<double> In case of one putative ortholog:
3343 threshold for factor between the two distances to their
3344 LCA (larger/smaller) to generate a warning. Default is 2.
3346 B=<int> Threshold for subtree-neighborings. Default is 0.
3348 b=<String> Build HMM from seed alignment with "hmmbuild -s" (optional).
3349 This is to prevent from finding multiple domains per sequence
3350 (i.e. prevents "cutting" the query sequence). Give path/name to
3353 j=<String> Name for temporary directory (optional).
3355 y=<int> Seed for random number generator. Default is 41.
3357 I Create and save a rooted, with duplication vs speciation,
3358 and orthology information annotated gene tree.
3359 If precalculated distances are used (modes 1 and 2): this gene
3360 tree is a NJ tree calculated based on the non-bootstrap resampled
3361 (original) pairwise distances.
3362 If precalculated distances are not used (modes 3 and 4): this gene
3363 is a consenus tree with ML branch length values and is also
3364 annotated with bootstrap values for each node.
3367 p Output ultra-paralogs.
3368 D Description from SWISS-PROT and TrEMBL.
3369 C Complete description from SWISS-PROT and TrEMBL.
3370 E 118 character output instead of 78 character output.
3372 K Keep intermediate files (they will go into the same directory
3373 as the output file, their names are the same as of the output
3374 file, with various suffixes added).
3376 s Ignore non SWISS-PROT sequences (i.e. sequences from TrEMBL)
3377 in the Pfam alignment.
3379 f Try to ignore TrEMBL "fragments" (sequences with "fragment" in
3382 + Parallel, use machines listed in file \$NODE_LIST.
3384 x RIO used as web server -- HTML output.