MuSeqBox is a
program designed for multi-query sequence BLAST output
examination. It examines the BLAST output, extracts the
informative parameters of BLAST hits, and saves them in
tabular form in either text or HTML format. The hit tables
are optionally further analyzed with the program to produce
subsets of BLAST hits according to user-specified criteria.
In particular, BLASTX output may be further analyzed to
indicate queries that might potentially be alternatively
spliced transcripts (e.g., an extra large segment of
insertion or deletion), full-length coding sequences, or
contain repeat structures. Users of the program should cite
Parameter field specifications
Any selected queries using the MuSeqBox are the potential candidates indicating those structures the user are looking for based on the BLAST local sequence alignment results. Further confirmation should be addressed via global sequence alignment as well as lab experiments.
Select queries using user-set criteria: Users may specify user-set criteria for selection of queries with BLAST hits. Queries may selected if any HSP (i.e., highest-scoring segment pair) hits meet the setting numerical ranges for: QLen (query sequence length); HLen (HSP length); SLen (database sequence length); CovS (percent coverage of the subject sequence, i.e., HLen relative to SLen); CovQ (percent coverage of the query sequence, i.e., HLen relative to QLen); pid (percent identity in the HSP); Gaps (number of gap symbols in the HSP for gapped alignment), Score (alignment bit score); and Eval (BLAST search expectation value).
Criteria are specified with the logical operators: >, >=, <, and <=. Multiple specifications can be set by checking the corresponding checkboxs, and are combined with the logical AND. For example, to select queries with sequence length at least 600nt and with BLAST expectation value less than 1e-10, the user first checks the two checkboxs to the QLen and Eval, then fills the two blanks with 600nt and 1e-10, respectively.
Select queries that globally highly similar to matching protein subjects: Such query selection will require three arguments: pid, minimal percent identity in each HSP; mao, maximal allowed overlap at either ends of the selected HSPs; and scv, cumulative percent coverage of the matched subject sequence(i.e., the sum of CovS for all selected non-overlapping and/or maximal allowed overlapped selected HSPs).
Select queries that potentially encode full-length coding sequences: This query selections requires the following parameters: v5s, maximal value length of the variable 5'-terminal region (not contained in any HSP) of the subject sequence; v3s, maximal length of the variable 3'-terminal region of the subject sequence; v5q, v3q, similarly for the query sequence. Precisely, -F v5s v3s v5q v3q selects subject/query pairs for which the variable segment (not covered by HSPs) at the 5'-end is either at most v5s letters in the subject sequence OR at most v5q letters in the query sequence, AND the variable segment at the 3'-end is either at most v3s letters in the subject sequence OR at most v3q letters in the query sequence. scv, cumulative percent coverage of the matched subject sequence (i.e., the sum of CovS for all non-overlapping HSPs); qcv, cumulative percent coverage of the query sequence (i.e., the sum of CovQ for all non-overlapping HSPs).
Select queries that represent potential alternatively spliced transcripts: Such query selection will require two parameters: indel, the minimal size of sequence segment, and type, for extral insertion to the query corresponding to unmatched residuals in the query between continuous HSPs in the protein subject or for extral deletion from the query corresponding to unmatched residuals in the subject between continous HSPs in the query sequence.
Note: In the case of single HSP found in the BLAST search, a large insertion may happen to the query sequence in a gapped allowed sequence alignment scoring system. User may set the Gaps criteria from user-set criteria options to select those queries which may indicate a large extral insertion to the query.
Select queries that may contain repeats or align to database sequences containing repeats: Such query seletion will require two parameters: rps, minimal potential repeat size (number of nucleotides or amino acids), and src, the origin of such repeats from the query sequence or from the subject sequence.
The MuSeqBox output consists of three parts:
Informative parametes extracted from the BLAST hits are: QueryID (query sequence identifier, usually referring to GenBank GI or ACCESSION number), SubjectID (subject sequence identifier, usually referring to GenBank GI or ACCESSION number), QLen (query sequence length), HSP (number of HSPs), HLen (HSP length), CovQ (percent query coverage, i.e., 100(Hlen/QLen)), Qx and Qy (the query coordinators ofDescription HSP, i.e., from and to), Sx and Sy (the subject coordinators of HSP, i.e., from and to), SLen (subject sequence length), CovS (percent subject coverage, i.e., 100(HLen/SLen)), Pid (percent identical residues in the HSP), Psi (percent similar residues in the HSP), Ngap (number of gap symbols in the HSP alignment) , Frame (reading frame used for translating nucleotide sequence to protein sequence), Score (alignment bit score), Eval (expectation value), Db (the name of database where the subject sequence is depostory), Annotation and Source (of species the annotation is derived).
For the informative parameters, the MuSeqBox provides four print options:
For the web-based output, the query identifier (QueryID) and subject identifier (SubjectID) are linked to GenBank nucleotide database or GenPept database according to their types. Furthermore, the expectation values (Eval) are linked to GenBank's nr database for re-submitting BLAST search for the particular query sequence at this record.