[ Japanese ]

4.9 BLAST server

4.9.1 What is BLAST

BLAST (Basic Local Alignment Search Tool)2. BLAST is a fast, heuristic search tool for sequence databases.

4.9.2 Getting into BLAST

BLAST can be accessed from the front page (index page) of the H-InvDB by clicking "BLAST" on the tool bar as shown below:


Direct access to the H-Inv BLAST search page

Users may access to the top search page of the BLAST from the following URL.
http://h-invitational.jp/hinv/blast/blasttop.cgi

Performing a search using BLAST

Paste a sequence in fasta format into the text box indicated in figure (Fig. 4.9.1). It is also possible to upload a sequence in fasta format from file. This is done by clicking on the browse box and following the path to the file containing the fasta sequence(Fig. 4.9.1).


Fig.4.9.1 The view of BLAST search page taken from H-Inv.

FASTA format was developed by Pearson3 and is one of the simplest formats for sequence data. The basic format is as follows:

>sp|P05064|ALFA_MOUSE Fructose-bisphosphate aldolase A (EC 4.1.2.13) (Muscle-type aldolase) (Aldolase 1) - Mus musculus (Mouse). PHPYPALTPEQKKELSDIAHRIVAPGKGILAADESTGSIAKRLQSIGTENTEENRRFYRQ LLLTADDRVNPCIGGVILFHETLYQKADDGRPFPQVIKSKGGVVGIKVDKGVVPLAGTNG ETTTQGLDGLSERCAQYKKDGADFAKWRCVLKIGEHTPSALAIMENANVLARYASICQQN GIVPIVEPEILPDGDHDLKRCQYVTEKVLAAVYKALSDHHVYLEGTLLKPNMVTPGHACT QKFSNEEIAMATVTALRRTVPPAVTGVTFLSGGQSEEEASINLNAINKCPLLKPWALTFS YGRALQASALKAWGGKKENLKAAQEEYIKRALANSLACQGKYTPSGQSGAAASESLFISN HAY

The identifier line always begins with a greater than sign (>), and ends with a new line. The sequence begins on the next line and ends in a new line. Multiple sequences can be submitted in the same file. For reasons of interoperability please avoid using white space characters in the identifier line.

If you are unfamiliar with the fasta format then there is a link that can explain the format to you and also inform you which non-alphanumeric characters are allowable (docs/fasta.html)

Next using the pull down menu (Fig. 4.9.1) select the type of search. The options for searches are as follows:

Next, select the type of database you wish to search. This can be performed by using the pull down menu (Fig. 4.9.1). The sequence database type selected [protein or nucleotide] will depend upon both the query sequence type [nucleotide or amino acid] and the type of search being performed (protein/protein, protein/nucleotide, nucleotide/protein, or nucleotide/nucleotide).

Once you have entered the target sequence in the sequence submission box and selected the appropriate choices from both the Program and Database pull down menus you may begin a basic BLAST search. Unless otherwise stated basic search takes place with the default options activated. These default options are:

The search is initiated by clicking the search box under the sequence box (Fig. 4.9.1). Alternatively you may clear the search by pressing the clear sequence button (Fig. 4.9.1).

However, should you wish to perform a search with different parameters from the default this can be done by adjusting the fields shown here (Fig. 4.9.1).

The advanced options which may be used during a BLAST search are as follows:

    Low Complexity Filtering

    This options masks off (conceals) segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton & Federhen4.

    Why use low complexity filtering?

    Filtering can eliminate statistically significant but biologically uninteresting reports from the BLAST output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs.

    What are the problems associated with low complexity filtering?

    It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect.

    A more detailed description of Low complexity filtering can be found at (docs/filtered.html)

    BLAST lookup table filter

    This option is still experimental and may change in the near future. This option masks only for purposes of constructing the lookup table used by BLAST. The BLAST extensions are performed without masking.

    Expectation factor

    The statistical significance threshold for reporting matches against database sequences; the default value is 10, such that 10 matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul2. If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Fractional values are acceptable.

    Substitution matrix used

    A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a 'score' for aligning any possible pair of residues. In general, different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees.
    A more detailed description of matrices can be found at (docs/matrix_info.html)

    Ungapped alignment option

    Leaving this box unchecked will allow gaps to be introduced into sequence alignments. This default option ensures that any similarities, even those that define a domain within the coding region will be identified, if the extent of local similarity is high enough.

    Query Genetic codes [BLASTX ONLY]

    When employing the BLASTX program (in which a translated nucleotide sequence is used as a query against a protein database), the genetic code to be used in the translation can be specified here. The standard genetic code is used by default.

    Out of frame (OOF) frame shift penalty [BLASTX ONLY]

    Advanced options

    BLAST was originally a text based tool with a number of 'UNIX-switch' type options. These switch options allow the user to put in custom gap penalties or 'word' sizes. The options and how to implement them are covered in greater detail at (docs/full_options.html)

    Graphical overview

    Enabling the graphical overview option allows an overview of the database sequences aligned to the query sequence to be shown. The score for each alignment is indicated by one of five different colors, which divides the range of scores into five groups. Multiple alignments on the same database sequence are connected by a striped line. Passing over a hit with the mouse causes the definition and score to be shown in the window at the top, clicking on a hit sequence takes the user to the associated alignments.

    Alignment view

    Alignment view allows the user to select various views of the data based upon interest. The views include pairwise, hit table,

    Number of short description matches reported

    This option allows the user to restrict the maximum number of high scoring alignments displayed in the results of the BLAST hit. If there are more HSPs than the maximum number selected BLAST selects only the highest scoring hits up to the maximum number selected.

    Number of high scoring alignments reported

    Restricts database sequences to the number specified for which high-scoring segment pairs (HSPs) are reported; the default limit is 50. If more database sequences than this happen to satisfy the statistical significance threshold for reporting (see EXPECT below), only the matches ascribed the greatest statistical significance are reported.

    EXPECT

    The statistical significance threshold for reporting matches against database sequences. It is the number of sequences expected to be found by chance. If the statistical significance is greater than the EXPECT threshold the match will not be reported. The lower the threshold the tougher the test this leads to fewer 'false positive' matches being reported as hits.

    Color schema

    BLAST offers a number of different color formats for data display. However the general rule is that as the schema number increases a greater number of alignment features are shown as colored. Schema specifics are covered in greater detail at (docs/color_schema.html)

4.9.3 Performing a search using BLAST

Using the default options listed above and the following sequence Fructose-bisphosphate aldolase A taken from SWISS-PROT [http://us.expasy.org/sprot/]:

>sp|P05064|ALFA_MOUSE Fructose-bisphosphate aldolase A (EC 4.1.2.13) (Muscle-type aldolase) (Aldolase 1) - Mus musculus (Mouse).
PHPYPALTPEQKKELSDIAHRIVAPGKGILAADESTGSIAKRLQSIGTENTEENRRFYRQ LLLTADDRVNPCIGGVILFHETLYQKADDGRPFPQVIKSKGGVVGIKVDKGVVPLAGTNG ETTTQGLDGLSERCAQYKKDGADFAKWRCVLKIGEHTPSALAIMENANVLARYASICQQN GIVPIVEPEILPDGDHDLKRCQYVTEKVLAAVYKALSDHHVYLEGTLLKPNMVTPGHACT QKFSNEEIAMATVTALRRTVPPAVTGVTFLSGGQSEEEASINLNAINKCPLLKPWALTFS YGRALQASALKAWGGKKENLKAAQEEYIKRALANSLACQGKYTPSGQSGAAASESLFISN HAY

The results achieved by the BLAST search are shown in Fig. 4.9.2- 4.9.4. Fig. 4.9.2 shows a literature reference concerning BLAST, the dataset being compared to the query sequence, and also information concerning the query sequence. At the bottom of the figure is the graphical overview for the query sequence and its hits (See information regarding Graphical overview option above).

Fig. 4.9.2 A view showing the graphical overview of BLAST for sample sequence EC 4.1.2.13.

Fig. 4.9.3 The view showing the top BLAST hits for sequence EC 4.1.2.13 .

The top of Fig. 4.9.3 shows information regarding the BLAST hits, their associated scores and E values. The lower part of Fig. 4.9.3 illustrates the best sequence hit obtained from the target database. The top sequence is the query sequence, the bottom sequence the target sequence, and the middle sequence is the pairwise alignment. In the middle row are illustrated matches, mismatches and gaps.

Fig. 4.9.4 The view of the parameters for the BLAST search carried out on sequence EC 4.1.2.13.

Fig. 4.9.4 shows the information regarding when the search was run and more information about the target database. Below the date and database information is information regarding the parameters of the BLAST search. Such parameters include matrix used and gap penalties used in the alignments.

References

Revised: December 26, 2007