BLAST (biotechnology)

Last updated
BLAST
Original author(s) Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David Lipman
Developer(s) NCBI
Stable release
2.16.0+ [1] / 25 June 2024;6 months ago (2024-06-25)
Written in C and C++ [2]
Operating system UNIX, Linux, Mac, MS-Windows
Type Bioinformatics tool
License Public domain
Website blast.ncbi.nlm.nih.gov/Blast.cgi

In bioinformatics, BLAST (basic local alignment search tool) [3] is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

Contents

Background

BLAST is one of the most widely used bioinformatics programs for sequence searching. [4] It addresses a fundamental problem in bioinformatics research. The heuristic algorithm it uses is much faster than other approaches, such as calculating an optimal alignment. This emphasis on speed is vital to making the algorithm practical on the huge genome databases currently available, although subsequent algorithms can be even faster.

The BLAST program was designed by Eugene Myers, Stephen Altschul, Warren Gish, David J. Lipman and Webb Miller at the NIH and was published in J. Mol. Biol. in 1990. BLAST extended the alignment work of a previously developed program for protein and DNA sequence similarity searches, FASTA, by adding a novel stochastic model developed by Samuel Karlin and Stephen Altschul. [5] They proposed "a method for estimating similarities between the known DNA sequence of one organism with that of another", [3] and their work has been described as "the statistical foundation for BLAST." [6] Subsequently, Altschul, Gish, Miller, Myers, and Lipman designed and implemented the BLAST program, which was published in the Journal of Molecular Biology in 1990 and has been cited over 100,000 times since. [7]

While BLAST is faster than any Smith-Waterman implementation for most cases, it cannot "guarantee the optimal alignments of the query and database sequences" as Smith-Waterman algorithm does. The Smith-Waterman algorithm was an extension of a previous optimal method, the Needleman–Wunsch algorithm, which was the first sequence alignment algorithm that was guaranteed to find the best possible alignment. However, the time and space requirements of these optimal algorithms far exceed the requirements of BLAST.

BLAST is more time-efficient than FASTA by searching only for the more significant patterns in the sequences, yet with comparative sensitivity. This could be further realized by understanding the algorithm of BLAST introduced below.

Examples of other questions that researchers use BLAST to answer are:

BLAST is also often used as part of other algorithms that require approximate sequence matching.

BLAST is available on the web on the NCBI website. Different types of BLASTs are available according to the query sequences and the target databases. Alternative implementations include AB-BLAST (formerly known as WU-BLAST), FSA-BLAST (last updated in 2006), and ScalaBLAST. [8] [9]

The original paper by Altschul, et al. [7] was the most highly cited paper published in the 1990s. [10]

Input

Input sequences (in FASTA or Genbank format), database to search and other optional parameters such as scoring matrix.[ clarification needed ] [11]

Output

BLAST output can be delivered in a variety of formats. These formats include HTML, plain text, and XML formatting. For NCBI's webpage, the default format for output is HTML. When performing a BLAST on NCBI, the results are given in a graphical format showing the hits found, a table showing sequence identifiers for the hits with scoring related data, as well as alignments for the sequence of interest and the hits received with corresponding BLAST scores for these. The easiest to read and most informative of these is probably the table.

If one is attempting to search for a proprietary sequence or simply one that is unavailable in databases available to the general public through sources such as NCBI, there is a BLAST program available for download to any computer, at no cost. This can be found at BLAST+ executables. There are also commercial programs available for purchase. Databases can be found on the NCBI site, as well as on the Index of BLAST databases (FTP).

Process

Using a heuristic method, BLAST finds similar sequences, by locating short matches between the two sequences. This process of finding similar sequences is called seeding. It is after this first match that BLAST begins to make local alignments. While attempting to find similarity in sequences, sets of common letters, known as words, are very important. For example, suppose that the sequence contains the following stretch of letters, GLKFA. If a BLAST was being conducted under normal conditions, the word size would be 3 letters. In this case, using the given stretch of letters, the searched words would be GLK, LKF, and KFA. The heuristic algorithm of BLAST locates all common three-letter words between the sequence of interest and the hit sequence or sequences from the database. This result will then be used to build an alignment. After making words for the sequence of interest, the rest of the words are also assembled. These words must satisfy a requirement of having a score of at least the threshold T, when compared by using a scoring matrix.

One commonly used scoring matrix for BLAST searches is BLOSUM62, [12] although the optimal scoring matrix depends on sequence similarity. Once both words and neighborhood words are assembled and compiled, they are compared to the sequences in the database in order to find matches. The threshold score T determines whether or not a particular word will be included in the alignment. Once seeding has been conducted, the alignment which is only 3 residues long, is extended in both directions by the algorithm used by BLAST. Each extension impacts the score of the alignment by either increasing or decreasing it. If this score is higher than a pre-determined T, the alignment will be included in the results given by BLAST. However, if this score is lower than this pre-determined T, the alignment will cease to extend, preventing the areas of poor alignment from being included in the BLAST results. Note that increasing the T score limits the amount of space available to search, decreasing the number of neighborhood words, while at the same time speeding up the process of BLAST

Algorithm

To run the software, BLAST requires a query sequence to search for, and a sequence to search against (also called the target sequence) or a sequence database containing multiple such sequences. BLAST will find sub-sequences in the database which are similar to subsequences in the query. In typical usage, the query sequence is much smaller than the database, e.g., the query may be one thousand nucleotides while the database is several billion nucleotides.

The main idea of BLAST is that there are often High-scoring Segment Pairs (HSP) contained in a statistically significant alignment. BLAST searches for high scoring sequence alignments between the query sequence and the existing sequences in the database using a heuristic approach that approximates the Smith-Waterman algorithm. However, the exhaustive Smith-Waterman approach is too slow for searching large genomic databases such as GenBank. Therefore, the BLAST algorithm uses a heuristic approach that is less accurate than the Smith-Waterman algorithm but over 50 times faster. [13] The speed and relatively good accuracy of BLAST are among the key technical innovations of the BLAST programs.

An overview of the BLAST algorithm (a protein to protein search) is as follows: [13]

  1. Remove low-complexity region or sequence repeats in the query sequence.
    "Low-complexity region" means a region of a sequence composed of few kinds of elements. These regions might give high scores that confuse the program to find the actual significant sequences in the database, so they should be filtered out. The regions will be marked with an X (protein sequences) or N (nucleic acid sequences) and then be ignored by the BLAST program. To filter out the low-complexity regions, the SEG program is used for protein sequences and the program DUST is used for DNA sequences. On the other hand, the program XNU is used to mask off the tandem repeats in protein sequences.
  2. Make a k-letter word list of the query sequence.
    Take k=3 for example, we list the words of length 3 in the query protein sequence (k is usually 11 for a DNA sequence) "sequentially", until the last letter of the query sequence is included. The method is illustrated in figure 1.
    Fig. 1 The method to establish the k-letter query word list. Query words.svg
    Fig. 1 The method to establish the k-letter query word list.
  3. List the possible matching words.
    This step is one of the main differences between BLAST and FASTA. FASTA cares about all of the common words in the database and query sequences that are listed in step 2; however, BLAST only cares about the high-scoring words. The scores are created by comparing the word in the list in step 2 with all the 3-letter words. By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20^3 possible match scores for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA is respectively 15 and 12 with the BLOSUM62 weighting scheme. For DNA words, a match is scored as +5 and a mismatch as -4, or as +2 and -3. After that, a neighborhood word score threshold T is used to reduce the number of possible matching words. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. For example, PEG is kept, but PQA is abandoned when T is 13.
  4. Organize the remaining high-scoring words into an efficient search tree.
    This allows the program to rapidly compare the high-scoring words to the database sequences.
  5. Repeat step 3 to 4 for each k-letter word in the query sequence.
  6. Scan the database sequences for exact matches with the remaining high-scoring words.
    The BLAST program scans the database sequences for the remaining high-scoring word, such as PEG, of each position. If an exact match is found, this match is used to seed a possible un-gapped alignment between the query and database sequences.
  7. Extend the exact matches to high-scoring segment pair (HSP).
    • The original version of BLAST stretches a longer alignment between the query and the database sequence in the left and right directions, from the position where the exact match occurred. The extension does not stop until the accumulated total score of the HSP begins to decrease. A simplified example is presented in figure 2.
      Fig. 2 The process to extend the exact match. Adapted from Biological Sequence Analysis I, Current Topics in Genome Analysis . Extension process.jpg
      Fig. 2 The process to extend the exact match. Adapted from Biological Sequence Analysis I, Current Topics in Genome Analysis .
      Fig. 3 The positions of the exact matches. Neighbor HSP.jpg
      Fig. 3 The positions of the exact matches.
    • To save more time, a newer version of BLAST, called BLAST2 or gapped BLAST, has been developed. BLAST2 adopts a lower neighborhood word score threshold to maintain the same level of sensitivity for detecting sequence similarity. Therefore, the list of possible matching words list in step 3 becomes longer. Next, the exact matched regions, within distance A from each other on the same diagonal in figure 3, will be joined as a longer new region. Finally, the new regions are then extended by the same method as in the original version of BLAST, and the HSPs' (High-scoring segment pair) scores of the extended regions are then created by using a substitution matrix as before.
  8. List all of the HSPs in the database whose score is high enough to be considered.
    We list the HSPs whose scores are greater than the empirically determined cutoff score S. By examining the distribution of the alignment scores modeled by comparing random sequences, a cutoff score S can be determined such that its value is large enough to guarantee the significance of the remaining HSPs.
  9. Evaluate the significance of the HSP score.
    BLAST next assesses the statistical significance of each HSP score by exploiting the Gumbel extreme value distribution (EVD). (It is proved that the distribution of Smith-Waterman local alignment scores between two random sequences follows the Gumbel EVD. For local alignments containing gaps it is not proved.). In accordance with the Gumbel EVD, the probability p of observing a score S equal to or greater than x is given by the equation
    where
    The statistical parameters and are estimated by fitting the distribution of the un-gapped local alignment scores, of the query sequence and a lot of shuffled versions (Global or local shuffling) of a database sequence, to the Gumbel extreme value distribution. Note that and depend upon the substitution matrix, gap penalties, and sequence composition (the letter frequencies). and are the effective lengths of the query and database sequences, respectively. The original sequence length is shortened to the effective length to compensate for the edge effect (an alignment start near the end of one of the query or database sequence is likely not to have enough sequence to build an optimal alignment). They can be calculated as
    where is the average expected score per aligned pair of residues in an alignment of two random sequences. Altschul and Gish gave the typical values, , , and , for un-gapped local alignment using BLOSUM62 as the substitution matrix. Using the typical values for assessing the significance is called the lookup table method; it is not accurate. The expect score E of a database match is the number of times that an unrelated database sequence would obtain a score S higher than x by chance. The expectation E obtained in a search for a database of D sequences is given by
    Furthermore, when , E could be approximated by the Poisson distribution as
    This expectation or expect value "E" (often called an E score or E-value or e-value) assessing the significance of the HSP score for un-gapped local alignment is reported in the BLAST results. The calculation shown here is modified if individual HSPs are combined, such as when producing gapped alignments (described below), due to the variation of the statistical parameters.
  10. Make two or more HSP regions into a longer alignment.
    Sometimes, we find two or more HSP regions in one database sequence that can be made into a longer alignment. This provides additional evidence of the relation between the query and database sequence. There are two methods, the Poisson method and the sum-of-scores method, to compare the significance of the newly combined HSP regions. Suppose that there are two combined HSP regions with the pairs of scores (65, 40) and (52, 45), respectively. The Poisson method gives more significance to the set with the maximal lower score (45>40). However, the sum-of-scores method prefers the first set, because 65+40 (105) is greater than 52+45(97). The original BLAST uses the Poisson method; gapped BLAST and the WU-BLAST uses the sum-of scores method.
  11. Show the gapped Smith-Waterman local alignments of the query and each of the matched database sequences.
    • The original BLAST only generates un-gapped alignments including the initially found HSPs individually, even when there is more than one HSP found in one database sequence.
    • BLAST2 produces a single alignment with gaps that can include all of the initially found HSP regions. Note that the computation of the score and its corresponding E-value involves use of adequate gap penalties.
  12. Report every match whose expect score is lower than a threshold parameter E.

Types of BLAST

BLASTn (Nucleotide BLAST)

BLASTn compares one or more nucleotide sequence to a database or another sequence. This is useful when trying to identify evolutionary relationships between organisms. [15]

tBLASTn

tBLASTn used to search for proteins in sequences that haven't been translated into proteins yet. It takes a protein sequence and compares it to all possible translations of a DNA sequence. This is useful when looking for similar protein-coding regions in DNA sequences that haven't been fully annotated, like ESTs (short, single-read cDNA sequences) and HTGs (draft genome sequences). Since these sequences don't have known protein translations, we can only search for them using tBLASTn. [16]

BLASTx

BLASTx compares a nucleotide query sequence, which can be translated into six different protein sequences, against a database of known protein sequences. This tool is useful when the reading frame of the DNA sequence is uncertain or contains errors that might cause mistakes in protein-coding. BLASTx provides combined statistics for hits across all frames, making it helpful for the initial analysis of new DNA sequences. [17]

BLASTp
Protein sequence being compared against nr database using BLASTp. Blastp.png
Protein sequence being compared against nr database using BLASTp.

BLASTp, or Protein BLAST, is used to compare protein sequences. You can input one or more protein sequences that you want to compare against a single protein sequence or a database of protein sequences. This is useful when you're trying to identify a protein by finding similar sequences in existing protein databases. [18]

Parallel BLAST

Parallel BLAST versions of split databases are implemented using MPI and Pthreads, and have been ported to various platforms including Windows, Linux, Solaris, Mac OS X, and AIX. Popular approaches to parallelize BLAST include query distribution, hash table segmentation, computation parallelization, and database segmentation (partition). Databases are split into equal sized pieces and stored locally on each node. Each query is run on all nodes in parallel and the resultant BLAST output files from all nodes merged to yield the final output. Specific implementations include MPIblast, ScalaBLAST, DCBLAST and so on. [19]

MPIblast makes use of a database segmentation technique to parallelize the computation process. [20] This allows for significant performance improvements when conducting BLAST searches across a set of nodes in a cluster. In some scenarios a superlinear speedup is achievable. This makes MPIblast suitable for the extensive genomic datasets that are typically used in bioinformatics.

BLAST generally runs at a speed of O(n), where n is the size of the database. [21] The time to complete the search increases linearly as the size of the database increases. MPIblast utilizes parallel processing to speed up the search. The ideal speed for any parallel computation is a complexity of O(n/p), with n being the size of the database and p being the number of processors. This would indicate that the job is evenly distributed among the p number of processors. This is visualized in the included graph. The superlinear speedup that can sometimes occur with MPIblast can have a complexity better than O(n/p). This occurs because the cache memory can be used to decrease the run time. [22]

Alternatives to BLAST

The predecessor to BLAST, FASTA, can also be used for protein and DNA similarity searching. FASTA provides a similar set of programs for comparing proteins to protein and DNA databases, DNA to DNA and protein databases, and includes additional programs for working with unordered short peptides and DNA sequences. In addition, the FASTA package provides SSEARCH, a vectorized implementation of the rigorous Smith-Waterman algorithm. FASTA is slower than BLAST, but provides a much wider range of scoring matrices, making it easier to tailor a search to a specific evolutionary distance.

An extremely fast but considerably less sensitive alternative to BLAST is BLAT (Blast Like Alignment Tool). While BLAST does a linear search, BLAT relies on k-mer indexing the database, and can thus often find seeds faster. [23] Another software alternative similar to BLAT is PatternHunter.

Advances in sequencing technology in the late 2000s has made searching for very similar nucleotide matches an important problem. New alignment programs tailored for this use typically use BWT-indexing of the target database (typically a genome). Input sequences can then be mapped very quickly, and output is typically in the form of a BAM file. Example alignment programs are BWA, SOAP, and Bowtie.

For protein identification, searching for known domains (for instance from Pfam) by matching with Hidden Markov Models is a popular alternative, such as HMMER.

An alternative to BLAST for comparing two banks of sequences is PLAST. PLAST provides a high-performance general purpose bank to bank sequence similarity search tool relying on the PLAST [24] and ORIS [25] algorithms. Results of PLAST are very similar to BLAST, but PLAST is significantly faster and capable of comparing large sets of sequences with a small memory (i.e. RAM) footprint.

For applications in metagenomics, where the task is to compare billions of short DNA reads against tens of millions of protein references, DIAMOND [26] runs at up to 20,000 times as fast as BLASTX, while maintaining a high level of sensitivity.

The open-source software MMseqs is an alternative to BLAST/PSI-BLAST, which improves on current search tools over the full range of speed-sensitivity trade-off, achieving sensitivities better than PSI-BLAST at more than 400 times its speed. [27]

Optical computing approaches have been suggested as promising alternatives to the current electrical implementations. OptCAM is an example of such approaches and is shown to be faster than BLAST. [28]

Comparing BLAST and the Smith-Waterman Process

While both Smith-Waterman and BLAST are used to find homologous sequences by searching and comparing a query sequence with those in the databases, they do have their differences.

Due to the fact that BLAST is based on a heuristic algorithm, the results received through BLAST will not include all the possible hits within the database. BLAST misses hard to find matches.

An alternative in order to find all the possible hits would be to use the Smith-Waterman algorithm. This method varies from the BLAST method in two areas, accuracy and speed. The Smith-Waterman option provides better accuracy, in that it finds matches that BLAST cannot, because it does not exclude any information. Therefore, it is necessary for remote homology. However, when compared to BLAST, it is more time consuming and requires large amounts of computing power and memory. However, advances have been made to speed up the Smith-Waterman search process dramatically. These advances include FPGA chips and SIMD technology.

For more complete results from BLAST, the settings can be changed from their default settings. The optimal settings for a given sequence, however, may vary. The settings one can change are E-Value, gap costs, filters, word size, and substitution matrix.

Note, the algorithm used for BLAST was developed from the algorithm used for Smith-Waterman. BLAST employs an alignment which finds "local alignments between sequences by finding short matches and from these initial matches (local) alignments are created". [29]

BLAST output visualization

To help users interpreting BLAST results, different software is available. According to installation and use, analysis features and technology, here are some available tools: [30]

Example visualizations of BLAST results are shown in Figure 4 and 5.

Fig. 4 Circos-style visualization of BLAST results generated using SequenceServer software. Seqserv-circos.png
Fig. 4 Circos-style visualization of BLAST results generated using SequenceServer software.
Fig. 5 Length distribution of BLAST hits generated using SequenceServer software showing that the query (a predicted gene product) is longer compared to similar database sequences. Seqserv-length-dist.png
Fig. 5 Length distribution of BLAST hits generated using SequenceServer software showing that the query (a predicted gene product) is longer compared to similar database sequences.

Uses of BLAST

BLAST can be used for several purposes. These include identifying species, locating domains, establishing phylogeny, DNA mapping, and comparison.

Identifying species
With the use of BLAST, you can possibly correctly identify a species or find homologous species. This can be useful, for example, when you are working with a DNA sequence from an unknown species.
Locating domains
When working with a protein sequence you can input it into BLAST, to locate known domains within the sequence of interest.
Establishing phylogeny
Using the results received through BLAST you can create a phylogenetic tree using the BLAST web-page. Phylogenies based on BLAST alone are less reliable than other purpose-built computational phylogenetic methods, so should only be relied upon for "first pass" phylogenetic analyses.
DNA mapping
When working with a known species, and looking to sequence a gene at an unknown location, BLAST can compare the chromosomal position of the sequence of interest, to relevant sequences in the database(s). NCBI has a "Magic-BLAST" tool built around BLAST for this purpose. [31]
Comparison
When working with genes, BLAST can locate common genes in two related species, and can be used to map annotations from one organism to another.
Classifying taxonomy
BLAST can use genetic sequences to compare multiple taxa against known taxonomical data. By doing this, it can provide a picture of the evolutionary relationships between various species (Fig.6). This is a useful way to identify orphan genes, since if the gene shows up in an organism outside of the ancestral lineage, then it wouldn't be classified as an orphan gene.
Fig. 6 Output of a BLASTP search showing that a gene found in Bufo japonicus is also found in many other species of the frog (Anura) lineage. Fig. 6.png
Fig. 6 Output of a BLASTP search showing that a gene found in Bufo japonicus is also found in many other species of the frog (Anura) lineage.
Although this method is helpful, some more accurate options to find homologs would be through pairwise sequence alignment and multiple sequence alignment.

See also

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

<span class="mw-page-title-main">National Center for Biotechnology Information</span> Database branch of the US National Library of Medicine

The National Center for Biotechnology Information (NCBI) is part of the (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is located in Bethesda, Maryland, and was founded in 1988 through legislation sponsored by US Congressman Claude Pepper.

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an alignment is important to create a useful alignment. Too many gaps can cause an alignment to become meaningless. Gap penalties are used to adjust alignment scores based on the number and length of gaps. The five main types of gap penalties are constant, linear, affine, convex, and profile-based.

FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.

In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames will be "open". Such an open reading frame (ORF) may contain a start codon and by definition cannot extend beyond a stop codon. That start codon indicates where translation may start. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis and can highlight homologous features between sequences. Alignments highlight mutation events such as point mutations, insertion mutations and deletion mutations, and alignments are used to assess sequence conservation and infer the presence and activity of protein domains, tertiary structures, secondary structures, and individual amino acids or nucleotides.

<span class="mw-page-title-main">BLOSUM</span> Bioinformatics tool

In bioinformatics, the BLOSUM matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff. They scanned the BLOCKS database for very conserved regions of protein families and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices.

<span class="mw-page-title-main">David J. Lipman</span> American biologist

David J. Lipman is an American biologist who from 1989 to 2017 was the director of the National Center for Biotechnology Information (NCBI) at the National Institutes of Health. NCBI is the home of GenBank, the U.S. node of the International Sequence Database Consortium, and PubMed, one of the most heavily used sites in the world for the search and retrieval of biomedical information. Lipman is one of the original authors of the BLAST sequence alignment program, and a respected figure in bioinformatics. In 2017, he left NCBI and became Chief Science Officer at Impossible Foods.

formatdb is a discontinued software tool that was used in molecular bioinformatics to format protein or nucleotide databases for BLAST. It has been replaced by makeblastdb and the NCBI "strongly encourage[s]" users to stop using formatdb.

BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.

Warren Richard Gish is the owner of Advanced Biocomputing LLC. He joined Washington University in St. Louis as a junior faculty member in 1994, and was a Research Associate Professor of Genetics from 2002 to 2007.

<span class="mw-page-title-main">HMMER</span> Software package for sequence analysis

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and macOS.

CS-BLAST (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST, using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST is the context-specific analog of PSI-BLAST, which computes the mutation profile with substitution probabilities and mixes it with the query profile. CSI-BLAST is the context specific analog of PSI-BLAST. Both of these programs are available as web-server and are available for free download.

PatternHunter is a commercially available homology search instrument software that uses sequence alignment techniques. It was initially developed in the year 2002 by three scientists: Bin Ma, John Tramp and Ming Li. These scientists were driven by the desire to solve the problem that many investigators face during studies that involve genomics and proteomics. These scientists realized that such studies greatly relied on homology studies that established short seed matches that were subsequently lengthened. Describing homologous genes was an essential part of most evolutionary studies and was crucial to the understanding of the evolution of gene families, the relationship between domains and families. Homologous genes could only be studied effectively using search tools that established like portions or local placement between two proteins or nucleic acid sequences. Homology was quantified by scores obtained from matching sequences, “mismatch and gap scores”.

In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., alignment, assembly, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

References

  1. BLAST Release Notes. National Center for Biotechnology Information (US). 24 June 2024.
  2. "BLAST Developer Information". blast.ncbi.nlm.nih.gov.
  3. 1 2 Douglas Martin (21 February 2008). "Samuel Karlin, Versatile Mathematician, Dies at 83". The New York Times .
  4. R. M. Casey (2005). "BLAST Sequences Aid in Genomics and Proteomics". Business Intelligence Network.
  5. "BLAST topics".
  6. Dan Stober (January 16, 2008). "Sam Karlin, mathematician who improved DNA analysis, dead at 83". Stanford.edu. Archived from the original on June 12, 2016. Retrieved July 16, 2019.
  7. 1 2 Stephen Altschul; Warren Gish; Webb Miller; Eugene Myers; David J. Lipman (1990). "Basic local alignment search tool". Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID   2231712. S2CID   14441902.
  8. Oehmen, C.; Nieplocha, J. (2006). "ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis". IEEE Transactions on Parallel and Distributed Systems. 17 (8): 740. doi:10.1109/TPDS.2006.112. S2CID   11122366.
  9. Oehmen, C. S.; Baxter, D. J. (2013). "ScalaBLAST 2.0: Rapid and robust BLAST calculations on multiprocessor systems". Bioinformatics. 29 (6): 797–798. doi:10.1093/bioinformatics/btt013. PMC   3597145 . PMID   23361326.
  10. "Sense from Sequences: Stephen F. Altschul on Bettering BLAST". ScienceWatch. July–August 2000. Archived from the original on 7 October 2007.
  11. Penumarthi, Lasya R.; Baptista, Rodrigo P.; Beaudry, Megan S.; Glenn, Travis C.; Kissinger, Jessica C. (2024-12-18). "A new chromosome-level genome assembly and annotation of Cryptosporidium meleagridis". Scientific Data. 11 (1): 1388. doi:10.1038/s41597-024-04235-7. ISSN   2052-4463.
  12. Steven Henikoff; Jorja Henikoff (1992). "Amino Acid Substitution Matrices from Protein Blocks". PNAS. 89 (22): 10915–10919. Bibcode:1992PNAS...8910915H. doi: 10.1073/pnas.89.22.10915 . PMC   50453 . PMID   1438297.
  13. 1 2 Mount, D. W. (2004). Bioinformatics: Sequence and Genome Analysis (2nd ed.). Cold Spring Harbor Press. ISBN   978-0-87969-712-9.
  14. Adapted from Biological Sequence Analysis I, Current Topics in Genome Analysis .
  15. "Library Guides: NCBI Bioinformatics Resources: An Introduction: BLAST: Compare & identify sequences".
  16. "Library Guides: NCBI Bioinformatics Resources: An Introduction: BLAST: Compare & identify sequences".
  17. "Library Guides: NCBI Bioinformatics Resources: An Introduction: BLAST: Compare & identify sequences".
  18. "Library Guides: NCBI Bioinformatics Resources: An Introduction: BLAST: Compare & identify sequences".
  19. Yim, WC; Cushman, JC (2017). "Divide and Conquer (DC) BLAST: fast and easy BLAST execution within HPC environments". PeerJ. 5: e3486. doi: 10.7717/peerj.3486 . PMC   5483034 . PMID   28652936.
  20. Darling, Ace; Carey, Lewis; Feng, Wei-Chun (2003). "The design, implementation, and evaluation of mpiBLAST" (PDF). University of Wisconsin-Madison. Retrieved 2023-04-17.
  21. Kellis, Manolis (5 October 2020). "The Blast Algorithm (Basic Alignment Search Tool". LibreTexts. Retrieved 2023-04-17.
  22. Darling, Ace; Carey, Lewis; Feng, Wei-Chun (2003). "The design, implementation, and evaluation of mpiBLAST" (PDF). University of Wisconsin-Madison. Retrieved 2023-04-17.
  23. Kent, W. James (2002-04-01). "BLAT—The BLAST-Like Alignment Tool". Genome Research. 12 (4): 656–664. doi:10.1101/gr.229202. ISSN   1088-9051. PMC   187518 . PMID   11932250.
  24. Lavenier, D.; Lavenier, Dominique (2009). "PLAST: parallel local alignment search tool for database comparison". BMC Bioinformatics. 10: 329. doi: 10.1186/1471-2105-10-329 . PMC   2770072 . PMID   19821978.
  25. Lavenier, D. (2009). "Ordered index seed algorithm for intensive DNA sequence comparison" (PDF). 2008 IEEE International Symposium on Parallel and Distributed Processing (PDF). pp. 1–8. CiteSeerX   10.1.1.155.3633 . doi:10.1109/IPDPS.2008.4536172. ISBN   978-1-4244-1693-6. S2CID   10804289.
  26. Buchfink, Xie and Huson (2015). "Fast and sensitive protein alignment using DIAMOND". Nature Methods. 12 (1): 59–60. doi:10.1038/nmeth.3176. PMID   25402007. S2CID   5346781.
  27. Steinegger, Martin; Soeding, Johannes (2017-10-16). "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets". Nature Biotechnology. 35 (11): 1026–1028. doi:10.1038/nbt.3988. hdl: 11858/00-001M-0000-002E-1967-3 . PMID   29035372. S2CID   402352.
  28. Maleki, Ehsan; Koohi, Somayyeh; Kavehvash, Zahra; Mashaghi, Alireza (2020). "OptCAM: An ultra-fast all-optical architecture for DNA variant discovery". Journal of Biophotonics. 13 (1): e201900227. doi: 10.1002/jbio.201900227 . PMID   31397961.
  29. "Bioinformatics Explained: BLAST versus Smith-Waterman" (PDF). 4 July 2007.
  30. Neumann, Kumar and Shalchian-Tabrizi (2014). "BLAST output visualization in the new sequencing era". Briefings in Bioinformatics. 15 (4): 484–503. doi: 10.1093/bib/bbt009 . PMID   23603091.
  31. "NCBI Magic-BLAST". ncbi.github.io. Retrieved 16 May 2019.