Warren Gish

Last updated
Warren Richard Gish
Nationality American
Alma mater University of California, Berkeley
Known for BLAST
Scientific career
Fields Bioinformatics
Institutions National Center for Biotechnology Information
Washington University in St. Louis
Advanced Biocomputing LLC
University of California, Berkeley
Thesis I. SV40 mutants isolated from transformed human cells. II. Methods for sequence analysis  (1988)
Doctoral advisor Michael Botchan [1]

Warren Richard Gish is the owner of Advanced Biocomputing LLC. He joined Washington University in St. Louis as a junior faculty member in 1994, and was a Research Associate Professor of Genetics from 2002 to 2007. [2] [3]

Contents

Education

After initially studying physics, Gish obtained an A.B. degree in Biochemistry from University of California, Berkeley, and completed work for his Ph.D. degree in Molecular Biology at the same institution in 1988. [1]

Research

Gish is primarily known for his contributions to NCBI BLAST, [4] [5] his creation of the BLAST Network Service and nr (non-redundant) databases, his 1996 release of the original gapped BLAST (WU-BLAST 2.0), and most recently his development and support of AB-BLAST. At Washington University in St. Louis, Gish also led the genome analysis group which annotated all finished human, mouse and rat genome data produced by the University's Genome Sequencing Center from 1995 through 2002.

As a graduate student, Gish applied the Quine-McCluskey algorithm to the analysis of splice site recognition sequences. In 1985, with a view toward rapid identification of restriction enzyme recognition sites in DNA, Gish developed a DFA function library in the C language. The idea to apply a finite-state machine to this problem had been suggested by fellow graduate student and BSD UNIX developer Mike Karels. Gish's DFA implementation was that of a Mealy machine architecture, which is more compact than an equivalent Moore machine and hence faster. Construction of the DFA was O(n), where n is the sum of the lengths of the query sequences. The DFA could then be used to scan subject sequences in a single pass with no backtracking in O(m) time, where m is the total length of the subject(s). The method of DFA construction was recognized later as being a consolidation of two algorithms, Algorithms 3 and 4 described by Alfred V. Aho and Margaret J. Corasick. [6]

While working for U.C. Berkeley in December 1986, Gish sped up the FASTP program [7] (later known as FASTA [8] ) of William R. Pearson and David J. Lipman by 2- to 3-fold without altering the results. When the performance modifications were communicated to Pearson and Lipman, Gish further suggested that a DFA (rather than a lookup table) would yield faster k-tuple identification and improve the overall speed of the program by perhaps as much as 10% in some cases; however such marginal improvement even in the best case was deemed by the authors to not be worth the added code complexity. Gish also envisioned at this time a centralized search service, wherein all nucleotide sequences from GenBank would be maintained in memory to eliminate I/O bottlenecksand stored in compressed form to conserve memorywith clients invoking FASTN searches remotely via the Internet.

Gish's earliest contributions to BLAST were made while working at the NCBI, starting in July 1989. Even in early prototypes BLAST was typically much faster than FASTA. Gish recognized the potential added benefit in this application of using a DFA for word-hit recognition. He morphed his earlier DFA code into a flexible form that he incorporated into all BLAST search modes. Others of his contributions to BLAST include: the use of compressed nucleotide sequences, both as an efficient storage format and as a rapid, native search format; parallel processing; memory-mapped I/O; the use of sentinel bytes and sentinel words at the start and end of sequences to improve the speed of word-hit extension; the original implementations of BLASTX, [9] TBLASTN [4] and TBLASTX (unpublished); the transparent use of external (plug-in) programs such as seg, xnu, and dust to mask low-complexity regions in query sequences at run time; the NCBI BLAST E-mail Service with optional public key-encrypted communications; the NCBI Experimental BLAST Network Service; the NCBI non-redundant (nr) protein and nucleotide sequence databases, typically updated on a daily basis with all data from GenBank, Swiss-Prot, and the PIR. Gish developed the first BLAST API, which was used in EST [10] annotation and Entrez data production, as well as in the NCBI BLAST version 1.4 application suite (Gish, unpublished). Gish was also the creator of and project manager for the earliest NCBI Dispatcher for distributed services (inspired by CORBA's Object Request Broker). First opened to outside users in December 1989, the NCBI Experimental BLAST Network Service, running the latest BLAST software on SMP hardware against the latest releases of the major sequence databases, quickly established the NCBI as a convenient, one-stop shop for sequence similarity searching.

At Washington University in St. Louis, Gish revolutionized similarity searching by developing the first BLAST suite of programs to combine rapid gapped sequence alignment with statistical evaluation methods appropriate for gapped alignment scores. The resulting search programs were significantly more sensitive but only marginally slower than ungapped BLAST, due to novel application of the BLAST dropoff score X during gapped alignment extension. Sensitivity of gapped BLAST was further improved by the novel application of Karlin-Altschul Sum statistics [11] to the evaluation of multiple, gapped alignment scores in all BLAST search modes. Sum statistics were originally developed analytically for the evaluation of multiple, ungapped alignment scores. The empirical use of Sum statistics in the treatment of gapped alignment scores was validated in collaboration with Stephen Altschul, from 1994-1995. In May 1996, WU-BLAST version 2.0 with gapped alignments was publicly released in the form of a drop-in upgrade for existing users of ungapped NCBI BLAST and WU-BLAST (both at version 1.4, after having forked in 1994). Little NIH funding was received for his WU-BLAST development, with an average of 20% FTE starting in November 1995, and ending shortly after the September 1997 release of the NCBI gapped BLAST (“blastall”). As an option to WU-BLAST, Gish implemented a faster, more memory-efficient and more sensitive two-hit BLAST algorithm than was used by the NCBI software for many years. In 1999, Gish added support to WU-BLAST for the Extended Database Format (XDF), the first BLAST database format capable of accurately representing the entire draft sequence of the human genome in full-length chromosome sequence objects. This was also the first time any BLAST package introduced a new database format transparently to existing users, without abandoning support for prior formats, as a result of abstracting the database I/O functions away from the data analysis functions. WU-BLAST with XDF was the first BLAST suite to support indexed-retrieval of NCBI standard FASTA-format sequence identifiers (including the entire range of NCBI identifiers); the first to allow retrieval of individual sequences in part or in whole, natively, translated or reverse-complemented; and the first able to dump the entire contents of a BLAST database back into human-readable FASTA format. In 2000, unique support for reporting of links (consistent sets of HSPs; also called chains in some later software packages) was added, along with the ability for users to limit the distance between HSPs allowed in the same set to a biologically relevant length (e.g., the length of the expected longest intron in the species of interest) and with the distance limitation entering into the calculation of E-values. Between 2001-2003, Gish improved the speed of the DFA code used in WU-BLAST. Gish also proposed multiplexing query sequences to speed up BLAST searches by an order of magnitude or more (MPBLAST); implemented segmented sequences with internal sentinel bytes, in part to aid multiplexing with MPBLAST and in part to aid analysis of segmented query sequences from shotgun sequencing assemblies; and directed use of WU-BLAST as a fast, flexible search engine for accurately identifying and masking genome sequences for repetitive elements and low-complexity sequences (the MaskerAid [12] package for RepeatMasker). With doctoral student Miao Zhang, Gish directed development of EXALIN, [13] which significantly improved the accuracy of spliced alignment predictions, by a novel approach that combined information from donor and acceptor splice site models with information from sequence conservation. Although EXALIN performed full dynamic programming by default, it could optionally utilize the output from WU-BLAST to seed the dynamic programming and speed up the process by about 100-fold with little loss of sensitivity or accuracy.

In 2008, Gish founded Advanced Biocomputing, LLC, where he continues to improve and support the AB-BLAST package.[ citation needed ]

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

<span class="mw-page-title-main">National Center for Biotechnology Information</span> Database branch of the US National Library of Medicine

The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is located in Bethesda, Maryland, and was founded in 1988 through legislation sponsored by US Congressman Claude Pepper.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

Stephen Frank Altschul is an American mathematician who has designed algorithms that are used in the field of bioinformatics. Altschul is the co-author of the BLAST algorithm used for sequence analysis of proteins and nucleotides.

FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.

In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames will be "open". Such an ORF may contain a start codon and by definition cannot extend beyond a stop codon. That start codon indicates where translation may start. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

<span class="mw-page-title-main">David J. Lipman</span> American biologist

David J. Lipman is an American biologist who from 1989 to 2017 was the director of the National Center for Biotechnology Information (NCBI) at the National Institutes of Health. NCBI is the home of GenBank, the U.S. node of the International Sequence Database Consortium, and PubMed, one of the most heavily used sites in the world for the search and retrieval of biomedical information. Lipman is one of the original authors of the BLAST sequence alignment program, and a respected figure in bioinformatics. In 2017, he left NCBI and became Chief Science Officer at Impossible Foods.

In bioinformatics, MAFFT is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version of MAFFT used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the Fast Fourier Transform. Subsequent versions of MAFFT have added other algorithms and modes of operation, including options for faster alignment of large numbers of sequences, higher accuracy alignments, alignment of non-coding RNA sequences, and the addition of new sequences to existing alignments.

<span class="mw-page-title-main">Webb Miller</span> American bioinformatician

Webb Colby Miller is an American bioinformatician who is professor in the Department of Biology and the Department of Computer Science and Engineering at The Pennsylvania State University.

formatdb is a discontinued software tool that was used in molecular bioinformatics to format protein or nucleotide databases for BLAST. It has been replaced by makeblastdb and the NCBI "strongly encourage[s]" users to stop using formatdb.

BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.

CS-BLAST (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST, using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST is the context-specific analog of PSI-BLAST, which computes the mutation profile with substitution probabilities and mixes it with the query profile. CSI-BLAST is the context specific analog of PSI-BLAST. Both of these programs are available as web-server and are available for free download.

PatternHunter is a commercially available homology search instrument software that uses sequence alignment techniques. It was initially developed in the year 2002 by three scientists: Bin Ma, John Tramp and Ming Li. These scientists were driven by the desire to solve the problem that many investigators face during studies that involve genomics and proteomics. These scientists realized that such studies greatly relied on homology studies that established short seed matches that were subsequently lengthened. Describing homologous genes was an essential part of most evolutionary studies and was crucial to the understanding of the evolution of gene families, the relationship between domains and families. Homologous genes could only be studied effectively using search tools that established like portions or local placement between two proteins or nucleic acid sequences. Homology was quantified by scores obtained from matching sequences, “mismatch and gap scores”.

Non-coding RNAs have been discovered using both experimental and bioinformatic approaches. Bioinformatic approaches can be divided into three main categories. The first involves homology search, although these techniques are by definition unable to find new classes of ncRNAs. The second category includes algorithms designed to discover specific types of ncRNAs that have similar properties. Finally, some discovery methods are based on very general properties of RNA, and are thus able to discover entirely new kinds of ncRNAs.

In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., alignment, assembly, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

References

  1. 1 2 Gish, Warren Richard (1988). I. SV40 mutants isolated from transformed human cells. II. Methods for sequence analysis (PhD thesis). University of California, Berkeley. ProQuest   303669506.
  2. Warren Gish publications indexed by Microsoft Academic
  3. Warren Gish at DBLP Bibliography Server OOjs UI icon edit-ltr-progressive.svg
  4. 1 2 Altschul, S.; Gish, W.; Miller, W.; Myers, E.; Lipman, D. (1990). "Basic Local Alignment Search Tool". Journal of Molecular Biology. 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID   2231712. S2CID   14441902.
  5. Sense from Sequences: Stephen F. Altschul on Bettering BLAST
  6. Aho, Alfred V.; Corasick, Margaret J. (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM. 18 (6): 333–340. doi: 10.1145/360825.360855 . S2CID   207735784.
  7. Lipman, DJ; Pearson, WR (1985). "Rapid and sensitive protein similarity searches". Science. 227 (4693): 1435–41. Bibcode:1985Sci...227.1435L. doi:10.1126/science.2983426. PMID   2983426.
  8. Pearson, W. R.; Lipman, D. J. (1988). "Improved tools for biological sequence comparison". Proceedings of the National Academy of Sciences of the United States of America. 85 (8): 2444–2448. Bibcode:1988PNAS...85.2444P. doi: 10.1073/pnas.85.8.2444 . PMC   280013 . PMID   3162770.
  9. Gish, W.; States, D.J. (1993). "Identification of protein coding regions by database similarity search". Nature Genetics. 3 (3): 266–272. doi:10.1038/ng0393-266. PMID   8485583. S2CID   15295142.
  10. Boguski, M.S.; Lowe, T.M.; Tolstoshev, C.M. (1993). "dbEST--database for "expressed sequence tags"". Nature Genetics. 4 (4): 332–333. doi:10.1038/ng0893-332. PMID   8401577. S2CID   40138950.
  11. Karlin, S.; Altschul, S. F. (1993). "Applications and statistics for multiple high-scoring segments in molecular sequences". Proceedings of the National Academy of Sciences of the United States of America. 90 (12): 5873–5877. Bibcode:1993PNAS...90.5873K. doi: 10.1073/pnas.90.12.5873 . PMC   46825 . PMID   8390686.
  12. Bedell, J. A.; Korf, I.; Gish, W. (2000). "MaskerAid : A performance enhancement to RepeatMasker". Bioinformatics. 16 (11): 1040–1041. doi: 10.1093/bioinformatics/16.11.1040 . PMID   11159316.
  13. Zhang, M.; Gish, W. (2005). "Improved spliced alignment from an information theoretic approach". Bioinformatics. 22 (1): 13–20. doi:10.1093/bioinformatics/bti748. PMID   16267086.