PatternHunter

Last updated

PatternHunter is a commercially available homology search instrument software that uses sequence alignment techniques. It was initially developed in the year 2002 by three scientists: Bin Ma, John Tramp and Ming Li. [1] :440 These scientists were driven by the desire to solve the problem that many investigators face during studies that involve genomics and proteomics. These scientists realized that such studies greatly relied on homology studies that established short seed matches that were subsequently lengthened. Describing homologous genes was an essential part of most evolutionary studies and was crucial to the understanding of the evolution of gene families, the relationship between domains and families. [2] :7 Homologous genes could only be studied effectively using search tools that established like portions or local placement between two proteins or nucleic acid sequences. [3] :15 Homology was quantified by scores obtained from matching sequences, “mismatch and gap scores”. [4] :164

Contents

Development

In comparative genomics, for example, it is necessary to compare huge chromosomes such as those found in the human genome. However, the immense expansion of genomic data introduces a predicament in the available methods of carrying out homology searches. For instance, enlarging the seed size lowers sensitivity while reducing seed size reduces the speed of calculations. Several sequence alignment programs have been developed to determine homology between genes. These include FASTA, the BLAST family, QUASAR, MUMmer, SENSEI, SIM, and REPuter. [1] :440 They mostly use Smith-Waterman alignment technique, which compares bases against other bases, but is too slow. BLAST makes an improvement to this technique by establishing brief, precise seed matches that it later joins up to form longer alignments. [5] :737 However, when dealing with lengthy sequences, the above-mentioned techniques are extremely sluggish and required considerable memory sizes. SENSEI, however, is more efficient than the other methods, but is incompetent in other forms of alignment as its strength lies in handling ungapped alignments. The quality of the production from Megablast, on the other hand, is of poor quality and does not adapt well to large sequences. Techniques such as MUMmer and QUASAR employ suffix trees, which are supposed to handle exact matches. However, these methods can only apply to the comparison of sequences that display elevated similarities. All the above-mentioned problems necessitate the development of a fast reliable tool that can handle all types of sequences efficiently without consuming too many resources in a computer.

Approach

PatternHunter utilizes numerous seeds (tiny search strings) with optimal intervals between them. Searches that employ seeds are extremely fast because they only determine homology in places where hits are established. The sensitivity of a search string is greatly influenced by the amount of space between adjacent strings. Large seeds are unable to find isolated homologies, whereas small ones generate numerous arbitrary hits that delay computation. PatternHunter strikes a delicate balance in this area by providing optimal spacing between search strings. It uses alternate k (k = 11) letters as seeds in contrast with BLAST, which utilizes successive k letters as seeds. The first stage in PatternHunter analysis entails a filtering phase where the program hunts for matches in k alternating points as denoted by the most advantageous pattern. [6] :11 The second stage is the alignment phase, which is identical to BLAST. In addition, it is possible to use more than one seed at a go with PatternHunter. This elevates the sensitivity of the tool without interfering with its speed.

Speed

PatternHunter takes a short time to analyze all types of sequences. On a modern computer, it can take a few seconds to handle prokaryotic genomes, minutes to process Arabidopsis thaliana sequences and several hours to process a human chromosome. [1] :440 When compared to other tools, PatternHunter exhibits speeds that are approximately a hundred times faster than BLAST and Mega BLAST. [7] These speeds are 3000-fold those attained from a Smith-Waterman algorithm. In addition, the program has a user-friendly interface that allows one to customize the search parameters.

Sensitivity

In terms of sensitivity, it is possible to attain the optimum sensitivity with PatternHunter while still retaining the same speed as a conventional BLAST search.

Specifications

The designing of PatternHunter uses Java technology. Consequently, the program runs smoothly when installed in any Java 1.4 environments. [7]

Future advances

Homology search is a very lengthy procedure that requires a lot of time. Challenges still remain in handling DNA-DNA searches as well as translated DNA-protein searches because of the vast sizes of databases and the tiny query that is used. PatternHunter has been improved to an upgraded PatternHunter II version, which hastens DNA-protein searches a hundredfold without altering the sensitivity. However, there are plans to improve PatternHunter to attain the high sensitivity of the Smith - Waterman tool while obtaining BLAST pace. A novel translated PatternHunter that intends to hasten tBLASTx. [4] :174 is also in the developmental stages.

Related Research Articles

Bioinformatics Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

Sequence alignment process in bioinformatics that aligns (identifies equivalent sites within) molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

National Center for Biotechnology Information Database branch of the US National Library of Medicine

The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological databases, and others. Since the development of methods of high-throughput production of gene and protein sequences, the rate of addition of new sequences to the databases increased exponentially. Such a collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms. However, comparing these new sequences to those with known functions is a key way of understanding the biology of an organism from which the new sequence comes. Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. Nowadays, there are many tools and techniques that provide the sequence comparisons and analyze the alignment product to understand its biology.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.

In bioinformatics, sequence clustering algorithms attempt to group biological sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs) or protein origin. For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the same gene before the ESTs are assembled to reconstruct the original mRNA.

BioJava is an open-source software project dedicated to provide Java tools to process biological data. BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines. BioJava supports a huge range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.

In molecular genetics, an open reading frame (ORF) is the part of a reading frame that has the ability to be translated. An ORF is a continuous stretch of codons that begins with a start codon and ends at a stop codon. An ATG codon within the ORF may indicate where translation starts. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation. In eukaryotic genes with multiple exons, introns are removed and exons are then joined together after transcription to yield the final mRNA for protein translation. In the context of gene finding, the start-stop definition of an ORF therefore only applies to spliced mRNAs, not genomic DNA, since introns may contain stop codons and/or cause shifts between reading frames. An alternative definition says that an ORF is a sequence that has a length divisible by three and is bounded by stop codons. This more general definition can also be useful in the context of transcriptomics and/or metagenomics, where start and/or stop codon may not be present in the obtained sequences. Such an ORF corresponds to parts of a gene rather than the complete gene.

Smith–Waterman algorithm

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

Multiple sequence alignment align more than two molecular sequence

A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.

Warren Richard Gish is the owner of Advanced Biocomputing LLC. He joined Washington University in St. Louis as a junior faculty member in 1994, and was a Research Associate Professor of Genetics from 2002 to 2007.

YASS is a free software, pairwise sequence alignment software for nucleotide sequences, that is, it can search for similarities between DNA or RNA sequences. YASS accepts nucleotide sequences in either plain text or the FASTA format and the output format includes the BLAST tabular output. YASS uses several transition-constrained spaced seed k-mers, which allow considerably improved sensitivity. YASS can be used locally on a user's machine, or as SaaS on the YASS web server, which produces a browser based dot-plot.

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and Mac OS.

The UCSC Genome Browser is an on-line, and downloadable, genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. The Browser is a graphical viewer optimized to support fast interactive performance and is an open-source, web-based tool suite built on top of a MySQL database for rapid visualization, examination, and querying of the data at many levels. The Genome Browser Database, browsing tools, downloadable data files, and documentation can all be found on the UCSC Genome Bioinformatics website.

Non-coding RNAs have been discovered using both experimental and bioinformatic approaches. Bioinformatic approaches can be divided into three main categories. The first involves homology search, although these techniques are by definition unable to find new classes of ncRNAs. The second category includes algorithms designed to discover specific types of ncRNAs that have similar properties. Finally, some discovery methods are based on very general properties of RNA, and are thus able to discover entirely new kinds of ncRNAs.

In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., alignment, assembly, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

References

  1. 1 2 3 Ma, Bin; Tromp, John; Li, Ming (2002). "PatternHunter: Faster and More Sensitive Homology Search". Bioinformatics. 18 (2): 440–445. doi: 10.1093/bioinformatics/18.3.440 . PMID   11934743.
  2. Joseph, Jacob M. (2012). On the identification and investigation of homologous gene families, with particular emphasis on the accuracy of multidomain families (PDF) (PhD). Carnegie Mellon University.
  3. Pevsner, Jonathan (2009). Bioinformatics and Functional Genomics (2nd ed.). New Jersey: Wiley Blackwell. ISBN   9780470451489.
  4. 1 2 Li, M.; Ma, B.; Kisman, D.; Tromp, J. (2003). "PatternHunter II: Highly sensitive and fast homology search". Genome Informatics. International Conference on Genome Informatics. 14: 164–175. PMID   15706531.
  5. Pearson, W. R. (1991). "Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms". Genomics. 11 (3): 635–650. doi:10.1016/0888-7543(91)90071-L. PMID   1774068.
  6. Zhang, Louxin. "Sequence Database Search Techniques I: Blast and PatternHunter tools" (PDF). Retrieved 6 December 2013.
  7. 1 2 "PatternHunter Brochure" (PDF). Archived from the original (PDF) on 11 December 2013. Retrieved 30 November 2013.