Fast statistical alignment

Last updated
FSA
Developer(s) Robert Bradley (UC Berkeley), Colin Dewey (UW Madison), Lior Pachter (UC Berkeley)
Stable release
1.5.2
Operating system UNIX, Linux, Mac
Type Bioinformatics tool
Licence Open source

Fast statistical alignment or FSA is a multiple sequence alignment program for aligning many proteins, RNAs, or long genomic DNA sequences. Along with MUSCLE and MAFFT, FSA is one of the few sequence alignment programs which can align datasets of hundreds or thousands of sequences. FSA uses a different optimization criterion which allows it to more reliably identify non-homologous sequences than these other programs, although this increased accuracy comes at the cost of decreased speed.

Contents

FSA is currently being used for multiple projects, including sequencing new worm genomes and analyzing in vivo transcription factor binding in flies.

Input/Output

This program accepts sequences in FASTA format and outputs alignments in FASTA format or Stockholm format.

Algorithm

The algorithm for the aligning of the input sequences has 4 core components.

Pair Hidden Markov Model for generating posterior probabilities

The algorithm starts first by determining posterior probabilities of alignment between any two random sequences from the pool of sequences being aligned. The posterior probabilities for each column reinforce the prediction of alignment probability between a sequence pair and also filter out columns that can be unreliably aligned. These probabilities also allow for the prediction and estimate of homology between any sequence pair. A standard five-state pair hidden Markov model (Pair HMM) is used to determine these posterior probabilities of alignment for any two input sequences. The Pair HMM model uses two sets of Delete (D) and Insert (I) states to account for symbol deletion and insertions between two aligned sequences, but it can also have three states without a significant loss of accuracy.

Since the number of pair-wise comparisons needed to determine the posterior probability distributions of any two pairs of sequences is computationally expensive and quadratic in the amount of sequences that are being aligned, it is decreased by using a randomized approach inspired by the Erdos-Renyi theory of random graphs. This significantly reduces the runtimes of datasets and the computational cost of running the multiple alignments.

Merging Probabilities

The posterior probabilities for each column in the sequence pairs are sorted using a weighting function that uses a steepest-ascent algorithm.

Sequence Annealing

Most existing programs that run multiple sequence alignment algorithms are based on progressive alignment where the process starts with a "null alignment," a state where none of the sequences have been aligned. The pool of sequences is then aligned either through pairwise comparisons or through an alignment of a pair of partial alignments of subsequences. This process can cause issues in alignment because the resulting multiple sequence alignment can and will heavily depend on the sequences that are aligned at the start. There is no realigning of previously aligned sequences that could correct the MSA.

FSA uses the sequence annealing technique to overcome this issue. The sorted posterior probabilities are used with the sequence annealing technique to generate a multiple alignment. The technique finds the alignment between two sequences that minimizes the expected distance to the truth. In this case, the distance between two sequences is the number of columns in which the character from one sequence is not homologous to the character in the same column in the second sequence.

The sequence annealing technique, by determining an alignment with the minimum expected distance to the truth, conversely finds the alignment with the maximum expected accuracy. The accuracy of an alignment depends on a "true" alignment as reference and indicates the fraction of columns where the sequences are homologous. This accuracy is then used as an objective function that starts with the unaligned sequences (null alignment) and aligns characters in different columns based on the increasing accuracy of an alignment.

Ordering of the alignment

FSA aligns multiple sequences based on homology within columns instead of strictly a consideration of indels and substitutions. As such, FSA considers alignments to be equivalent if for every position along the sequences in both alignments, the same statement about homology can be made. For example, when considering pairwise comparisons, if there is a gap at a specific position in two alignments, then it can be said that the two sequences being compared are not homologous at said position. This can result in alignments where gap-open events can differ and yet still be considered equivalent. As such, FSA chooses to output the alignment in which there is a minimum amount of "gap openings".

Parallelization

To handle overly large datasets, FSA is able to divide the work of running all necessary pairwise comparisons and alignments to different processors. This is handled by using a "fixed-size chunking" strategy that distributes the pairwise comparisons to each available processor in chunks. Each processor is therefore able to run the posterior probability calculation on a chunk of pairwise comparisons before merging the collected data back to a single processor for sequence annealing.

Visualization

The results of the multiple sequence alignment under FSA can be displayed under the FSA's own GUI. The GUI is able to display and color label different measures of alignment quality on the columns of characters within the alignment itself. The five different measures that can be observed and are approximated under the FSA model include accuracy, sensitivity, certainty, specificity, and consistency.

Comparisons to other programs

FSA has been benchmarked against multiple alignment databases for protein (SABmark 1.65 and BAliBASE 3), RNA (BRAliBase 2.1 and Consanmix80), and DNA sequences. These benchmarks were conducted alongside other popular alignment programs such as ClustalW, MAFFT, MUSCLE, T-Coffee, and so on. Overall, at the time that FSA's abstract and research paper was received for review, FSA outperformed most alignment programs in accuracy and positive predictive values with sensitivities being on-par with the better-performing programs such as MAFFT and ProbConsRNA. Runtime comparisons were also conducted by comparing the timings to align 16S ribosomal sequences. MAFFT performed the alignment faster than the other alignment programs while MUSCLE and FSA (using a 3-state HMM and with disabled iterative refinement) were the next fastest programs.

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

<span class="mw-page-title-main">Structural alignment</span> Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an alignment is important to create a useful alignment. Too many gaps can cause an alignment to become meaningless. Gap penalties are used to adjust alignment scores based on the number and length of gaps. The five main types of gap penalties are constant, linear, affine, convex, and profile-based.

<span class="mw-page-title-main">Clustal</span> Bioinformatics computer program

Clustal is a computer program used for multiple sequence alignment in bioinformatics. The software and its algorithms have gone through several iterations, with ClustalΩ (Omega) being the latest version as of 2011. It is available as standalone software, via a web interface, and through a server hosted by the European Bioinformatics Institute.

In molecular biology, protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure. It differs from the homology modeling method of structure prediction as it is used for proteins which do not have their homologous protein structures deposited in the Protein Data Bank (PDB), whereas homology modeling is used for those proteins which do. Threading works by using statistical knowledge of the relationship between the structures deposited in the PDB and the sequence of the protein which one wishes to model.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis and can highlight homologous features between sequences. Alignments highlight mutation events such as point mutations, insertion mutations and deletion mutations, and alignments are used to assess sequence conservation and infer the presence and activity of protein domains, tertiary structures, secondary structures, and individual amino acids or nucleotides.

T-Coffee is a multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment. It can also combine multiple sequences alignments obtained previously and in the latest versions can use structural information from PDB files (3D-Coffee). It has advanced features to evaluate the quality of the alignments and some capacity for identifying occurrence of motifs (Mocca). It produces alignment in the aln format (Clustal) by default, but can also produce PIR, MSF, and FASTA format. The most common input formats are supported.

In bioinformatics, MAFFT is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version of MAFFT used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the fast Fourier transform. Subsequent versions of MAFFT have added other algorithms and modes of operation, including options for faster alignment of large numbers of sequences, higher accuracy alignments, alignment of non-coding RNA sequences, and the addition of new sequences to existing alignments.

<span class="mw-page-title-main">Homology modeling</span> Method of protein structure prediction using other known proteins

Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein. Homology modeling relies on the identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of an alignment that maps residues in the query sequence to residues in the template sequence. It has been seen that protein structures are more conserved than protein sequences amongst homologues, but sequences falling below a 20% sequence identity can have very different structure.

AMAP is a multiple sequence alignment program based on sequence annealing. This approach consists of building up the multiple alignment one match at a time, thereby circumventing many of the problems of progressive alignment. The AMAP parameters can be used to tune the sensitivity-specificity tradeoff.

BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.

In bioinformatics, Stemloc is an open source software for multiple RNA sequence alignment and RNA structure prediction based on probabilistic models of RNA structure known as Pair stochastic context-free grammars. Stemloc attempts to simultaneously predict and align the structure of RNA sequences with an improved time and space cost compared to previous methods with the same motive. The resulting software implements constrained versions of the Sankoff algorithm by introducing both fold and alignment constraints, which reduces processor and memory usage and allows for larger RNA sequences to be analyzed on commodity hardware. Stemloc was written in 2004 by Ian Holmes.

<span class="mw-page-title-main">HMMER</span> Software package for sequence analysis

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and macOS.

RAPTOR is protein threading software used for protein structure prediction. It has been replaced by RaptorX, which is much more accurate than RAPTOR.

CS-BLAST (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST, using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST is the context-specific analog of PSI-BLAST, which computes the mutation profile with substitution probabilities and mixes it with the query profile. CSI-BLAST is the context specific analog of PSI-BLAST. Both of these programs are available as web-server and are available for free download.

Phyre and Phyre2 are free web-based services for protein structure prediction. Phyre is among the most popular methods for protein structure prediction having been cited over 1500 times. Like other remote homology recognition techniques, it is able to regularly generate reliable protein models when other widely used methods such as PSI-BLAST cannot. Phyre2 has been designed to ensure a user-friendly interface for users inexpert in protein structure prediction methods. Its development is funded by the Biotechnology and Biological Sciences Research Council.

The HH-suite is an open-source software package for sensitive protein sequence searching. It contains programs that can search for similar protein sequences in protein sequence databases. Sequence searches are a standard tool in modern biology with which the function of unknown proteins can be inferred from the functions of proteins with similar sequences. HHsearch and HHblits are two main programs in the package and the entry point to its search function, the latter being a faster iteration. HHpred is an online server for protein structure prediction that uses homology information from HH-suite.

PatternHunter is a commercially available homology search instrument software that uses sequence alignment techniques. It was initially developed in the year 2002 by three scientists: Bin Ma, John Tramp and Ming Li. These scientists were driven by the desire to solve the problem that many investigators face during studies that involve genomics and proteomics. These scientists realized that such studies greatly relied on homology studies that established short seed matches that were subsequently lengthened. Describing homologous genes was an essential part of most evolutionary studies and was crucial to the understanding of the evolution of gene families, the relationship between domains and families. Homologous genes could only be studied effectively using search tools that established like portions or local placement between two proteins or nucleic acid sequences. Homology was quantified by scores obtained from matching sequences, “mismatch and gap scores”.

References

Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L (2009). "Fast Statistical Alignment". PLOS Computational Biology. 5 (5): e1000392. Bibcode:2009PLSCB...5E0392B. doi: 10.1371/journal.pcbi.1000392 . PMC   2684580 . PMID   19478997.

Schwartz AS, Pachter L (2007) Multiple alignment by sequence annealing. Bioinformatics 23: e24-9.

Eddy SR. Multiple alignment using hidden Markov models. Proc Int Conf Intell Syst Mol Biol. 1995;3:114-20. PMID 7584426.