T-Coffee

Last updated
T-Coffee
Developer(s) Cédric Notredame, Centro de Regulacio Genomica (CRG) - Barcelona
Stable release
13.45.0.4846264 / 15 October 2020;3 years ago (2020-10-15)
Preview release
13.45.33.7d7e789 / 23 December 2020;3 years ago (2020-12-23)
Repository
Operating system UNIX, Linux, MS-Windows, Mac OS X
Type Bioinformatics tool
Licence GPL
Website www.tcoffee.org

T-Coffee (Tree-based Consistency Objective Function for Alignment Evaluation) is a multiple sequence alignment software using a progressive approach. [1] It generates a library of pairwise alignments to guide the multiple sequence alignment. It can also combine multiple sequences alignments obtained previously and in the latest versions can use structural information from PDB files (3D-Coffee). It has advanced features to evaluate the quality of the alignments and some capacity for identifying occurrence of motifs (Mocca). It produces alignment in the aln format (Clustal) by default, but can also produce PIR, MSF, and FASTA format. The most common input formats are supported (FASTA, PIR).

Contents

Algorithm

T-Coffee algorithm consist of two main features, the first by utilizing heterogeneous data sources it is able to provide simple and flexible means of generating multiple alignments. T-coffee can compute multiple alignments using a library that was generated using a mixture of local and global pair-wise alignments. [1]

The second is the "Optimization method", used to find the multiple alignment that best fits the pair-wise alignments in the input library using a progressive strategy that can be compared to the one used in ClustalW. The Optimization method has the advantage of being fast and robust. The information in the library is used to carry out progressive alignments and facilitates the duty of considering the alignments between all the pairs while carrying out every step of the progressive multiple alignments. [1]

Generating a primary library of alignments

The library incorporates a set of pair-wise alignments between all of the sequences to be aligned, the alignments are not required to be consistent. Inside the library, there can be found information on each of the N(N-1)/2 in where N is the number of sequences. Two alignment sources are used for each pair of sequences, one of them classified as local, and the other as global. [1]

Global alignments are constructed using ClustalW on the sequences, two at a time, and sed to give one full-length alignment between each pair of sequences. The local alignments are the ten top-scoring non-intersecting local alignments gathered using the Lalign program of the FASTA package. [1]

Each alignment is represented in the library as a list of pair-wise residue matches, each pair is a constraint; however, some constraints are more relevant than others. the importance of each constraint depends on which are more likely to be correct. While computing the multiple alignments, priority is given to the most reliable residue pairs by utilizing a weighting scheme. [1]

Combination of the libraries

Efficient combination of local and global alignment information is an important factor of T-Coffee. By using the ClustalW and Lalign primary libraries it can be achieved with a process of addition. Any duplicated pair between both libraries is merged into a single entry with the weight of the total sum of both pairs. Else, a new entry is created for the pair. Pairs with a weight of zero will not be represented. [1] For each pair of aligned residues in the library, it is possible to assign a weight that belongs to the degree to which those residues align consistently. This is called Library extension.

Comparisons with other alignment software

While the default output is a Clustal-like format, it is sufficiently different from the output of ClustalW/X that many programs supporting Clustal format cannot read it; fortunately ClustalX can import T-Coffee output so the simplest fix for this issue is usually to import T-Coffee's output into ClustalX and then re-export. Another possibility is to request the strict Clustalw output format with the option "-output=clustalw_aln".

An important specificity of T-Coffee is its ability to combine different methods and different data types. In its latest version, T-Coffee can be used to combine protein sequences and structures, RNA sequences and structures. It can also run and combine the output of the most common sequence and structure alignment packages.

T-Coffee comes along with a sophisticated sequence reformatting utility named seq_reformat. An extensive documentation is available online.

Variations

Evaluation

(Transitive Consistency Score) is an extended version of the T-Coffee scoring scheme. [14] It uses T-Coffee libraries of pairwise alignments to evaluate any third party MSA. Pairwise projections can be produced using fast or slow methods, thus allowing a trade-off between speed and accuracy. TCS has been shown to lead to significantly better estimates of structural accuracy and more accurate phylogenetic trees against Heads-or-Tails, GUIDANCE, Gblocks, and trimAl. [15]

See also

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

<span class="mw-page-title-main">Structural alignment</span> Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

<span class="mw-page-title-main">Clustal</span>

Clustal is a series of computer programs used in bioinformatics for multiple sequence alignment. There have been many versions of Clustal over the development of the algorithm that are listed below. The analysis of each tool and its algorithm is also detailed in their respective categories. Available operating systems listed in the sidebar are a combination of the software availability and may not be supported for every current version of the Clustal tools. Clustal Omega has the widest variety of operating systems out of all the Clustal tools.

<span class="mw-page-title-main">Ensembl genome database project</span> Scientific project at the European Bioinformatics Institute

Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

<span class="mw-page-title-main">Pfam</span> Database of protein families

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 36.0, was released in September 2023 and contains 20,795 families.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

MAVID is a multiple sequence alignment program suitable for the alignment of large numbers of DNA sequences. The sequences can be small mitochondrial genomes or large genomic regions up to megabases long. The latest version is 2.0.4.

In bioinformatics, MAFFT is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version of MAFFT used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the fast Fourier transform. Subsequent versions of MAFFT have added other algorithms and modes of operation, including options for faster alignment of large numbers of sequences, higher accuracy alignments, alignment of non-coding RNA sequences, and the addition of new sequences to existing alignments.

MUltiple Sequence Comparison by Log-Expectation (MUSCLE) is computer software for multiple sequence alignment of protein and nucleotide sequences. It is licensed as public domain. The method was published by Robert C. Edgar in two papers in 2004. The first paper, published in Nucleic Acids Research, introduced the sequence alignment algorithm. The second paper, published in BMC Bioinformatics, presented more technical details.

<span class="mw-page-title-main">Therapeutic Targets Database</span> Database of protein targets in drug design

Therapeutic Target Database (TTD) is a pharmaceutical and medical repository constructed by the Innovative Drug Research and Bioinformatics Group (IDRB) at Zhejiang University, China and the Bioinformatics and Drug Design Group at the National University of Singapore. It provides information about known and explored therapeutic protein and nucleic acid targets, the targeted disease, pathway information and the corresponding drugs directed at each of these targets. Detailed knowledge about target function, sequence, 3D structure, ligand binding properties, enzyme nomenclature and drug structure, therapeutic class, and clinical development status. TTD is freely accessible without any login requirement at https://idrblab.org/ttd/.

Simple Modular Architecture Research Tool (SMART) is a biological database that is used in the identification and analysis of protein domains within protein sequences. SMART uses profile-hidden Markov models built from multiple sequence alignments to detect protein domains in protein sequences. The most recent release of SMART contains 1,204 domain models. Data from SMART was used in creating the Conserved Domain Database collection and is also distributed as part of the InterPro database. The database is hosted by the European Molecular Biology Laboratory in Heidelberg.

OMPdb is a dedicated database that contains beta barrel (β-barrel) outer membrane proteins from Gram-negative bacteria. Such proteins are responsible for a broad range of important functions, like passive nutrient uptake, active transport of large molecules, protein secretion, as well as adhesion to host cells, through which bacteria expose their virulence activity.

<span class="mw-page-title-main">European Nucleotide Archive</span> Online database from the EBI on Nucleotides

The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred. Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident. Sequence homology can then be deduced even if not apparent. Superfamilies typically contain several protein families which show sequence similarity within each family. The term protein clan is commonly used for protease and glycosyl hydrolases superfamilies based on the MEROPS and CAZy classification systems.

<span class="mw-page-title-main">I-TASSER</span>

I-TASSER is a bioinformatics method for predicting three-dimensional structure model of protein molecules from amino acid sequences. It detects structure templates from the Protein Data Bank by a technique called fold recognition. The full-length structure models are constructed by reassembling structural fragments from threading templates using replica exchange Monte Carlo simulations. I-TASSER is one of the most successful protein structure prediction methods in the community-wide CASP experiments.

<span class="mw-page-title-main">Desmond G. Higgins</span>

Desmond Gerard Higgins is a Professor of Bioinformatics at University College Dublin, widely known for CLUSTAL, a series of computer programs for performing multiple sequence alignment. According to Nature, Higgins' papers describing CLUSTAL are among the top ten most highly cited scientific papers of all time.

Toby James Gibson is a group leader and biochemist at the European Molecular Biology Laboratory (EMBL) in Heidelberg known for his work on Clustal. According to Nature, Gibson's co-authored papers describing Clustal are among the top ten most highly cited scientific papers of all time.

VFDB also known as Virulence Factor Database is a database that provides scientist quick access to virulence factors in bacterial pathogens. It can be navigated and browsed using genus or words. A BLAST tool is provided for search against known virulence factors. VFDB contains a collection of 16 important bacterial pathogens. Perl scripts were used to extract positions and sequences of VF from GenBank. Clusters of Orthologous Groups (COG) was used to update incomplete annotations. More information was obtained by NCBI. VFDB was built on Linux operation systems on DELL PowerEdge 1600SC servers.

References

  1. 1 2 3 4 5 6 7 8 Notredame C, Higgins DG, Heringa J (2000-09-08). "T-Coffee: A novel method for fast and accurate multiple sequence alignment". J Mol Biol. 302 (1): 205–217. doi:10.1006/jmbi.2000.4042. PMID   10964570. S2CID   10189971.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  2. Wallace, Iain M.; O'Sullivan, Orla; Higgins, Desmond G.; Notredame, Cedric (2006). "M-Coffee: combining multiple sequence alignment methods with T-Coffee". Nucleic Acids Research. 34 (6): 1692–1699. doi:10.1093/nar/gkl091. ISSN   1362-4962. PMC   1410914 . PMID   16556910.
  3. Armougom, Fabrice; Moretti, Sébastien; Poirot, Olivier; Audic, Stéphane; Dumas, Pierre; Schaeli, Basile; Keduas, Vladimir; Notredame, Cedric (2006-07-01). "Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee". Nucleic Acids Research. 34 (Web Server issue): W604–608. doi:10.1093/nar/gkl092. ISSN   1362-4962. PMC   1538866 . PMID   16845081.
  4. Zhang, Yang; Skolnick, Jeffrey (2005). "TM-align: a protein structure alignment algorithm based on the TM-score". Nucleic Acids Research. 33 (7): 2302–2309. doi:10.1093/nar/gki524. ISSN   1362-4962. PMC   1084323 . PMID   15849316.
  5. Konagurthu, Arun S.; Whisstock, James C.; Stuckey, Peter J.; Lesk, Arthur M. (2006-08-15). "MUSTANG: a multiple structural alignment algorithm". Proteins. 64 (3): 559–574. doi:10.1002/prot.20921. ISSN   1097-0134. PMID   16736488. S2CID   14074658.
  6. Sun, Zheng; Tian, Weidong (2012). "SAP--a sequence mapping and analyzing program for long sequence reads alignment and accurate variants discovery". PLOS ONE. 7 (8): e42887. Bibcode:2012PLoSO...742887S. doi: 10.1371/journal.pone.0042887 . ISSN   1932-6203. PMC   3413671 . PMID   22880129.
  7. Wilm, Andreas; Higgins, Desmond G.; Notredame, Cédric (May 2008). "R-Coffee: a method for multiple alignment of non-coding RNA". Nucleic Acids Research. 36 (9): e52. doi:10.1093/nar/gkn174. ISSN   1362-4962. PMC   2396437 . PMID   18420654.
  8. Moretti, Sébastien; Wilm, Andreas; Higgins, Desmond G.; Xenarios, Ioannis; Notredame, Cédric (2008-07-01). "R-Coffee: a web server for accurately aligning noncoding RNA sequences". Nucleic Acids Research. 36 (Web Server issue): W10–13. doi:10.1093/nar/gkn278. ISSN   1362-4962. PMC   2447777 . PMID   18483080.
  9. 1 2 Di Tommaso P, Moretti S, Xenarios I, Orobitg M, Montanyola A, Chang JM, Taly JF, Notredame C (Jul 2011). "T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension". Nucleic Acids Res. 39 (Web Server issue): W13–7. doi:10.1093/nar/gkr245. PMC   3125728 . PMID   21558174.
  10. Kemena C, Notredame C (2009-10-01). "Upcoming challenges for multiple sequence alignment methods in the high-throughput era". Bioinformatics. 25 (19): 2455–65. doi:10.1093/bioinformatics/btp452. PMC   2752613 . PMID   19648142.
  11. Chang JM, Di Tommaso P, Taly JF, Notredame C (2012-03-28). "Accurate multiple sequence alignment of transmembrane proteins with PSI-Coffee". BMC Bioinformatics. 13: S1. doi: 10.1186/1471-2105-13-S4-S1 . PMC   3303701 . PMID   22536955.
  12. Erb I, González-Vallinas JR, Bussotti G, Blanco E, Eyras E, Notredame C (Apr 2012). "Use of ChIP-Seq data for the design of a multiple promoter-alignment method". Nucleic Acids Res. 40 (7): e52. doi:10.1093/nar/gkr1292. PMC   3326335 . PMID   22230796.
  13. "T-Coffee Server". tcoffee.crg.eu. Retrieved 2023-12-26.
  14. Chang, JM; Di Tommaso, P; Lefort, V; Gascuel, O; Notredame, C (1 July 2015). "TCS: a web server for multiple sequence alignment evaluation and phylogenetic reconstruction". Nucleic Acids Research. 43 (W1): W3-6. doi:10.1093/nar/gkv310. PMC   4489230 . PMID   25855806.
  15. Chang, JM; Di Tommaso, P; Notredame, C (Jun 2014). "TCS: A New Multiple Sequence Alignment Reliability Measure to Estimate Alignment Accuracy and Improve Phylogenetic Tree Reconstruction". Molecular Biology and Evolution. 31 (6): 1625–37. doi: 10.1093/molbev/msu117 . PMID   24694831.