List of software to detect low complexity regions in proteins

Last updated

Computational methods can study protein sequences to identify regions with low complexity, which can have particular properties regarding their function and structure.

NameLast updateUsageDescriptionOpen source?Reference
SAPS 1992 downloadable / web It describes several protein sequence statistics for the evaluation of distinctive characteristics of residue content and arrangement in primary structures.yes [1]
SEG1993 downloadable It is a two pass algorithm: first, identifies the LCR, and then performs local optimization by masking with Xs the LCRsyes [2]
fLPS 2017 downloadable / web It can readily handle very large protein data sets, such as might come from metagenomics projects. It is useful in searching for proteins with similar CBRs and for making functional inferences about CBRs for a protein of interestyes [3]
CAST 2000 web It identifies LCRs using dynamic programming.no [4]
SIMPLE 2002 downloadable web It facilitates the quantification of the amount of simple sequence in proteins and determines the type of short motifs that show clustering above a certain threshold.yes [5]
Oj.py2001on requestA tool for demarcating low complexity protein domains.no [6]
DSR2003on requestIt calculates complexity using reciprocal complexity.no [7]
ScanCom2003on requestCalculates the compositional complexity using the linguistic complexity measure.no [8]
CARD2005on requestBased on the complexity analysis of subsequences delimited by pairs of identical, repeating subsequences.no [9]
BIAS 2006 downloadable / web It uses discrete scan statistics that provide a highly accurate multiple test correction to compute analytical estimates of the significance of each compositionally biased segment.yes [10]
GBA2006on requestA graph-based algorithm that constructs a graph of the sequence.no [11]
SubSeqer 2008 web A graph-based approach for the detection and identification of repetitive elements in low–complexity sequences.no [12]
ANNIE 2009 web This method creates an automation of the sequence analytic process.no [13]
LPS-annotate2011on requestThis algorithm defines compositional bias through a thorough search for lowest-probability subsequences (LPSs; Low Probability Sequences) and serves as workbench of tools now available to molecular biologists to generate hypotheses and inferences about the proteins that they are investigating.no [14]
LCReXXXplorer 2015 web A web platform to search, visualize and share data for low complexity regions in protein sequences. LCR-eXXXplorer offers tools for displaying LCRs from the UniProt/SwissProt knowledgebase, in combination with other relevant protein features, predicted or experimentally verified. Also, users may perform queries against a custom designed sequence/LCR-centric database.no [15]
XNU1993 downloadable It uses the PAM120 scoring matrix for the calculation of complexity.yes [16]
AlcoR2022 downloadable A compression-based and alignment-free tool for detecting low-complexity regions in biological datayes [17]

For a comprehensive review on the various methods and tools, see. [18]

In addition, a web meta-server named PLAtform of TOols for LOw COmplexity (PlaToLoCo) has been developed, for visualization and annotation of low complexity regions in proteins. [19] PlaToLoCo integrates and collects the output of five different state-of-the-art tools for discovering LCRs and provides functional annotations such as domain detection, transmembrane segment prediction, and calculation of amino acid frequencies. Furthermore, the union or intersection of the results of the search on a query sequence can be obtained.

A Neural Network webserver, named LCR-hound has been developed to predict the function of prokaryotic and eukaryotic LCRs, based on their amino acid or di-amino acid content. [20]

Related Research Articles

In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.

BioJava is an open-source software project dedicated to provide Java tools to process biological data. BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines. BioJava supports a range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

<span class="mw-page-title-main">UniProt</span> Database of protein sequences and functional information

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, United States.

The Protein Information Resource (PIR), located at Georgetown University Medical Center, is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies. It contains protein sequences databases

<span class="mw-page-title-main">Pfam</span> Database of protein families

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 36.0, was released in September 2023 and contains 20,795 families.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

<span class="mw-page-title-main">Amos Bairoch</span>

Amos Bairoch is a Swiss bioinformatician and Professor of Bioinformatics at the Department of Human Protein Sciences of the University of Geneva where he leads the CALIPHO group at the Swiss Institute of Bioinformatics (SIB) combining bioinformatics, curation, and experimental efforts to functionally characterize human proteins.

T-Coffee is a multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment. It can also combine multiple sequences alignments obtained previously and in the latest versions can use structural information from PDB files (3D-Coffee). It has advanced features to evaluate the quality of the alignments and some capacity for identifying occurrence of motifs (Mocca). It produces alignment in the aln format (Clustal) by default, but can also produce PIR, MSF, and FASTA format. The most common input formats are supported.

<span class="mw-page-title-main">Dot plot (bioinformatics)</span>

In bioinformatics a dot plot is a graphical method for comparing two biological sequences and identifying regions of close similarity after sequence alignment. It is a type of recurrence plot.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

PDBsum is a database that provides an overview of the contents of each 3D macromolecular structure deposited in the Protein Data Bank (PDB).

<span class="mw-page-title-main">Blast2GO</span> Bioinformatics software tool

Blast2GO, first published in 2005, is a bioinformatics software tool for the automatic, high-throughput functional annotation of novel sequence data. It makes use of the BLAST algorithm to identify similar sequences to then transfers existing functional annotation from yet characterised sequences to the novel one. The functional information is represented via the Gene Ontology (GO), a controlled vocabulary of functional attributes. The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

<span class="mw-page-title-main">Protein tandem repeats</span>

An array of protein tandem repeats is defined as several adjacent copies having the same or similar sequence motifs. These periodic sequences are generated by internal duplications in both coding and non-coding genomic sequences. Repetitive units of protein tandem repeats are considerably diverse, ranging from the repetition of a single amino acid to domains of 100 or more residues.

Low complexity regions (LCRs) in protein sequences, also defined in some contexts as compositionally biased regions (CBRs), are regions in protein sequences that differ from the composition and complexity of most proteins that is normally associated with globular structure. LCRs have different properties from normal regions regarding structure, function and evolution.

References

  1. Brendel V, Bucher P, Nourbakhsh IR, Blaisdell BE, Karlin S (15 Mar 1992). "Methods and algorithms for statistical analysis of protein sequences". Proc Natl Acad Sci U S A. 89 (6): 2002–2006. Bibcode:1992PNAS...89.2002B. doi: 10.1073/pnas.89.6.2002 . PMC   48584 . PMID   1549558.
  2. Wootton JC, Federhen S (June 2003). "Statistics of local complexity in amino acid sequences and sequence databases". Computers and Chemistry. 17 (2): 149–163. doi:10.1016/0097-8485(93)85006-X.
  3. Harrison PM (13 Nov 2017). "fLPS: Fast discovery of compositional biases for the protein universe". BMC Bioinformatics. 18 (1): 476. doi: 10.1186/s12859-017-1906-3 . PMC   5684748 . PMID   29132292.
  4. Promponas VJ, Enright AJ, Tsoka S, Kreil DP, Leroy C, Hamodrakas S, Sander C, Ouzounis CA (Oct 2000). "CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts". Bioinformatics. 16 (10): 915–922. doi: 10.1093/bioinformatics/16.10.915 . PMID   11120681.
  5. Albà MM, Laskowski RA, Hancock JM (May 2002). "Detecting cryptically simple protein sequences using the SIMPLE algorithm". Bioinformatics. 18 (5): 672–678. doi: 10.1093/bioinformatics/18.5.672 . PMID   12050063.
  6. Wise MJ (2001). "0j.py: a software tool for low complexity proteins and protein domains". Bioinformatics. 17 (Suppl 1): S288–S295. doi: 10.1093/bioinformatics/17.suppl_1.s288 . PMID   11473020.
  7. Wan H, Li L, Federhen S, Wootton JC (2003). "Discovering simple regions in biological sequences associated with scoring schemes". J Comput Biol. 10 (2): 171–185. doi:10.1089/106652703321825955. PMID   12804090.
  8. Nandi T, Dash D, Ghai R, B-Rao C, Kannan K, Brahmachari SK, Ramakrishnan C, Ramachandran S (2003). "A new algorithm for detecting low-complexity regions in protein sequences". J Biomol Struct Dyn. 20 (5): 657–668. doi:10.1080/07391102.2003.10506882. PMID   12643768. S2CID   45635217.
  9. Shin SW, Kim SM (15 Jan 2005). "A novel complexity measure for comparative analysis of protein sequences from complete genomes". Bioinformatics. 21 (2): 160–170. doi: 10.1093/bioinformatics/bth497 . PMID   15333459.
  10. Kuznetsov IB, Hwang S (1 May 2006). "A novel sensitive method for the detection of user-defined compositional bias in biological sequences". Bioinformatics. 22 (9): 1055–1063. doi: 10.1093/bioinformatics/btl049 . PMID   16500936.
  11. Li X, Kahveci T (15 Dec 2006). "A Novel algorithm for identifying low-complexity regions in a protein sequence". Bioinformatics. 22 (24): 2980–2987. doi: 10.1093/bioinformatics/btl495 . PMID   17018537.
  12. He D, Parkinson J (1 Apr 2008). "SubSeqer: a graph-based approach for the detection and identification of repetitive elements in low-complexity sequences". Bioinformatics. 24 (7): 1016–1017. doi: 10.1093/bioinformatics/btn073 . PMID   18304932.
  13. Ooi HS, Kwo CY, Wildpaner M, Sirota FL, Eisenhaber B, Maurer-Stroh S, Wong WC, Schleiffer A, Eisenhaber F, Schneider G (Jul 2009). "ANNIE: integrated de novo protein sequence annotation". Nucleic Acids Res. 37 (Web server issue): W435–W440. doi:10.1093/nar/gkp254. PMC   2703921 . PMID   19389726.
  14. Harbi D, Kumar M, Harrison PM (6 Jan 2011). "LPS-annotate: complete annotation of compositionally biased regions in the protein knowledgebase". Database (Oxford). 2011: baq031. doi:10.1093/database/baq031. PMC   3017391 . PMID   21216786.
  15. Kirmitzoglou I, Promponas VJ (1 Jul 2015). "LCR-eXXXplorer: a web platform to search, visualize and share data for low complexity regions in protein sequences". Bioinformatics. 31 (13): 2208–2210. doi:10.1093/bioinformatics/btv115. PMC   4481844 . PMID   25712690.
  16. Claverie JM, States D (June 1993). "Information enhancement methods for large scale sequence analysis". Computers Chem. 17 (2): 191–201. doi:10.1016/0097-8485(93)85010-a.
  17. Silva JM, Qi W, Pinho AJ, Pratas D (2022-12-28). "AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data". GigaScience. 12. doi:10.1093/gigascience/giad101. ISSN   2047-217X. PMC   10716826 . PMID   38091509.
  18. Mier P, Paladin L, Tamana S, Petrosian S, Hajdu-Soltész B, Urbanek A, Gruca A, Plewczynski D, Grynberg M, Bernadó P, Gáspári Z (2020-03-23). "Disentangling the complexity of low complexity proteins". Briefings in Bioinformatics. 21 (2): 458–472. doi:10.1093/bib/bbz007. ISSN   1467-5463. PMC   7299295 . PMID   30698641.
  19. Jarnot P, Ziemska-Legiecka J, Dobson L, Merski M, Mier P, Andrade-Navarro MA, Hancock JM, Dosztányi Z, Paladin L, Necci M, Piovesan D (2020-07-02). "PlaToLoCo: the first web meta-server for visualization and annotation of low complexity regions in proteins". Nucleic Acids Research. 48 (W1): W77–W84. doi:10.1093/nar/gkaa339. ISSN   0305-1048. PMC   7319588 . PMID   32421769.
  20. Ntountoumi C, Vlastaridis P, Mossialos D, Stathopoulos C, Iliopoulos I, Promponas V, Oliver SG, Amoutzias GD (2019-11-04). "Low complexity regions in the proteins of prokaryotes perform important functional roles and are highly conserved". Nucleic Acids Research. 47 (19): 9998–10009. doi:10.1093/nar/gkz730. ISSN   0305-1048. PMC   6821194 . PMID   31504783.