Computational methods use different properties of protein sequences and structures to find, characterize and annotate protein tandem repeats.
Name | Last update | Usage | Result types | Description | Open source? | Repeat type specific | Reference | |
---|---|---|---|---|---|---|---|---|
ard2 | 2013 | web | annotated sequence | Neural network | no | alpha-solenoid | [1] | |
DECIPHER | 2021 | downloadable | Detection of tandem and/or interspersed repeats by orthology (DetectRepeats function in R package) | yes | no | [2] | ||
TRUST | 2004 | downloadable / web | unit position, multiple sequence alignment | Ab-initio determination of internal repeats in proteins. Exploits transitivity of alignments | ? | no | [3] | |
T-REKS | 2009 | downloadable / web | repeat unit | Clustering of lengths between identical short strings by using a K-means algorithm | yes | no | [4] | |
HHRepID | 2008 | downloadable / web | Identification of repeats in protein sequences via HMM-HMM comparison to exploit evolutionary information in the form of multiple sequence alignments of homologs | no | [5] | |||
RADAR | 2018 | downloadable / web | unit position, multiple sequence alignment | RADAR identifies short composition biased and gapped approximate repeats, as well as complex repeat architectures involving many different types of repeats in a query sequence | yes | no | [6] [7] | |
XSTREAM | 2007 | web | unit position, different periods, multiple sequence alignment | data-mining tool designed to efficiently identify Tandem Repeat (TR) patterns in biological sequence data. The program uses a seed-extension strategy coupled with several post-processing algorithms to analyze FASTA-formatted protein or nucleotide sequences | no | no | [8] | |
TRED | 2007 | downloadable | definition for tandem repeats over the edit distance and an efficient, deterministic algorithm for finding these repeats | no | no | |||
TRAL | 2015 | downloadable | Detects tandem repeats with both de novo software and sequence profile HMMs; statistical significance analysis of putative tandem repeats, and filtering of redundant predictions | yes | [9] | |||
DOTTER | 1995 | downloadable | Graphical dotplot program for detailed comparison of two sequences | [10] | ||||
0J.PY | [11] | |||||||
PTRStalker | 2012 | downloadable | unit position, multiple sequence alignment | Ab-initio detection of fuzzy tandem repeats in protein amino acid sequences. | no | [12] | ||
TRDistiller | 2015 | Rapid sorting of tandem repeat (TR)- and no-TR-containing sequences | [13] | |||||
REPRO | 2000 | web | Repeats detection based on a variation of the Smith-Waterman local alignment strategy followed by a graph-based iterative clustering procedure | no | no | [14] | ||
REP | 2000 | web | no | yes |
Name | Last update | Usage | Result types | Description | Open source? | Repeat type specific | Reference |
---|---|---|---|---|---|---|---|
TAPO | 2016 | web | unit position | Uses periodicities of atomic coordinates and other types of structural representation, including strings generated by conformational alphabets, residue contact maps, and arrangements of vectors of secondary structure elements | no | no | [15] |
SYMD | 2014 | galaxy | repeat geometry | Detects internally symmetric protein structures through an “alignment scan” procedure in which a protein structure is aligned to itself after circularly permuting the second copy by all possible number of residues | no | no | [16] |
RAPHAEL | 2012 | web | repeat probability | Reduce to three dimensional structure to a wave function. It then determines periodicity information. | no | no | [17] |
CE-SYMM | 2021 | ||||||
ProSTRIP | 2010 | ||||||
DAVROS | 2004 | ||||||
RQA | 2009 | ||||||
OPAAS | 2006 | ||||||
Gplus | 2009 | ||||||
REUPRED | 2016 | ||||||
ConSole | 2015 | ||||||
RepeatsDB-Lite | 2017 | ||||||
PRIGSA | 2014 | ||||||
Swelfe | 2008 | ||||||
Frustratometer | 2021 |
BioCreAtIvE consists in a community-wide effort for evaluating information extraction and text mining developments in the biological domain.
Multifactor dimensionality reduction (MDR) is a statistical approach, also used in machine learning automatic approaches, for detecting and characterizing combinations of attributes or independent variables that interact to influence a dependent or class variable. MDR was designed specifically to identify nonadditive interactions among discrete variables that influence a binary outcome and is considered a nonparametric and model-free alternative to traditional statistical methods such as logistic regression.
An alpha solenoid is a protein fold composed of repeating alpha helix subunits, commonly helix-turn-helix motifs, arranged in antiparallel fashion to form a superhelix. Alpha solenoids are known for their flexibility and plasticity. Like beta propellers, alpha solenoids are a form of solenoid protein domain commonly found in the proteins comprising the nuclear pore complex. They are also common in membrane coat proteins known as coatomers, such as clathrin, and in regulatory proteins that form extensive protein-protein interactions with their binding partners. Examples of alpha solenoid structures binding RNA and lipids have also been described.
Structural and physical properties of DNA provide important constraints on the binding sites formed on surfaces of DNA-binding proteins. Characteristics of such binding sites may be used for predicting DNA-binding sites from the structural and even sequence properties of unbound proteins. This approach has been successfully implemented for predicting the protein–protein interface. Here, this approach is adopted for predicting DNA-binding sites in DNA-binding proteins. First attempt to use sequence and evolutionary features to predict DNA-binding sites in proteins was made by Ahmad et al. (2004) and Ahmad and Sarai (2005). Some methods use structural information to predict DNA-binding sites and therefore require a three-dimensional structure of the protein, while others use only sequence information and do not require protein structure in order to make a prediction.
DNA binding sites are a type of binding site found in DNA where other molecules may bind. DNA binding sites are distinct from other binding sites in that (1) they are part of a DNA sequence and (2) they are bound by DNA-binding proteins. DNA binding sites are often associated with specialized proteins known as transcription factors, and are thus linked to transcriptional regulation. The sum of DNA binding sites of a specific transcription factor is referred to as its cistrome. DNA binding sites also encompasses the targets of other proteins, like restriction enzymes, site-specific recombinases and methyltransferases.
SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.
A supertree is a single phylogenetic tree assembled from a combination of smaller phylogenetic trees, which may have been assembled using different datasets or a different selection of taxa. Supertree algorithms can highlight areas where additional data would most usefully resolve any ambiguities. The input trees of a supertree should behave as samples from the larger tree.
David Tudor Jones is a Professor of Bioinformatics, and Head of Bioinformatics Group in the University College London. He is also the director in Bloomsbury Center for Bioinformatics, which is a joint Research Centre between UCL and Birkbeck, University of London and which also provides bioinformatics training and support services to biomedical researchers. In 2013, he is a member of editorial boards for PLoS ONE, BioData Mining, Advanced Bioinformatics, Chemical Biology & Drug Design, and Protein: Structure, Function and Bioinformatics.
Schellman loops are commonly occurring structural features of proteins and polypeptides. Each has six amino acid residues with two specific inter-mainchain hydrogen bonds and a characteristic main chain dihedral angle conformation. The CO group of residue i is hydrogen-bonded to the NH of residue i+5, and the CO group of residue i+1 is hydrogen-bonded to the NH of residue i+4. Residues i+1, i+2, and i+3 have negative φ (phi) angle values and the phi value of residue i+4 is positive. Schellman loops incorporate a three amino acid residue RL nest, in which three mainchain NH groups form a concavity for hydrogen bonding to carbonyl oxygens. About 2.5% of amino acids in proteins belong to Schellman loops. Two websites are available for examining small motifs in proteins, Motivated Proteins: ; or PDBeMotif:.
An array of protein tandem repeats is defined as several adjacent copies having the same or similar sequence motifs. These periodic sequences are generated by internal duplications in both coding and non-coding genomic sequences. Repetitive units of protein tandem repeats are considerably diverse, ranging from the repetition of a single amino acid to domains of 100 or more residues.
Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.
A toroid repeat is a protein fold composed of repeating subunits, arranged in circular fashion to form a closed structure.
Computational methods that use protein sequence and/ or protein structure to predict protein aggregation. The table below, shows the main features of software for prediction of protein aggregation
DIMPL is a bioinformatic pipeline that enables the extraction and selection of bacterial GC-rich intergenic regions (IGRs) that are enriched for structured non-coding RNAs (ncRNAs). The method of enriching bacterial IGRs for ncRNA motif discovery was first reported for a study in "Genome-wide discovery of structured noncoding RNAs in bacteria".