Alignment-free sequence analysis

Last updated

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches. [1]

Contents

The emergence and need for the analysis of different types of data generated through biological research has given rise to the field of bioinformatics. [2] Molecular sequence and structure data of DNA, RNA, and proteins, gene expression profiles or microarray data, metabolic pathway data are some of the major types of data being analysed in bioinformatics. Among them sequence data is increasing at the exponential rate due to advent of next-generation sequencing technologies. Since the origin of bioinformatics, sequence analysis has remained the major area of research with wide range of applications in database searching, genome annotation, comparative genomics, molecular phylogeny and gene prediction. The pioneering approaches for sequence analysis were based on sequence alignment either global or local, pairwise or multiple sequence alignment. [3] [4] Alignment-based approaches generally give excellent results when the sequences under study are closely related and can be reliably aligned, but when the sequences are divergent, a reliable alignment cannot be obtained and hence the applications of sequence alignment are limited. Another limitation of alignment-based approaches is their computational complexity and are time-consuming and thus, are limited when dealing with large-scale sequence data. [5] The advent of next-generation sequencing technologies has resulted in generation of voluminous sequencing data. The size of this sequence data poses challenges on alignment-based algorithms in their assembly, annotation and comparative studies.

Alignment-free methods

Alignment-free methods can broadly be classified into five categories: a) methods based on k-mer/word frequency, b) methods based on the length of common substrings, c) methods based on the number of (spaced) word matches, d) methods based on micro-alignments, e) methods based on information theory and f) methods based on graphical representation. Alignment-free approaches have been used in sequence similarity searches, [6] clustering and classification of sequences, [7] and more recently in phylogenetics [8] [9] (Figure 1).

Such molecular phylogeny analyses employing alignment-free approaches are said to be part of next-generation phylogenomics. [9] A number of review articles provide in-depth review of alignment-free methods in sequence analysis. [1] [10] [11] [12] [13] [14] [15]

The AFproject is an international collaboration to benchmark and compare software tools for alignment-free sequence comparison. [16]

Methods based on k-mer/word frequency

The popular methods based on k-mer/word frequencies include feature frequency profile (FFP), [17] [18] Composition vector (CV), [19] [20] Return time distribution (RTD), [21] frequency chaos game representation (FCGR). [22] and Spaced Words. [23]

Feature frequency profile (FFP)

The methodology involved in FFP based method starts by calculating the count of each possible k-mer (possible number of k-mers for nucleotide sequence: 4k, while that for protein sequence: 20k) in sequences. Each k-mer count in each sequence is then normalized by dividing it by total of all k-mers' count in that sequence. This leads to conversion of each sequence into its feature frequency profile. The pair wise distance between two sequences is then calculated Jensen–Shannon (JS) divergence between their respective FFPs. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA etc.

Composition vector (CV)

In this method frequency of appearance of each possible k-mer in a given sequence is calculated. The next characteristic step of this method is the subtraction of random background of these frequencies using Markov model to reduce the influence of random neutral mutations to highlight the role of selective evolution. The normalized frequencies are put a fixed order to form the composition vector (CV) of a given sequence. Cosine distance function is then used to compute pairwise distance between CVs of sequences. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA etc. This method can be extended through resort to efficient pattern matching algorithms to include in the computation of the composition vectors: (i) all k-mers for any value of k, (ii) all substrings of any length up to an arbitrarily set maximum k value, (iii) all maximal substrings, where a substring is maximal if extending it by any character would cause a decrease in its occurrence count. [24] [25]

Return time distribution (RTD)

The RTD based method does not calculate the count of k-mers in sequences, instead it computes the time required for the reappearance of k-mers. The time refers to the number of residues in successive appearance of particular k-mer. Thus the occurrence of each k-mer in a sequence is calculated in the form of RTD, which is then summarised using two statistical parameters mean (μ) and standard deviation (σ). Thus each sequence is represented in the form of numeric vector of size 24k containing μ and σ of 4k RTDs. The pair wise distance between sequences is calculated using Euclidean distance measure. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA etc. A recent approach Pattern Extraction through Entropy Retrieval (PEER) provides direct detection of the k-mer length and summarised the occurrence interval using entropy.

Frequency chaos game representation (FCGR)

The FCGR methods have evolved from chaos game representation (CGR) technique, which provides scale independent representation for genomic sequences. [26] The CGRs can be divided by grid lines where each grid square denotes the occurrence of oligonucleotides of a specific length in the sequence. Such representation of CGRs is termed as Frequency Chaos Game Representation (FCGR). This leads to representation of each sequence into FCGR. The pair wise distance between FCGRs of sequences can be calculated using the Pearson distance, the Hamming distance or the Euclidean distance. [27]

Spaced-word frequencies

While most alignment-free algorithms compare the word-composition of sequences, Spaced Words uses a pattern of care and don't care positions. The occurrence of a spaced word in a sequence is then defined by the characters at the match positions only, while the characters at the don't care positions are ignored. Instead of comparing the frequencies of contiguous words in the input sequences, this approach compares the frequencies of the spaced words according to the pre-defined pattern. [23] Note that the pre-defined pattern can be selected by analysis of the Variance of the number of matches, [28] the probability of the first occurrence on several models, [29] or the Pearson correlation coefficient between the expected word frequency and the true alignment distance. [30]

Methods based on length of common substrings

The methods in this category employ the similarity and differences of substrings in a pair of sequences. These algorithms were mostly used for string processing in computer science. [31]

Average common substring (ACS)

In this approach, for a chosen pair of sequences (A and B of lengths n and m respectively), longest substring starting at some position is identified in one sequence (A) which exactly matches in the other sequence (B) at any position. In this way, lengths of longest substrings starting at different positions in sequence A and having exact matches at some positions in sequence B are calculated. All these lengths are averaged to derive a measure . Intuitively, larger the , the more similar the two sequences are. To account for the differences in the length of sequences, is normalized [i.e. ]. This gives the similarity measure between the sequences.

In order to derive a distance measure, the inverse of similarity measure is taken and a correction term is subtracted from it to assure that will be zero. Thus

This measure is not symmetric, so one has to compute , which gives final ACS measure between the two strings (A and B). [32] The subsequence/substring search can be efficiently performed by using suffix trees. [33] [34] [35]

k-mismatch average common substring approach (kmacs)

This approach is a generalization of the ACS approach. To define the distance between two DNA or protein sequences, kmacs estimates for each position i of the first sequence the longest substring starting at i and matching a substring of the second sequence with up to k mismatches. It defines the average of these values as a measure of similarity between the sequences and turns this into a symmetric distance measure. Kmacs does not compute exact k-mismatch substrings, since this would be computational too costly, but approximates such substrings. [36]

Mutation distances (Kr)

This approach is closely related to the ACS, which calculates the number of substitutions per site between two DNA sequences using the shortest absent substring (termed as shustring). [37]

Length distribution of k-mismatch common substrings

This approach uses the program kmacs [36] to calculate longest common substrings with up to k mismatches for a pair of DNA sequences. The phylogenetic distance between the sequences can then be estimated from a local maximum in the length distribution of the k-mismatch common substrings. [38]

Methods based on the number of (spaced) word matches

and

These approachese are variants of the statistics that counts the number of -mer matches between two sequences. They improve the simple statistics by taking the background distribution of the compared sequences into account. [39]

MASH

This is an extremely fast method that uses the MinHash bottom sketch strategy for estimating the Jaccard index of the multi-sets of -mers of two input sequences. That is, it estimates the ratio of -mer matches to the total number of -mers of the sequences. This can be used, in turn, to estimate the evolutionary distances between the compared sequences, measured as the number of substitutions per sequence position since the sequences evolved from their last common ancestor. [40]

Slope-Tree

This approach calculates a distance value between two protein sequences based on the decay of the number of -mer matches if increases. [41]

Slope-SpaM

This method calculates the number of -mer or spaced-word matches (SpaM) for different values for the word length or number of match positions in the underlying pattern, respectively. The slope of an affine-linear function that depends on is calculated to estimate the Jukes-Cantor distance between the input sequences . [42]

Skmer

Skmer calculates distances between species from unassembled sequencing reads. Similar to MASH, it uses the Jaccard index on the sets of -mers from the input sequences. In contrast to MASH, the program is still accurate for low sequencing coverage, so it can be used for genome skimming. [43]

Methods based on micro-alignments

Strictly spoken, these methods are not alignment-free. They are using simple gap-free micro-alignments where sequences are required to match at certain pre-defined positions. The positions aligned at the remaining positions of the micro-alignments where mismatches are allowed, are then used for phylogeny inference.

Co-phylog

This method searches for so-called structures that are defined as pairs of k-mer matches between two DNA sequences that are one position apart in both sequences. The two k-mer matches are called the context, the position between them is called the object. Co-phylog then defines the distance between two sequences the fraction of such structures for which the two nucleotides in the object are different. The approach can be applied to unassembled sequencing reads. [44]

andi

andi estimates phylogenetic distances between genomic sequences based on ungapped local alignments that are flanked by maximal exact word matches. Such word matches can be efficiently found using suffix arrays. The gapfree alignments between the exact word matches are then used to estimate phylogenetic distances between genome sequences. The resulting distance estimates are accurate for up to around 0.6 substitutions per position. [45]

Filtered Spaced-Word Matches (FSWM)

FSWM uses a pre-defined binary pattern P representing so-called match positions and don't-care positions. For a pair of input DNA sequences, it then searches for spaced-word matches w.r.t. P, i.e. for local gap-free alignments with matching nucleotides at the match positions of P and possible mismatches at the don't-care positions. Spurious low-scoring spaced-word matches are discarded, evolutionary distances between the input sequences are estimated based on the nucleotides aligned to each other at the don't-care positions of the remaining, homologous spaced-word matches. [46] FSWM has been adapted to estimate distances based on unassembled NGS reads, this version of the program is called Read-SpaM. [47]

Prot-SpaM

Prot-SpaM (Proteome-based Spaced-word Matches) is an implementation of the FSWM algorithm for partial or whole proteome sequences. [48]

Multi-SpaM

Multi-SpaM (MultipleSpaced-word Matches) is an approach to genome-based phylogeny reconstruction that extends the FSWM idea to multiple sequence comparison. [49] Given a binary pattern P of match positions and don't-care positions, the program searches for P-blocks, i.e. local gap-free four-way alignments with matching nucleotides at the match positions of P and possible mismatches at the don't-care positions. Such four-way alignments are randomly sampled from a set of input genome sequences. For each P-block, an unrooted tree topology is calculated using RAxML. [50] The program Quartet MaxCut is then used to calculate a supertree from these trees.

Methods based on information theory

Information Theory has provided successful methods for alignment-free sequence analysis and comparison. The existing applications of information theory include global and local characterization of DNA, RNA and proteins, estimating genome entropy to motif and region classification. It also holds promise in gene mapping, next-generation sequencing analysis and metagenomics. [51]

Base–base correlation (BBC)

Base–base correlation (BBC) converts the genome sequence into a unique 16-dimensional numeric vector using the following equation,

The and denotes the probabilities of bases i and j in the genome. The indicates the probability of bases i and j at distance in the genome. The parameter K indicates the maximum distance between the bases i and j. The variation in the values of 16 parameters reflect variation in the genome content and length. [52] [53] [54]

Information correlation and partial information correlation (IC-PIC)

IC-PIC (information correlation and partial information correlation) based method employs the base correlation property of DNA sequence. IC and PIC were calculated using following formulas,

The final vector is obtained as follows:

which defines the range of distance between bases. [55]

The pairwise distance between sequences is calculated using Euclidean distance measure. The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA, etc..

Compression

Examples are effective approximations to Kolmogorov complexity, for example Lempel-Ziv complexity. In general compression-based methods use the mutual information between the sequences. This is expressed in conditional Kolmogorov complexity, that is, the length of the shortest self-delimiting program required to generate a string given the prior knowledge of the other string. This measure has a relation to measuring k-words in a sequence, as they can be easily used to generate the sequence. It is sometimes a computationally intensive method. The theoretic basis for the Kolmogorov complexity approach was laid by Bennett, Gacs, Li, Vitanyi, and Zurek (1998) by proposing the information distance. [56] The Kolmogorov complexity being incomputable it was approximated by compression algorithms. The better they compress the better they are. Li, Badger, Chen, Kwong,, Kearney, and Zhang (2001) used a non-optimal but normalized form of this approach, [57] and the optimal normalized form by Li, Chen, Li, Ma, and Vitanyi (2003) appeared in [58] and more extensively and proven by Cilibrasi and Vitanyi (2005) in. [59] Otu and Sayood (2003) used the Lempel-Ziv complexity method to construct five different distance measures for phylogenetic tree construction. [60]

Context modeling compression

In the context modeling complexity the next-symbol predictions, of one or more statistical models, are combined or competing to yield a prediction that is based on events recorded in the past. The algorithmic information content derived from each symbol prediction can be used to compute algorithmic information profiles with a time proportional to the length of the sequence. The process has been applied to DNA sequence analysis. [61]

Methods based on graphical representation

Iterated maps

The use of iterated maps for sequence analysis was first introduced by HJ Jefferey in 1990 [26] when he proposed to apply the Chaos Game to map genomic sequences into a unit square. That report coined the procedure as Chaos Game Representation (CGR). However, only 3 years later this approach was first dismissed as a projection of a Markov transition table by N Goldman. [62] This objection was overruled by the end of that decade when the opposite was found to be the case – that CGR bijectively maps Markov transition is into a fractal, order-free (degree-free) representation. [63] The realization that iterated maps provide a bijective map between the symbolic space and numeric space led to the identification of a variety of alignment-free approaches to sequence comparison and characterization. These developments were reviewed in late 2013 by JS Almeida in. [64] A number of web apps such as https://github.com/usm/usm.github.com/wiki, [65] are available to demonstrate how to encode and compare arbitrary symbolic sequences in a manner that takes full advantage of modern MapReduce distribution developed for cloud computing.

Comparison of alignment based and alignment-free methods

Alignment-based methodsAlignment-free methods
These methods assume that homologous regions are contiguous (with gaps)Does not assume such contiguity of homologous regions
Computes all possible pairwise comparisons of sequences; hence computationally expensiveBased on occurrences of sub-sequences; composition; computationally inexpensive, can be memory-intensive
Well-established approach in phylogenomicsRelatively recent and application in phylogenomics is limited; needs further testing for robustness and scalability
Requires substitution/evolutionary modelsLess dependent on substitution/evolutionary models
Sensitive to stochastic sequence variation, recombination, horizontal (or lateral) genetic transfer, rate heterogeneity and sequences of varied lengths, especially when similarity lies in the "twilight zone"Less sensitive to stochastic sequence variation, recombination, horizontal (or lateral) genetic transfer, rate heterogeneity and sequences of varied lengths
Best practice uses inference algorithms with complexity at least O(n2); less time-efficientInference algorithms typically O(n2) or less; more time-efficient
Heuristic in nature; statistical significance of how alignment scores relate to homology is difficult to assessExact solutions; statistical significance of the sequence distances (and degree of similarity) can be readily assessed
Relies on dynamic programming (computationally expensive) to find alignment that has optimal score.side-steps computational expensive dynamic programming by indexing word counts or positions in fractal space. [66]

Applications of alignment-free methods

List of web servers/software for alignment-free methods

NameDescriptionAvailabilityReference
ProtcompMost Expressed Features scoring approach PROTCOMP [87]
kmacsk-mismatch average common substring approach kmacs [36]
Spaced wordsSpaced-word frequencies spaced-words [23]
Co-phylogassembly-free micro-alignment approach Co-phylog [44]
Prot-SpaMProteome-based spaced-word matches Prot-SpaM [48]
FSWMFiltered Spaced-Word Matches FSWM [46]
FFPFeature frequency profile based phylogeny FFP [17]
CVTreeComposition vector based server for phylogeny CVTree [88]
RTD PhylogenyReturn time distribution based server for phylogeny RTD Phylogeny [21]
AGPA multimethods web server for alignment-free genome phylogeny AGP [89]
AlfyAlignment-free detection of local similarity among viral and bacterial genomes Alfy [8]
decaf+pyDistancE Calculation using Alignment-Free methods in PYthon decaf+py [90]
Dengue SubtyperGenotyping of Dengue viruses based on RTDDengue Subtyper [21]
WNV TyperGenotyping of West nile viruses based on RTDWNV Typer [77]
AllergenFPAllergenicity prediction by descriptor fingerprints AllergenFP [79]
kSNP v2Alignment-Free SNP Discovery kSNP v2 [80]
d2ToolsComparison of Metatranscriptomic Samples Based on k-Tuple Frequencies d2Tools [91]
rushRecombination detection Using SHustrings rush [81]
smashGenomic rearrangements detection and visualisation smash [67]
Smash++Finding and visualizing genomic rearrangements Smash++ [68]
GScompareOligonucleotide-based fast clustering of bacterial genomes GScompare
COMETAlignment-free subtyping of HIV-1, HIV-2 and HCV viral sequences COMET [78]
USMFractal MapReduce decomposition of sequence alignment usm.github.io [65]
FALCONAlignment-free method to infer metagenomic composition of ancient DNA FALCON [73]
KrakenTaxonomic classification using exact k-mer matches Kraken 2 [74]
EAGLEAn ultra-fast tool to find relative absent words in genomic data EAGLE2 [92]
CLCPhylogenetic trees using reference-free k-mer based matching CLC Microbial Genome Module [93]
xgTaxonomyA tool for metagenomic classification that uses data compression algorithms to classify genomic sequences. xgTaxonomy [84]
AlcoRAn extremely efficient method for identifying and visualizing low-complexity regions in genomic and proteomic sequences. AlcoR [86]
AltaiRA C toolkit for alignment-free and temporal analysis of multi-FASTA data. AltaiR [85]

See also

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

A phylogenetic tree, phylogeny or evolutionary tree is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time. In other words, it is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. In evolutionary biology, all life on Earth is theoretically part of a single phylogenetic tree, indicating common ancestry. Phylogenetics is the study of phylogenetic trees. The main challenge is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of species or taxa. Computational phylogenetics focuses on the algorithms involved in finding optimal phylogenetic tree in the phylogenetic landscape.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.

In bioinformatics, sequence clustering algorithms attempt to group biological sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs) or protein origin. For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the same gene before the ESTs are assembled to reconstruct the original mRNA.

<span class="mw-page-title-main">Comparative genomics</span> Field of biological research

Comparative genomics is a branch of biological research that examines genome sequences across a spectrum of species, spanning from humans and mice to a diverse array of organisms from bacteria to chimpanzees. This large-scale holistic approach compares two or more genomes to discover the similarities and differences between the genomes and to study the biology of the individual genomes. Comparison of whole genome sequences provides a highly detailed view of how organisms are related to each other at the gene level. By comparing whole genome sequences, researchers gain insights into genetic relationships between organisms and study evolutionary changes. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, Comparative genomics provides a powerful tool for studying evolutionary changes among organisms, helping to identify genes that are conserved or common among species, as well as genes that give unique characteristics of each organism. Moreover, these studies can be performed at different levels of the genomes to obtain multiple perspectives about the organisms.

A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an alignment is important to create a useful alignment. Too many gaps can cause an alignment to become meaningless. Gap penalties are used to adjust alignment scores based on the number and length of gaps. The five main types of gap penalties are constant, linear, affine, convex, and profile-based.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

<span class="mw-page-title-main">Clustal</span> Bioinformatics computer program

Clustal is a computer program used for multiple sequence alignment in bioinformatics. The software and its algorithms have gone through several iterations, with ClustalΩ (Omega) being the latest version as of 2011. It is available as standalone software, via a web interface, and through a server hosted by the European Bioinformatics Institute.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis and can highlight homologous features between sequences. Alignments highlight mutation events such as point mutations, insertion mutations and deletion mutations, and alignments are used to assess sequence conservation and infer the presence and activity of protein domains, tertiary structures, secondary structures, and individual amino acids or nucleotides.

Ancestral reconstruction is the extrapolation back in time from measured characteristics of individuals, populations, or species to their common ancestors. It is an important application of phylogenetics, the reconstruction and study of the evolutionary relationships among individuals, populations or species to their ancestors. In the context of evolutionary biology, ancestral reconstruction can be used to recover different kinds of ancestral character states of organisms that lived millions of years ago. These states include the genetic sequence, the amino acid sequence of a protein, the composition of a genome, a measurable characteristic of an organism (phenotype), and the geographic range of an ancestral population or species. This is desirable because it allows us to examine parts of phylogenetic trees corresponding to the distant past, clarifying the evolutionary history of the species in the tree. Since modern genetic sequences are essentially a variation of ancient ones, access to ancient sequences may identify other variations and organisms which could have arisen from those sequences. In addition to genetic sequences, one might attempt to track the changing of one character trait to another, such as fins turning to legs.

<i>k</i>-mer Substrings of length k contained in a biological sequence

In bioinformatics, k-mers are substrings of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides, k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where is number of possible monomers.

In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome. Binning methods can be based on either compositional features or alignment (similarity), or both.

In the field of computational biology, a planted motif search (PMS) also known as a (l, d)-motif search (LDMS) is a method for identifying conserved motifs within a set of nucleic acid or peptide sequences.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.

In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., alignment, assembly, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

<span class="mw-page-title-main">Phylogenetic reconciliation</span> Technique in evolutionary study

In phylogenetics, reconciliation is an approach to connect the history of two or more coevolving biological entities. The general idea of reconciliation is that a phylogenetic tree representing the evolution of an entity can be drawn within another phylogenetic tree representing an encompassing entity to reveal their interdependence and the evolutionary events that have marked their shared history. The development of reconciliation approaches started in the 1980s, mainly to depict the coevolution of a gene and a genome, and of a host and a symbiont, which can be mutualist, commensalist or parasitic. It has also been used for example to detect horizontal gene transfer, or understand the dynamics of genome evolution.

References

  1. 1 2 Vinga S, Almeida J (March 2003). "Alignment-free sequence comparison-a review". Bioinformatics. 19 (4): 513–523. doi: 10.1093/bioinformatics/btg005 . PMID   12611807.
  2. Rothberg J, Merriman B, Higgs G (September 2012). "Bioinformatics. Introduction". The Yale Journal of Biology and Medicine. 85 (3): 305–308. PMC   3447194 . PMID   23189382.
  3. Batzoglou S (March 2005). "The many faces of sequence alignment". Briefings in Bioinformatics. 6 (1): 6–22. doi: 10.1093/bib/6.1.6 . PMID   15826353.
  4. Mullan L (March 2006). "Pairwise sequence alignment--it's all about us!". Briefings in Bioinformatics. 7 (1): 113–115. doi:10.1093/bib/bbk008. PMID   16761368.
  5. Kemena C, Notredame C (October 2009). "Upcoming challenges for multiple sequence alignment methods in the high-throughput era". Bioinformatics. 25 (19): 2455–2465. doi:10.1093/bioinformatics/btp452. PMC   2752613 . PMID   19648142.
  6. Hide W, Burke J, Davison DB (1994). "Biological evaluation of d2, an algorithm for high-performance sequence comparison". Journal of Computational Biology. 1 (3): 199–215. doi:10.1089/cmb.1994.1.199. PMID   8790465.
  7. Miller RT, Christoffels AG, Gopalakrishnan C, Burke J, Ptitsyn AA, Broveak TR, Hide WA (November 1999). "A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base". Genome Research. 9 (11): 1143–1155. doi:10.1101/gr.9.11.1143. PMC   310831 . PMID   10568754.
  8. 1 2 3 Domazet-Lošo M, Haubold B (June 2011). "Alignment-free detection of local similarity among viral and bacterial genomes". Bioinformatics. 27 (11): 1466–1472. doi: 10.1093/bioinformatics/btr176 . PMID   21471011.
  9. 1 2 3 Chan CX, Ragan MA (January 2013). "Next-generation phylogenomics". Biology Direct. 8: 3. doi: 10.1186/1745-6150-8-3 . PMC   3564786 . PMID   23339707.
  10. Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F (May 2014). "New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing". Briefings in Bioinformatics. 15 (3): 343–353. doi:10.1093/bib/bbt067. PMC   4017329 . PMID   24064230.
  11. 1 2 Haubold B (May 2014). "Alignment-free phylogenetics and population genetics". Briefings in Bioinformatics. 15 (3): 407–418. doi: 10.1093/bib/bbt083 . PMID   24291823.
  12. Bonham-Carter O, Steele J, Bastola D (November 2014). "Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis". Briefings in Bioinformatics. 15 (6): 890–905. doi:10.1093/bib/bbt052. PMC   4296134 . PMID   23904502.
  13. Zielezinski A, Vinga S, Almeida J, Karlowski WM (October 2017). "Alignment-free sequence comparison: benefits, applications, and tools". Genome Biology. 18 (1): 186. doi: 10.1186/s13059-017-1319-7 . PMC   5627421 . PMID   28974235.
  14. 1 2 Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, et al. (March 2019). "Alignment-free inference of hierarchical and reticulate phylogenomic relationships". Briefings in Bioinformatics. 20 (2): 426–435. doi:10.1093/bib/bbx067. PMC   6433738 . PMID   28673025.
  15. Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F (July 2018). "Alignment-Free Sequence Analysis and Applications". Annual Review of Biomedical Data Science. 1: 93–114. arXiv: 1803.09727 . Bibcode:2018arXiv180309727R. doi:10.1146/annurev-biodatasci-080917-013431. PMC   6905628 . PMID   31828235.
  16. Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, et al. (July 2019). "Benchmarking of alignment-free sequence comparison methods". Genome Biology. 20 (1): 144. doi: 10.1186/s13059-019-1755-7 . PMC   6659240 . PMID   31345254.
  17. 1 2 Sims GE, Jun SR, Wu GA, Kim SH (October 2009). "Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions". Proceedings of the National Academy of Sciences of the United States of America. 106 (40): 17077–17082. Bibcode:2009PNAS..10617077S. doi: 10.1073/pnas.0909377106 . PMC   2761373 . PMID   19805074.
  18. Sims GE, Kim SH (May 2011). "Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs)". Proceedings of the National Academy of Sciences of the United States of America. 108 (20): 8329–8334. Bibcode:2011PNAS..108.8329S. doi: 10.1073/pnas.1105168108 . PMC   3100984 . PMID   21536867.
  19. Gao L, Qi J (March 2007). "Whole genome molecular phylogeny of large dsDNA viruses using composition vector method". BMC Evolutionary Biology. 7 (1): 41. Bibcode:2007BMCEE...7...41G. doi: 10.1186/1471-2148-7-41 . PMC   1839080 . PMID   17359548.
  20. Wang H, Xu Z, Gao L, Hao B (August 2009). "A fungal phylogeny based on 82 complete genomes using the composition vector method". BMC Evolutionary Biology. 9 (1): 195. Bibcode:2009BMCEE...9..195W. doi: 10.1186/1471-2148-9-195 . PMC   3087519 . PMID   19664262.
  21. 1 2 3 4 Kolekar P, Kale M, Kulkarni-Kale U (November 2012). "Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping". Molecular Phylogenetics and Evolution. 65 (2): 510–522. doi:10.1016/j.ympev.2012.07.003. PMID   22820020.
  22. Hatje K, Kollmar M (2012). "A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method". Frontiers in Plant Science. 3: 192. doi: 10.3389/fpls.2012.00192 . PMC   3429886 . PMID   22952468.
  23. 1 2 3 Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B (July 2014). "Fast alignment-free sequence comparison using spaced-word frequencies". Bioinformatics. 30 (14): 1991–1999. doi:10.1093/bioinformatics/btu177. PMC   4080745 . PMID   24700317.
  24. Apostolico A, Denas O (October 2008). "Fast algorithms for computing sequence distances by exhaustive substring composition". Algorithms for Molecular Biology. 3: 13. doi: 10.1186/1748-7188-3-13 . PMC   2615014 . PMID   18957094.
  25. Apostolico A, Denas O, Dress A (September 2010). "Efficient tools for comparative substring analysis". Journal of Biotechnology. 149 (3): 120–126. doi:10.1016/j.jbiotec.2010.05.006. PMID   20682467.
  26. 1 2 Jeffrey HJ (April 1990). "Chaos game representation of gene structure". Nucleic Acids Research. 18 (8): 2163–2170. doi:10.1093/nar/18.8.2163. PMC   330698 . PMID   2336393.
  27. Wang Y, Hill K, Singh S, Kari L (February 2005). "The spectrum of genomic signatures: from dinucleotides to chaos game representation". Gene. 346: 173–185. doi:10.1016/j.gene.2004.10.021. PMID   15716010.
  28. Hahn L, Leimeister CA, Ounit R, Lonardi S, Morgenstern B (October 2016). "rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison". PLOS Computational Biology. 12 (10): e1005107. arXiv: 1511.04001 . Bibcode:2016PLSCB..12E5107H. doi: 10.1371/journal.pcbi.1005107 . PMC   5070788 . PMID   27760124.
  29. Noé L (Feb 14, 2017). "Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds". Algorithms for Molecular Biology. 12 (1): 1. doi: 10.1186/s13015-017-0092-1 . PMC   5310094 . PMID   28289437.
  30. 1 2 Noé L, Martin DE (December 2014). "A coverage criterion for spaced seeds and its applications to support vector machine string kernels and k-mer distances". Journal of Computational Biology. 21 (12): 947–963. arXiv: 1412.2587 . Bibcode:2014arXiv1412.2587N. doi:10.1089/cmb.2014.0173. PMC   4253314 . PMID   25393923.
  31. Gusfield D (1997). Algorithms on strings, trees, and sequences: computer science and computational biology (Reprinted (with corr.) ed.). Cambridge [u.a.]: Cambridge Univ. Press. ISBN   9780521585194.
  32. Ulitsky I, Burstein D, Tuller T, Chor B (March 2006). "The average common substring approach to phylogenomic reconstruction". Journal of Computational Biology. 13 (2): 336–350. CiteSeerX   10.1.1.106.5122 . doi:10.1089/cmb.2006.13.336. PMID   16597244.
  33. Weiner P (1973). "Linear pattern matching algorithms". 14th Annual Symposium on Switching and Automata Theory (swat 1973). pp. 1–11. CiteSeerX   10.1.1.474.9582 . doi:10.1109/SWAT.1973.13.
  34. He D (2006). "Using suffix tree to discover complex repetitive patterns in DNA sequences". 2006 International Conference of the IEEE Engineering in Medicine and Biology Society. Vol. 1. pp. 3474–7. doi:10.1109/IEMBS.2006.260445. ISBN   978-1-4244-0032-4. PMID   17945779. S2CID   5953866.
  35. Välimäki N, Gerlach W, Dixit K, Mäkinen V (March 2007). "Compressed suffix tree--a basis for genome-scale sequence analysis". Bioinformatics. 23 (5): 629–630. doi: 10.1093/bioinformatics/btl681 . PMID   17237063.
  36. 1 2 3 Leimeister CA, Morgenstern B (July 2014). "Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison". Bioinformatics. 30 (14): 2000–2008. doi:10.1093/bioinformatics/btu331. PMC   4080746 . PMID   24828656.
  37. Haubold B, Pfaffelhuber P, Domazet-Loso M, Wiehe T (October 2009). "Estimating mutation distances from unaligned genomes". Journal of Computational Biology. 16 (10): 1487–1500. doi:10.1089/cmb.2009.0106. hdl: 11858/00-001M-0000-000F-D624-D . PMID   19803738.
  38. Morgenstern B, Schöbel S, Leimeister CA (2017). "Phylogeny reconstruction based on the length distribution of k-mismatch common substrings". Algorithms for Molecular Biology. 12: 27. doi: 10.1186/s13015-017-0118-8 . PMC   5724348 . PMID   29238399.
  39. Reinert G, Chew D, Sun F, Waterman MS (December 2009). "Alignment-free sequence comparison (I): statistics and power". Journal of Computational Biology. 16 (12): 1615–1634. doi:10.1089/cmb.2009.0198. PMC   2818754 . PMID   20001252.
  40. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM (June 2016). "Mash: fast genome and metagenome distance estimation using MinHash". Genome Biology. 17 (1): 132. doi: 10.1186/s13059-016-0997-x . PMC   4915045 . PMID   27323842.
  41. Bromberg R, Grishin NV, Otwinowski Z (June 2016). "Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer". PLOS Computational Biology. 12 (6): e1004985. Bibcode:2016PLSCB..12E4985B. doi: 10.1371/journal.pcbi.1004985 . PMC   4918981 . PMID   27336403.
  42. Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B (2020). "The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances". PLOS ONE. 15 (2): e0228070. Bibcode:2020PLoSO..1528070R. doi: 10.1371/journal.pone.0228070 . PMC   7010260 . PMID   32040534.
  43. Sarmashghi S, Bohmann K, P Gilbert MT, Bafna V, Mirarab S (February 2019). "Skmer: assembly-free and alignment-free sample identification using genome skims". Genome Biology. 20 (1): 34. doi: 10.1186/s13059-019-1632-4 . PMC   6374904 . PMID   30760303.
  44. 1 2 Yi H, Jin L (April 2013). "Co-phylog: an assembly-free phylogenomic approach for closely related organisms". Nucleic Acids Research. 41 (7): e75. doi:10.1093/nar/gkt003. PMC   3627563 . PMID   23335788.
  45. Haubold B, Klötzl F, Pfaffelhuber P (April 2015). "andi: fast and accurate estimation of evolutionary distances between closely related genomes". Bioinformatics. 31 (8): 1169–1175. doi: 10.1093/bioinformatics/btu815 . PMID   25504847.
  46. 1 2 Leimeister CA, Sohrabi-Jahromi S, Morgenstern B (April 2017). "Fast and accurate phylogeny reconstruction using filtered spaced-word matches". Bioinformatics. 33 (7): 971–979. doi:10.1093/bioinformatics/btw776. PMC   5409309 . PMID   28073754.
  47. Lau AK, Dörrer S, Leimeister CA, Bleidorn C, Morgenstern B (December 2019). "Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage". BMC Bioinformatics. 20 (Suppl 20): 638. doi: 10.1186/s12859-019-3205-7 . PMC   6916211 . PMID   31842735.
  48. 1 2 Leimeister CA, Schellhorn J, Dörrer S, Gerth M, Bleidorn C, Morgenstern B (March 2019). "Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences". GigaScience. 8 (3): giy148. doi:10.1093/gigascience/giy148. PMC   6436989 . PMID   30535314.
  49. Dencker T, Leimeister CA, Gerth M, Bleidorn C, Snir S, Morgenstern B (March 2020). "'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees". NAR Genomics and Bioinformatics. 2 (1): lqz013. doi: 10.1093/nargab/lqz013 . PMC   7671388 . PMID   33575565.
  50. Stamatakis A (November 2006). "RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models". Bioinformatics. 22 (21): 2688–2690. doi: 10.1093/bioinformatics/btl446 . PMID   16928733.
  51. Vinga S (May 2014). "Information theory applications for biological sequence analysis". Briefings in Bioinformatics. 15 (3): 376–389. doi: 10.1093/bib/bbt068 . PMC   7109941 . PMID   24058049.
  52. Liu Z, Meng J, Sun X (April 2008). "A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping". Biochemical and Biophysical Research Communications. 368 (2): 223–230. doi:10.1016/j.bbrc.2008.01.070. PMID   18230342.
  53. Liu ZH, Sun X (2008). "Coronavirus phylogeny based on base-base correlation". International Journal of Bioinformatics Research and Applications. 4 (2): 211–220. doi:10.1504/ijbra.2008.018347. PMID   18490264.
  54. Cheng J, Zeng X, Ren G, Liu Z (March 2013). "CGAP: a new comprehensive platform for the comparative analysis of chloroplast genomes". BMC Bioinformatics. 14: 95. doi: 10.1186/1471-2105-14-95 . PMC   3636126 . PMID   23496817.
  55. Gao Y, Luo L (January 2012). "Genome-based phylogeny of dsDNA viruses by a novel alignment-free method". Gene. 492 (1): 309–314. doi:10.1016/j.gene.2011.11.004. PMID   22100880.
  56. Bennett, C.H., Gacs, P., Li, M., Vitanyi, P. and Zurek, W., Information distance, IEEE Trans. Inform. Theory, 44, 1407--1423
  57. Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P. and Zhang, H., (2001) An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17:(2001), 149--154
  58. M. Li, X. Chen, X. Li, B. Ma, P.M.B. Vitanyi. The similarity metric, IEEE Trans. Inform. Th., 50:12(2004), 3250--3264
  59. R.L. Cilibrasi and P.M.B. Vitanyi, Clustering by compression, IEEE Trans. Informat. Th., 51:4(2005), 1523--1545
  60. Otu HH, Sayood K (November 2003). "A new sequence distance measure for phylogenetic tree construction". Bioinformatics. 19 (16): 2122–2130. doi: 10.1093/bioinformatics/btg295 . PMID   14594718.
  61. Pinho AJ, Garcia SP, Pratas D, Ferreira PJ (Nov 21, 2013). "DNA sequences at a glance". PLOS ONE. 8 (11): e79922. Bibcode:2013PLoSO...879922P. doi: 10.1371/journal.pone.0079922 . PMC   3836782 . PMID   24278218.
  62. Goldman N (May 1993). "Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences". Nucleic Acids Research. 21 (10): 2487–2491. doi:10.1093/nar/21.10.2487. PMC   309551 . PMID   8506142.
  63. Almeida JS, Carriço JA, Maretzek A, Noble PA, Fletcher M (May 2001). "Analysis of genomic sequences by Chaos Game Representation". Bioinformatics. 17 (5): 429–437. doi: 10.1093/bioinformatics/17.5.429 . PMID   11331237.
  64. Almeida JS (May 2014). "Sequence analysis by iterated maps, a review". Briefings in Bioinformatics. 15 (3): 369–375. doi:10.1093/bib/bbt072. PMC   4017330 . PMID   24162172.
  65. 1 2 Almeida JS, Grüneberg A, Maass W, Vinga S (May 2012). "Fractal MapReduce decomposition of sequence alignment". Algorithms for Molecular Biology. 7 (1): 12. doi: 10.1186/1748-7188-7-12 . PMC   3394223 . PMID   22551205.
  66. Vinga S, Carvalho AM, Francisco AP, Russo LM, Almeida JS (May 2012). "Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis". Algorithms for Molecular Biology. 7 (1): 10. doi: 10.1186/1748-7188-7-10 . PMC   3402988 . PMID   22551152.
  67. 1 2 Pratas D, Silva RM, Pinho AJ, Ferreira PJ (May 2015). "An alignment-free method to find and visualise rearrangements between pairs of DNA sequences". Scientific Reports. 5 (10203): 10203. Bibcode:2015NatSR...510203P. doi:10.1038/srep10203. PMC   4434998 . PMID   25984837.
  68. 1 2 Hosseini M, Pratas D, Morgenstern B, Pinho AJ (May 2020). "Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements". GigaScience. 9 (5): giaa048. doi: 10.1093/gigascience/giaa048 . PMC   7238676 . PMID   32432328.
  69. Bernard G, Greenfield P, Ragan MA, Chan CX (Nov 20, 2018). "k-mer Similarity, Networks of Microbial Genomes, and Taxonomic Rank". mSystems. 3 (6): e00257–18. doi:10.1128/mSystems.00257-18. PMC   6247013 . PMID   30505941.
  70. 1 2 Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F (May 2014). "New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing". Briefings in Bioinformatics. 15 (3): 343–353. doi:10.1093/bib/bbt067. PMC   4017329 . PMID   24064230.
  71. Břinda K, Sykulski M, Kucherov G (November 2015). "Spaced seeds improve k-mer-based metagenomic classification". Bioinformatics. 31 (22): 3584–3592. arXiv: 1502.06256 . Bibcode:2015Bioin..31.3584B. doi:10.1093/bioinformatics/btv419. PMID   26209798. S2CID   8626694.
  72. Ounit R, Lonardi S (December 2016). "Higher classification sensitivity of short metagenomic reads with CLARK-S". Bioinformatics. 32 (24): 3823–3825. doi: 10.1093/bioinformatics/btw542 . PMID   27540266.
  73. 1 2 Pratas D, Pinho AJ, Silva RM, Rodrigues JM, Hosseini M, Caetano T, Ferreira PJ (February 2018). "FALCON: a method to infer metagenomic composition of ancient DNA". bioRxiv   10.1101/267179 .
  74. 1 2 Wood DE, Salzberg SL (March 2014). "Kraken: ultrafast metagenomic sequence classification using exact alignments". Genome Biology. 15 (3): R46. doi: 10.1186/gb-2014-15-3-r46 . PMC   4053813 . PMID   24580807.
  75. Pinello L, Lo Bosco G, Yuan GC (May 2014). "Applications of alignment-free methods in epigenomics". Briefings in Bioinformatics. 15 (3): 419–430. doi:10.1093/bib/bbt078. PMC   4017331 . PMID   24197932.
  76. La Rosa M, Fiannaca A, Rizzo R, Urso A (2013). "Alignment-free analysis of barcode sequences by means of compression-based methods". BMC Bioinformatics. 14 (Suppl 7): S4. doi: 10.1186/1471-2105-14-S7-S4 . PMC   3633054 . PMID   23815444.
  77. 1 2 Kolekar P, Hake N, Kale M, Kulkarni-Kale U (March 2014). "WNV Typer: a server for genotyping of West Nile viruses using an alignment-free method based on a return time distribution". Journal of Virological Methods. 198: 41–55. doi: 10.1016/j.jviromet.2013.12.012 . PMID   24388930.
  78. 1 2 Struck D, Lawyer G, Ternes AM, Schmit JC, Bercoff DP (October 2014). "COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification". Nucleic Acids Research. 42 (18): e144. doi:10.1093/nar/gku739. PMC   4191385 . PMID   25120265.
  79. 1 2 Dimitrov I, Naneva L, Doytchinova I, Bangov I (March 2014). "AllergenFP: allergenicity prediction by descriptor fingerprints". Bioinformatics. 30 (6): 846–851. doi: 10.1093/bioinformatics/btt619 . PMID   24167156.
  80. 1 2 Gardner SN, Hall BG (Dec 9, 2013). "When whole-genome alignments just won't work: kSNP v2 software for alignment-free SNP discovery and phylogenetics of hundreds of microbial genomes". PLOS ONE. 8 (12): e81760. Bibcode:2013PLoSO...881760G. doi: 10.1371/journal.pone.0081760 . PMC   3857212 . PMID   24349125.
  81. 1 2 Haubold B, Krause L, Horn T, Pfaffelhuber P (December 2013). "An alignment-free test for recombination". Bioinformatics. 29 (24): 3121–3127. doi:10.1093/bioinformatics/btt550. PMC   5994939 . PMID   24064419.
  82. Silva JM, Pratas D, Caetano T, Matos S (August 2022). "The complexity landscape of viral genomes". GigaScience. 11: 1–16. doi:10.1093/gigascience/giac079. PMC   9366995 . PMID   35950839.
  83. Silva JM, Pratas D, Caetano T, Matos S (2022), Pinho AJ, Georgieva P, Teixeira LF, Sánchez JA (eds.), "Feature-Based Classification of Archaeal Sequences Using Compression-Based Methods", Pattern Recognition and Image Analysis, Lecture Notes in Computer Science, vol. 13256, Cham: Springer International Publishing, pp. 309–320, doi:10.1007/978-3-031-04881-4_25, ISBN   978-3-031-04880-7 , retrieved 2022-08-31
  84. 1 2 Silva, Jorge Miguel; Almeida, João Rafael (2024-10-01). "Enhancing metagenomic classification with compression-based features". Artificial Intelligence in Medicine. 156: 102948. doi:10.1016/j.artmed.2024.102948. ISSN   0933-3657. PMID   39173422.
  85. 1 2 Silva, Jorge M; Pinho, Armando J; Pratas, Diogo (2024). "AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data". GigaScience. 13. doi:10.1093/gigascience/giae086. ISSN   2047-217X. PMC   11590114 . PMID   39589438.
  86. 1 2 Silva JM, Qi W, Pinho AJ, Pratas D (December 2022). "AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data". GigaScience. 12. doi:10.1093/gigascience/giad101. PMC   10716826 . PMID   38091509.
  87. Di Biasi L, Piotto S. ARISE: Artificial Intelligence Semantic Search Engine. WIVACE2021.
  88. Xu Z, Hao B (July 2009). "CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes". Nucleic Acids Research. 37 (Web Server issue): W174–W178. doi:10.1093/nar/gkp278. PMC   2703908 . PMID   19398429.
  89. Cheng J, Cao F, Liu Z (May 2013). "AGP: a multimethods web server for alignment-free genome phylogeny". Molecular Biology and Evolution. 30 (5): 1032–1037. doi: 10.1093/molbev/mst021 . PMC   7574599 . PMID   23389766.
  90. Höhl M, Rigoutsos I, Ragan MA (February 2007). "Pattern-based phylogenetic distance estimation and tree reconstruction". Evolutionary Bioinformatics Online. 2: 359–375. arXiv: q-bio/0605002 . Bibcode:2006q.bio.....5002H. PMC   2674673 . PMID   19455227.
  91. Wang Y, Liu L, Chen L, Chen T, Sun F (Jan 2, 2014). "Comparison of metatranscriptomic samples based on k-tuple frequencies". PLOS ONE. 9 (1): e84348. Bibcode:2014PLoSO...984348W. doi: 10.1371/journal.pone.0084348 . PMC   3879298 . PMID   24392128.
  92. Pratas D, Silva JM (January 2021). "Persistent minimal sequences of SARS-CoV-2". Bioinformatics. 36 (21): 5129–5132. doi: 10.1093/bioinformatics/btaa686 . PMC   7559010 . PMID   32730589.
  93. "CLC Microbial Genomics Module". QIAGEN Bioinformatics. 2019.