In bioinformatics, k-mers are substrings of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides (i.e. A, T, G, and C), k-mers are capitalized upon to assemble DNA sequences, [1] improve heterologous gene expression, [2] [3] identify species in metagenomic samples, [4] and create attenuated vaccines. [5] Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where is number of possible monomers (e.g. four in the case of DNA).
k-mers are simply length subsequences. For example, all the possible k-mers of a DNA sequence are shown below:
k | k-mers |
---|---|
1 | G, T, A, C |
2 | GT, TA, AG, GA, AG, GC, CT, TG |
3 | GTA, TAG, AGA, GAG, AGC, GCT, CTG, TGT |
4 | GTAG, TAGA, AGAG, GAGC, AGCT, GCTG, CTGT |
5 | GTAGA, TAGAG, AGAGC, GAGCT, AGCTG, GCTGT |
6 | GTAGAG, TAGAGC, AGAGCT, GAGCTG, AGCTGT |
7 | GTAGAGC, TAGAGCT, AGAGCTG, GAGCTGT |
8 | GTAGAGCT, TAGAGCTG, AGAGCTGT |
9 | GTAGAGCTG, TAGAGCTGT |
10 | GTAGAGCTGT |
A method of visualizing k-mers, the k-mer spectrum, shows the multiplicity of each k-mer in a sequence versus the number of k-mers with that multiplicity. [6] The number of modes in a k-mer spectrum for a species's genome varies, with most species having a unimodal distribution. [7] However, all mammals have a multimodal distribution. The number of modes within a k-mer spectrum can vary between regions of genomes as well: humans have unimodal k-mer spectra in 5' UTRs and exons but multimodal spectra in 3' UTRs and introns.
The frequency of k-mer usage is affected by numerous forces, working at multiple levels, which are often in conflict. It is important to note that k-mers for higher values of k are affected by the forces affecting lower values of k as well. For example, if the 1-mer A does not occur in a sequence, none of the 2-mers containing A (AA, AT, AG, and AC) will occur either, thereby linking the effects of the different forces.
When k = 1, there are four DNA k-mers, i.e., A, T, G, and C. At the molecular level, there are three hydrogen bonds between G and C, whereas there are only two between A and T. GC bonds, as a result of the extra hydrogen bond (and stronger stacking interactions), are more thermally stable than AT bonds. [8] Mammals and birds have a higher ratio of Gs and Cs to As and Ts (GC-content), which led to the hypothesis that thermal stability was a driving factor of GC-content variation. [9] However, while promising, this hypothesis did not hold up under scrutiny: analysis among a variety of prokaryotes showed no evidence of GC-content correlating with temperature as the thermal adaptation hypothesis would predict. [10] Indeed, if natural selection were to be the driving force behind GC-content variation, that would require that single nucleotide changes, which are often silent, to alter the fitness of an organism. [11]
Rather, current evidence suggests that GC‐biased gene conversion (gBGC) is a driving factor behind variation in GC content. [11] gBGC is a process that occurs during recombination which replaces As and Ts with Gs and Cs. [12] This process, though distinct from natural selection, can nevertheless exert selective pressure on DNA biased towards GC replacements being fixed in the genome. gBGC can therefore be seen as an "impostor" of natural selection. As would be expected, GC content is greater at sites experiencing greater recombination. [13] Furthermore, organisms with higher rates of recombination exhibit higher GC content, in keeping with the gBGC hypothesis's predicted effects. [14] Interestingly, gBGC does not appear to be limited to eukaryotes. [15] Asexual organisms such as bacteria and archaea also experience recombination by means of gene conversion, a process of homologous sequence replacement resulting in multiple identical sequences throughout the genome. [16] That recombination is able to drive up GC content in all domains of life suggests that gBGC is universally conserved. Whether gBGC is a (mostly) neutral byproduct of the molecular machinery of life or is itself under selection remains to be determined. The exact mechanism and evolutionary advantage or disadvantage of gBGC is currently unknown. [17]
Despite the comparatively large body of literature discussing GC-content biases, relatively little has been written about dinucleotide biases. What is known is that these dinucleotide biases are relatively constant throughout the genome, unlike GC-content, which, as seen above, can vary considerably. [18] This is an important insight that must not be overlooked. If dinucleotide bias were subject to pressures resulting from translation, then there would be differing patterns of dinucleotide bias in coding and noncoding regions driven by some dinucelotides' reduced translational efficiency. [19] Because there is not, it can therefore be inferred that the forces modulating dinucleotide bias are independent of translation. Further evidence against translational pressures affecting dinucleotide bias is the fact that the dinucleotide biases of viruses, which rely heavily on translational efficiency, are shaped by their viral family more than by their hosts, whose translational machinery the viruses hijack. [20]
Counter to gBGC's increasing GC-content is CG suppression, which reduces the frequency of CG 2-mers due to deamination of methylated CG dinucleotides, resulting in substitutions of CGs with TGs, thereby reducing the GC-content. [21] This interaction highlights the interrelationship between the forces affecting k-mers for varying values of k.
One interesting fact about dinucleotide bias is that it can serve as a "distance" measurement between phylogenetically similar genomes. The genomes of pairs of organisms that are closely related share more similar dinucleotide biases than between pairs of more distantly related organisms. [18]
There are twenty natural amino acids that are used to build the proteins that DNA encodes. However, there are only four nucleotides. Therefore, there cannot be a one-to-one correspondence between nucleotides and amino acids. Similarly, there are 16 2-mers, which is also not enough to unambiguously represent every amino acid. However, there are 64 distinct 3-mers in DNA, which is enough to uniquely represent each amino acid. These non-overlapping 3-mers are called codons. While each codon only maps to one amino acid, each amino acid can be represented by multiple codons. Thus, the same amino acid sequence can have multiple DNA representations. Interestingly, each codon for an amino acid is not used in equal proportions. [22] This is called codon-usage bias (CUB). When k = 3, a distinction must be made between true 3-mer frequency and CUB. For example, the sequence ATGGCA has four 3-mer words within it (ATG, TGG, GGC, and GCA) while only containing two codons (ATG and GCA). However, CUB is a major driving factor of 3-mer usage bias (accounting for up to ⅓ of it, since ⅓ of the k-mers in a coding region are codons) and will be the main focus of this section.
The exact cause of variation between the frequencies of various codons is not fully understood. It is known that codon preference is correlated with tRNA abundances, with codons matching more abundant tRNAs being correspondingly more frequent [22] and that more highly expressed proteins exhibit greater CUB. [23] This suggests that selection for translational efficiency or accuracy is the driving force behind CUB variation.
Similar to the effect seen in dinucleotide bias, the tetranucleotide biases of phylogenetically similar organisms are more similar than between less closely related organisms. [4] The exact cause of variation in tetranucleotide bias is not well understood, but it has been hypothesized to be the result of the maintenance of genetic stability at the molecular level. [24]
The frequency of a set of k-mers in a species's genome, in a genomic region, or in a class of sequences can be used as a "signature" of the underlying sequence. Comparing these frequencies is computationally easier than sequence alignment and is an important method in alignment-free sequence analysis. It can also be used as a first stage analysis before an alignment.
In sequence assembly, k-mers are used during the construction of De Bruijn graphs. [25] [26] In order to create a De Bruijn Graph, the k-mers stored in each edge with length must overlap another string in another edge by in order to create a vertex. Reads generated from next-generation sequencing will typically have different read lengths being generated. For example, reads by Illumina's sequencing technology capture reads of 100-mers. However, the problem with the sequencing is that only small fractions out of all the possible 100-mers that are present in the genome are actually generated. This is due to read errors, but more importantly, just simple coverage holes that occur during sequencing. The problem is that these small fractions of the possible k-mers violate the key assumption of De Bruijn graphs that all the k-mer reads must overlap its adjoining k-mer in the genome by (which cannot occur when all the possible k-mers are not present).
The solution to this problem is to break these k-mer sized reads into smaller k-mers, such that the resulting smaller k-mers will represent all the possible k-mers of that smaller size that are present in the genome. [27] Furthermore, splitting the k-mers into smaller sizes also helps alleviate the problem of different initial read lengths. In this example, the five reads do not account for all the possible 7-mers of the genome, and as such, a De Bruijn graph cannot be created. But, when they are split into 4-mers, the resultant subsequences are enough to reconstruct the genome using a De Bruijn graph.
Beyond being used directly for sequence assembly, k-mers can also be used to detect genome mis-assembly by identifying k-mers that are overrepresented which suggest the presence of repeated DNA sequences that have been combined. [28] In addition, k-mers are also used to detect bacterial contamination during eukaryotic genome assembly, an approach borrowed from the field of metagenomics. [29] [30]
The choice of the k-mer size has many different effects on the sequence assembly. These effects vary greatly between lower sized and larger sized k-mers. Therefore, an understanding of the different k-mer sizes must be achieved in order to choose a suitable size that balances the effects. The effects of the sizes are outlined below.
With respect to disease, dinucleotide bias has been applied to the detection of genetic islands associated with pathogenicity. [11] Prior work has also shown that tetranucleotide biases are able to effectively detect horizontal gene transfer in both prokaryotes [32] and eukaryotes. [33]
Another application of k-mers is in genomics-based taxonomy. For example, GC-content has been used to distinguish between species of Erwinia with moderate success. [34] Similar to the direct use of GC-content for taxonomic purposes is the use of Tm, the melting temperature of DNA. Because GC bonds are more thermally stable, sequences with higher GC content exhibit a higher Tm. In 1987, the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics proposed the use of ΔTm as factor in determining species boundaries as part of the phylogenetic species concept, though this proposal does not appear to have gained traction within the scientific community. [35]
Other applications within genetics and genomics include:
k-mer frequency and spectrum variation is heavily used in metagenomics for both analysis [47] [48] and binning. In binning, the challenge is to separate sequencing reads into "bins" of reads for each organism (or operational taxonomic unit), which will then be assembled. TETRA is a notable tool that takes metagenomic samples and bins them into organisms based on their tetranucleotide (k = 4) frequencies. [49] Other tools that similarly rely on k-mer frequency for metagenomic binning are CompostBin (k = 6), [50] PCAHIER, [51] PhyloPythia (5 ≤ k ≤ 6), [52] CLARK (k ≥ 20), [53] and TACOA (2 ≤ k ≤ 6). [54] Recent developments have also applied deep learning to metagenomic binning using k-mers. [55]
Other applications within metagenomics include:
Modifying k-mer frequencies in DNA sequences has been used extensively in biotechnological applications to control translational efficiency. Specifically, it has been used to both up- and down-regulate protein production rates.
With respect to increasing protein production, reducing unfavorable dinucleotide frequency has been used yield higher rates of protein synthesis. [61] In addition, codon usage bias has been modified to create synonymous sequences with greater protein expression rates. [2] [3] Similarly, codon pair optimization, a combination of dinucelotide and codon optimization, has also been successfully used to increase expression. [62]
The most studied application of k-mers for decreasing translational efficiency is codon-pair manipulation for attenuating viruses in order to create vaccines. Researchers were able to recode dengue virus, the virus that causes dengue fever, such that its codon-pair bias was more different to mammalian codon-usage preference than the wild type. [63] Though containing an identical amino-acid sequence, the recoded virus demonstrated significantly weakened pathogenicity while eliciting a strong immune response. This approach has also been used effectively to create an influenza vaccine [64] as well a vaccine for Marek's disease herpesvirus (MDV). [65] Notably, the codon-pair bias manipulation employed to attenuate MDV did not effectively reduce the oncogenicity of the virus, highlighting a potential weakness in the biotechnology applications of this approach. To date, no codon-pair deoptimized vaccine has been approved for use.
Two later articles help explain the actual mechanism underlying codon-pair deoptimization: codon-pair bias is the result of dinucleotide bias. [66] [67] By studying viruses and their hosts, both sets of authors were able to conclude that the molecular mechanism that results in the attenuation of viruses is an increase in dinucleotides poorly suited for translation.
GC-content, due to its effect on DNA melting point, is used to predict annealing temperature in PCR, another important biotechnology tool.
Determining the possible k-mers of a read can be done by simply cycling over the string length by one and taking out each substring of length . The pseudocode to achieve this is as follows:
procedure k-mers(string seq, integer k) is L ← length(seq) arr ← new array of L − k + 1 empty strings // iterate over the number of k-mers in seq,// storing the nth k-mer in the output arrayfor n ← 0 to L − k + 1 exclusive do arr[n] ← subsequence of seq from letter n inclusive to letter n + k exclusive return arr
Because the number of k-mers grows exponentially for values of k, counting k-mers for large values of k (usually >10) is a computationally difficult task. While simple implementations such as the above pseudocode work for small values of k, they need to be adapted for high-throughput applications or when k is large. To solve this problem, various tools have been developed:
In molecular biology, a stop codon is a codon that signals the termination of the translation process of the current protein. Most codons in messenger RNA correspond to the addition of an amino acid to a growing polypeptide chain, which may ultimately become a protein; stop codons signal the termination of this process by binding release factors, which cause the ribosomal subunits to disassociate, releasing the amino acid chain.
Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.
Molecular evolution describes how inherited DNA and/or RNA change over evolutionary time, and the consequences of this for proteins and other components of cells and organisms. Molecular evolution is the basis of phylogenetic approaches to describing the tree of life. Molecular evolution overlaps with population genetics, especially on shorter timescales. Topics in molecular evolution include the origins of new genes, the genetic nature of complex traits, the genetic basis of adaptation and speciation, the evolution of development, and patterns and processes underlying genomic changes during evolution.
The CpG sites or CG sites are regions of DNA where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. CpG sites occur with high frequency in genomic regions called CpG islands.
Codon usage bias refers to differences in the frequency of occurrence of synonymous codons in coding DNA. A codon is a series of three nucleotides that encodes a specific amino acid residue in a polypeptide chain or for the termination of translation.
Chargaff's rules state that in the DNA of any species and any organism, the amount of guanine should be equal to the amount of cytosine and the amount of adenine should be equal to the amount of thymine. Further, a 1:1 stoichiometric ratio of purine and pyrimidine bases should exist. This pattern is found in both strands of the DNA. They were discovered by Austrian-born chemist Erwin Chargaff in the late 1940s.
In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.
In molecular biology and genetics, GC-content is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C). This measure indicates the proportion of G and C bases out of an implied four total bases, also including adenine and thymine in DNA and adenine and uracil in RNA.
Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.
Gene conversion is the process by which one DNA sequence replaces a homologous sequence such that the sequences become identical after the conversion event. Gene conversion can be either allelic, meaning that one allele of the same gene replaces another allele, or ectopic, meaning that one paralogous DNA sequence converts another.
Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.
De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.
In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome. Binning methods can be based on either compositional features or alignment (similarity), or both.
In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.
SPAdes is a genome assembly algorithm which was designed for single cell and multi-cells bacterial data sets. Therefore, it might not be suitable for large genomes projects.
Viral metagenomics uses metagenomic technologies to detect viral genomic material from diverse environmental and clinical samples. Viruses are the most abundant biological entity and are extremely diverse; however, only a small fraction of viruses have been sequenced and only an even smaller fraction have been isolated and cultured. Sequencing viruses can be challenging because viruses lack a universally conserved marker gene so gene-based approaches are limited. Metagenomics can be used to study and analyze unculturable viruses and has been an important tool in understanding viral diversity and abundance and in the discovery of novel viruses. For example, metagenomics methods have been used to describe viruses associated with cancerous tumors and in terrestrial ecosystems.
Horizontal or lateral gene transfer is the transmission of portions of genomic DNA between organisms through a process decoupled from vertical inheritance. In the presence of HGT events, different fragments of the genome are the result of different evolutionary histories. This can therefore complicate investigations of the evolutionary relatedness of lineages and species. Also, as HGT can bring into genomes radically different genotypes from distant lineages, or even new genes bearing new functions, it is a major source of phenotypic innovation and a mechanism of niche adaptation. For example, of particular relevance to human health is the lateral transfer of antibiotic resistance and pathogenicity determinants, leading to the emergence of pathogenic lineages.
De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers.
Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.
Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.
{{cite journal}}
: CS1 maint: DOI inactive as of July 2024 (link){{cite journal}}
: CS1 maint: multiple names: authors list (link)