Dot plot (bioinformatics)

Last updated
A DNA dot plot of a human zinc finger transcription factor (GenBank ID NM_002383), showing regional self-similarity. The main diagonal represents the sequence's alignment with itself; lines off the main diagonal represent similar or repetitive patterns within the sequence. Zinc-finger-dot-plot.png
A DNA dot plot of a human zinc finger transcription factor (GenBank ID NM_002383), showing regional self-similarity. The main diagonal represents the sequence's alignment with itself; lines off the main diagonal represent similar or repetitive patterns within the sequence.

In bioinformatics a dot plot is a graphical method for comparing two biological sequences and identifying regions of close similarity after sequence alignment. It is a type of recurrence plot.

Contents

History

One way to visualize the similarity between two protein or nucleic acid sequences is to use a similarity matrix, known as a dot plot. These were introduced by Gibbs and McIntyre in 1970 [1] and are two-dimensional matrices that have the sequences of the proteins being compared along the vertical and horizontal axes. For a simple visual representation of the similarity between two sequences, individual cells in the matrix can be shaded black if residues are identical, so that matching sequence segments appear as runs of diagonal lines across the matrix.

Interpretation

Some idea of the similarity of the two sequences can be gleaned from the number and length of matching segments shown in the matrix. Identical proteins will obviously have a diagonal line in the center of the matrix. Insertions and deletions between sequences give rise to disruptions in this diagonal. Regions of local similarity or repetitive sequences give rise to further diagonal matches in addition to the central diagonal. One way of reducing this noise is to only shade runs or 'tuples' of residues, e.g. a tuple of 3 corresponds to three residues in a row. This is effective because the probability of matching three residues in a row by chance is much lower than single-residue matches.

Dot plots compare two sequences by organizing one sequence on the x-axis, and another on the y-axis, of a plot. When the residues of both sequences match at the same location on the plot, a dot is drawn at the corresponding position. Note, that the sequences can be written backwards or forwards, however the sequences on both axes must be written in the same direction. Also note, that the direction of the sequences on the axes will determine the direction of the line on the dot plot. Once the dots have been plotted, they will combine to form lines. The closeness of the sequences in similarity will determine how close the diagonal line is to what a graph showing a curve demonstrating a direct relationship is. This relationship is affected by certain sequence features such as frame shifts, direct repeats, and inverted repeats. Frame shifts include insertions, deletions, and mutations. The presence of one of these features, or the presence of multiple features, will cause for multiple lines to be plotted in a various possibility of configurations, depending on the features present in the sequences. A feature that will cause a very different result on the dot plot is the presence of low-complexity region/regions. Low-complexity regions are regions in the sequence with only a few amino acids, which in turn, causes redundancy within that small or limited region. These regions are typically found around the diagonal, and may or may not have a square in the middle of the dot plot.

Software to create dot plots

In addition to the tools listed above, the NCBI Blast Server at https://blast.ncbi.nlm.nih.gov/Blast.cgi includes Dot Plots in its output.

See also

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological databases, and others.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

In bioinformatics, sequence clustering algorithms attempt to group biological sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs) or protein origin. For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the same gene before the ESTs are assembled to reconstruct the original mRNA.

<span class="mw-page-title-main">Protein family</span> Group of evolutionarily-related proteins

A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.

<span class="mw-page-title-main">Structural alignment</span> Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

BioJava is an open-source software project dedicated to provide Java tools to process biological data. BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines. BioJava supports a huge range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an alignment is important to create a useful alignment. Too many gaps can cause an alignment to become meaningless. Gap penalties are used to adjust alignment scores based on the number and length of gaps. The five main types of gap penalties are constant, linear, affine, convex, and profile-based.

FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.

In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames will be "open". Such an ORF may contain a start codon and by definition cannot extend beyond a stop codon. That start codon indicates where translation may start. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

<span class="mw-page-title-main">Protein contact map</span>

A protein contact map represents the distance between all possible amino acid residue pairs of a three-dimensional protein structure using a binary two-dimensional matrix. For two residues and , the element of the matrix is 1 if the two residues are closer than a predetermined threshold, and 0 otherwise. Various contact definitions have been proposed: The distance between the Cα-Cα atom with threshold 6-12 Å; distance between Cβ-Cβ atoms with threshold 6-12 Å ; and distance between the side-chain centers of mass.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.

The Viral Bioinformatics Resource Center (VBRC) is an online resource providing access to a database of curated viral genomes and a variety of tools for bioinformatic genome analysis. This resource was one of eight BRCs funded by NIAID with the goal of promoting research against emerging and re-emerging pathogens, particularly those seen as potential bioterrorism threats. The VBRC is now supported by Dr. Chris Upton at the University of Victoria.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

TRANSFAC is a manually curated database of eukaryotic transcription factors, their genomic binding sites and DNA binding profiles. The contents of the database can be used to predict potential transcription factor binding sites.

References

  1. Gibbs, Adrian J.; McIntyre, George A. (1970). "The Diagram, a Method for Comparing Sequences. Its Use with Amino Acid and Nucleotide Sequences". Eur. J. Biochem. 16 (1): 1–11. doi: 10.1111/j.1432-1033.1970.tb01046.x . PMID   5456129.
  2. Klopp, Christophe; Cabanettes, Floréal (2018-02-23). "D-GENIES : Dot plot large GENomes in an interactive, efficient and simple way". PeerJ. 6: e4958. doi: 10.7287/peerj.preprints.26567v1 . PMC   5991294 . PMID   29888139.
  3. Rice, P.; Longden, I.; Bleasby, A. (June 2000). "EMBOSS: the European Molecular Biology Open Software Suite". Trends in Genetics. 16 (6): 276–277. doi:10.1016/s0168-9525(00)02024-2. ISSN   0168-9525. PMID   10827456.
  4. Sonnhammer, E. L.; Durbin, R. (1995-12-29). "A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis". Gene. 167 (1–2): GC1–10. doi:10.1016/0378-1119(95)00714-8. ISSN   0378-1119. PMID   8566757.
  5. Brodie, Ryan; Roper, Rachel L.; Upton, Chris (2004-01-22). "JDotter: a Java interface to multiple dotplots generated by dotter". Bioinformatics. 20 (2): 279–281. doi: 10.1093/bioinformatics/btg406 . ISSN   1367-4803. PMID   14734323.
  6. Seibt, Kathrin M.; Schmidt, Thomas; Heitkam, Tony (2018-10-15). "FlexiDot: Highly customizable, ambiguity-aware dotplots for visual sequence analyses". Bioinformatics. 34 (20): 3575–3577. doi: 10.1093/bioinformatics/bty395 . PMID   29762645.
  7. Krumsiek, Jan; Arnold, Roland; Rattei, Thomas (2007-04-15). "Gepard: a rapid and sensitive tool for creating dotplots on genome scale". Bioinformatics. 23 (8): 1026–1028. doi: 10.1093/bioinformatics/btm039 . ISSN   1367-4803. PMID   17309896.
  8. Frith MC. and Kawaguchi R. (2015). "Split-alignment of genomes finds orthologies more accurately". Genome Biol. 16 (1): 106. doi: 10.1186/s13059-015-0670-9 . PMC   4464727 . PMID   25994148.
  9. Harris, R. S. (2007). Improved pairwise alignment of genomic DNA. Ph.D. thesis. Pennsylvania: The Pennsylvania State University.
  10. Noe L., Kucherov. G. (2005). "YASS: enhancing the sensitivity of DNA similarity search". Nucleic Acids Research. 33 (2): W540–W543. doi:10.1093/nar/gki478. PMC   1160238 . PMID   15980530.