Sequence analysis

Last updated
Not to be confused with sequential analysis, sequence analysis of synthetic polymers, or sequence analysis in social sciences.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological databases, and others. [1]

Contents

Since the development of methods of high-throughput production of gene and protein sequences, the rate of addition of new sequences to the databases increased very rapidly. Such a collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms. However, comparing these new sequences to those with known functions is a key way of understanding the biology of an organism from which the new sequence comes. Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. Nowadays, there are many tools and techniques that provide the sequence comparisons (sequence alignment) and analyze the alignment product to understand its biology.

Sequence analysis in molecular biology includes a very wide range of relevant topics:

  1. The comparison of sequences in order to find similarity, often to infer if they are related (homologous)
  2. Identification of intrinsic features of the sequence such as active sites, post translational modification sites, gene-structures, reading frames, distributions of introns and exons and regulatory elements
  3. Identification of sequence differences and variations such as point mutations and single nucleotide polymorphism (SNP) in order to get the genetic marker.
  4. Revealing the evolution and genetic diversity of sequences and organisms
  5. Identification of molecular structure from sequence alone.

History

Since the very first sequences of the insulin protein were characterized by Fred Sanger in 1951, biologists have been trying to use this knowledge to understand the function of molecules. [2] [3] He and his colleagues' discoveries contributed to the successful sequencing of the first DNA-based genome. [4] The method used in this study, which is called the “Sanger method” or Sanger sequencing, was a milestone in sequencing long strand molecules such as DNA. This method was eventually used in the human genome project. [5] According to Michael Levitt, sequence analysis was born in the period from 1969 to 1977. [6] In 1969 the analysis of sequences of transfer RNAs was used to infer residue interactions from correlated changes in the nucleotide sequences, giving rise to a model of the tRNA secondary structure. [7] In 1970, Saul B. Needleman and Christian D. Wunsch published the first computer algorithm for aligning two sequences. [8] Over this time, developments in obtaining nucleotide sequence improved greatly, leading to the publication of the first complete genome of a bacteriophage in 1977. [9] Robert Holley and his team in Cornell University were believed to be the first to sequence an RNA molecule. [10]

Sequence alignment

Example multiple sequence alignment WPP domain alignment.PNG
Example multiple sequence alignment

There are millions of protein and nucleotide sequences known. These sequences fall into many groups of related sequences known as protein families or gene families. Relationships between these sequences are usually discovered by aligning them together and assigning this alignment a score. There are two main types of sequence alignment. Pair-wise sequence alignment only compares two sequences at a time and multiple sequence alignment compares many sequences. Two important algorithms for aligning pairs of sequences are the Needleman-Wunsch algorithm and the Smith-Waterman algorithm. Popular tools for sequence alignment include:

A common use for pairwise sequence alignment is to take a sequence of interest and compare it to all known sequences in a database to identify homologous sequences. In general, the matches in the database are ordered to show the most closely related sequences first, followed by sequences with diminishing similarity. These matches are usually reported with a measure of statistical significance such as an Expectation value.

Profile comparison

In 1987, Michael Gribskov, Andrew McLachlan, and David Eisenberg introduced the method of profile comparison for identifying distant similarities between proteins. [11] Rather than using a single sequence, profile methods use a multiple sequence alignment to encode a profile which contains information about the conservation level of each residue. These profiles can then be used to search collections of sequences to find sequences that are related. Profiles are also known as Position Specific Scoring Matrices (PSSMs). In 1993, a probabilistic interpretation of profiles was introduced by Anders Krogh and colleagues using hidden Markov models. [12] [13] These models have become known as profile-HMMs.

In recent years,[ when? ] methods have been developed that allow the comparison of profiles directly to each other. These are known as profile-profile comparison methods. [14]

Sequence assembly

Sequence assembly refers to the reconstruction of a DNA sequence by aligning and merging small DNA fragments. It is an integral part of modern DNA sequencing. Since presently-available DNA sequencing technologies are ill-suited for reading long sequences, large pieces of DNA (such as genomes) are often sequenced by (1) cutting the DNA into small pieces, (2) reading the small fragments, and (3) reconstituting the original DNA by merging the information on various fragments.

Recently, sequencing multiple species at one time is one of the top research objectives. Metagenomics is the study of microbial communities directly obtained from the environment. Different from cultured microorganisms from the lab, the wild sample usually contains dozens, sometimes even thousands of types of microorganisms from their original habitats. [15] Recovering the original genomes can prove to be very challenging.

Gene prediction

Gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include the prediction of other functional elements such as regulatory regions. Geri is one of the first and most important steps in understanding the genome of a species once it has been sequenced. In general, the prediction of bacterial genes is significantly simpler and more accurate than the prediction of genes in eukaryotic species that usually have complex intron/exon patterns. Identifying genes in long sequences remains a problem, especially when the number of genes is unknown. Hidden markov models can be part of the solution. [16] Machine learning has played a significant role in predicting the sequence of transcription factors. [17] Traditional sequencing analysis focused on the statistical parameters of the nucleotide sequence itself (The most common programs used are listed in Table 4.1). Another method is to identify homologous sequences based on other known gene sequences (Tools see Table 4.3). [18] The two methods described here are focused on the sequence. However, the shape feature of these molecules such as DNA and protein have also been studied and proposed to have an equivalent, if not higher, influence on the behaviors of these molecules. [19]

Protein structure prediction

Target protein structure (3dsm, shown in ribbons), with Calpha backbones (in gray) of 354 predicted models for it submitted in the CASP8 structure-prediction experiment. Target3dsmRib 354predictedModels CASP8.jpg
Target protein structure (3dsm, shown in ribbons), with Calpha backbones (in gray) of 354 predicted models for it submitted in the CASP8 structure-prediction experiment.

The 3D structures of molecules are of major importance to their functions in nature. Since structural prediction of large molecules at an atomic level is a largely intractable problem, some biologists introduced ways to predict 3D structure at a primary sequence level. This includes the biochemical or statistical analysis of amino acid residues in local regions and structural the inference from homologs (or other potentially related proteins) with known 3D structures.

There have been a large number of diverse approaches to solve the structure prediction problem. In order to determine which methods were most effective, a structure prediction competition was founded called CASP (Critical Assessment of Structure Prediction). [20]

Methodology

The tasks that lie in the space of sequence analysis are often non-trivial to resolve and require the use of relatively complex approaches. Of the many types of methods used in practice, the most popular include:

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

In genetics and biochemistry, sequencing means to determine the primary structure of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succinctly summarizes much of the atomic-level structure of the sequenced molecule.

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

<span class="mw-page-title-main">Nucleic acid sequence</span> Succession of nucleotides in a nucleic acid

A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nucleotides. By convention, sequences are usually presented from the 5' end to the 3' end. For DNA, with its double helix, there are two possible directions for the notated sequence; of these two, the sense strand is used. Because nucleic acids are normally linear (unbranched) polymers, specifying the sequence is equivalent to defining the covalent structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the primary structure.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">Functional genomics</span> Field of molecular biology

Functional genomics is a field of molecular biology that attempts to describe gene functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "candidate-gene" approach.

<span class="mw-page-title-main">DNA sequencing</span> Process of determining the nucleic acid sequence

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

<span class="mw-page-title-main">Conserved sequence</span> Similar DNA, RNA or protein sequences within genomes or among species

In evolutionary biology, conserved sequences are identical or similar sequences in nucleic acids or proteins across species, or within a genome, or between donor and receptor taxa. Conservation indicates that a sequence has been maintained by natural selection.

Nucleic acid structure prediction is a computational method to determine secondary and tertiary nucleic acid structure from its sequence. Secondary structure can be predicted from one or several nucleic acid sequences. Tertiary structure can be predicted from the sequence, or by comparative modeling.

Anders Krogh is a bioinformatician at the University of Copenhagen, where he leads the university's bioinformatics center. He is known for his pioneering work on the use of hidden Markov models in bioinformatics, and is co-author of a widely used textbook in bioinformatics. In addition, he also co-authored one of the early textbooks on neural networks. His current research interests include promoter analysis, non-coding RNA, gene prediction and protein structure prediction.

<span class="mw-page-title-main">HMMER</span> Software package for sequence analysis

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and macOS.

<span class="mw-page-title-main">Richard M. Durbin</span> British computational biologist

Richard Michael Durbin is a British computational biologist and Al-Kindi Professor of Genetics at the University of Cambridge. He also serves as an associate faculty member at the Wellcome Sanger Institute where he was previously a senior group leader.

<span class="mw-page-title-main">Nucleic acid secondary structure</span>

Nucleic acid secondary structure is the basepairing interactions within a single nucleic acid polymer or between two polymers. It can be represented as a list of bases which are paired in a nucleic acid molecule. The secondary structures of biological DNAs and RNAs tend to be different: biological DNA mostly exists as fully base paired double helices, while biological RNA is single stranded and often forms complex and intricate base-pairing interactions due to its increased ability to form hydrogen bonds stemming from the extra hydroxyl group in the ribose sugar.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

Non-coding RNAs have been discovered using both experimental and bioinformatic approaches. Bioinformatic approaches can be divided into three main categories. The first involves homology search, although these techniques are by definition unable to find new classes of ncRNAs. The second category includes algorithms designed to discover specific types of ncRNAs that have similar properties. Finally, some discovery methods are based on very general properties of RNA, and are thus able to discover entirely new kinds of ncRNAs.

References

  1. Durbin, Richard M.; Eddy, Sean R.; Krogh, Anders; Mitchison, Graeme (1998), Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (1st ed.), Cambridge, New York: Cambridge University Press, ISBN   0-521-62971-3, OCLC   593254083
  2. Sanger F; Tuppy H (September 1951). "The amino-acid sequence in the phenylalanyl chain of insulin. I. The identification of lower peptides from partial hydrolysates". Biochem. J. 49 (4): 463–81. doi:10.1042/bj0490463. PMC   1197535 . PMID   14886310.
  3. SANGER F; TUPPY H (September 1951). "The amino-acid sequence in the phenylalanyl chain of insulin. 2. The investigation of peptides from enzymic hydrolysates". Biochem. J. 49 (4): 481–90. doi:10.1042/bj0490481. PMC   1197536 . PMID   14886311.
  4. Sanger, F; Nicklen, S; Coulson, AR (December 1977). "DNA sequencing with chain-terminating inhibitors". Proc Natl Acad Sci U S A. 74 (12): 441–448. Bibcode:1977PNAS...74.5463S. doi: 10.1073/pnas.74.12.5463 . PMC   431765 . PMID   271968.
  5. Sanger, F; Air, GM; Barrell, BG; Brown, NL; Coulson, AR; Fiddes, CA; Hutchison, CA; Slocombe, PM; Smith, M (February 1977). "Nucleotide sequence of bacteriophage phi X174 DNA". Nature. 265 (5596): 687–695. Bibcode:1977Natur.265..687S. doi:10.1038/265687a0. PMID   870828. S2CID   4206886.
  6. Levitt M (May 2001). "The birth of computational structural biology". Nature Structural & Molecular Biology. 8 (5): 392–3. doi:10.1038/87545. PMID   11323711. S2CID   6519868.
  7. Levitt M (November 1969). "Detailed molecular model for transfer ribonucleic acid". Nature. 224 (5221): 759–63. Bibcode:1969Natur.224..759L. doi:10.1038/224759a0. PMID   5361649. S2CID   983981.
  8. Needleman SB; Wunsch CD (March 1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins". J. Mol. Biol. 48 (3): 443–53. doi:10.1016/0022-2836(70)90057-4. PMID   5420325.
  9. Sanger F, Air GM, Barrell BG, et al. (February 1977). "Nucleotide sequence of bacteriophage phi X174 DNA". Nature. 265 (5596): 687–95. Bibcode:1977Natur.265..687S. doi:10.1038/265687a0. PMID   870828. S2CID   4206886.
  10. Holley, RW; Apgar, J; Everett, GA; Madison, JT; Marquisee, M; Merrill, SH; Penswick, JR; Zamir, A (May 1965). "Structure of a Ribonucleic Acid". Science. 147 (3664): 1462–1465. Bibcode:1965Sci...147.1462H. doi:10.1126/science.147.3664.1462. PMID   14263761. S2CID   40989800.
  11. Gribskov M; McLachlan AD; Eisenberg D (July 1987). "Profile analysis: detection of distantly related proteins". Proc. Natl. Acad. Sci. U.S.A. 84 (13): 4355–8. Bibcode:1987PNAS...84.4355G. doi: 10.1073/pnas.84.13.4355 . PMC   305087 . PMID   3474607.
  12. Brown M; Hughey R; Krogh A; Mian IS; Sjölander K; Haussler D (1993). "Using Dirichlet mixture priors to derive hidden Markov models for protein families". Proc Int Conf Intell Syst Mol Biol. 1: 47–55. PMID   7584370.
  13. Krogh A; Brown M; Mian IS; Sjölander K; Haussler D (February 1994). "Hidden Markov models in computational biology. Applications to protein modeling". J. Mol. Biol. 235 (5): 1501–31. doi:10.1006/jmbi.1994.1104. PMID   8107089. S2CID   2160404.
  14. Ye X; Wang G; Altschul SF (December 2011). "An assessment of substitution scores for protein profile-profile comparison". Bioinformatics. 27 (24): 3356–63. doi:10.1093/bioinformatics/btr565. PMC   3232366 . PMID   21998158.
  15. Wooley, JC; Godzik, A; Friedberg, I (Feb 26, 2010). "A primer on metagenomics". PLOS Comput Biol. 6 (2): e1000667. Bibcode:2010PLSCB...6E0667W. doi: 10.1371/journal.pcbi.1000667 . PMC   2829047 . PMID   20195499.
  16. Stanke, M; Waack, S (Oct 19, 2003). "Gene prediction with a hidden Markov model and a new intron submodel". Bioinformatics. 19 Suppl 2 (2): 215–25. doi: 10.1093/bioinformatics/btg1080 . PMID   14534192.
  17. Alipanahi, B; Delong, A; Weirauch, MT; Frey, BJ (Aug 2015). "Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning". Nat Biotechnol. 33 (8): 831–8. doi: 10.1038/nbt.3300 . PMID   26213851.
  18. Wooley, JC; Godzik, A; Friedberg, I (Feb 26, 2010). "A primer on metagenomics". PLOS Comput Biol. 6 (2): e1000667. Bibcode:2010PLSCB...6E0667W. doi: 10.1371/journal.pcbi.1000667 . PMC   2829047 . PMID   20195499.
  19. Abe, N; Dror, I; Yang, L; Slattery, M; Zhou, T; Bussemaker, HJ; Rohs R, R; Mann, RS (Apr 9, 2015). "Deconvolving the recognition of DNA shape from sequence". Cell. 161 (2): 307–18. doi:10.1016/j.cell.2015.02.008. PMC   4422406 . PMID   25843630.
  20. Moult J; Hubbard T; Bryant SH; Fidelis K; Pedersen JT (1997). "Critical assessment of methods of protein structure prediction (CASP): round II". Proteins. Suppl 1 (S1): 2–6. doi:10.1002/(SICI)1097-0134(1997)1+<2::AID-PROT2>3.0.CO;2-T. PMID   9485489. S2CID   26823924.