Representative sequences

Last updated

In social sciences and other domains, representative sequences are whole sequences that best characterize or summarize a set of sequences. [1] In bioinformatics, representative sequences also designate substrings of a sequence that characterize the sequence. [2] [3]

Contents

Social sciences

Representative sequences covering 27% of 2000 cohabitation sequences between age 15 and 30 (extract of biographical data from the Swiss Household Panel) Fg-rep-seq-biofam.png
Representative sequences covering 27% of 2000 cohabitation sequences between age 15 and 30 (extract of biographical data from the Swiss Household Panel)

In Sequence analysis in social sciences, representative sequences are used to summarize sets of sequences describing for example the family life course or professional career of several thousands individuals. [4]

The identification of representative sequences [1] [4] proceeds from the pairwise dissimilarities between sequences. One typical solution is the medoid sequence, i.e., the observed sequence that minimizes the sum of its distances to all other sequences in the set. An other solution is the densest observed sequence, i.e., the sequence with the greatest number of other sequences in its neighborhood. When the diversity of the sequences is large, a single representative is often insufficient to efficiently characterize the set. In such cases, an as small as possible set of representative sequences covering (i.e., which includes in at least one neighborhood of a representative) a given percentage of all sequences is searched.

A solution also considered is to select the medoids of relative frequency groups. More specifically, the method consists in sorting the sequences (for example, according to the first principal coordinate of the pairwise dissimilarity matrix), splitting the sorted list into equal sized groups (called relative frequency groups), and selecting the medoids of the equal sized groups. [5]

The methods for identifying representative sequences described above have been implemented in the R package TraMineR. [6]

Bioinformatics

Representative sequences are short regions within protein sequences that can be used to approximate the evolutionary relationships of those proteins, or the organisms from which they come. Representative sequences are contiguous subsequences (typically 300 residues) from ubiquitous, conserved proteins, such that each orthologous family of representative sequences taken alone gives a distance matrix in close agreement with the consensus matrix. [7]

Use

Protein sequences can provide data about the biological function and evolution of proteins and protein domains. Grouping and interrelating protein sequences can therefore provide information about both human biological processes, and the evolutionary development of biological processes on earth; such sequence clusters allow for the effective coverage of sequence space. Sequence clusters can reduce a large database of sequences to a smaller set of sequence representatives, each of which should represent its cluster at the sequence level. Sequence representatives allow the effective coverage of the original database with fewer sequences. The database of sequence representatives is called non-redundant, as similar (or redundant) sequences have been removed at a certain similarity threshold.

See also

Sequence analysis in social sciences

Sequence analysis in bioinformatics

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

<span class="mw-page-title-main">Computational biology</span> Branch of biology

Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has foundations in applied mathematics, chemistry, and genetics. It differs from biological computing, a subfield of computer science and engineering which uses bioengineering to build computers.

In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.

In bioinformatics, sequence clustering algorithms attempt to group biological sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs) or protein origin. For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the same gene before the ESTs are assembled to reconstruct the original mRNA.

<span class="mw-page-title-main">Structural alignment</span> Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

<span class="mw-page-title-main">Hierarchical clustering</span> Statistical method of analysis which seeks to build a hierarchy of clusters

In data mining and statistics, hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:

Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression. These are also of interest while wanting to find a representative using some distance other than squared euclidean distance.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

In molecular biology, protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure. It differs from the homology modeling method of structure prediction as it is used for proteins which do not have their homologous protein structures deposited in the Protein Data Bank (PDB), whereas homology modeling is used for those proteins which do. Threading works by using statistical knowledge of the relationship between the structures deposited in the PDB and the sequence of the protein which one wishes to model.

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

<span class="mw-page-title-main">BLOSUM</span> Bioinformatics tool

In bioinformatics, the BLOSUM matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff. They scanned the BLOCKS database for very conserved regions of protein families and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices.

The k-medoids problem is a clustering problem similar to k-means. The name was coined by Leonard Kaufman and Peter J. Rousseeuw with their PAM algorithm. Both the k-means and k-medoids algorithms are partitional and attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids chooses actual data points as centers, and thereby allows for greater interpretability of the cluster centers than in k-means, where the center of a cluster is not necessarily one of the input data points. Furthermore, k-medoids can be used with arbitrary dissimilarity measures, whereas k-means generally requires Euclidean distance for efficient solutions. Because k-medoids minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances, it is more robust to noise and outliers than k-means.

In bioinformatics, MAFFT is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version of MAFFT used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the Fast Fourier Transform. Subsequent versions of MAFFT have added other algorithms and modes of operation, including options for faster alignment of large numbers of sequences, higher accuracy alignments, alignment of non-coding RNA sequences, and the addition of new sequences to existing alignments.

<span class="mw-page-title-main">Microarray analysis techniques</span>

Microarray analysis techniques are used in interpreting the data generated from experiments on DNA, RNA, and protein microarrays, which allow researchers to investigate the expression state of a large number of genes – in many cases, an organism's entire genome – in a single experiment. Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult – if not impossible – to analyze without the help of computer programs.

Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. It was proposed by Belgian statistician Peter Rousseeuw in 1987.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

<span class="mw-page-title-main">Sequence analysis in social sciences</span>

In social sciences, sequence analysis (SA) is concerned with the analysis of sets of categorical sequences that typically describe longitudinal data. Analyzed sequences are encoded representations of, for example, individual life trajectories such as family formation, school to work transitions, working careers, but they may also describe daily or weekly time use or represent the evolution of observed or self-reported health, of political behaviors, or the development stages of organizations. Such sequences are chronologically ordered unlike words or DNA sequences for example.

<span class="mw-page-title-main">Gilbert Ritschard</span> Swiss statistician, expert in sequence analysis for social sciences

Gilbert Ritschard is a Swiss statistician specialized in quantitative methods for the social sciences and in the analysis of longitudinal data describing life courses. He is Professor Emeritus at the University of Geneva. He earned a Ph.D. in Econometrics and Statistics at the University of Geneva in 1979. His main contributions are in sequence analysis. He initiated and led the SNFS project that developed the TraMineR R toolkit for sequence analysis. He is one of the founders of the Sequence Analysis Association, which he served as first president.

References

  1. 1 2 Gabadinho, Alexis; Ritschard, Gilbert; Studer, Matthias; Müller, Nicolas S. (2011), Fred, Ana; Dietz, Jan L. G.; Liu, Kecheng; Filipe, Joaquim (eds.), "Extracting and Rendering Representative Sequences", Knowledge Discovery, Knowledge Engineering and Knowledge Management, Communications in Computer and Information Science, Berlin, Heidelberg: Springer Berlin Heidelberg, vol. 128, pp. 94–106, doi:10.1007/978-3-642-19032-2_7, ISBN   978-3-642-19031-5 , retrieved 2023-06-12
  2. Kuri-Morales, Angel F.; Ortiz-Posadas, Martha R. (2005), Gelbukh, Alexander; de Albornoz, Álvaro; Terashima-Marín, Hugo (eds.), "A New Approach to Sequence Representation of Proteins in Bioinformatics", MICAI 2005: Advances in Artificial Intelligence, Berlin, Heidelberg: Springer Berlin Heidelberg, vol. 3789, pp. 880–889, doi:10.1007/11579427_90, ISBN   978-3-540-29896-0 , retrieved 2023-06-12
  3. Chen, William L.; Leland, Burton A.; Durant, Joseph L.; Grier, David L.; Christie, Bradley D.; Nourse, James G.; Taylor, Keith T. (2011-09-26). "Self-Contained Sequence Representation: Bridging the Gap between Bioinformatics and Cheminformatics". Journal of Chemical Information and Modeling. 51 (9): 2186–2208. doi:10.1021/ci2001988. ISSN   1549-9596. PMID   21800899.
  4. 1 2 Gabadinho, Alexis; Ritschard, Gilbert (2013). Levy, René; Widmer, Eric D. (eds.). "Searching for typical life trajectories, applied to childbirth histories". Gendered Life Courses, Between Standardization and Individualization: A European Approach Applied to Switzerland. Zurich: LIT: 287–312.
  5. Fasang, Anette Eva; Liao, Tim Futing (2014). "Visualizing Sequences in the Social Sciences: Relative Frequency Sequence Plots". Sociological Methods & Research. 43 (4): 643–676. doi:10.1177/0049124113506563. hdl: 10419/209702 . ISSN   0049-1241. S2CID   61487252.
  6. Gabadinho, Alexis; Ritschard, Gilbert; Müller, Nicolas S.; Studer, Matthias (2011). "Analyzing and Visualizing State Sequences in R with TraMineR". Journal of Statistical Software. 40 (4). doi: 10.18637/jss.v040.i04 . ISSN   1548-7660.
  7. Bern, Marshall; Goldberg, David (November 2, 2004). "Automatic selection of representative proteins for bacterial phylogeny". BMC Evolutionary Biology. 5 (34): 34. doi: 10.1186/1471-2148-5-34 . PMC   1175084 . PMID   15927057.