Stemloc

Stemloc
Developer(s)	Ian Holmes (UC Berkeley)
Stable release	1
Written in	Dart
Operating system	UNIX, Linux, Mac, Cygwin on Windows XP
Type	Bioinformatics tool
Licence	Open source
Website	Stemloc homepage

Last updated December 24, 2023

In bioinformatics, Stemloc is an open source software for multiple RNA sequence alignment and RNA structure prediction based on probabilistic models of RNA structure known as Pair stochastic context-free grammars (also probabilistic context-free grammars). Stemloc attempts to simultaneously predict and align the structure of RNA sequences with an improved time and space cost compared to previous methods with the same motive. The resulting software implements constrained versions of the Sankoff algorithm by introducing both fold and alignment constraints, which reduces processor and memory usage and allows for larger RNA sequences to be analyzed on commodity hardware. Stemloc was written in 2004 by Ian Holmes.

Terminology

Fold: RNA folding is the process by which an RNA molecule acquires secondary structure through intra-molecular interactions.
Fold envelope: The set of candidate folds to be considered in the algorithm
Alignment envelope: The set of candidate alignments to be considered in the algorithm

Background

A previously developed algorithm by David Sankoff in 1985 uses dynamic programming to simultaneously align and predict multiple RNA structures. The Sankoff Algorithm takes time and space in big O notation $O(L^{3N})$ and $O(L^{2N})$ respectively for $N$ sequences of length $L$ . This is observantly expensive, and thus is the motivation to create better RNA analysis tools like Stemloc. The initial goal of Stemloc was to reduce the time and space cost of simultaneous alignment and structure prediction of two RNA sequences by using a stochastic context-free grammar (SCFG) scoring scheme and by implementing constrained versions of the Sankoff Algorithm.

Stemloc uses alignment envelopes and fold envelopes to simultaneously constrain both the alignment and the secondary structures of the sequences being compared. Fold envelopes can be used to "prune" the search over secondary structures and determine the subsequences of two RNA sequences that can be considered in the algorithm. For example, including or excluding specific nitrogen-bonded base pairings. Alignment envelopes can be used to "prune" the search over the alignments and determine possible "cutpoints" in the alignment of the two sequences. For example, including or excluding specific residue-level homologies. Fold envelopes are pre-calculated for each sequence individually, and alignment envelopes are pre-calculated by comparing the two sequences while ignoring secondary structures. Both global and local alignment is supported.

Input

Input in Stemloc can either be in FASTA or Stockholm format (see above for descriptions of each). Sample input shown below:

stemloc--localdynalign.trna

The "--local" command analyzes the file in local alignment mode. Using "--global" will use global alignment mode.

Output

This output is in Stockholm format. It shows the sequence names, the co-ordinates of the matches, the alignment, the consensus primary sequence, the secondary structure of each sequence, the consensus secondary structure, and the log-odds score of the alignment in bits. The "//" line is used to separate alignments or indicate end of file. Sample output shown below:

# STOCKHOLM 1.0#=GR RD0260/26-67 SS ..<<<<<.......>>>>>.....(<<<<.......>>>>). RD0260/26-67UACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUC #=GC PS_cons   UAC..CCCUGUCACGG..G.GA..G.GGGUUC.AAUCCC..C RD0500/26-66UACGACCCUGUCACGGUCGUGA-CGCGGGUUCgAAUCCCGCC #=GR RD0500/26-66 SS ..<<<<<.......>>>>>...-.<<<<<.......>>>>>.#=GC SS_cons   ..<<<<<.......>>>>>.....<<<<<.......>>>>>.#=GF SC      31.872 //

Process

Stemloc relies heavily on stochastic context-free grammars, which can be seen as a scoring scheme for the algorithm. Because Sankoff's algorithm considers all possible folds and all possible alignments it is quite accurate and thorough, but it takes a measurable amount of time to obtain any results or output. To better this, Stemloc allows the user to constrain the total number of folds and alignments to be considered. More specifically, each sequence can be pre-folded individually in $O(L^{3})$ time and pre-aligned, ignoring secondary structure in $O(L^{2})$ time. For example, the using the "-fast" command below will only consider the 100 best RNA structures rather than analyzing all possible folds. Using the "-log DOTPLOT" command will output a visual representation of the fold and alignment envelopes.

stemlocnanos-tiny.rna-fast-logDOTPLOT

Constraining the envelopes

The main idea of Stemloc is being able to set a threshold for the number of folds and alignments that are sampled to create the envelopes. This can be done with the options "-nf" and "-na", which sets the number of folds and alignments to be considered. (Using a -1 will unlimited the number of folds and alignments sampled, thus using -1 for both parameters will run the Sankoff algorithm on the input dataset.

stemlocnanos-tiny.rna-nf-1-na-1

Parameter training

Another feature of Stemloc is its ability to parameterize probabilistic models like stochastic context-free grammars from data. Stemloc utilizes the Inside-Outside algorithm and stochastic context-free grammars to maximize the likelihood of a training set. This is useful because the default parameters for Stemloc were trained on a selection of pairwise alignments of between 30% and 40% sequence identity from Rfam (database) version 5.0. These parameters however, are not always effective which is why being able to train parameters as a user can be helpful.

In practice

Stemloc has since been used in a variety of research publications in RNA structure analysis. Most notably in the study of optimal multiple sequence alignment.

Related Research Articles

An L-system or Lindenmayer system is a parallel rewriting system and a type of formal grammar. An L-system consists of an alphabet of symbols that can be used to make strings, a collection of production rules that expand each symbol into some larger string of symbols, an initial "axiom" string from which to begin construction, and a mechanism for translating the generated strings into geometric structures. L-systems were introduced and developed in 1968 by Aristid Lindenmayer, a Hungarian theoretical biologist and botanist at the University of Utrecht. Lindenmayer used L-systems to describe the behaviour of plant cells and to model the growth processes of plant development. L-systems have also been used to model the morphology of a variety of organisms and can be used to generate self-similar fractals.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent Markov process. An HMM requires that there be an observable process $whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about state of by observing By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be "influenced" exclusively by the outcome of at and that the outcomes of and at must be conditionally independent of at given at time Estimation of the parameters in an HMM can be performed using maximum likelihood. For linear chain HMMs, the Baum-Welch algorithm can be used to estimate the parameters.$

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

T-Coffee is a multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment. It can also combine multiple sequences alignments obtained previously and in the latest versions can use structural information from PDB files (3D-Coffee). It has advanced features to evaluate the quality of the alignments and some capacity for identifying occurrence of motifs (Mocca). It produces alignment in the aln format (Clustal) by default, but can also produce PIR, MSF, and FASTA format. The most common input formats are supported.

Nucleic acid structure prediction is a computational method to determine secondary and tertiary nucleic acid structure from its sequence. Secondary structure can be predicted from one or several nucleic acid sequences. Tertiary structure can be predicted from the sequence, or by comparative modeling.

Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

Stochastic computing is a collection of techniques that represent continuous values by streams of random bits. Complex computations can then be computed by simple bit-wise operations on the streams. Stochastic computing is distinct from the study of randomized algorithms.

Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values.

System identification is a method of identifying or measuring the mathematical model of a system from measurements of the system inputs and outputs. The applications of system identification include any system where the inputs and outputs can be measured and include industrial processes, control systems, economic data, biology and the life sciences, medicine, social systems and many more.

The ViennaRNA Package is a set of standalone programs and libraries used for prediction and analysis of RNA secondary structures. The source code for the package is distributed freely and compiled binaries are available for Linux, macOS and Windows platforms. The original paper has been cited over 2000 times.

Stochastic chains with memory of variable length are a family of stochastic chains of finite order in a finite alphabet, such as, for every time pass, only one finite suffix of the past, called context, is necessary to predict the next symbol. These models were introduced in the information theory literature by Jorma Rissanen in 1983, as a universal tool to data compression, but recently have been used to model data in different areas such as biology, linguistics and music.

References

Holmes I. (2005) Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics. 2005 Mar 24;6:73.
Sankoff D. (1985) Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems. SIAM Journal on Applied Mathematics. 1985 Oct;45:5.Sankoff D. (1985) Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems. SIAM Journal on Applied Mathematics. 1985 Oct;45:5.

External links

Stemloc homepage and tutorial

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.