Trajectory inference

Last updated
Trajectory inference as implemented in Slingshot for (a) a simulated two-dimensional dataset and (b) a single-cell RNA-seq dataset of the olfactory epithelium. Slingshot-pseudotime.png
Trajectory inference as implemented in Slingshot for (a) a simulated two-dimensional dataset and (b) a single-cell RNA-seq dataset of the olfactory epithelium.

Trajectory inference or pseudotemporal ordering is a computational technique used in single-cell transcriptomics to determine the pattern of a dynamic process experienced by cells and then arrange cells based on their progression through the process. Single-cell protocols have much higher levels of noise than bulk RNA-seq, [1] so a common step in a single-cell transcriptomics workflow is the clustering of cells into subgroups. [2] Clustering can contend with this inherent variation by combining the signal from many cells, while allowing for the identification of cell types. [3] However, some differences in gene expression between cells are the result of dynamic processes such as the cell cycle, cell differentiation, or response to an external stimuli. Trajectory inference seeks to characterize such differences by placing cells along a continuous path that represents the evolution of the process rather than dividing cells into discrete clusters. [4] In some methods this is done by projecting cells onto an axis called pseudotime which represents the progression through the process. [5]

Contents

Methods

Since 2015, more than 50 algorithms for trajectory inference have been created. [6] Although the approaches taken are diverse there are some commonalities to the methods. Typically, the steps in the algorithm consist of dimensionality reduction to reduce the complexity of the data, trajectory building to determine the structure of the dynamic process, and projection of the data onto the trajectory so that cells are positioned by their evolution through the process and cells with similar expression profiles are situated near each other. [6] Trajectory inference algorithms differ in the specific procedure used for dimensionality reduction, the kinds of structures that can be used to represent the dynamic process, and the prior information that is required or can be provided. [2]

PCA of a multivariate Gaussian distribution. The vectors shown are the first (longer vector) and second principal components, which indicate the directions of maximum variance. GaussianScatterPCA.svg
PCA of a multivariate Gaussian distribution. The vectors shown are the first (longer vector) and second principal components, which indicate the directions of maximum variance.

Dimensionality reduction

The data produced by single-cell RNA-seq can consist of thousands of cells each with expression levels recorded across thousands of genes. [7] In order to efficiently process data with such high dimensionality many trajectory inference algorithms employ a dimensionality reduction procedure such as principal component analysis (PCA), independent component analysis (ICA), or t-SNE as their first step. [8] The purpose of this step is to combine many features of the data into a more informative measure of the data. [4] For example, a coordinate resulting from dimensionality reduction could combine expression levels from many genes that are associated with the cell cycle into one value that represents a cell's position in the cell cycle. [8] Such a transformation corresponds to dimensionality reduction in the feature space, but dimensionality reduction can also be applied to the sample space by clustering together groups of similar cells. [1]

Trajectory building

A graph with six vertices. Many trajectory inference algorithms use graphs to build the trajectory. 6n-graf.svg
A graph with six vertices. Many trajectory inference algorithms use graphs to build the trajectory.

Many methods represent the structure of the dynamic process via a graph-based approach. In such an approach the vertices of the graph correspond to states in the dynamic process, such as cell types in cell differentiation, and the edges between the nodes correspond to transitions between the states. [6] The creation of the trajectory graph can be accomplished using k-nearest neighbors or minimum spanning tree algorithms. [9] The topology of the trajectory refers to the structure of the graph and different algorithms are limited to creation of graph topologies of a particular type such as linear, branching, or cyclic. [4]

Use of prior information

Some methods require or allow for the input of prior information which is used to guide the creation of the trajectory. The use of prior information can lead to more accurate trajectory determination, but poor priors can lead the algorithm astray or bias results towards expectations. [6] Examples of prior information that can be used in trajectory inference are the selection of start cells that are at the beginning of the trajectory, the number of branches in the trajectory, and the number of end states for the trajectory. [10]

Software

MARGARET

MARGARET employs a deep unsupervised metric learning approach for inferring the cellular latent space and cell clusters. The trajectory is modeled using a cluster-connectivity graph to capture complex trajectory topologies. MARGARET utilizes the inferred trajectory for determining terminal states and inferring cell-fate plasticity using a scalable Absorbing Markov chain model. [11]

Monocle

Monocle first employs a differential expression test to reduce the number of genes then applies independent component analysis for additional dimensionality reduction. To build the trajectory Monocle computes a minimum spanning tree, then finds the longest connected path in that tree. Cells are projected onto the nearest point to them along that path. [5]

p-Creode

p-Creode finds the most likely path through a density-adjusted k-nearest neighbor graph. Graphs from an ensemble are scored with a graph similarity metric to select the most representative topology.  p-Creode has been tested on a range of single-cell platforms, including mass cytometry, multiplex immunofluorescence, [12] and single-cell RNA-seq. No prior information is required. [13]

Slingshot

Slingshot takes cluster labels as input and then orders these clusters into lineages by the construction of a minimum spanning tree. Paths through the tree are smoothed by fitting simultaneous principal curves and a cell's pseudotime value is determined by its projection onto one or more of these curves. Prior information, such as initial and terminal clusters, is optional. [10]

TSCAN

TSCAN performs dimensionality reduction using principal component analysis and clusters cells using a mixture model. A minimum spanning tree is calculated using the centers of the clusters and the trajectory is determined as the longest connected path of that tree. TSCAN is an unsupervised algorithm that requires no prior information. [14]

Wanderlust/Wishbone

Wanderlust was developed for analysis of mass cytometry data, but has been adapted for single-cell transcriptomics applications. A k-nearest neighbors algorithm is used to construct a graph which connects every cell to the cell closest to it with respect to a metric such as Euclidean distance or cosine distance. Wanderlust requires the input of a starting cell as prior information. [15]

Wishbone is built on Wanderlust and allows for a bifurcation in the graph topology, whereas Wanderlust creates a linear graph. Wishbone combines principal component analysis and diffusion maps to achieve dimensionality reduction then also creates a KNN graph. [16]

Waterfall

Waterfall performs dimensionality reduction via principal component analysis and uses a k-means algorithm to find cell clusters. A minimal spanning tree is built between the centers of the clusters. Waterfall is entirely unsupervised, requiring no prior information, and produces linear trajectories. [17]

Related Research Articles

<span class="mw-page-title-main">DNA microarray</span> Collection of microscopic DNA spots attached to a solid surface

A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as probes. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. The original nucleic acid arrays were macro arrays approximately 9 cm × 12 cm and the first computerized image based analysis was published in 1981. It was invented by Patrick O. Brown. An example of its application is in SNPs arrays for polymorphisms in cardiovascular diseases, cancer, pathogens and GWAS analysis. It is also used for the identification of structural variations and the measurement of gene expression.

The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The term transcriptome is a portmanteau of the words transcript and genome; it is associated with the process of transcript production during the biological process of transcription.

Biological network inference is the process of making inferences and predictions about biological networks. By using these networks to analyze patterns in biological systems, such as food-webs, we can visualize the nature and strength of these interactions between species, DNA, proteins, and more.

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a sequencing technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample, representing an aggregated snapshot of the cells' dynamic pool of RNAs, also known as transcriptome.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Single-cell sequencing examines the nucleic acid sequence information from individual cells with optimized next-generation sequencing technologies, providing a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment. For example, in cancer, sequencing the DNA of individual cells can give information about mutations carried by small populations of cells. In development, sequencing the RNAs expressed by individual cells can give insight into the existence and behavior of different cell types. In microbial systems, a population of the same species can appear genetically clonal. Still, single-cell sequencing of RNA or epigenetic modifications can reveal cell-to-cell variability that may help populations rapidly adapt to survive in changing environments.

<span class="mw-page-title-main">Gene co-expression network</span>

A gene co-expression network (GCN) is an undirected graph, where each node corresponds to a gene, and a pair of nodes is connected with an edge if there is a significant co-expression relationship between them. Having gene expression profiles of a number of genes for several samples or experimental conditions, a gene co-expression network can be constructed by looking for pairs of genes which show a similar expression pattern across samples, since the transcript levels of two co-expressed genes rise and fall together across samples. Gene co-expression networks are of biological interest since co-expressed genes are controlled by the same transcriptional regulatory program, functionally related, or members of the same pathway or protein complex.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

Perturb-seq refers to a high-throughput method of performing single cell RNA sequencing (scRNA-seq) on pooled genetic perturbation screens. Perturb-seq combines multiplexed CRISPR mediated gene inactivations with single cell RNA sequencing to assess comprehensive gene expression phenotypes for each perturbation. Inferring a gene’s function by applying genetic perturbations to knock down or knock out a gene and studying the resulting phenotype is known as reverse genetics. Perturb-seq is a reverse genetics approach that allows for the investigation of phenotypes at the level of the transcriptome, to elucidate gene functions in many cells, in a massively parallel fashion.

Single-cell transcriptomics examines the gene expression level of individual cells in a given population by simultaneously measuring the RNA concentration of hundreds to thousands of genes. Single-cell transcriptomics makes it possible to unravel heterogeneous cell populations, reconstruct cellular developmental pathways, and model transcriptional dynamics — all previously masked in bulk RNA sequencing.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

<span class="mw-page-title-main">Spatial transcriptomics</span> Range of methods designed for assigning cell types

Spatial transcriptomics is a method for assigning cell types to their locations in the histological sections. This method can also be used to determine subcellular localization of mRNA molecules. The term is a variation of Spatial Genomics, first described by Doyle, et al., in 2000 and then expanded upon by Ståhl et al. in a technique developed in 2016, which has since undergone a variety of improvements and modifications.

CITE-Seq is a method for performing RNA sequencing along with gaining quantitative and qualitative information on surface proteins with available antibodies on a single cell level. So far, the method has been demonstrated to work with only a few proteins per cell. As such, it provides an additional layer of information for the same cell by combining both proteomics and transcriptomics data. For phenotyping, this method has been shown to be as accurate as flow cytometry by the groups that developed it. It is currently one of the main methods, along with REAP-Seq, to evaluate both gene expression and protein levels simultaneously in different species.

<span class="mw-page-title-main">Patch-sequencing</span>

Patch-sequencing (patch-seq) is a method designed for tackling specific problems involved in characterizing neurons. As neural tissues are one of the most transcriptomically diverse populations of cells, classifying neurons into cell types in order to understand the circuits they form is a major challenge for neuroscientists. Combining classical classification methods with single cell RNA-sequencing post-hoc has proved to be difficult and slow. By combining multiple data modalities such as electrophysiology, sequencing and microscopy, Patch-seq allows for neurons to be characterized in multiple ways simultaneously. It currently suffers from low throughput relative to other sequencing methods mainly due to the manual labor involved in achieving a successful patch-clamp recording on a neuron. Investigations are currently underway to automate patch-clamp technology which will improve the throughput of patch-seq as well.

<span class="mw-page-title-main">RNA timestamp</span>

An RNA timestamp is a technology that enables the age of any given RNA transcript to be inferred by exploiting RNA editing. In this technique, the RNA of interest is tagged to an adenosine rich reporter motif that consists of multiple MS2 binding sites. These MS2 binding sites recruit a complex composed of ADAR2 and MCP. The binding of the ADAR2 enzyme to the RNA timestamp initiates the gradual conversion of adenosine to inosine molecules. Over time, these edits accumulate and are then read through RNA-seq. This technology allows us to glean cell-type specific temporal information associated with RNA-seq data, that until now, has not been accessible.

Single-cell genome and epigenome by transposases sequencing (scGET-seq) is a DNA sequencing method for profiling open and closed chromatin. In contrast to single-cell assay for transposase-accessible chromatin with sequencing (scATAC-seq), which only targets active euchromatin. scGET-seq is also capable of probing inactive heterochromatin.

RNA velocity is based on bridging measurements to a underlying mechanism, mRNA splicing, with two modes indicating the current and future state. It is a method used to predict the future gene expression of a cell based on the measurement of both spliced and unspliced transcripts of mRNA.

References

  1. 1 2 Bacher, Rhonda; Kendziorski, Christina (2016-04-07). "Design and computational analysis of single-cell RNA-sequencing experiments". Genome Biology. 17 (1): 63. doi:10.1186/s13059-016-0927-y. ISSN   1474-760X. PMC   4823857 . PMID   27052890.
  2. 1 2 Hwang, Byungjin; Lee, Ji Hyun; Bang, Duhee (2018-08-07). "Single-cell RNA sequencing technologies and bioinformatics pipelines". Experimental & Molecular Medicine. 50 (8): 1–14. doi:10.1038/s12276-018-0071-8. ISSN   2092-6413. PMC   6082860 . PMID   30089861.
  3. Stegle, Oliver; Teichmann, Sarah A.; Marioni, John C. (2015-01-28). "Computational and analytical challenges in single-cell transcriptomics". Nature Reviews Genetics. 16 (3): 133–145. doi:10.1038/nrg3833. ISSN   1471-0056. PMID   25628217. S2CID   205486032.
  4. 1 2 3 Cannoodt, Robrecht; Saelens, Wouter; Saeys, Yvan (2016-10-19). "Computational methods for trajectory inference from single-cell transcriptomics". European Journal of Immunology. 46 (11): 2496–2506. doi: 10.1002/eji.201646347 . ISSN   0014-2980. PMID   27682842. S2CID   19562455.
  5. 1 2 Trapnell, Cole; Cacchiarelli, Davide; Grimsby, Jonna; Pokharel, Prapti; Li, Shuqiang; Morse, Michael; Lennon, Niall J; Livak, Kenneth J; Mikkelsen, Tarjei S (2014-03-23). "The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells". Nature Biotechnology. 32 (4): 381–386. doi:10.1038/nbt.2859. ISSN   1087-0156. PMC   4122333 . PMID   24658644.
  6. 1 2 3 4 Saelens, Wouter; Cannoodt, Robrecht; Todorov, Helena; Saeys, Yvan (2019-01-04). "A comparison of single-cell trajectory inference methods". Nature Biotechnology. 37 (5): 547–555. doi:10.1038/s41587-019-0071-9. PMID   30936559. S2CID   89616753.
  7. Conesa, Ana; Madrigal, Pedro; Tarazona, Sonia; Gomez-Cabrero, David; Cervera, Alejandra; McPherson, Andrew; Szcześniak, Michał Wojciech; Gaffney, Daniel J.; Elo, Laura L. (2016-01-26). "A survey of best practices for RNA-seq data analysis". Genome Biology. 17 (1): 13. doi:10.1186/s13059-016-0881-8. ISSN   1474-760X. PMC   4728800 . PMID   26813401.
  8. 1 2 Yosef, Nir; Regev, Aviv; Wagner, Allon (November 2016). "Revealing the vectors of cellular identity with single-cell genomics". Nature Biotechnology. 34 (11): 1145–1160. doi:10.1038/nbt.3711. ISSN   1546-1696. PMC   5465644 . PMID   27824854.
  9. Cahan, Patrick; Tan, Yuqi; Kumar, Pavithra (2017-01-01). "Understanding development and stem cells using single cell-based analyses of gene expression". Development. 144 (1): 17–32. doi:10.1242/dev.133058. ISSN   1477-9129. PMC   5278625 . PMID   28049689.
  10. 1 2 Street, Kelly; Risso, Davide; Fletcher, Russell B.; Das, Diya; Ngai, John; Yosef, Nir; Purdom, Elizabeth; Dudoit, Sandrine (2018-06-19). "Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics". BMC Genomics. 19 (1): 477. doi:10.1186/s12864-018-4772-0. PMC   6007078 . PMID   29914354.
  11. Pandey, Kushagra; Zafar, Hamim (2022). "Inference of cell state transitions and cell fate plasticity from single-cell with MARGARET". Nucleic Acids Research. 50 (15): e86. doi: 10.1093/nar/gkac412 . ISSN   0305-1048. PMC   9410915 . PMID   35639499.
  12. Gerdes, M. J.; Sevinsky, C. J.; Sood, A.; Adak, S.; Bello, M. O.; Bordwell, A.; Can, A.; Corwin, A.; Dinn, S. (2013-07-01). "Highly multiplexed single-cell analysis of formalin-fixed, paraffin-embedded cancer tissue". Proceedings of the National Academy of Sciences. 110 (29): 11982–11987. Bibcode:2013PNAS..11011982G. doi: 10.1073/pnas.1300136110 . ISSN   0027-8424. PMC   3718135 . PMID   23818604.
  13. Lau, Ken S.; Coffey, Robert J.; Gerdes, Michael J.; Liu, Qi; Franklin, Jeffrey L.; Roland, Joseph T.; Ping, Jie; Simmons, Alan J.; McKinley, Eliot T. (2018-01-24). "Unsupervised Trajectory Analysis of Single-Cell RNA-Seq and Imaging Data Reveals Alternative Tuft Cell Origins in the Gut". Cell Systems. 6 (1): 37–51.e9. doi:10.1016/j.cels.2017.10.012. ISSN   2405-4712. PMC   5799016 . PMID   29153838.
  14. Ji, Zhicheng; Ji, Hongkai (2016-05-13). "TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis". Nucleic Acids Research. 44 (13): e117. doi:10.1093/nar/gkw430. ISSN   0305-1048. PMC   4994863 . PMID   27179027.
  15. Bendall, Sean C.; Davis, Kara L.; Amir, El-ad David; Tadmor, Michelle D.; Simonds, Erin F.; Chen, Tiffany J.; Shenfeld, Daniel K.; Nolan, Garry P.; Pe'Er, Dana (2014-04-24). "Single-Cell Trajectory Detection Uncovers Progression and Regulatory Coordination in Human B Cell Development". Cell. 157 (3): 714–725. doi:10.1016/j.cell.2014.04.005. ISSN   0092-8674. PMC   4045247 . PMID   24766814.
  16. Setty, Manu; Tadmor, Michelle D; Reich-Zeliger, Shlomit; Angel, Omer; Salame, Tomer Meir; Kathail, Pooja; Choi, Kristy; Bendall, Sean; Friedman, Nir (2016-05-02). "Wishbone identifies bifurcating developmental trajectories from single-cell data". Nature Biotechnology. 34 (6): 637–645. doi:10.1038/nbt.3569. ISSN   1087-0156. PMC   4900897 . PMID   27136076.
  17. Shin, Jaehoon; Berg, Daniel A.; Zhu, Yunhua; Shin, Joseph Y.; Song, Juan; Bonaguidi, Michael A.; Enikolopov, Grigori; Nauen, David W.; Christian, Kimberly M.; Ming, Guo-li; Song, Hongjun (2015-09-03). "Single-Cell RNA-Seq with Waterfall Reveals Molecular Cascades underlying Adult Neurogenesis". Cell Stem Cell. 17 (3): 360–372. doi: 10.1016/j.stem.2015.07.013 . ISSN   1934-5909. PMC   8638014 . PMID   26299571.