Targeted projection pursuit

Last updated
In this example targeted projection pursuit is being used to explore projections of a gene expression data set. Each of the 122 points corresponds to a sample taken from a cancer tumor of four diagnostic classes (represented by color). For each sample, the expression level of 100 genes was recorded (represented by the axes). The animation shows that TPP is able to separate two of the classes clearly (red and purple), but two others could not be distinguished (blue and green). The position of the axes then indicates the activation of which genes are most associated with each class. Example of Targeted Projection Pursuit.gif
In this example targeted projection pursuit is being used to explore projections of a gene expression data set. Each of the 122 points corresponds to a sample taken from a cancer tumor of four diagnostic classes (represented by color). For each sample, the expression level of 100 genes was recorded (represented by the axes). The animation shows that TPP is able to separate two of the classes clearly (red and purple), but two others could not be distinguished (blue and green). The position of the axes then indicates the activation of which genes are most associated with each class.

Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds of attributes) to find features or patterns of potential interest.

Contents

Conventional, or 'blind', projection pursuit, finds the most "interesting" possible projections in multidimensional data, using a search algorithm that optimizes some fixed criterion of "interestingness" – such as deviation from a normal distribution. In contrast, targeted projection pursuit allows the user to explore the space of projections by manipulating data points directly in an interactive scatter plot.

Targeted projection pursuit has found applications in DNA microarray data analysis, [1] protein sequence analysis, [2] graph layout [3] and digital signal processing. [4] It is available as a package for the WEKA machine learning toolkit.

Related Research Articles

Bioinformatics Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

Graph drawing visualization of node-link graphs

Graph drawing is an area of mathematics and computer science combining methods from geometric graph theory and information visualization to derive two-dimensional depictions of graphs arising from applications such as social network analysis, cartography, linguistics, and bioinformatics.

Biological database database of biological information

Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.

Structural bioinformatics The branch of bioinformatics concerned with the analysis and prediction of the three-dimensional structure of biological macromolecules

Structural bioinformatics is the branch of bioinformatics that is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structures such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, and binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology. Structural bioinformatics main objectives are the creation of new methods to deal with biological macromolecules data to solve problems in biology and generate new knowledge.

Orange (software) component-based data mining and machine learning software suite

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for explorative rapid qualitative data analysis and interactive data visualization.

Vasant G. Honavar is an Indian born American computer scientist, and artificial intelligence, machine learning, big data, data science, causality, knowledge representation, bioinformatics and health informatics researcher and educator.

UTOPIA (bioinformatics tools) bioinformatics tool

UTOPIA is a suite of free tools for visualising and analysing bioinformatics data. Based on an ontology-driven data model, it contains applications for viewing and aligning protein sequences, rendering complex molecular structures in 3D, and for finding and using resources such as web services and data objects. There are two major components, the protein analysis suite and UTOPIA documents.

Cytoscape open source software platform for visualizing molecular interaction networks and biological pathways

Cytoscape is an open source bioinformatics software platform for visualizing molecular interaction networks and integrating with gene expression profiles and other state data. Additional features are available as plugins. Plugins are available for network and molecular profiling analyses, new layouts, additional file format support and connection with databases and searching in large networks. Plugins may be developed using the Cytoscape open Java software architecture by anyone and plugin community development is encouraged. Cytoscape also has a JavaScript-centric sister project named Cytoscape.js that can be used to analyse and visualise graphs in JavaScript environments, like a browser.

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

The Viral Bioinformatics Resource Center (VBRC) is an online resource providing access to a database of curated viral genomes and a variety of tools for bioinformatic genome analysis. This resource was one of eight BRCs funded by NIAID with the goal of promoting research against emerging and re-emerging pathogens, particularly those seen as potential bioterrorism threats. The VBRC is now supported by Dr. Chris Upton at the University of Victoria.

UGENE

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

Computational Resources for Drug Discovery (CRDD) is one of the important silico modules of Open Source for Drug Discovery (OSDD). The CRDD web portal provides computer resources related to drug discovery on a single platform. It provides computational resources for researchers in computer-aided drug design, a discussion forum, and resources to maintain Wikipedia related to drug discovery, predict inhibitors, and predict the ADME-Tox property of molecules One of the major objectives of CRDD is to promote open source software in the field of chemoinformatics and pharmacoinformatics.

PATRIC is the Bacterial Bioinformatics Resource Center, an information system designed to support the biomedical research community’s work on bacterial infectious diseases via integration of vital pathogen information with rich data and analysis tools. PATRIC sharpens and hones the scope of available bacterial phylogenomic data from numerous sources specifically for the bacterial research community, in order to save biologists time and effort when conducting comparative analyses. The freely available PATRIC platform provides an interface for biologists to discover data and information and conduct comprehensive comparative genomics and other analyses in a one-stop shop. PATRIC, a project of Virginia Tech’s Cyberinfrastructure Division, is funded by the National Institutes of Allergy and Infectious Diseases (NIAID), a component of the National Institutes of Health (NIH).

The Virus Pathogen Database and Analysis Resource (ViPR) is an integrative and comprehensive publicly available database and analysis resource to search, analyze, visualize, save and share data for viral pathogens in the U.S. National Institute of Allergy and Infectious Diseases (NIAID) Category A-C Priority Pathogen lists for biodefense research, and other viral pathogens causing emerging/reemerging infectious diseases. ViPR is one of the five Bioinformatics Resource Centers (BRC) funded by NIAID, a component of the National Institutes of Health (NIH), which is an agency of the United States Department of Health and Human Services.

The Influenza Research Database (IRD) is an integrative and comprehensive publicly available database and analysis resource to search, analyze, visualize, save and share data for influenza virus research. IRD is one of the five Bioinformatics Resource Centers (BRC) funded by the National Institute of Allergy and Infectious Diseases (NIAID), a component of the National Institutes of Health (NIH), which is an agency of the United States Department of Health and Human Services.

Interactive Visual Analysis (IVA) is a set of techniques for combining the computational power of computers with the perceptive and cognitive capabilities of humans, in order to extract knowledge from large and complex datasets. The techniques rely heavily on user interaction and the human visual system, and exist in the intersection between visual analytics and big data. It is a branch of data visualization. IVA is a suitable technique for analyzing high-dimensional data that has a large number of data points, where simple graphing and non-interactive techniques give an insufficient understanding of the information.

Ron Shamir Israeli bioinformatician

Ron Shamir is an Israeli professor of computer science known for his work in graph theory and in computational biology. He holds the Raymond and Beverly Sackler Chair in Bioinformatics, and is the founder and head of the Edmond J. Safra Center for Bioinformatics at Tel Aviv University.

BioJS open-source library of JavaScript components to visualise biological data

BioJS is an open-source project for bioinformatics data on the web. Its goal is to develop an open-source library of JavaScript components to visualise biological data. BioJS develops and maintains small building blocks (components) which can be reused by others. For a discovery of available components, BioJS maintains a registry.

Machine learning, a subfield of computer science involving the development of algorithms that learn how to make predictions based on data, has a number of emerging applications in the field of bioinformatics. Bioinformatics deals with computational and mathematical approaches for understanding and processing biological data.

References

  1. Faith, Joseph; Robert Mintram; Maia Angelova (2006). "Targeted Projection Pursuit for Visualising Gene Expression Data Classifications" (PDF). Bioinformatics. 22 (21): 2667–267. doi:10.1093/bioinformatics/btl463. PMID   16954139.
  2. Haddow, Chris; Marcus Durrant; Justin Perry; Joe Faith (2011). "Predicting Functional Residues of Protein Sequence Alignments as a Feature Selection Task". International Journal of Data Mining and Bioinformatics. 5 (6): 691–705. doi:10.1504/IJDMB.2011.045417. PMID   22295751.
  3. Gibson, Helen; Joe Faith (2011). "Node-Attribute Graph Layout for Small-World Networks". Proceedings of 15th International Conference on Information Visualisation.
  4. Sujan, Rajbhandari; Joe Faith (2010). "The Use of Linear Projections in the Visual Analysis of Signals in an Indoor Optical Wireless Link". 2010 7th International Symposium on Communication Systems, Networks & Digital Signal Processing (CSNDSP 2010). IEEE. pp. 576–581. doi:10.1109/CSNDSP16145.2010.5580367. ISBN   978-1-4244-8858-2.

Further reading