Discovery Net

Last updated

Discovery Net is one of the earliest examples of a scientific workflow system allowing users to coordinate the execution of remote services based on Web service and Grid Services (OGSA and Open Grid Services Architecture) standards. The system was designed and implemented at Imperial College London as part of the Discovery Net pilot project funded by the UK e-Science Programme (E-Science § UK programme). Many of the concepts pioneered by Discovery Net have been later incorporated into a variety of other scientific workflow systems.

Contents

History

The Discovery Net system was developed as part of the Discovery Net pilot project (2001–2005), a £2m research project funded by the EPSRC under the UK e-Science Programme (E-Science § UK programme). The research on the project was conducted at Imperial College London as a collaboration between the Departments of Computing, Physics, Biochemistry and Earth Science & Engineering. Being a single institution project, the project was unique compared to the other 10 pilot projects funded by the EPSRC which were all multi-institutional.

The aims of the Discovery Net project were to investigate and address the key issues in developing of an e-Science platform for scientific discovery from the data generated by a wide variety of high throughput devices. It originally considered requirements from applications in life science, geo-hazard monitoring, environmental modelling and renewable energy. The project successfully delivered on all its objectives including the development of the Discovery Net workflow platform and workflow system. Over the years the system evolved to address applications in many other areas including bioinformatics, cheminformatics, health informatics, text mining and financial and business applications.

Scientific workflow system

The Discovery Net system developed within the project is one of the earliest examples of scientific workflow systems. It is an e-Science platform based on a workflow model supporting the integration of distributed data sources and analytical tools thus enabling the end-users to derive new knowledge from devices, sensors, databases, analysis components and computational resources that reside across the Internet or grid.

Architecture and workflow server

The system is based on a multi-tier architecture, with a workflow server providing a number of supporting functions needed for workflow authoring and execution, such as integration and access to remote computational and data resources, collaboration tools, visualisers and publishing mechanisms. The architecture itself evolved over the years focusing on the internals of the workflow server (Ghanem et al. 2009) to support extensibility over multiple application domains as well as different execution environments.

Visual workflow authoring

Discovery Net workflows are represented and stored using DPML (Discovery Process Markup Language), an XML-based representation language for workflow graphs supporting both a data flow model of computation (for analytical workflows) and a control flow model (for orchestrating multiple disjoint workflows).

As with most modern workflow systems, the system supported a drag-and-drop visual interface enabling users to easily construct their applications by connecting nodes together.

Within DPML, each node in a workflow graph represents an executable component (e.g. a computational tool or a wrapper that can extract data from a particular data source). Each component has a number of parameters that can be set by the user and also a number of input and output ports for receiving and transmitting data.

Each directed edge in the graph represents a connection from an output port, namely the tail of the edge, to an input port, namely the head of the edge. A port is connected if there is one or more connections from/to that port. In addition, each node in the graph provides metadata describing the input and output ports of the component, including the type of data that can be passed to the component and parameters of the service that a user might want to change. Such information is used for the verification of workflows and to ensure meaningful chaining of components. A connection between an input and an output port is valid only if the types are compatible, which is strictly enforced.

Separation between data and control flows

A key contribution of the system is its clean separation between the data flow and control flow models of computations within a scientific workflows. This is achieved through the concept of embedding enabling complete data flow fragments to be embedded with a block-structured fragments of control flow constructs. This results both in simpler workflow graphs compared to other scientific workflow systems, e.g. Taverna workbench and the Kepler scientific workflow system and also provides the opportunity to apply formal methods for the analysis of their properties.

Data management and multiple data models

A key feature of the design of the system has been its support for data management within the workflow engine itself. This is an important feature since scientific experiments typically generate and use large amounts of heterogeneous and distributed data sets. The system was thus designed to support persistence and caching of intermediate data products and also to support scalable workflow execution over potentially large data sets using remote compute resources.

A second important aspect of the Discovery Net system is based on a typed workflow language and its extensibility to support arbitrary data types defined by the user. Data typing simplifies workflow scientific workflow development, enhances optimization of workflows and enhances error checking for workflow validation . The system included a number of default data types for the purpose of supporting data mining in a variety if scientific applications. These included a relational model for tabular data, a bioinformatics data model (FASTA) for representing gene sequences and a stand-off markup model for text mining based on the Tipster architecture.

Each model has an associated set of data import and export components, as well as specific visualizers, which integrate with the generic import, export and visualization tools already present in the system. As an example, chemical compounds represented in the widely used SMILES (Simplified molecular input line entry specification) format can be imported inside data tables, where they can be rendered adequately using either a three-dimensional representation or its structural formula. The relational model also serves as the base data model for data integration, and is used for the majority of generic data cleaning and transformation tasks.

Applications

The system won the "Most Innovative Data Intensive Application Award" at the ACM SC02 (Supercomputing 2002) conference and exhibition, based on a demonstration of a fully interactive distributed genome annotation pipeline for a Malaria genome case study. Many of the features of the system (architecture features, visual front-end, simplified access to remote Web and Grid Services and inclusion of a workflow store) were considered novel at the time, and have since found their way into other academic and commercial systems, and especially features found in bioinformatics workflow management systems.

Beyond the original Discovery Net project, the system has been used in a large number of scientific applications, for example the BAIR: Biological Atlas of Insulin Resistance project funded by the Wellcome Trust and also in a large number of projects funded by both the EPSRC and BBSRC in the UK. The Discovery Net technology and system have also evolved into commercial products though the Imperial College spinout company InforSense Ltd, which further extended and applied the system in a wide variety of commercial applications as well as through further research projects, including SIMDAT, TOPCOMBI, BRIDGE and ARGUGRID[ citation needed ]. [1]

See also

Related Research Articles

Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished from conventional high-performance computing systems such as cluster computing in that grid computers have each node set to perform a different task/application. Grid computers also tend to be more heterogeneous and geographically dispersed than cluster computers. Although a single grid can be dedicated to a particular application, commonly a grid is used for a variety of purposes. Grids are often constructed with general-purpose grid middleware software libraries. Grid sizes can be quite large.

<span class="mw-page-title-main">Workflow</span> Pattern of activity often with a result

A workflow is a generic term for orchestrated and repeatable patterns of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence of operations, the work of a person or group, the work of an organization of staff, or one or more simple or complex mechanisms.

E-Science or eScience is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing; the term sometimes includes technologies that enable distributed collaboration, such as the Access Grid. The term was created by John Taylor, the Director General of the United Kingdom's Office of Science and Technology in 1999 and was used to describe a large funding initiative starting in November 2000. E-science has been more broadly interpreted since then, as "the application of computer technology to the undertaking of modern scientific investigation, including the preparation, experimentation, data collection, results dissemination, and long-term storage and accessibility of all materials generated through the scientific process. These may include data modeling and analysis, electronic/digitized laboratory notebooks, raw and fitted data sets, manuscript production and draft versions, pre-prints, and print and/or electronic publications." In 2014, IEEE eScience Conference Series condensed the definition to "eScience promotes innovation in collaborative, computationally- or data-intensive research across all disciplines, throughout the research lifecycle" in one of the working definitions used by the organizers. E-science encompasses "what is often referred to as big data [which] has revolutionized science... [such as] the Large Hadron Collider (LHC) at CERN... [that] generates around 780 terabytes per year... highly data intensive modern fields of science...that generate large amounts of E-science data include: computational biology, bioinformatics, genomics" and the human digital footprint for the social sciences.

A workflow pattern is a specialized form of design pattern as defined in the area of software engineering or business process engineering. Workflow patterns refer specifically to recurrent problems and proven solutions related to the development of workflow applications in particular, and more broadly, process-oriented applications.

<span class="mw-page-title-main">Advanced Resource Connector</span> Grid computing software

Advanced Resource Connector (ARC) is a grid computing middleware introduced by NorduGrid. It provides a common interface for submission of computational tasks to different distributed computing systems and thus can enable grid infrastructures of varying size and complexity. The set of services and utilities providing the interface is known as ARC Computing Element (ARC-CE). ARC-CE functionality includes data staging and caching, developed in order to support data-intensive distributed computing. ARC is an open source software distributed under the Apache License 2.0.

<span class="mw-page-title-main">Carole Goble</span> British computer scientist

Carole Anne Goble, is a British academic who is Professor of Computer Science at the University of Manchester. She is principal investigator (PI) of the myGrid, BioCatalogue and myExperiment projects and co-leads the Information Management Group (IMG) with Norman Paton.

The myGrid consortium produces and uses a suite of tools design to “help e-Scientists get on with science and get on with scientists”. The tools support the creation of e-laboratories and have been used in domains as diverse as systems biology, social science, music, astronomy, multimedia and chemistry.

<span class="mw-page-title-main">Apache Taverna</span>

Apache Taverna was an open source software tool for designing and executing workflows, initially created by the myGrid project under the name Taverna Workbench, then a project under the Apache incubator. Taverna allowed users to integrate many different software components, including WSDL SOAP or REST Web services, such as those provided by the National Center for Biotechnology Information, the European Bioinformatics Institute, the DNA Databank of Japan (DDBJ), SoapLab, BioMOBY and EMBOSS. The set of available services was not finite and users could import new service descriptions into the Taverna Workbench.

<span class="mw-page-title-main">Renaissance Computing Institute</span>

Renaissance Computing Institute (RENCI) was launched in 2004 as a collaboration involving the State of North Carolina, University of North Carolina at Chapel Hill (UNC-CH), Duke University, and North Carolina State University. RENCI is organizationally structured as a research institute within UNC-CH, and its main campus is located in Chapel Hill, NC, a few miles from the UNC-CH campus. RENCI has engagement centers at UNC-CH, Duke University (Durham), and North Carolina State University (Raleigh).

<span class="mw-page-title-main">Galaxy (computational biology)</span>

Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system.

A sensor grid integrates wireless sensor networks with grid computing concepts to enable real-time data collection and the sharing of computational and storage resources for sensor data processing and management. It is an enabling technology for building large-scale infrastructures, integrating heterogeneous sensor, data and computational resources deployed over a wide area, to undertake complicated surveillance tasks such as environmental monitoring.

Kepler is a free software system for designing, executing, reusing, evolving, archiving, and sharing scientific workflows. Kepler's facilities provide process and data monitoring, provenance information, and high-speed data movement. Workflows in general, and scientific workflows in particular, are directed graphs where the nodes represent discrete computational components, and the edges represent paths along which data and results can flow between components. In Kepler, the nodes are called 'Actors' and the edges are called 'channels'. Kepler includes a graphical user interface for composing workflows in a desktop environment, a runtime engine for executing workflows within the GUI and independently from a command-line, and a distributed computing option that allows workflow tasks to be distributed among compute nodes in a computer cluster or computing grid. The Kepler system principally targets the use of a workflow metaphor for organizing computational tasks that are directed towards particular scientific analysis and modeling goals. Thus, Kepler scientific workflows generally model the flow of data from one step to another in a series of computations that achieve some scientific goal.

<span class="mw-page-title-main">UGENE</span>

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

A scientific workflow system is a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or workflow, in a scientific application.

<span class="mw-page-title-main">David De Roure</span> English computer scientist

David Charles De Roure is an English computer scientist who is a professor of e-Research at the University of Oxford, where he is responsible for Digital Humanities in The Oxford Research Centre in the Humanities (TORCH), and is a Turing Fellow at The Alan Turing Institute. He is a supernumerary Fellow of Wolfson College, Oxford, and Oxford Martin School Senior Alumni Fellow.

A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, that relate to bioinformatics.

Norman William Paton is a Professor in the Department of Computer Science at the University of Manchester in the UK where he co-leads the Information Management Group (IMG) with Carole Goble.

<span class="mw-page-title-main">OnlineHPC</span>

The OnlineHPC was a free public web service that supplied tools to deal with high performance computers and online workflow editor. OnlineHPC allowed users to design and execute workflows using the online workflow designer and to work with high performance computers – clusters and clouds. Access to high performance resources was available as directly from the service user interface, as from workflow components. The workflow engine of the OnlineHPC service was Taverna as traditionally used for scientific workflow execution in such domains, as bioinformatics, cheminformatics, medicine, astronomy, social science, music, and digital preservation.

Data mining, the process of discovering patterns in large data sets, has been used in many applications.

Pegasus is an open-source workflow management system. It provides the necessary abstractions for scientists to create scientific workflows and allows for transparent execution of these workflows on a range of computing platforms including high performance computing clusters, clouds, and national cyberinfrastructure. In Pegasus, workflows are described abstractly as directed acyclic graphs (DAGs) using a provided API for Jupyter Notebooks, Python, R, or Java. During execution, Pegasus translates the constructed abstract workflow into an executable workflow which is executed and managed by HTCondor.

References

  1. "New partnership launched to improve IT analytics | Imperial News | Imperial College London". Imperial News. Retrieved 2019-04-25.
  1. Ghanem, M; Guo, Y; Rowe, A; Wendel, P (2002). "Grid-based knowledge discovery services for high throughput informatics". Proceedings 11th IEEE International Symposium on High Performance Distributed Computing. p. 416. doi:10.1109/HPDC.2002.1029946. ISBN   0-7695-1686-6. S2CID   28782519.
  2. Ćurčin, V; Ghanem, M; Guo, Y; Köhler, M; Rowe, A; Syed, J; Wendel, P (2002). "Discovery net". Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 658–63. doi:10.1145/775047.775145. ISBN   1-58113-567-X. S2CID   14652611.
  3. Jameel Syed, Moustafa Ghanem, Yike Guo. Discovery processes: representation and re-use. Proceedings of the First UK e-Science All-hands Conference, Sheffield, UK. September, 2002.
  4. Nikolaos Giannadakis, Moustafa Ghanem, Yike Guo. Information integration for e-Science. Proceedings of the First UK e-Science All-hands Conference, Sheffield, UK. September, 2002.
  5. Ghanem, Moustafa M; Guo, Yike; Lodhi, Huma; Zhang, Yong (2002). "Automatic scientific text classification using local patterns". ACM SIGKDD Explorations Newsletter. 4 (2): 95. doi:10.1145/772862.772876. S2CID   6328759.
  6. Rowe, A; Kalaitzopoulos, D; Osmond, M; Ghanem, M; Guo, Y (2003). "The discovery net system for high throughput bioinformatics". Bioinformatics. 19 Suppl 1: i225–31. doi: 10.1093/bioinformatics/btg1031 . PMID   12855463.
  7. Alsairafi, Salman; Emmanouil, Filippia-Sofia; Ghanem, Moustafa; Giannadakis, Nikolaos; Guo, Yike; Kalaitzopoulos, Dimitrios; Osmond, Michelle; Rowe, Anthony; Syed, Jameel; Wendel, Patrick (2016). "The Design of Discovery Net: Towards Open Grid Services for Knowledge Discovery". The International Journal of High Performance Computing Applications. 17 (3): 297. doi:10.1177/1094342003173003. S2CID   15707637.
  8. Giannadakis, Nikolaos; Rowe, Anthony; Ghanem, Moustafa; Guo, Yi-ke (2003). "InfoGrid: Providing information integration for knowledge discovery". Information Sciences. 155 (3–4): 199–226. doi:10.1016/S0020-0255(03)00170-1.
  9. Moustafa Ghanem, Yike Guo, Anthony Rowe. Integrated data and text mining in support of bioinformatics. Proceedings of the 3rd UK e-Science All-hands Conference AHM 2004, Nottingham, UK. September, 2004.
  10. Vasa Curcin, Moustafa Ghanem, Yike Guo. SARS analysis on the Grid. Proceedings of the 3rd UK e-Science All-hands Conference AHM 2004, Nottingham, UK. September, 2004
  11. Peter Au, Vasa Curcin, Moustafa Ghanem, Nikolaos Giannadakis, Yike Guo, Mohammad Jafri, Michelle Osmond, Anthony Rowe, Jameel Syed, Patrick Wendel, Yong Zhang. Why Grid-based data mining matters? Fighting natural disasters on the Grid: From SARS to land slides. Proceedings of the 3rd UK e-Science All-hands Conference AHM 2004. September, 2004
  12. Curcin, V; Ghanem, M; Yike Guo; Rowe, A; He, W; Hao Pei; Lu Qiang; Yuanyuan Li (2004). "IT service infrastructure for integrative systems biology". IEEE International Conference on Services Computing, 2004. (SCC 2004). Proceedings. 2004. pp. 123–31. doi:10.1109/SCC.2004.1357998. ISBN   0-7695-2225-4. S2CID   28687432.
  13. Moustafa Ghanem, Vasa Curcin, Yike Guo, Neil Davis, Rob Gaizauskas, Yikun Guo, Henk Harkema, Ian Roberts, Jonathan Ratcliffe. GoTag: A case study in using a shared UK e-Science infrastructure. 4th UK e-Science All Hands Meeting 2005. September, 2005
  14. Neil Davis, Henk Harkema, Rob Gaizauskas, Yikun Guo, Moustafa Ghanem, Tom Barnwell, Yike Guo, Jonathan Ratcliffe. Three Approaches to GO-Tagging Biomedical Abstracts. CEUR Workshop Proceedings. April, 2006.
  15. Ghanem, Moustafa; Azam, Nabeel; Boniface, Mike; Ferris, Justin (2006). "Grid-Enabled Workflows for Industrial Product Design" (PDF). 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06). p. 96. doi:10.1109/E-SCIENCE.2006.261180. ISBN   0-7695-2734-5.
  16. Moustafa Ghanem, Nabeel Azam, Mike Boniface. Workflow Interoperability in Grid-based Systems. Cracow Grid Workshop 2006. October, 2006
  17. Vasa Curcin, Moustafa Ghanem, Yike Guo, Kostas Stathis, Francesca Toni. Building next generation Service-Oriented Architectures using argumentation agents. 3rd International Conference on Grid Services Engineering and Management (GSEM 2006). Springer Verlag. September, 2006.
  18. Patrick Wendel, Arnold Fung, Moustafa Ghanem, Yike Guo. Designing a Java-based Grid scheduler using commodity services. Proceedings of the UK e-Science All Hands Meeting 2006. Nottingham, UK, September 2006.
  19. Qiang Lu, Xinzhong Li, Moustafa Ghanem, Yike Guo, Haiyan Pan. Integrating R into Discovery Net. Proceedings of the UK e-Science All Hands Meeting 2006. September, 2006.
  20. "CSDL | IEEE Computer Society". doi:10.1109/E-SCIENCE.2006.17. S2CID   18097525.{{cite journal}}: Cite journal requires |journal= (help)
  21. Richards, M; Ghanem, M; Osmond, M; Guo, Y; Hassard, J (2006). "Grid-based analysis of air pollution data". Ecological Modelling. 194 (1–3): 274–286. doi:10.1016/j.ecolmodel.2005.10.042.
  22. Syed, Jameel; Ghanem, Moustafa; Guo, Yike (2007). "Supporting scientific discovery processes in Discovery Net". Concurrency and Computation: Practice and Experience. 19 (2): 167. doi:10.1002/cpe.1049. S2CID   16212949.
  23. Vasa Curcin, Moustafa Ghanem, Yike Guo, John Darlington. Mining adverse drug reactions with e-science workflows. Proceedings of the 4th Cairo International Biomedical Engineering Conference, 2008. CIBEC 2008. December, 2008.
  24. Curcin, V; Ghanem, M (2008). "Scientific workflow systems - can one size fit all?". 2008 Cairo International Biomedical Engineering Conference. pp. 1–9. doi:10.1109/CIBEC.2008.4786077. ISBN   978-1-4244-2694-2. S2CID   1885579.
  25. Ghanem, Moustafa; Curcin, Vasa; Wendel, Patrick; Guo, Yike (2009). "Building and Using Analytical Workflows in Discovery Net". Data Mining Techniques in Grid Computing Environments. pp. 119–39. doi:10.1002/9780470699904.ch8. ISBN   978-0-470-69990-4.
  26. Curcin, Vasa; Ghanem, Moustafa M; Guo, Yike (2009). "Analysing scientific workflows with Computational Tree Logic". Cluster Computing. 12 (4): 399. doi:10.1007/s10586-009-0099-6. S2CID   12600641.
  27. Antje Wolf, Martin Hofmann-Apitius, Moustafa Ghanem, Nabeel Azam, Dimitrios Kalaitzopoulos, Kunqian Yu, Vinod Kasam. DockFlow - A prototypic PharmaGrid for virtual screening integrating four different docking tools. In Proceedings of HealthGrid 2009 Volume 147, pp. 3–12 Studies in Health Technology and Informatics May, 2009