Discovery Net is one of the earliest examples of a scientific workflow system allowing users to coordinate the execution of remote services based on Web service and Grid Services (OGSA and Open Grid Services Architecture) standards. The system was designed and implemented at

Imperial College London Imperial College London (legally Imperial College of Science, Technology and Medicine) is a public research university in London, United Kingdom. Its history began with Prince Albert, consort of Queen Victoria, who developed his vision for a ...

as part of the Discovery Net pilot project funded by the UK e-Science Programme (). Many of the concepts pioneered by Discovery Net have been later incorporated into a variety of other scientific workflow systems.

History: The Discovery Net e-Science Pilot Project

The Discovery Net system was developed as part of the Discovery Net pilot project (2001–2005), a £2m research project funded by the

EPSRC The Engineering and Physical Sciences Research Council (EPSRC) is a British Research Council that provides government funding for grants to undertake research and postgraduate degrees in engineering and the physical sciences, mainly to unive ...

under the UK e-Science Programme (). The research on the project was conducted at

as a collaboration between the Departments of Computing, Physics, Biochemistry and Earth Science & Engineering. Being a single institution project, the project was unique compared to the other 10 pilot projects funded by the EPSRC which were all multi-institutional. The aims of the Discovery Net project were to investigate and address the key issues in developing of an

e-Science E-Science or eScience is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing; the term sometimes includes technologies that enable dis ...

platform for scientific discovery from the data generated by a wide variety of high throughput devices. It originally considered requirements from applications in life science, geo-hazard monitoring, environmental modelling and renewable energy. The project successfully delivered on all its objectives including the development of the Discovery Net

workflow A workflow consists of an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence ...

platform and workflow system. Over the years the system evolved to address applications in many other areas including

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...

cheminformatics Cheminformatics (also known as chemoinformatics) refers to use of physical chemistry theory with computer and information science techniques—so called "''in silico''" techniques—in application to a range of descriptive and prescriptive proble ...

health informatics Health informatics is the field of science and engineering that aims at developing methods and technologies for the acquisition, processing, and study of patient data, which can come from different sources and modalities, such as electronic ...

text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...

and financial and business applications.

Scientific workflow system

The Discovery Net system developed within the project is one of the earliest examples of scientific

systems. It is an e-Science platform based on a workflow model supporting the integration of distributed data sources and analytical tools thus enabling the end-users to derive new knowledge from devices, sensors, databases, analysis components and computational resources that reside across the Internet or grid.

Architecture and workflow server

The system is based on a multi-tier architecture, with a workflow server providing a number of supporting functions needed for workflow authoring and execution, such as integration and access to remote computational and data resources, collaboration tools, visualisers and publishing mechanisms. The architecture itself evolved over the years focusing on the internals of the workflow server (Ghanem et al. 2009) to support extensibility over multiple application domains as well as different execution environments.

Visual workflow authoring

Discovery Net workflows are represented and stored using DPML (Discovery Process Markup Language), an XML-based representation language for workflow graphs supporting both a data flow model of computation (for analytical workflows) and a control flow model (for orchestrating multiple disjoint workflows). As with most modern workflow systems, the system supported a drag-and-drop visual interface enabling users to easily construct their applications by connecting nodes together. Within DPML, each node in a workflow graph represents an executable component (e.g. a computational tool or a wrapper that can extract data from a particular data source). Each component has a number of parameters that can be set by the user and also a number of input and output ports for receiving and transmitting data. Each directed edge in the graph represents a connection from an output port, namely the tail of the edge, to an input port, namely the head of the edge. A port is connected if there is one or more connections from/to that port. In addition, each node in the graph provides metadata describing the input and output ports of the component, including the type of data that can be passed to the component and parameters of the service that a user might want to change. Such information is used for the verification of workflows and to ensure meaningful chaining of components. A connection between an input and an output port is valid only if the types are compatible, which is strictly enforced.

Separation between data and control flows

A key contribution of the system is its clean separation between the data flow and control flow models of computations within a scientific workflows. This is achieved through the concept of embedding enabling complete data flow fragments to be embedded with a block-structured fragments of control flow constructs. This results both in simpler workflow graphs compared to other scientific workflow systems, e.g.

Taverna workbench Apache Taverna was an open source software tool for designing and executing workflows, initially created by the myGrid project under the name ''Taverna Workbench'', then a project under the Apache incubator. Taverna allowed users to integrate many ...

and the Kepler scientific workflow system and also provides the opportunity to apply formal methods for the analysis of their properties.

Data management and multiple data models

A key feature of the design of the system has been its support for data management within the workflow engine itself. This is an important feature since scientific experiments typically generate and use large amounts of heterogeneous and distributed data sets. The system was thus designed to support persistence and caching of intermediate data products and also to support scalable workflow execution over potentially large data sets using remote compute resources. A second important aspect of the Discovery Net system is based on a typed workflow language and its extensibility to support arbitrary data types defined by the user. Data typing simplifies workflow scientific workflow development, enhances optimization of workflows and enhances error checking for workflow validation . The system included a number of default data types for the purpose of supporting data mining in a variety if scientific applications. These included a

relational model The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tup ...

for tabular data, a

data model (

FASTA FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics. History The original FASTA program ...

) for representing gene sequences and a stand-off markup model for text mining based on the

Tipster A tipster is someone who regularly provides information (tips) on the likely outcomes of sporting events on internet sites or special betting places. History In the past tips were bartered for and traded but nowadays, thanks largely to the Inter ...

architecture. Each model has an associated set of data import and export components, as well as specific visualizers, which integrate with the generic import, export and visualization tools already present in the system. As an example, chemical compounds represented in the widely used SMILES ( Simplified molecular input line entry specification) format can be imported inside data tables, where they can be rendered adequately using either a three-dimensional representation or its structural formula. The relational model also serves as the base data model for data integration, and is used for the majority of generic data cleaning and transformation tasks.

Applications

The system won the "Most Innovative Data Intensive Application Award" at the ACM SC02 (Supercomputing 2002) conference and exhibition, based on a demonstration of a fully interactive distributed genome annotation pipeline for a Malaria genome case study. Many of the features of the system (architecture features, visual front-end, simplified access to remote Web and Grid Services and inclusion of a workflow store) were considered novel at the time, and have since found their way into other academic and commercial systems, and especially features found in bioinformatics workflow management systems. Beyond the original Discovery Net project, the system has been used in a large number of scientific applications, for example the BAIR: Biological Atlas of Insulin Resistance project funded by the

Wellcome Trust The Wellcome Trust is a charitable foundation focused on health research based in London, in the United Kingdom. It was established in 1936 with legacies from the pharmaceutical magnate Henry Wellcome (founder of one of the predecessors of Glax ...

and also in a large number of projects funded by both the

and

BBSRC Biotechnology and Biological Sciences Research Council (BBSRC), part of UK Research and Innovation, is a non-departmental public body (NDPB), and is the largest UK public funder of non-medical bioscience. It predominantly funds scientific rese ...

in the UK. The Discovery Net technology and system have also evolved into commercial products though the Imperial College spinout company InforSense Ltd, which further extended and applied the system in a wide variety of commercial applications as well as through further research projects, including SIMDAT, TOPCOMBI, BRIDGE and ARGUGRID.

References

# # # Jameel Syed, Moustafa Ghanem, Yike Guo. Discovery processes: representation and re-use. Proceedings of the First UK e-Science All-hands Conference, Sheffield, UK. September, 2002. # Nikolaos Giannadakis, Moustafa Ghanem, Yike Guo. Information integration for e-Science. Proceedings of the First UK e-Science All-hands Conference, Sheffield, UK. September, 2002. # # # # # Moustafa Ghanem, Yike Guo, Anthony Rowe. Integrated data and text mining in support of bioinformatics. Proceedings of the 3rd UK e-Science All-hands Conference AHM 2004, Nottingham, UK. September, 2004. # Vasa Curcin, Moustafa Ghanem, Yike Guo. SARS analysis on the Grid. Proceedings of the 3rd UK e-Science All-hands Conference AHM 2004, Nottingham, UK. September, 2004 # Peter Au, Vasa Curcin, Moustafa Ghanem, Nikolaos Giannadakis, Yike Guo, Mohammad Jafri, Michelle Osmond, Anthony Rowe, Jameel Syed, Patrick Wendel, Yong Zhang. Why Grid-based data mining matters? Fighting natural disasters on the Grid: From SARS to land slides. Proceedings of the 3rd UK e-Science All-hands Conference AHM 2004. September, 2004 # # Moustafa Ghanem, Vasa Curcin, Yike Guo, Neil Davis, Rob Gaizauskas, Yikun Guo, Henk Harkema, Ian Roberts, Jonathan Ratcliffe. GoTag: A case study in using a shared UK e-Science infrastructure. 4th UK e-Science All Hands Meeting 2005. September, 2005 # Neil Davis, Henk Harkema, Rob Gaizauskas, Yikun Guo, Moustafa Ghanem, Tom Barnwell, Yike Guo, Jonathan Ratcliffe. Three Approaches to GO-Tagging Biomedical Abstracts. CEUR Workshop Proceedings. April, 2006. # # Moustafa Ghanem, Nabeel Azam, Mike Boniface. Workflow Interoperability in Grid-based Systems. Cracow Grid Workshop 2006. October, 2006 # Vasa Curcin, Moustafa Ghanem, Yike Guo, Kostas Stathis, Francesca Toni. Building next generation Service-Oriented Architectures using argumentation agents. 3rd International Conference on Grid Services Engineering and Management (GSEM 2006). Springer Verlag. September, 2006. # Patrick Wendel, Arnold Fung, Moustafa Ghanem, Yike Guo. Designing a Java-based Grid scheduler using commodity services. Proceedings of the UK e-Science All Hands Meeting 2006. Nottingham, UK, September 2006. # Qiang Lu, Xinzhong Li, Moustafa Ghanem, Yike Guo, Haiyan Pan. Integrating R into Discovery Net. Proceedings of the UK e-Science All Hands Meeting 2006. September, 2006. # # # # Vasa Curcin, Moustafa Ghanem, Yike Guo, John Darlington. Mining adverse drug reactions with e-science workflows. Proceedings of the 4th Cairo International Biomedical Engineering Conference, 2008. CIBEC 2008. December, 2008. # # # {{cite journal , doi=10.1007/s10586-009-0099-6 , title=Analysing scientific workflows with Computational Tree Logic , journal=Cluster Computing , volume=12 , issue=4 , pages=399 , year=2009 , last1=Curcin , first1=Vasa , last2=Ghanem , first2=Moustafa M , last3=Guo , first3=Yike , s2cid=12600641 # Antje Wolf, Martin Hofmann-Apitius, Moustafa Ghanem, Nabeel Azam, Dimitrios Kalaitzopoulos, Kunqian Yu, Vinod Kasam. DockFlow - A prototypic PharmaGrid for virtual screening integrating four different docking tools. In Proceedings of HealthGrid 2009 Volume 147, pp. 3–12 Studies in Health Technology and Informatics May, 2009

External links

* List of e-Science Pilot Projects funded by the EPSRC "https://web.archive.org/web/20100723012926/http://www.epsrc.ac.uk/about/progs/rii/escience/Pages/fundedprojects.aspx" * SIMDAT "http://www.simdat.org/". * The BRIDGE Project "http://www.bridge-grid.eu/" * The ARGUGRID Project "http://www.argugrid.eu/" * BAIR project: "https://web.archive.org/web/20100430111119/http://www.bair.org.uk/" * InforSense Ltd. "https://web.archive.org/web/20100328015758/http://www.inforsense.com/" Academic computer network organizations E-Science Engineering and Physical Sciences Research Council Grid computing projects Information technology organisations based in the United Kingdom Workflow technology