Scientific Workflow System
   HOME

TheInfoList



OR:

A scientific workflow system is a specialized form of a
workflow management system A workflow management system (WfMS or WFMS) provides an infrastructure for the set-up, performance and monitoring of a defined sequence of tasks, arranged as a workflow application. International standards There are several international standards ...
designed specifically to compose and execute a series of computational or data manipulation steps, or
workflow A workflow consists of an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence o ...
, in a scientific application.


Applications

Distributed scientists can collaborate on conducting large scale scientific experiments and
knowledge discovery Knowledge extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must r ...
applications using distributed systems of computing resources, data sets, and devices. Scientific workflow systems play an important role in enabling this vision. More specialized scientific workflow systems provide a visual programming front end enabling users to easily construct their applications as a visual graph by connecting nodes together, and tools have also been developed to build such applications in a platform-independent manner. Each directed edge in the graph of a workflow typically represents a connection from the output of one application to the input of the next. A sequence of such edges may be called a
pipeline Pipeline may refer to: Electronics, computers and computing * Pipeline (computing), a chain of data-processing stages or a CPU optimization found on ** Instruction pipelining, a technique for implementing instruction-level parallelism within a s ...
. A
bioinformatics workflow management system A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, that relate to bioinformatics. Ther ...
is a specialized scientific workflow system focused on bioinformatics.


Scientific workflows

The simplest computerized scientific workflows are scripts that call in data, programs, and other inputs and produce outputs that might include visualizations and analytical results. These may be implemented in programs such as R or
MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementa ...
, using a scripting language such as
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
with a command-line interface, or more recently using open-source web applications such as
Jupyter Notebook Project Jupyter () is a project with goals to develop open-source software, open standards, and services for interactive computing across multiple programming languages. It was spun off from IPython in 2014 by Fernando Pérez and Brian Granger ...
. There are many motives for differentiating scientific workflows from traditional business process workflows. These include: * providing an easy-to-use environment for individual application scientists themselves to create their own workflows. * providing interactive tools for the scientists enabling them to execute their workflows and view their results in real-time. * simplifying the process of sharing and reusing workflows between the scientists. * enabling scientists to track the provenance of the workflow execution results and the workflow creation steps. By focusing on the scientists, the focus of designing scientific workflow system shifts away from the workflow
scheduling A schedule or a timetable, as a basic time-management tool, consists of a list of times at which possible tasks, events, or actions are intended to take place, or of a sequence of events in the chronological order in which such things are ...
activities, typically considered by grid computing environments for optimizing the execution of complex computations on predefined resources, to a domain-specific view of what data types, tools and distributed resources should be made available to the scientists and how can one make them easily accessible and with specific Quality of Service requirements Scientific workflows are now recognized as a crucial element of the
cyberinfrastructure United States federal research funders use the term cyberinfrastructure to describe research environments that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computing a ...
, facilitating e-Science. Typically sitting on top of a
middleware Middleware is a type of computer software that provides services to software applications beyond those available from the operating system. It can be described as "software glue". Middleware makes it easier for software developers to implement c ...
layer, scientific workflows are a means by which scientists can model, design, execute, debug, re-configure, and re-run their analysis and visualization pipelines. Part of the established scientific method is to create a record of the origins of a result, how it was obtained, experimental methods used, machine calibrations and parameters, etc. It is the same in e-Science, except provenance data are a record of the workflow activities invoked, services and databases accessed, data sets used, and so forth. Such information is useful for a scientist to interpret their workflow results and for other scientists to establish trust in the experimental result.


Sharing workflows

Social networking communities such as
myExperiment myExperiment is a social web site for researchers sharing research objects such as scientific workflows. The myExperiment website was launched in November 2007 and contains a significant collection of scientific workflows for a variety of workfl ...
have been developed to facilitate sharing and collaborative development of scientific workflows. Galaxy provide collaborative mechanisms for editing and publication of workflow definitions and workflow results directly on the Galaxy installation.


Analysis

A key assumption underlying all scientific workflow systems is that the scientists themselves will be able to use a workflow system to develop their applications based on visual flowcharting, logic diagramming, or, as a last resort, writing code to describe the workflow logic. Powerful workflow systems make it easy for non-programmers to first sketch out workflow steps using simple flowcharting tools, and then hook in various data acquisition, analysis, and reporting tools. For maximum productivity, details of the underlying programming code should normally be hidden. Workflow analysis techniques can be used to analyze the properties of such workflows to verify certain properties before executing them. An example of a theoretical formal analysis framework for the verification and profiling of the control-flow aspects of scientific workflows and their data flow aspects for the
Discovery Net Discovery Net is one of the earliest examples of a scientific workflow system allowing users to coordinate the execution of remote services based on Web service and Grid Services (OGSA and Open Grid Services Architecture) standards. The system was ...
system is described in the paper, "The design and implementation of a workflow analysis tool" by Curcin et al. The authors note that introducing program analysis and verification into the
workflow A workflow consists of an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence o ...
world requires detailed understanding of execution semantics of workflow language, including execution properties of nodes and arcs in the workflow graph, understanding functional equivalencies between workflow patterns, and many other issues. Doing such analysis is difficult, and addressing these issues requires building on formal methods used in computer science research (e.g.
Petri net A Petri net, also known as a place/transition (PT) net, is one of several mathematical modeling languages for the description of distributed systems. It is a class of discrete event dynamic system. A Petri net is a directed bipartite graph that ...
s) and building on these formal methods to develop user-level tools to reason about the properties of both workflows and workflow systems. The lack of such tools in the past stopped automated workflow management solutions from maturing from nice-to-have academic toys to production-level tools used outside the narrow circle of early adopters and workflow enthusiasts.


Notable systems

Notable scientific workflow systems include: * Anduril, bioinformatics and image analysis * Apache Airavata, a general purpose workflow management system * Apache Airflow, a general purpose workflow management system *
Apache Taverna Apache Taverna was an open source software tool for designing and executing workflows, initially created by the myGrid project under the name ''Taverna Workbench'', then a project under the Apache incubator. Taverna allowed users to integrate many ...
, widely used in bioinformatics, astronomy, biodiversity *
BioBIKE BioBike(nee. BioLingua ) is a cloud-based, through-the-web programmable ( Paas) symbolic biocomputing and bioinformatics platform that aims to make computational biology, and especially intelligent biocomputing (that is, the application of Artifi ...
, a cloud-based bioinformatics platform * Bioclipse, a graphical workbench, with a scripting environment that lets you perform complex actions as a kind of workflow. * Collective Knowledge, a Python-based general workflow and experiment crowdsourcing framework with JSON API and cross-platform package manager * Common Workflow Language, a community-developed
YAML YAML ( and ) (''see '') is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Ext ...
-based workflow language, supported by multiple engine implementations. *
Cuneiform Cuneiform is a logo-syllabic script that was used to write several languages of the Ancient Middle East. The script was in active use from the early Bronze Age until the beginning of the Common Era. It is named for the characteristic wedge-sh ...
, a functional workflow language. *
Discovery Net Discovery Net is one of the earliest examples of a scientific workflow system allowing users to coordinate the execution of remote services based on Web service and Grid Services (OGSA and Open Grid Services Architecture) standards. The system was ...
, one of the earliest examples of a scientific workflow system * Galaxy, initially targeted at genomics *
GenePattern GenePattern is a freely available computational biology open-source software package originally created and developed at the Broad Institute for the analysis of genomic data. Designed to enable researchers to develop, capture, and reproduce geno ...
, a powerful scientific workflow system that provides access to hundreds of genomic analysis tools. *
Kepler Johannes Kepler (; ; 27 December 1571 – 15 November 1630) was a German astronomer, mathematician, astrologer, natural philosopher and writer on music. He is a key figure in the 17th-century Scientific Revolution, best known for his laws o ...
, a scientific workflow management system *
KNIME KNIME (), the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks ...
, an open-source data analytics platform * Pegasus, an open-source scientific workflow management system * OnlineHPC, online scientific workflow designer and high performance computing toolkit *
Orange Orange most often refers to: *Orange (fruit), the fruit of the tree species '' Citrus'' × ''sinensis'' ** Orange blossom, its fragrant flower *Orange (colour), from the color of an orange, occurs between red and yellow in the visible spectrum * ...
, open source data visualization and analysis * Pipeline Pilot, graphical programming with many tools to address Cheminformatics workflows * Swift parallel scripting language, a scripting language with many of the capabilities of scientific workflow systems built-in. *
VisTrails VisTrails is a scientific workflow management system developed at the Scientific Computing and Imaging Institute at the University of Utah that provides support for data exploration and visualization. It is written in Python and employs Qt via ...
, a scientific workflow system developed in
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
More than 280 computational data analysis workflow systems have been identified, although the distinction between ''data analysis workflows'' and ''scientific workflows'' is fluid, as not all analysis workflow systems are used for scientific purposes.


See also

*
Bioinformatics workflow management systems A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, that relate to bioinformatics. Ther ...
*
e-Science E-Science or eScience is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing; the term sometimes includes technologies that enable dist ...
* Grid computing *
Workflow engine A workflow engine is a software application that manages business processes. It is a key component in workflow technology and typically makes use of a database server. A workflow engine manages and monitors the state of activities in a workflow, su ...


References


External links

*{{cite journal , doi=10.1145/1084805.1084814 , volume=34 , issue=3 , title=A taxonomy of scientific workflow systems for grid computing , journal=ACM SIGMOD Record , page=44, year=2005 , last1=Yu , first1=Jia , last2=Buyya , first2=Rajkumar , citeseerx=10.1.1.63.3176 , s2cid=538714
Scientific workflow systems - can one size fit all?
paper in CIBEC'08 comparing the features of multiple scientific workflow systems.
List of software tools
related to scientific workflows on the
DataONE DataONE is a network of interoperable data repositories facilitating data sharing, data discovery, and open science. Originally supported by $21.2 million in funding from the US National Science Foundation as one of the initial DataNet programs ...
website Workflow applications Science software, Workflow system