HOME

TheInfoList



OR:

A scientific workflow system is a specialized form of a
workflow management system A workflow management system (WfMS or WFMS) provides an infrastructure for the set-up, performance, and monitoring of a defined sequence of tasks arranged as a workflow application. International standards There are several international standard ...
designed specifically to compose and execute a series of computational or data manipulation steps, or
workflow Workflow is a generic term for orchestrated and repeatable patterns of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a seque ...
, in a scientific application. Scientific workflow systems are generally developed for use by scientists from different disciplines like
astronomy Astronomy is a natural science that studies celestial objects and the phenomena that occur in the cosmos. It uses mathematics, physics, and chemistry in order to explain their origin and their overall evolution. Objects of interest includ ...
,
earth science Earth science or geoscience includes all fields of natural science related to the planet Earth. This is a branch of science dealing with the physical, chemical, and biological complex constitutions and synergistic linkages of Earth's four spheres ...
, and
bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
. All such systems are based on an abstract representation of how a computation proceeds in the form of a directed graph, where each node represents a task to be executed and edges represent either data flow or execution dependencies between different tasks. Each system typically provides a visual front-end, allowing the user to build and modify complex applications with little or no programming expertise.


Applications

Distributed scientists can collaborate on conducting large scale scientific experiments and knowledge discovery applications using distributed systems of computing resources, data sets, and devices. Scientific workflow systems play an important role in enabling this vision. More specialized scientific workflow systems provide a visual programming front end enabling users to easily construct their applications as a visual graph by connecting nodes together, and tools have also been developed to build such applications in a platform-independent manner. Each directed edge in the graph of a workflow typically represents a connection from the output of one application to the input of the next. A sequence of such edges may be called a
pipeline A pipeline is a system of Pipe (fluid conveyance), pipes for long-distance transportation of a liquid or gas, typically to a market area for consumption. The latest data from 2014 gives a total of slightly less than of pipeline in 120 countries ...
.


Scientific workflows

The simplest computerized scientific workflows are scripts that call in data, programs, and other inputs and produce outputs that might include visualizations and analytical results. These may be implemented in programs such as R or
MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementat ...
, using a scripting language such as
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
with a
command-line interface A command-line interface (CLI) is a means of interacting with software via command (computing), commands each formatted as a line of text. Command-line interfaces emerged in the mid-1960s, on computer terminals, as an interactive and more user ...
, or more recently using open-source web applications such as
Jupyter Notebook Project Jupyter (pronounced "Jupiter") is a project to develop open-source software, open standards, and services for interactive computing across multiple programming languages. It was spun off from IPython in 2014 by Fernando Pérez and Bria ...
. There are many motives for differentiating scientific workflows from traditional business process workflows. These include: * providing an easy-to-use environment for individual application scientists themselves to create their own workflows. * providing interactive tools for the scientists enabling them to execute their workflows and view their results in real-time. * simplifying the process of sharing and reusing workflows between the scientists. * enabling scientists to track the
provenance Provenance () is the chronology of the ownership, custody or location of a historical object. The term was originally mostly used in relation to works of art, but is now used in similar senses in a wide range of fields, including archaeology, p ...
of the workflow execution results and the workflow creation steps. By focusing on the scientists, the focus of designing scientific workflow system shifts away from the workflow
scheduling A schedule (, ) or a timetable, as a basic time-management tool, consists of a list of times at which possible tasks, events, or actions are intended to take place, or of a sequence of events in the chronological order in which such things ...
activities, typically considered by
grid computing Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished fro ...
environments for optimizing the execution of complex computations on predefined resources, to a domain-specific view of what data types, tools and distributed resources should be made available to the scientists and how can one make them easily accessible and with specific Quality of Service requirements Scientific workflows are now recognized as a crucial element of the
cyberinfrastructure United States federal government agencies use the term cyberinfrastructure to describe research environments that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computin ...
, facilitating e-Science. Typically sitting on top of a
middleware Middleware is a type of computer software program that provides services to software applications beyond those available from the operating system. It can be described as "software glue". Middleware makes it easier for software developers to imple ...
layer, scientific workflows are a means by which scientists can model, design, execute, debug, re-configure, and re-run their analysis and visualization
pipelines A pipeline is a system of pipes for long-distance transportation of a liquid or gas, typically to a market area for consumption. The latest data from 2014 gives a total of slightly less than of pipeline in 120 countries around the world. The Un ...
. Part of the established scientific method is to create a record of the origins of a result, how it was obtained, experimental methods used, machine calibrations and parameters, etc. It is the same in e-Science, except provenance data are a record of the workflow activities invoked, services and databases accessed, data sets used, and so forth. Such information is useful for a scientist to interpret their workflow results and for other scientists to establish trust in the experimental result.


Sharing workflows

Social networking communities such as
myExperiment myExperiment is a social web site for researchers sharing research objects such as scientific workflows. The myExperiment website was launched in November 2007 and contains a significant collection of scientific workflows for a variety of workf ...
have been developed to facilitate sharing and collaborative development of scientific workflows.
Galaxy A galaxy is a Physical system, system of stars, stellar remnants, interstellar medium, interstellar gas, cosmic dust, dust, and dark matter bound together by gravity. The word is derived from the Ancient Greek, Greek ' (), literally 'milky', ...
provide collaborative mechanisms for editing and publication of workflow definitions and workflow results directly on the Galaxy installation.


Analysis

A key assumption underlying all scientific workflow systems is that the scientists themselves will be able to use a workflow system to develop their applications based on visual flowcharting, logic diagramming, or, as a last resort, writing code to describe the workflow logic. Powerful workflow systems make it easy for non-programmers to first sketch out workflow steps using simple flowcharting tools, and then hook in various data acquisition, analysis, and reporting tools. For maximum productivity, details of the underlying programming code should normally be hidden. Workflow analysis techniques can be used to analyze the properties of such workflows to verify certain properties before executing them. An example of a theoretical formal analysis framework for the verification and profiling of the control-flow aspects of scientific workflows and their data flow aspects for the Discovery Net system is described in the paper, "The design and implementation of a workflow analysis tool" by Curcin et al. The authors note that introducing program analysis and verification into the
workflow Workflow is a generic term for orchestrated and repeatable patterns of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a seque ...
world requires detailed understanding of execution semantics of workflow language, including execution properties of nodes and arcs in the workflow graph, understanding functional equivalencies between workflow patterns, and many other issues. Doing such analysis is difficult, and addressing these issues requires building on formal methods used in computer science research (e.g.
Petri net A Petri net, also known as a place/transition net (PT net), is one of several mathematical modeling languages for the description of distributed systems. It is a class of discrete event dynamic system. A Petri net is a directed bipartite graph t ...
s) and building on these formal methods to develop user-level tools to reason about the properties of both workflows and workflow systems. The lack of such tools in the past stopped automated workflow management solutions from maturing from nice-to-have academic toys to production-level tools used outside the narrow circle of early adopters and workflow enthusiasts.


Notable systems

Notable scientific workflow systems include: * Anduril, bioinformatics and image analysis * Apache Airavata, a general purpose workflow management system * Apache Airflow, a general purpose workflow management system * Apache Taverna, widely used in bioinformatics, astronomy, biodiversity * BioBIKE, a cloud-based bioinformatics platform * Bioclipse, a graphical workbench, with a scripting environment that lets you perform complex actions as a kind of workflow. * Collective Knowledge, a Python-based general workflow and experiment crowdsourcing framework with
JSON JSON (JavaScript Object Notation, pronounced or ) is an open standard file format and electronic data interchange, data interchange format that uses Human-readable medium and data, human-readable text to store and transmit data objects consi ...
API and cross-platform package manager * Common Workflow Language, a community-developed
YAML YAML ( ) is a human-readable data serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Extensible Marku ...
-based workflow language, supported by multiple engine implementations. *
Cuneiform Cuneiform is a Logogram, logo-Syllabary, syllabic writing system that was used to write several languages of the Ancient Near East. The script was in active use from the early Bronze Age until the beginning of the Common Era. Cuneiform script ...
, a functional workflow language. * Clone Manager from Sci-Ed. * CLC bio, a bioinformatics analysis and workflow management platform from QIAGEN Digital Insights. * Discovery Net, one of the earliest examples of a scientific workflow system *
Galaxy A galaxy is a Physical system, system of stars, stellar remnants, interstellar medium, interstellar gas, cosmic dust, dust, and dark matter bound together by gravity. The word is derived from the Ancient Greek, Greek ' (), literally 'milky', ...
, initially targeted at
genomics Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
* GenePattern, a powerful scientific workflow system that provides access to hundreds of genomic analysis tools. *
Kepler Johannes Kepler (27 December 1571 – 15 November 1630) was a German astronomer, mathematician, astrologer, natural philosopher and writer on music. He is a key figure in the 17th-century Scientific Revolution, best known for his laws of p ...
, a scientific workflow management system *
KNIME KNIME (), the Konstanz Information Miner, is a data analytics, reporting and integrating platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks of Analytics" con ...
, an open-source data analytics platform * Nextflow, a bioinformatic data analysis workflow system * OnlineHPC, online scientific workflow designer and high performance computing toolkit * Orange, open source data visualization and analysis *
Pegasus Pegasus (; ) is a winged horse in Greek mythology, usually depicted as a white stallion. He was sired by Poseidon, in his role as horse-god, and foaled by the Gorgon Medusa. Pegasus was the brother of Chrysaor, both born from Medusa's blood w ...
, an open-source scientific workflow management system * Pipeline Pilot, graphical programming with many tools to address Cheminformatics workflows * Swift parallel scripting language, a scripting language with many of the capabilities of scientific workflow systems built-in. *
UGENE UGENE is computer software for bioinformatics. It helps biologists to analyze various biological genetics data, such as sequences, annotations, multiple alignments, phylogenetic trees, NGS assemblies, and others. UGENE integrates dozens of well- ...
provides a workflow management system that is installed on a local computer * VisTrails, a scientific workflow system developed in
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
More than 280 computational data analysis workflow systems have been identified, although the distinction between ''data analysis workflows'' and ''scientific workflows'' is fluid, as not all analysis workflow systems are used for scientific purposes.


Comparisons between bioinformatics workflow systems

With a large number of bioinformatics workflow systems to choose from, it becomes difficult to understand and compare the features of the different workflow systems. There has been little work conducted in evaluating and comparing the systems from a bioinformatician's perspective, especially when it comes to comparing the data types they can deal with, the in-built functionalities that are provided to the user or even their performance or usability. Examples of existing comparisons include: * The paper "Scientific workflow systems-can one size fit all?", which provides a high-level framework for comparing workflow systems based on their control flow and data flow properties. The systems compared include Discovery Net,
Taverna A taverna (; ) is a small Greek restaurant that serves Greek cuisine. The taverna is an integral part of Greek culture and has become familiar to people from other countries who visit Greece, as well as through the establishment of tavernes ...
, Triana,
Kepler Johannes Kepler (27 December 1571 – 15 November 1630) was a German astronomer, mathematician, astrologer, natural philosopher and writer on music. He is a key figure in the 17th-century Scientific Revolution, best known for his laws of p ...
as well as Yawl and
BPEL The Web Services Business Process Execution Language (WS-BPEL), commonly known as BPEL (Business Process Execution Language), is an OASIS standard executable language for specifying actions within business processes with web services. Processes ...
. * The paper "Meta-workflows: pattern-based interoperability between Galaxy and Taverna" which provides a more user-oriented comparison between
Taverna A taverna (; ) is a small Greek restaurant that serves Greek cuisine. The taverna is an integral part of Greek culture and has become familiar to people from other countries who visit Greece, as well as through the establishment of tavernes ...
and
Galaxy A galaxy is a Physical system, system of stars, stellar remnants, interstellar medium, interstellar gas, cosmic dust, dust, and dark matter bound together by gravity. The word is derived from the Ancient Greek, Greek ' (), literally 'milky', ...
in the context of enabling interoperability between both systems. * The infrastructure paper "Delivering ICT Infrastructure for Biomedical Research" compares two workflow systems, Anduril and Chipster, in terms of infrastructure requirements in a cloud-delivery model. * The paper "A review of bioinformatic pipeline frameworks" attempts to classify workflow management systems based on three dimensions: "using an implicit or explicit syntax, using a configuration, convention or class-based design paradigm and offering a command line or workbench interface".


See also

*
e-Science E-Science or eScience is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing; the term sometimes includes technologies that enable dis ...
*
Grid computing Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished fro ...
*
Workflow engine A workflow engine is a software application that manages business processes. It is a key component in workflow technology and typically makes use of a database server. A workflow engine manages and monitors the state of activities in a workflow, su ...


References


External links

*{{cite journal , doi=10.1145/1084805.1084814 , volume=34 , issue=3 , title=A taxonomy of scientific workflow systems for grid computing , journal=ACM SIGMOD Record , page=44, year=2005 , last1=Yu , first1=Jia , last2=Buyya , first2=Rajkumar , citeseerx=10.1.1.63.3176 , s2cid=538714
Scientific workflow systems - can one size fit all?
paper in CIBEC'08 comparing the features of multiple scientific workflow systems.
List of software tools
related to scientific workflows on the
DataONE logo DataONE is a network of interoperable data repositories facilitating data sharing, data discovery, and open science. Originally supported by $21.2 million in funding from the US National Science Foundation as one of the initial DataNet pro ...
website Workflow applications Workflow system