HOME

TheInfoList



OR:

Cuneiform is an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
workflow language for large-scale scientific data analysis. It is a statically typed
functional programming language In computer science, functional programming is a programming paradigm where programs are constructed by applying and composing functions. It is a declarative programming paradigm in which function definitions are trees of expressions that m ...
promoting
parallel computing Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different fo ...
. It features a versatile
foreign function interface A foreign function interface (FFI) is a mechanism by which a program written in one programming language can call routines or make use of services written in another. Naming The term comes from the specification for Common Lisp, which explicit ...
allowing users to integrate software from many external programming languages. At the organizational level Cuneiform provides facilities like
conditional branching In computer science, conditionals (that is, conditional statements, conditional expressions and conditional constructs,) are programming language commands for handling decisions. Specifically, conditionals perform different computations or actio ...
and general recursion making it
Turing-complete In computability theory, a system of data-manipulation rules (such as a computer's instruction set, a programming language, or a cellular automaton) is said to be Turing-complete or computationally universal if it can be used to simulate any Tur ...
. In this, Cuneiform is the attempt to close the gap between scientific workflow systems like
Taverna A taverna (Greek: ταβέρνα) is a small Greek restaurant that serves Greek cuisine. The taverna is an integral part of Greek culture and has become familiar to people from other countries who visit Greece, as well as through the establishmen ...
,
KNIME KNIME (), the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks ...
, or
Galaxy A galaxy is a system of stars, stellar remnants, interstellar gas, dust, dark matter, bound together by gravity. The word is derived from the Greek ' (), literally 'milky', a reference to the Milky Way galaxy that contains the Solar System. ...
and large-scale data analysis programming models like
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
or
Pig Latin Pig Latin is a language game or argot in which words in English are altered, usually by adding a fabricated suffix or by moving the onset or initial consonant or consonant cluster of a word to the end of the word and adding a vocalic syllable ...
while offering the generality of a functional programming language. Cuneiform is implemented in distributed Erlang. If run in distributed mode it drives a
POSIX The Portable Operating System Interface (POSIX) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX defines both the system- and user-level application programming interf ...
-compliant distributed file system like
Gluster Gluster Inc. (formerly known as Z RESEARCH) was a software company that provided an open source platform for scale-out public and private cloud storage. The company was privately funded and headquartered in Sunnyvale, California, with an engin ...
or Ceph (or a
FUSE Fuse or FUSE may refer to: Devices * Fuse (electrical), a device used in electrical systems to protect against excessive current ** Fuse (automotive), a class of fuses for vehicles * Fuse (hydraulic), a device used in hydraulic systems to protect ...
integration of some other file system, e.g.,
HDFS Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage a ...
). Alternatively, Cuneiform scripts can be executed on top of
HTCondor HTCondor is an open-source high-throughput computing software framework for coarse-grained distributed parallelization of computationally intensive tasks. It can be used to manage workload on a dedicated cluster of computers, or to farm out wor ...
or
Hadoop Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
. Cuneiform is influenced by the work of Peter Kelly who proposes functional programming as a model for scientific workflow execution. In this, Cuneiform is distinct from related workflow languages based on
dataflow programming In computer programming, dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture. Dataflow programming languages share s ...
like
Swift Swift or SWIFT most commonly refers to: * SWIFT, an international organization facilitating transactions between banks ** SWIFT code * Swift (programming language) * Swift (bird), a family of birds It may also refer to: Organizations * SWIFT, ...
.


External software integration

External tools and libraries (e.g., R or
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
libraries) are integrated via a
foreign function interface A foreign function interface (FFI) is a mechanism by which a program written in one programming language can call routines or make use of services written in another. Naming The term comes from the specification for Common Lisp, which explicit ...
. In this it resembles, e.g.,
KNIME KNIME (), the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks ...
which allows the use of external software through snippet nodes, or
Taverna A taverna (Greek: ταβέρνα) is a small Greek restaurant that serves Greek cuisine. The taverna is an integral part of Greek culture and has become familiar to people from other countries who visit Greece, as well as through the establishmen ...
which offers
BeanShell BeanShell is a small, free, embeddable Java source interpreter with object scripting language features, written in Java. It runs in the Java Runtime Environment (JRE), dynamically executes standard Java syntax and extends it with common scripting c ...
services for integrating
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
software. By defining a task in a foreign language it is possible to use the API of an external tool or library. This way, tools can be integrated directly without the need of writing a wrapper or reimplementing the tool. Currently supported foreign programming languages are: *
Bash Bash or BASH may refer to: Arts and entertainment * ''Bash!'' (Rockapella album), 1992 * ''Bash!'' (Dave Bailey album), 1961 * '' Bash: Latter-Day Plays'', a dramatic triptych * ''BASH!'' (role-playing game), a 2005 superhero game * "Bash" ('' ...
*
Elixir ELIXIR (the European life-sciences Infrastructure for biological Information) is an initiative that will allow life science laboratories across Europe to share and store their research data as part of an organised network. Its goal is to bring t ...
* Erlang *
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
*
JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of Website, websites use JavaScript on the Client (computing), client side ...
*
MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation ...
*
GNU Octave GNU Octave is a high-level programming language primarily intended for scientific computing and numerical computation. Octave helps in solving linear and nonlinear problems numerically, and for performing other numerical experiments using a langu ...
*
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offici ...
*
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
* R * Racket Foreign language support for
AWK AWK (''awk'') is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. Like sed and grep, it is a filter, and is a standard feature of most Unix-like operating systems. The AWK lang ...
and
gnuplot gnuplot is a command-line and GUI program that can generate two- and three-dimensional plots of functions, data, and data fits. The program runs on all major computers and operating systems (Linux, Unix, Microsoft Windows, macOS, FreeDOS, ...
are planned additions.


Type System

Cuneiform provides a simple, statically checked type system. While Cuneiform provides lists as
compound data type In computer science, a composite data type or compound data type is any data type which can be constructed in a program using the programming language's primitive data types and other composite types. It is sometimes called a structure or aggre ...
s it omits traditional list accessors (head and tail) to avoid the possibility of runtime errors which might arise when accessing the empty list. Instead lists are accessed in an all-or-nothing fashion by only mapping or folding over them. Additionally, Cuneiform omits (at the organizational level) arithmetics which excludes the possibility of division by zero. The omission of any partially defined operation allows to guarantee that runtime errors can arise exclusively in foreign code.


Base data types

As base data types Cuneiform provides Booleans, strings, and files. Herein, files are used to exchange data in arbitrary format between foreign functions.


Records and pattern matching

Cuneiform provides records (structs) as compound data types. The example below shows the definition of a variable r being a record with two fields a1 and a2, the first being a string and the second being a Boolean. let r : = ; Records can be accessed either via projection or via
pattern matching In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be ...
. The example below extracts the two fields a1 and a2 from the record r. let a1 : Str = ( r, a1 ); let = r;


Lists and list processing

Furthermore, Cuneiform provides lists as compound data types. The example below shows the definition of a variable xs being a file list with three elements. let xs :
ile Ile may refer to: * iLe, a Puerto Rican singer * Ile District (disambiguation), multiple places * Ilé-Ifẹ̀, an ancient Yoruba city in south-western Nigeria * Interlingue (ISO 639:ile), a planned language * Isoleucine, an amino acid * Anothe ...
= a.txt', 'b.txt', 'c.txt' : File
Lists can be processed with the for and fold operators. Herein, the for operator can be given multiple lists to consume list element-wise (similar to for/list in Racket, mapcar in
Common Lisp Common Lisp (CL) is a dialect of the Lisp programming language, published in ANSI standard document ''ANSI INCITS 226-1994 (S20018)'' (formerly ''X3.226-1994 (R1999)''). The Common Lisp HyperSpec, a hyperlinked HTML version, has been derived fro ...
or zipwith in Erlang). The example below shows how to map over a single list, the result being a file list. for x <- xs do process-one( arg1 = x ) : File end; The example below shows how to zip two lists the result also being a file list. for x <- xs, y <- ys do process-two( arg1 = x, arg2 = y ) : File end; Finally, lists can be aggregated by using the fold operator. The following example sums up the elements of a list. fold acc = 0, x <- xs do add( a = acc, b = x ) end;


Parallel execution

Cuneiform is a purely functional language, i.e., it does not support mutable references. In the consequence, it can use subterm-independence to divide a program into parallelizable portions. The Cuneiform scheduler distributes these portions to worker nodes. In addition, Cuneiform uses a Call-by-Name evaluation strategy to compute values only if they contribute to the computation result. Finally, foreign function applications are memoized to speed up computations that contain previously derived results. For example, the following Cuneiform program allows the applications of f and g to run in parallel while h is dependent and can be started only when both f and g are finished. The following Cuneiform program creates three parallel applications of the function f by mapping f over a three-element list: Similarly, the applications of f and g are independent in the construction of the record r and can, thus, be run in parallel:


Examples

A hello-world script: def greet( person : Str ) -> in Bash ** ( greet( person = "world" ), out ); This script defines a task greet in
Bash Bash or BASH may refer to: Arts and entertainment * ''Bash!'' (Rockapella album), 1992 * ''Bash!'' (Dave Bailey album), 1961 * '' Bash: Latter-Day Plays'', a dramatic triptych * ''BASH!'' (role-playing game), a 2005 superhero game * "Bash" ('' ...
which prepends "Hello " to its string argument person. The function produces a record with a single string field out. Applying greet, binding the argument person to the string "world" produces the record . Projecting this record to its field out evaluates the string "Hello world". Command line tools can be integrated by defining a task in
Bash Bash or BASH may refer to: Arts and entertainment * ''Bash!'' (Rockapella album), 1992 * ''Bash!'' (Dave Bailey album), 1961 * '' Bash: Latter-Day Plays'', a dramatic triptych * ''BASH!'' (role-playing game), a 2005 superhero game * "Bash" ('' ...
: def samtoolsSort( bam : File ) -> in Bash ** In this example a task samtoolsSort is defined. It calls the tool
SAMtools SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM (Sequence Alignment/Map), BAM (Binary Alignment/Map) and CRAM formats, written by Heng Li. These files are generated as output ...
, consuming an input file, in BAM format, and producing a sorted output file, also in BAM format.


Release history

In April 2016, Cuneiform's implementation language switched from
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
to Erlang and, in February 2018, its major distributed execution platform changed from a Hadoop to distributed Erlang. Additionally, from 2015 to 2018
HTCondor HTCondor is an open-source high-throughput computing software framework for coarse-grained distributed parallelization of computationally intensive tasks. It can be used to manage workload on a dedicated cluster of computers, or to farm out wor ...
had been maintained as an alternative execution platform. Cuneiform's surface syntax was revised twice, as reflected in the major version number.


Version 1

In its first draft published in May 2014, Cuneiform was closely related to
Make Make or MAKE may refer to: * Make (magazine), a tech DIY periodical *Make (software), a software build tool *Make, Botswana, in the Kalahari Desert *Make Architects Make Architects is an international architecture practice headquartered in Londo ...
in that it constructed a static data dependency graph which the interpreter traversed during execution. The major difference to later versions was the lack of conditionals, recursion, or static type checking. Files were distinguished from strings by juxtaposing single-quoted string values with a tilde ~. The script's query expression was introduced with the target keyword. Bash was the default foreign language. Function application had to be performed using an apply form that took task as its first keyword argument. One year later, this surface syntax was replaced by a streamlined but similar version. The following example script downloads a reference genome from an FTP server.
declare download-ref-genome;

deftask download-fa( fa : ~path ~id ) **

ref-genome-path = ~'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes';
ref-genome-id = ~'chr22';

ref-genome = apply(
    task : download-fa
    path : ref-genome-path
    id : ref-genome-id
);

target ref-genome;


Version 2

The second draft of the Cuneiform surface syntax, first published in March 2015, remained in use for three years outlasting the transition from Java to Erlang as Cuneiform's implementation language. Evaluation differs from earlier approaches in that the interpreter reduces a query expression instead of traversing a static graph. During the time the surface syntax remained in use the interpreter was formalized and simplified which resulted in a first specification of Cuneiform's semantics. The syntax featured conditionals. However, Booleans were encoded as lists, recycling the empty list as Boolean false and the non-empty list as Boolean true. Recursion was added later as a byproduct of formalization. However, static type checking was introduced only in Version 3. The following script decompresses a zipped file and splits it into evenly sized partitions.
deftask unzip(  : zip( File ) ) in bash **

deftask split(  : file( File ) ) in bash **

sotu = "sotu/stateoftheunion1790-2014.txt.zip";
fileLst = split( file: unzip( zip: sotu ) );

fileLst;


Version 3

The current version of Cuneiform's surface syntax, in comparison to earlier drafts, is an attempt to close the gap to mainstream functional programming languages. It features a simple, statically checked typesystem and introduces records in addition to lists as a second type of compound data structure. Booleans are a separate base data type. The following script untars a file resulting in a file list.
def untar( tar : File ) -> ile 
Ile may refer to:

*  iLe, a Puerto Rican singer
* Ile District (disambiguation), multiple places
*  Ilé-Ifẹ̀, an ancient Yoruba city in south-western Nigeria
* Interlingue (ISO 639:ile), a planned language
*  Isoleucine, an amino acid
* Anothe ...
in Bash ** let hg38Tar : File = 'hg38/hg38.tar'; let ile Ile may refer to: * iLe, a Puerto Rican singer * Ile District (disambiguation), multiple places * Ilé-Ifẹ̀, an ancient Yoruba city in south-western Nigeria * Interlingue (ISO 639:ile), a planned language * Isoleucine, an amino acid * Anothe ...
= untar( tar = hg38Tar ); faLst;


References

{{Reflist, 30em Programming languages Workflow languages Functional languages Scripting languages Linux programming tools Hadoop Statically typed programming languages Cross-platform free software