Data-centric Programming Language
   HOME

TheInfoList



OR:

Data-centric programming language defines a category of programming languages where the primary function is the management and manipulation of data. A data-centric programming language includes built-in processing primitives for accessing data stored in sets, tables, lists, and other data structures and databases, and for specific manipulation and transformation of data required by a programming application. Data-centric programming languages are typically
declarative Declarative may refer to: * Declarative learning, acquiring information that one can speak about * Declarative memory, one of two types of long term human memory * Declarative programming, a computer programming paradigm * Declarative sentence, a t ...
and often dataflow-oriented, and define the processing result desired; the specific processing steps required to perform the processing are left to the language compiler. The
SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
relational database language is an example of a declarative, data-centric language. Declarative, data-centric programming languages are ideal for
data-intensive computing Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that ...
applications.


Background

The rapid growth of the
Internet The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...
and
World Wide Web The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
has led to huge amounts of information available online and the need for
Big Data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
processing capabilities. Business and government organizations create large amounts of both structured and unstructured information which needs to be processed, analyzed, and linked. The storing, managing, accessing, and processing of this vast amount of data represents a fundamental need and an immense challenge in order to satisfy needs to search, analyze, mine, and visualize this data as information. Declarative, data-centric languages are increasingly addressing these problems, because focusing on the data makes these problems much simpler to express. Computer system architectures such as
Hadoop Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
and
HPCC HPCC (High-Performance Computing Cluster), also known as DAS (Data Analytics Supercomputer), is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architect ...
which can support data-parallel applications are a potential solution to the terabyte and petabyte scale data processing requirements of
data-intensive computing Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that ...
. Clusters of commodity hardware are commonly being used to address Big Data problems. The fundamental challenges for Big Data applications and data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The National Science Foundation has identified key issues related to data-intensive computing problems such as the programming abstractions including models, languages, and algorithms which allow a natural expression of parallel processing of data. Declarative, data-centric programming languages are well-suited to this class of problems. Data-centric programming languages provide a processing approach in which applications are expressed in terms of high-level operations on data, and the runtime system transparently controls the scheduling, execution, load balancing, communications, and movement of programs and data across the computing cluster. The programming abstraction and language tools allow the processing to be expressed in terms of data flows and transformations incorporating shared libraries of common data manipulation algorithms such as sorting. Declarative Data-centric programming languages are inherently adaptable to various forms of distributed computing including clusters and data grids and cloud computing. Using declarative, data-centric programming languages suggest more than just adapting to a new computing capability, it also suggests changes to the thought process of data analysis and design of applications.


Data-centric language examples

SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
is the best known declarative, data-centric programming language and has been in use since the 1980s and became a de facto standard for use with relational databases. However, a variety of new system architectures and associated programming languages have been implemented for
data-intensive computing Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that ...
, Big Data applications, and large-scale data analysis applications. Most data growth is with data in unstructured form"The Expanding Digital Universe" by J. F. Gantz, D. Reinsel, C. Chute, W. Schlichting, J. McArthur, S. Minton, J. Xheneti, A. Toncheva, and A. Manfrediz. IDC, White Paper, 2007. and new processing paradigms with more flexible data models were needed. Several solutions have emerged including the MapReduce architecture pioneered by Google and now available in an open-source implementation called Hadoop used by Yahoo, Facebook, and others and the HPCC system architecture offered by LexisNexis Risk Solutions.


Hadoop Pig

Hadoop is an open source software project sponsored by The Apache Software Foundation (http://www.apache.org) which implements the MapReduce architecture. The Hadoop execution environment supports additional distributed data processing capabilities which are designed to run using the Hadoop MapReduce architecture. These include Pig – a high-level data-flow programming language and execution framework for data-intensive computing. Pig was developed at Yahoo! to provide a specific data-centric language notation for data analysis applications and to improve programmer productivity and reduce development cycles when using the Hadoop MapReduce environment. Pig programs are automatically translated into sequences of MapReduce programs if needed in the execution environment. Pig provides capabilities in the language for loading, storing, filtering, grouping, de-duplication, ordering, sorting, aggregation, and joining operations on the data.Pig latin: A Not-So-Foreign Language for Data Processing
, by C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Stanford University, 2008.
Figure 1 shows a sample Pig program and Figure 2 shows how this is translated into a series of MapReduce operations.


HPCC ECL

The HPCC data-intensive computing platform from LexisNexis Risk Solutions includes a new high-level declarative, data-centric programming language called
ECL ECL may refer to: Science and technology * Electrochemiluminescence * Enhanced chemiluminescence * Emitter-coupled logic * Enterochromaffin-like cell Computing * ECL programming language, an extensible programming language * ECL (data-centric ...
. ECL allows the programmer to define what the data processing result should be and the dataflows and transformations that are necessary to achieve the result. The ECL language includes extensive capabilities for data definition, filtering, data management, and data transformation, and provides an extensive set of built-in functions to operate on records in datasets which can include user-defined transformation functions. ECL programs are compiled into optimized C++ source code, which is subsequently compiled into executable code and distributed to the nodes of a processing cluster. ECL combines data representation with algorithm implementation, and is the fusion of both a query language and a parallel data processing language. ECL includes built-in data transform operations which process through entire datasets including PROJECT, ITERATE, ROLLUP, JOIN, COMBINE, FETCH, NORMALIZE, DENORMALIZE, and PROCESS. For example, the transform function defined for a JOIN operation receives two records, one from each dataset being joined, and can perform any operations on the fields in the pair of records, and returns an output record which can be completely different from either of the input records. Example syntax for the JOIN operation from the ECL Language Reference Manual is shown in Figure 3. Figure 4 shows an example of the equivalent ECL code for the Pig example program shown in Figure 1. The ECL programming language also provides built-in primitives for
Natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
(NLP) with PATTERN statements and the built-in PARSE operation. PATTERN statements allow matching patterns including regular expressions to be defined and used to parse information from unstructured data such as raw text. PATTERN statements can be combined to implement complex parsing operations or complete grammars from
Backus–Naur form In computer science, Backus–Naur form (BNF, pronounced ), also known as Backus normal form, is a notation system for defining the Syntax (programming languages), syntax of Programming language, programming languages and other Formal language, for ...
(BNF) definitions. The PARSE operation operates across a dataset of records on a specific field within a record, this field could be an entire line in a text file for example. Using this capability of the ECL language is possible to implement parallel processing form information extraction applications across document files and all types of unstructured and semi-structured data including XML-based documents or Web pages. Figure 5 shows an example of ECL code used in a log analysis application which incorporates NLP.


See also

*
Programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
*
Declarative programming In computer science, declarative programming is a programming paradigm—a style of building the structure and elements of computer programs—that expresses the logic of a computation without describing its control flow. Many languages that ap ...
*
Data-intensive computing Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that ...
*
Parallel computing Parallel computing is a type of computing, computation in which many calculations or Process (computing), processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. ...
*
Distributed computing Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers. The components of a distributed system commu ...
*
Data parallelism Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like ...
*
Big Data Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data processing, data-processing application software, software. Data with many entries (rows) offer greater statistical power, while data with ...
( ECL, data-centric programming language for Big Data)


References

{{reflist Parallel computing Distributed computing Data-centric programming languages