HOME

TheInfoList



OR:

Sawzall is a procedural
domain-specific Domain specificity is a theoretical position in cognitive science (especially modern cognitive development) that argues that many aspects of cognition are supported by specialized, presumably evolutionarily specified, learning devices. The posit ...
programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming ...
, used by
Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
to process large numbers of individual
log Log most often refers to: * Trunk (botany), the stem and main wooden axis of a tree, called logs when cut ** Logging, cutting down trees for logs ** Firewood, logs used for fuel ** Lumber or timber, converted from wood logs * Logarithm, in mathe ...
records. Sawzall was first described in 2003, and the szl runtime was open-sourced in August 2010. However, since the
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
table aggregators have not been released, the open-sourced runtime is not useful for large-scale data analysis of multiple log files off the shelf. Sawzall has been replaced by Lingo (logs in Go) for most purposes within Google.


Motivation

Google's server logs are stored as large collections of records (
Protocol Buffers Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data. The method involves an i ...
) that are partitioned over many disks within GFS. In order to perform calculations involving the logs, engineers can write
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
programs in C++ or Java. MapReduce programs need to be compiled and may be more verbose than necessary, so writing a program to analyze the logs can be time-consuming. To make it easier to write quick scripts,
Rob Pike Robert "Rob" Pike (born 1956) is a Canadian programmer and author. He is best known for his work on the Go programming language and at Bell Labs, where he was a member of the Unix team and was involved in the creation of the Plan 9 from Bell La ...
et al. developed the Sawzall language. A Sawzall script runs within the Map phase of a MapReduce and "emits" values to tables. Then the Reduce phase (which the script writer does not have to be concerned about) aggregates the tables from multiple runs into a single set of tables. Currently, only the language runtime (which runs a Sawzall script once over a single input) has been open-sourced; the supporting program built on MapReduce has not been released.Discussion on which parts of Sawzall are open-source


Features

Some interesting features include: * A Sawzall script has a single input (a log record) and can output only by emitting to tables. The script can have no other side-effects. * A script can define any number of output tables. Table types include: ** collection saves every value emitted ** sum saves the sum of every emitted value ** maximum(n) saves only the highest n values on a given weight. *In addition, there are several statistical table types that give inexact results. The higher the parameter n, the more accurate the estimates are. ** sample(n) gives a random sample of n values from all the emitted values ** quantile(n) calculates a cumulative probability distribution of the given numbers. ** top(n) gives n values that are probably the most frequent of the emitted values. ** unique(n) estimates the number of unique values emitted. Sawzall's design favors efficiency and engine simplicity over power: * Sawzall is statically typed, and the engine compiles the script to
x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was intr ...
before running it. * Sawzall supports the
compound data type In computer science, a composite data type or compound data type is any data type which can be constructed in a program using the programming language's primitive data types and other composite types. It is sometimes called a structure or aggre ...
s lists, maps, and structs. However, there are no references or pointers. All assignments and function arguments create copies. This means that
recursive data structure In computer programming languages, a recursive data type (also known as a recursively-defined, inductively-defined or inductive data type) is a data type for values that may contain other values of the same type. Data of recursive types are usual ...
s and cycles are impossible. * Like C, functions can modify
global variable In computer programming, a global variable is a variable with global scope, meaning that it is visible (hence accessible) throughout the program, unless shadowed. The set of all global variables is known as the ''global environment'' or ''global s ...
s and
local variable In computer science, a local variable is a variable that is given ''local scope''. A local variable reference in the function or block in which it is declared overrides the same variable name in the larger scope. In programming languages with o ...
s but are not closures.


Sawzall code

This complete Sawzall program will read the input and produce three results: the number of records, the sum of the values, and the sum of the squares of the values. count: table sum of int; total: table sum of float; sum_of_squares: table sum of float; x: float = input; emit count <- 1; emit total <- x; emit sum_of_squares <- x * x;


See also

*
Pig The pig (''Sus domesticus''), often called swine, hog, or domestic pig when distinguishing from other members of the genus '' Sus'', is an omnivorous, domesticated, even-toed, hoofed mammal. It is variously considered a subspecies of ''Sus ...
– similar tool and language for use with Apache Hadoop *
Sawmill (software) {{Infobox software , name = Sawmill , logo = sawmill-logo.png , screenshot = , caption = Entry Pages view of Sawmill demo site as rendered in Firefox 3.0 , collapsible ...


References


Further reading

* S. Ghemawat, H. Gobioff, S.-T. Leung, The Google file system, in: 19th ACM Symposium on Operating Systems Principles, Proceedings, 17 ACM Press, 2003, pp. 29–43.


External links


Google Code Archive - Long-term storage for Google Code Project Hosting.

MapReduce
{{Google FOSS Domain-specific programming languages Procedural programming languages Google software Programming languages created in 2003 Software using the Apache license