HOME

TheInfoList



OR:

Sawzall is a procedural
domain-specific Domain specificity is a theoretical position in cognitive science (especially modern cognitive development) that argues that many aspects of cognition are supported by specialized, presumably evolutionarily specified, learning devices. The posit ...
programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
, used by
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
to process large numbers of individual log records. Sawzall was first described in 2003, and the szl runtime was open-sourced in August 2010. However, since the
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filte ...
table aggregators have not been released, the open-sourced runtime is not useful for large-scale data analysis of multiple log files off the shelf. Sawzall has been replaced by Lingo (logs in Go) for most purposes within Google.


Motivation

Google's server logs are stored as large collections of records (
Protocol Buffers Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs that communicate with each other over a network or for storing data. The method involves an ...
) that are partitioned over many disks within GFS. In order to perform calculations involving the logs, engineers can write
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filte ...
programs in C++ or Java. MapReduce programs need to be compiled and may be more verbose than necessary, so writing a program to analyze the logs can be time-consuming. To make it easier to write quick scripts,
Rob Pike Robert Pike (born 1956) is a Canadian programmer and author. He is best known for his work on the Go programming language while working at Google and the Plan 9 operating system while working at Bell Labs, where he was a member of the Unix t ...
et al. developed the Sawzall language. A Sawzall script runs within the Map phase of a MapReduce and "emits" values to tables. Then the Reduce phase (which the script writer does not have to be concerned about) aggregates the tables from multiple runs into a single set of tables. Currently, only the language runtime (which runs a Sawzall script once over a single input) has been open-sourced; the supporting program built on MapReduce has not been released.Discussion on which parts of Sawzall are open-source


Features

Some interesting features include: * A Sawzall script has a single input (a log record) and can output only by emitting to tables. The script can have no other side-effects. * A script can define any number of output tables. Table types include: ** collection saves every value emitted ** sum saves the sum of every emitted value ** maximum(n) saves only the highest n values on a given weight. *In addition, there are several statistical table types that give inexact results. The higher the parameter n, the more accurate the estimates are. ** sample(n) gives a random sample of n values from all the emitted values ** quantile(n) calculates a cumulative probability distribution of the given numbers. ** top(n) gives n values that are probably the most frequent of the emitted values. ** unique(n) estimates the number of unique values emitted. Sawzall's design favors efficiency and engine simplicity over power: * Sawzall is statically typed, and the engine compiles the script to
x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel, based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088. Th ...
before running it. * Sawzall supports the compound data types lists, maps, and structs. However, there are no references or pointers. All assignments and function arguments create copies. This means that recursive data structures and cycles are impossible. * Like C, functions can modify
global variable In computer programming, a global variable is a variable with global scope, meaning that it is visible (hence accessible) throughout the program, unless shadowed. The set of all global variables is known as the ''global environment'' or ''global ...
s and
local variable In computer science, a local variable is a variable that is given ''local scope''. A local variable reference in the function or block in which it is declared overrides the same variable name in the larger scope. In programming languages with ...
s but are not closures.


Sawzall code

This complete Sawzall program will read the input and produce three results: the number of records, the sum of the values, and the sum of the squares of the values. count: table sum of int; total: table sum of float; sum_of_squares: table sum of float; x: float = input; emit count <- 1; emit total <- x; emit sum_of_squares <- x * x;


See also

*
Pig The pig (''Sus domesticus''), also called swine (: swine) or hog, is an omnivorous, domesticated, even-toed, hoofed mammal. It is named the domestic pig when distinguishing it from other members of the genus '' Sus''. Some authorities cons ...
– similar tool and language for use with
Apache Hadoop Apache Hadoop () is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop wa ...
* Sawmill (software)


References


Further reading

* S. Ghemawat, H. Gobioff, S.-T. Leung, The Google file system, in: 19th ACM Symposium on Operating Systems Principles, Proceedings, 17 ACM Press, 2003, pp. 29–43.


External links


Google Code Archive - Long-term storage for Google Code Project Hosting.

MapReduce
{{Google FOSS Domain-specific programming languages Procedural programming languages Google software Programming languages created in 2003 Software using the Apache license