A computer cluster is a set of
computers that work together so that they can be viewed as a single system. Unlike
grid computers, computer clusters have each
node
In general, a node is a localized swelling (a "knot") or a point of intersection (a vertex).
Node may refer to:
In mathematics
* Vertex (graph theory), a vertex in a mathematical graph
*Vertex (geometry), a point where two or more curves, lines ...
set to perform the same task, controlled and scheduled by software.
The components of a cluster are usually connected to each other through fast
local area network
A local area network (LAN) is a computer network that interconnects computers within a limited area such as a residence, school, laboratory, university campus or office building. By contrast, a wide area network (WAN) not only covers a larger ...
s, with each
node
In general, a node is a localized swelling (a "knot") or a point of intersection (a vertex).
Node may refer to:
In mathematics
* Vertex (graph theory), a vertex in a mathematical graph
*Vertex (geometry), a point where two or more curves, lines ...
(computer used as a server) running its own instance of an
operating system
An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs.
Time-sharing operating systems schedule tasks for efficient use of the system and may also i ...
. In most circumstances, all of the nodes use the same hardware and the same operating system, although in some setups (e.g. using
Open Source Cluster Application Resources
Open Source Cluster Application Resources (OSCAR) is a Linux-based software installation for high-performance cluster computing. OSCAR allows users to install a Beowulf type high performance computing cluster.
See also
* TORQUE Resource Manager ...
(OSCAR)), different operating systems can be used on each computer, or different hardware.
Clusters are usually deployed to improve performance and availability over that of a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.
Computer clusters emerged as a result of convergence of a number of computing trends including the availability of low-cost microprocessors, high-speed networks, and software for high-performance
distributed computing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. Distributed computing is a field of computer sci ...
. They have a wide range of applicability and deployment, ranging from small business clusters with a handful of nodes to some of the fastest
supercomputers in the world such as
IBM's Sequoia. Prior to the advent of clusters, single unit
fault tolerant
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
mainframes
A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterprise ...
with
modular redundancy were employed; but the lower upfront cost of clusters, and increased speed of network fabric has favoured the adoption of clusters. In contrast to high-reliability mainframes clusters are cheaper to scale out, but also have increased complexity in error handling, as in clusters error modes are not opaque to running programs.
Basic concepts
The desire to get more computing power and better reliability by orchestrating a number of low-cost
commercial off-the-shelf computers has given rise to a variety of architectures and configurations.
The computer clustering approach usually (but not always) connects a number of readily available computing nodes (e.g. personal computers used as servers) via a fast
local area network
A local area network (LAN) is a computer network that interconnects computers within a limited area such as a residence, school, laboratory, university campus or office building. By contrast, a wide area network (WAN) not only covers a larger ...
.
The activities of the computing nodes are orchestrated by "clustering middleware", a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit, e.g. via a
single system image In distributed computing, a single system image (SSI) cluster is a cluster of machines that appears to be one single system. The concept is often considered synonymous with that of a distributed operating system, but a single image may be presented ...
concept.
Computer clustering relies on a centralized management approach which makes the nodes available as orchestrated shared servers. It is distinct from other approaches such as
peer-to-peer
Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the network. They are said to form a peer-to-peer ...
or
grid computing which also use many nodes, but with a far more
distributed nature.
A computer cluster may be a simple two-node system which just connects two personal computers, or may be a very fast
supercomputer. A basic approach to building a cluster is that of a
Beowulf cluster which may be built with a few personal computers to produce a cost-effective alternative to traditional
high-performance computing
High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation problems.
Overview
HPC integrates systems administration (including network and security knowledge) and parallel programming into a mult ...
. An early project that showed the viability of the concept was the 133-node
Stone Soupercomputer The Stone Soupercomputer was a Beowulf-style computer cluster built at the US Oak Ridge National Laboratory in the late 1990s.
A group of lab employees including William W. Hargrove and Forrest M. Hoffman applied for a grant to build a cluster in ...
.
The developers used
Linux
Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, w ...
, the
Parallel Virtual Machine
Parallel Virtual Machine (PVM) is a software tool for parallel networking of computers. It is designed to allow a network of heterogeneous Unix and/or Windows machines to be used as a single distributed parallel processor. Thus large computatio ...
toolkit and the
Message Passing Interface library to achieve high performance at a relatively low cost.
Although a cluster may consist of just a few personal computers connected by a simple network, the cluster architecture may also be used to achieve very high levels of performance. The
TOP500
The TOP500 project ranks and details the 500 most powerful non- distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The first of these updates always coinci ...
organization's semiannual list of the 500 fastest
supercomputers often includes many clusters, e.g. the world's fastest machine in 2011 was the
K computer
The K computer named for the Japanese word/numeral , meaning 10 quadrillion (1016)See Japanese numbers was a supercomputer manufactured by Fujitsu, installed at the Riken Advanced Institute for Computational Science campus in Kobe, Hyōgo Pref ...
which has a
distributed memory
In computer science, distributed memory refers to a multiprocessor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data are required, the computational task mu ...
, cluster architecture.
History
Greg Pfister has stated that clusters were not invented by any specific vendor but by customers who could not fit all their work on one computer, or needed a backup. Pfister estimates the date as some time in the 1960s. The formal engineering basis of cluster computing as a means of doing parallel work of any sort was arguably invented by
Gene Amdahl
Gene Myron Amdahl (November 16, 1922 – November 10, 2015) was an American computer architect and high-tech entrepreneur, chiefly known for his work on mainframe computers at IBM and later his own companies, especially Amdahl Corporation ...
of
IBM, who in 1967 published what has come to be regarded as the seminal paper on parallel processing:
Amdahl's Law
In computer architecture, Amdahl's law (or Amdahl's argument) is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It states tha ...
.
The history of early computer clusters is more or less directly tied into the history of early networks, as one of the primary motivations for the development of a network was to link computing resources, creating a de facto computer cluster.
The first production system designed as a cluster was the Burroughs
B5700 in the mid-1960s. This allowed up to four computers, each with either one or two processors, to be tightly coupled to a common disk storage subsystem in order to distribute the workload. Unlike standard multiprocessor systems, each computer could be restarted without disrupting overall operation.
The first commercial loosely coupled clustering product was
Datapoint Corporation's "Attached Resource Computer" (ARC) system, developed in 1977, and using
ARCnet
Attached Resource Computer NETwork (ARCNET or ARCnet) is a communications protocol for local area networks. ARCNET was the first widely available networking system for microcomputers; it became popular in the 1980s for office automation tasks. It ...
as the cluster interface. Clustering per se did not really take off until
Digital Equipment Corporation
Digital Equipment Corporation (DEC ), using the trademark Digital, was a major American company in the computer industry from the 1960s to the 1990s. The company was co-founded by Ken Olsen and Harlan Anderson in 1957. Olsen was president un ...
released their
VAXcluster
A VMScluster, originally known as a VAXcluster, is a computer cluster involving a group of computers running the OpenVMS operating system. Whereas tightly coupled multiprocessor systems run a single copy of the operating system, a VMScluster is l ...
product in 1984 for the
VMS operating system. The ARC and VAXcluster products not only supported parallel computing, but also shared
file systems and
peripheral devices. The idea was to provide the advantages of parallel processing, while maintaining data reliability and uniqueness. Two other noteworthy early commercial clusters were the
''Tandem Himalayan'' (a circa 1994 high-availability product) and the ''IBM S/390 Parallel Sysplex'' (also circa 1994, primarily for business use).
Within the same time frame, while computer clusters used parallelism outside the computer on a commodity network,
supercomputers began to use them within the same computer. Following the success of the
CDC 6600
The CDC 6600 was the flagship of the 6000 series of mainframe computer systems manufactured by Control Data Corporation. Generally considered to be the first successful supercomputer, it outperformed the industry's prior recordholder, the IBM ...
in 1964, the
Cray 1
The Cray-1 was a supercomputer designed, manufactured and marketed by Cray Research. Announced in 1975, the first Cray-1 system was installed at Los Alamos National Laboratory in 1976. Eventually, over 100 Cray-1s were sold, making it one of the ...
was delivered in 1976, and introduced internal parallelism via
vector processing
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called ...
.
While early supercomputers excluded clusters and relied on
shared memory
In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between progr ...
, in time some of the fastest supercomputers (e.g. the
K computer
The K computer named for the Japanese word/numeral , meaning 10 quadrillion (1016)See Japanese numbers was a supercomputer manufactured by Fujitsu, installed at the Riken Advanced Institute for Computational Science campus in Kobe, Hyōgo Pref ...
) relied on cluster architectures.
Attributes of clusters
Computer clusters may be configured for different purposes ranging from general purpose business needs such as web-service support, to computation-intensive scientific calculations. In either case, the cluster may use a
high-availability
High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
Modernization has resulted in an increased reliance on these systems. Fo ...
approach. Note that the attributes described below are not exclusive and a "computer cluster" may also use a high-availability approach, etc.
"
Load-balancing" clusters are configurations in which cluster-nodes share computational workload to provide better overall performance. For example, a web server cluster may assign different queries to different nodes, so the overall response time will be optimized.
However, approaches to load-balancing may significantly differ among applications, e.g. a high-performance cluster used for scientific computations would balance load with different algorithms from a web-server cluster which may just use a simple
round-robin method by assigning each new request to a different node.
Computer clusters are used for computation-intensive purposes, rather than handling
IO-oriented operations such as web service or databases.
For instance, a computer cluster might support
computational simulations of vehicle crashes or weather. Very tightly coupled computer clusters are designed for work that may approach "
supercomputing".
"
High-availability cluster
High-availability clusters (also known as HA clusters, fail-over clusters) are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability softwa ...
s" (also known as
failover clusters, or HA clusters) improve the availability of the cluster approach. They operate by having redundant
nodes, which are then used to provide service when system components fail. HA cluster implementations attempt to use redundancy of cluster components to eliminate
single points of failure. There are commercial implementations of High-Availability clusters for many operating systems. The
Linux-HA
The Linux-HA (High-Availability Linux) project provides a high-availability ( clustering) solution for Linux, FreeBSD, OpenBSD, Solaris and Mac OS X which promotes reliability, availability, and serviceability (RAS).Alan Robertson ''The Evolu ...
project is one commonly used
free software
Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, no ...
HA package for the
Linux
Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, w ...
operating system.
Benefits
Clusters are primarily designed with performance in mind, but installations are based on many other factors. Fault tolerance (''the ability for a system to continue working with a malfunctioning node'') allows for
scalability
Scalability is the property of a system to handle a growing amount of work by adding resources to the system.
In an economic context, a scalable business model implies that a company can increase sales given increased resources. For example, a ...
, and in high-performance situations, low frequency of maintenance routines, resource consolidation (e.g.
RAID
Raid, RAID or Raids may refer to:
Attack
* Raid (military), a sudden attack behind the enemy's lines without the intention of holding ground
* Corporate raid, a type of hostile takeover in business
* Panty raid, a prankish raid by male college ...
), and centralized management. Advantages include enabling data recovery in the event of a disaster and providing parallel data processing and high processing capacity.
In terms of scalability, clusters provide this in their ability to add nodes horizontally. This means that more computers may be added to the cluster, to improve its performance, redundancy and fault tolerance. This can be an inexpensive solution for a higher performing cluster compared to scaling up a single node in the cluster. This property of computer clusters can allow for larger computational loads to be executed by a larger number of lower performing computers.
When adding a new node to a cluster, reliability increases because the entire cluster does not need to be taken down. A single node can be taken down for maintenance, while the rest of the cluster takes on the load of that individual node.
If you have a large number of computers clustered together, this lends itself to the use of
distributed file systems
A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system (only direct attached storage for ...
and
RAID
Raid, RAID or Raids may refer to:
Attack
* Raid (military), a sudden attack behind the enemy's lines without the intention of holding ground
* Corporate raid, a type of hostile takeover in business
* Panty raid, a prankish raid by male college ...
, both of which can increase the reliability and speed of a cluster.
Design and configuration
One of the issues in designing a cluster is how tightly coupled the individual nodes may be. For instance, a single computer job may require frequent communication among nodes: this implies that the cluster shares a dedicated network, is densely located, and probably has homogeneous nodes. The other extreme is where a computer job uses one or few nodes, and needs little or no inter-node communication, approaching
grid computing.
In a
Beowulf cluster
A Beowulf cluster is a computer cluster of what are normally identical, commodity-grade computers networked into a small local area network with libraries and programs installed which allow processing to be shared among them. The result is a hi ...
, the application programs never see the computational nodes (also called slave computers) but only interact with the "Master" which is a specific computer handling the scheduling and management of the slaves.
In a typical implementation the Master has two network interfaces, one that communicates with the private Beowulf network for the slaves, the other for the general purpose network of the organization.
The slave computers typically have their own version of the same operating system, and local memory and disk space. However, the private slave network may also have a large and shared file server that stores global persistent data, accessed by the slaves as needed.
A special purpose 144-node
DEGIMA cluster is tuned to running astrophysical N-body simulations using the Multiple-Walk parallel treecode, rather than general purpose scientific computations.
Due to the increasing computing power of each generation of
game consoles, a novel use has emerged where they are repurposed into
High-performance computing
High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation problems.
Overview
HPC integrates systems administration (including network and security knowledge) and parallel programming into a mult ...
(HPC) clusters. Some examples of game console clusters are
Sony PlayStation clusters and
Microsoft
Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washin ...
Xbox
Xbox is a video gaming brand created and owned by Microsoft. The brand consists of five video game consoles, as well as applications (games), streaming services, an online service by the name of Xbox network, and the development arm by the ...
clusters. Another example of consumer game product is the
Nvidia Tesla Personal Supercomputer The Tesla Personal Supercomputer is a desktop computer ( personal supercomputer) that is backed by Nvidia and built by various hardware vendors. It is meant to be a demonstration of the capabilities of Nvidia's Tesla GPGPU brand; it utilizes Nvidia ...
workstation, which uses multiple graphics accelerator processor chips. Besides game consoles, high-end graphics cards too can be used instead. The use of graphics cards (or rather their GPU's) to do calculations for grid computing is vastly more economical than using CPU's, despite being less precise. However, when using double-precision values, they become as precise to work with as CPU's and are still much less costly (purchase cost).
Computer clusters have historically run on separate physical
computers with the same
operating system
An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs.
Time-sharing operating systems schedule tasks for efficient use of the system and may also i ...
. With the advent of
virtualization
In computing, virtualization or virtualisation (sometimes abbreviated v12n, a numeronym) is the act of creating a virtual (rather than actual) version of something at the same abstraction level, including virtual computer hardware platforms, stor ...
, the cluster nodes may run on separate physical computers with different operating systems which are painted above with a virtual layer to look similar.
The cluster may also be virtualized on various configurations as maintenance takes place; an example implementation is
Xen as the virtualization manager with
Linux-HA
The Linux-HA (High-Availability Linux) project provides a high-availability ( clustering) solution for Linux, FreeBSD, OpenBSD, Solaris and Mac OS X which promotes reliability, availability, and serviceability (RAS).Alan Robertson ''The Evolu ...
.
Data sharing and communication
Data sharing
As the computer clusters were appearing during the 1980s, so were
supercomputers. One of the elements that distinguished the three classes at that time was that the early supercomputers relied on
shared memory
In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between progr ...
. To date clusters do not typically use physically shared memory, while many supercomputer architectures have also abandoned it.
However, the use of a
clustered file system is essential in modern computer clusters. Examples include the
IBM General Parallel File System
GPFS (General Parallel File System, brand name IBM Spectrum Scale) is high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel modes, or a combination of these. It ...
, Microsoft's
Cluster Shared Volumes
Cluster Shared Volumes (CSV) is a feature of Failover Clustering first introduced in Windows Server 2008 R2 for use with the Hyper-V role. A Cluster Shared Volume is a shared disk containing an NTFS or ReFS (ReFS: Windows Server 2012 R2 or newer) ...
or the
Oracle Cluster File System.
Message passing and communication
Two widely used approaches for communication between cluster nodes are MPI (
Message Passing Interface) and PVM (
Parallel Virtual Machine
Parallel Virtual Machine (PVM) is a software tool for parallel networking of computers. It is designed to allow a network of heterogeneous Unix and/or Windows machines to be used as a single distributed parallel processor. Thus large computatio ...
).
PVM was developed at the
Oak Ridge National Laboratory
Oak Ridge National Laboratory (ORNL) is a U.S. multiprogram science and technology national laboratory sponsored by the U.S. Department of Energy (DOE) and administered, managed, and operated by UT–Battelle as a federally funded research an ...
around 1989 before MPI was available. PVM must be directly installed on every cluster node and provides a set of software libraries that paint the node as a "parallel virtual machine". PVM provides a run-time environment for message-passing, task and resource management, and fault notification. PVM can be used by user programs written in C, C++, or Fortran, etc.
MPI emerged in the early 1990s out of discussions among 40 organizations. The initial effort was supported by
ARPA and
National Science Foundation
The National Science Foundation (NSF) is an independent agency of the United States government that supports fundamental research and education in all the non-medical fields of science and engineering. Its medical counterpart is the National ...
. Rather than starting anew, the design of MPI drew on various features available in commercial systems of the time. The MPI specifications then gave rise to specific implementations. MPI implementations typically use
TCP/IP
The Internet protocol suite, commonly known as TCP/IP, is a framework for organizing the set of communication protocols used in the Internet and similar computer networks according to functional criteria. The foundational protocols in the suit ...
and socket connections.
MPI is now a widely available communications model that enables parallel programs to be written in languages such as
C,
Fortran,
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
, etc.
Thus, unlike PVM which provides a concrete implementation, MPI is a specification which has been implemented in systems such as
MPICH
MPICH, formerly known as MPICH2, is a freely available, portable implementation of MPI, a standard for message-passing for distributed-memory applications used in parallel computing. MPICH is Free and open source software with some public domain ...
and
Open MPI
Open MPI is a Message Passing Interface (MPI) library project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI). It is used by many TOP500 supercomputers including Roadrunner, which was ...
.
Cluster management
One of the challenges in the use of a computer cluster is the cost of administrating it which can at times be as high as the cost of administrating N independent machines, if the cluster has N nodes.
In some cases this provides an advantage to
shared memory architecture
In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between progr ...
s with lower administration costs.
This has also made
virtual machine
In computing, a virtual machine (VM) is the virtualization/ emulation of a computer system. Virtual machines are based on computer architectures and provide functionality of a physical computer. Their implementations may involve specialized h ...
s popular, due to the ease of administration.
Task scheduling
When a large multi-user cluster needs to access very large amounts of data,
task scheduling
In computing, scheduling is the action of assigning ''resources'' to perform ''tasks''. The ''resources'' may be processors, network links or expansion cards. The ''tasks'' may be threads, processes or data flows.
The scheduling activity is ca ...
becomes a challenge. In a heterogeneous CPU-GPU cluster with a complex application environment, the performance of each job depends on the characteristics of the underlying cluster. Therefore, mapping tasks onto CPU cores and GPU devices provides significant challenges.
This is an area of ongoing research; algorithms that combine and extend
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
and
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage an ...
have been proposed and studied.
Node failure management
When a node in a cluster fails, strategies such as "
fencing
Fencing is a group of three related combat sports. The three disciplines in modern fencing are the foil, the épée, and the sabre (also ''saber''); winning points are made through the weapon's contact with an opponent. A fourth discipline, ...
" may be employed to keep the rest of the system operational.
Fencing is the process of isolating a node or protecting shared resources when a node appears to be malfunctioning. There are two classes of fencing methods; one disables a node itself, and the other disallows access to resources such as shared disks.
The
STONITH method stands for "Shoot The Other Node In The Head", meaning that the suspected node is disabled or powered off. For instance, ''power fencing'' uses a power controller to turn off an inoperable node.
The ''resources fencing'' approach disallows access to resources without powering off the node. This may include ''persistent reservation fencing'' via the
SCSI3
Parallel SCSI (formally, SCSI Parallel Interface, or SPI) is the earliest of the interface implementations in the SCSI family. SPI is a parallel bus; there is one set of electrical connections stretching from one end of the SCSI bus to the oth ...
, fibre channel fencing to disable the
fibre channel port, or
global network block device (GNBD) fencing to disable access to the GNBD server.
Software development and administration
Parallel programming
Load balancing clusters such as web servers use cluster architectures to support a large number of users and typically each user request is routed to a specific node, achieving
task parallelism
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing tasks—concurrent ...
without multi-node cooperation, given that the main goal of the system is providing rapid user access to shared data. However, "computer clusters" which perform complex computations for a small number of users need to take advantage of the parallel processing capabilities of the cluster and partition "the same computation" among several nodes.
Automatic parallelization
Automatic may refer to:
Music Bands
* Automatic (band), Australian rock band
* Automatic (American band), American rock band
* The Automatic, a Welsh alternative rock band
Albums
* ''Automatic'' (Jack Bruce album), a 1983 electronic rock ...
of programs remains a technical challenge, but
parallel programming model
In computing, a parallel programming model is an abstraction of parallel computer architecture, with which it is convenient to express algorithms and their composition in programs. The value of a programming model can be judged on its ''generality ...
s can be used to effectuate a higher
degree of parallelism via the simultaneous execution of separate portions of a program on different processors.
Debugging and monitoring
Developing and debugging parallel programs on a cluster requires parallel language primitives and suitable tools such as those discussed by the ''High Performance Debugging Forum'' (HPDF) which resulted in the HPD specifications.
Tools such as
TotalView were then developed to debug parallel implementations on computer clusters which use
Message Passing Interface (MPI) or
Parallel Virtual Machine
Parallel Virtual Machine (PVM) is a software tool for parallel networking of computers. It is designed to allow a network of heterogeneous Unix and/or Windows machines to be used as a single distributed parallel processor. Thus large computatio ...
(PVM) for message passing.
The
University of California, Berkeley
The University of California, Berkeley (UC Berkeley, Berkeley, Cal, or California) is a public land-grant research university in Berkeley, California. Established in 1868 as the University of California, it is the state's first land-grant u ...
''Network of Workstations'' (NOW) system gathers cluster data and stores them in a database, while a system such as PARMON, developed in India, allows visually observing and managing large clusters.
Application checkpointing
Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application's state, so that applications can restart from that point in case of failure. This is particularly im ...
can be used to restore a given state of the system when a node fails during a long multi-node computation.
This is essential in large clusters, given that as the number of nodes increases, so does the likelihood of node failure under heavy computational loads. Checkpointing can restore the system to a stable state so that processing can resume without needing to recompute results.
Implementations
The Linux world supports various cluster software; for application clustering, there is
distcc, and
MPICH
MPICH, formerly known as MPICH2, is a freely available, portable implementation of MPI, a standard for message-passing for distributed-memory applications used in parallel computing. MPICH is Free and open source software with some public domain ...
.
Linux Virtual Server
Linux Virtual Server (LVS) is load balancing software for Linux kernel–based operating systems.
LVS is a free and open-source project started by Wensong Zhang in May 1998, subject to the requirements of the GNU General Public License (GPL ...
,
Linux-HA
The Linux-HA (High-Availability Linux) project provides a high-availability ( clustering) solution for Linux, FreeBSD, OpenBSD, Solaris and Mac OS X which promotes reliability, availability, and serviceability (RAS).Alan Robertson ''The Evolu ...
- director-based clusters that allow incoming requests for services to be distributed across multiple cluster nodes.
MOSIX
MOSIX is a proprietary distributed operating system. Although early versions were based on older UNIX systems, since 1999 it focuses on Linux clusters and grids. In a MOSIX cluster/grid there is no need to modify or to link applications with an ...
,
LinuxPMI,
Kerrighed
Kerrighed is an open source single-system image (SSI) cluster software project. The project started in October 1998 at the Paris research group The French National Institute for Research in Computer Science and Control. From 2006 to 2011, the pro ...
,
OpenSSI
OpenSSI is an open-source single-system image clustering system. It allows a collection of computers to be treated as one large system, allowing applications running on any one machine access to the resources of all the machines in the cluster. ...
are full-blown clusters integrated into the
kernel
Kernel may refer to:
Computing
* Kernel (operating system), the central component of most operating systems
* Kernel (image processing), a matrix used for image convolution
* Compute kernel, in GPGPU programming
* Kernel method, in machine learn ...
that provide for automatic process migration among homogeneous nodes.
OpenSSI
OpenSSI is an open-source single-system image clustering system. It allows a collection of computers to be treated as one large system, allowing applications running on any one machine access to the resources of all the machines in the cluster. ...
,
openMosix
openMosix was a free cluster management system that provided single-system image (SSI) capabilities, e.g. automatic work distribution among nodes. It allowed program processes (not threads) to migrate to machines in the node's network that ...
and
Kerrighed
Kerrighed is an open source single-system image (SSI) cluster software project. The project started in October 1998 at the Paris research group The French National Institute for Research in Computer Science and Control. From 2006 to 2011, the pro ...
are
single-system image In distributed computing, a single system image (SSI) cluster is a cluster of machines that appears to be one single system. The concept is often considered synonymous with that of a distributed operating system, but a single image may be presented ...
implementations.
Microsoft Windows computer cluster Server 2003 based on the
Windows Server
Windows Server (formerly Windows NT Server) is a group of operating systems (OS) for servers that Microsoft has been developing since July 27, 1993. The first OS that was released for this platform was Windows NT 3.1 Advanced Server. With the r ...
platform provides pieces for High-Performance Computing like the Job Scheduler, MSMPI library and management tools.
gLite is a set of middleware technologies created by the
Enabling Grids for E-sciencE
In psychotherapy and mental health, enabling has a positive sense of empowering individuals, or a negative sense of encouraging dysfunctional behavior.[slurm
The Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management (SLURM), or simply Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and compu ...](_b ...<br></span></div> (EGEE) project.
<div class=)
is also used to schedule and manage some of the largest supercomputer clusters (see top500 list).
Other approaches
Although most computer clusters are permanent fixtures, attempts at
flash mob computing have been made to build short-lived clusters for specific computations. However, larger-scale
volunteer computing
Volunteer computing is a type of distributed computing in which people donate their computers' unused resources to a research-oriented project, and sometimes in exchange for credit points. The fundamental idea behind it is that a modern desktop co ...
systems such as
BOINC
The Berkeley Open Infrastructure for Network Computing (BOINC, pronounced – rhymes with "oink") is an open-source middleware system for volunteer computing (a type of distributed computing). Developed originally to support SETI@home, it beca ...
-based systems have had more followers.
See also
References
Further reading
*
*
*
*
*
External links
IEEE Technical Committee on Scalable Computing (TCSC)Tivoli System Automation WikiLarge-scale cluster management at Google with Borg April 2015, by Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune and John Wilkes
{{DEFAULTSORT:Cluster (Computing)
Parallel computing
Concurrent computing
*Computer cluster
Local area networks
Classes of computers
Fault-tolerant computer systems
Server hardware