Supercomputer Systems
   HOME

TheInfoList



OR:

A supercomputer operating system is an
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also in ...
intended for
supercomputer A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second ( FLOPS) instead of million instructions ...
s. Since the end of the 20th century, supercomputer operating systems have undergone major transformations, as fundamental changes have occurred in
supercomputer architecture Approaches to supercomputer architecture have taken dramatic turns since the earliest systems were introduced in the 1960s. Early supercomputer architectures pioneered by Seymour Cray relied on compact innovative designs and local parallelism to ...
. While early operating systems were custom tailored to each supercomputer to gain speed, the trend has been moving away from in-house operating systems and toward some form of
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
, with it running all the supercomputers on the
TOP500 The TOP500 project ranks and details the 500 most powerful non-distributed computing, distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The first of these ...
list in November 2017. In 2021, top 10 computers run for instance
Red Hat Enterprise Linux Red Hat Enterprise Linux (RHEL) is a commercial open-source Linux distribution developed by Red Hat for the commercial market. Red Hat Enterprise Linux is released in server versions for x86-64, Power ISA, ARM64, and IBM Z and a desktop version ...
(RHEL), or some variant of it or other Linux distribution e.g.
Ubuntu Ubuntu ( ) is a Linux distribution based on Debian and composed mostly of free and open-source software. Ubuntu is officially released in three editions: ''Desktop'', ''Server'', and ''Core'' for Internet of things devices and robots. All the ...
. Given that modern
massively parallel Massively parallel is the term for using a large number of computer processors (or separate computers) to simultaneously perform a set of coordinated computations in parallel. GPUs are massively parallel architecture with tens of thousands of t ...
supercomputers typically separate computations from other services by using multiple types of
nodes In general, a node is a localized swelling (a "knot") or a point of intersection (a Vertex (graph theory), vertex). Node may refer to: In mathematics *Vertex (graph theory), a vertex in a mathematical graph *Vertex (geometry), a point where two ...
, they usually run different operating systems on different nodes, e.g., using a small and efficient lightweight kernel such as Compute Node Kernel (CNK) or Compute Node Linux (CNL) on compute nodes, but a larger system such as a Linux-derivative on server and
input/output In computing, input/output (I/O, or informally io or IO) is the communication between an information processing system, such as a computer, and the outside world, possibly a human or another information processing system. Inputs are the signals ...
(I/O) nodes.''An Evaluation of the Oak Ridge National Laboratory Cray XT3'' by Sadaf R. Alam, et al., International Journal of High Performance Computing Applications, February 2008 vol. 22 no. 1 52–80. While in a traditional multi-user computer system
job scheduling A job scheduler is a computer application for controlling unattended background program execution of jobs. This is commonly called batch scheduling, as execution of non-interactive jobs is often called batch processing, though traditional ''job'' ...
is in effect a
tasking TASKING GmbH is a provider of embedded-software development tools headquartered in Munich, Germany. History Founded as a software consulting company in 1977, TASKING developed its first C compiler in 1986. In 1988, its first embedded toolset f ...
problem for processing and peripheral resources, in a massively parallel system, the job management system needs to manage the allocation of both computational and communication resources, as well as gracefully dealing with inevitable hardware failures when tens of thousands of processors are present.Open Job Management Architecture for the Blue Gene/L Supercomputer by Yariv Aridor et al in ''Job scheduling strategies for parallel processing'' by Dror G. Feitelson 2005 pages 95–101. Although most modern supercomputers use the Linux operating system, each manufacturer has made its own specific changes to the Linux-derivative they use, and no industry standard exists, partly because the differences in hardware architectures require changes to optimize the operating system to each hardware design.


Context and overview

In the early days of supercomputing, the basic architectural concepts were evolving rapidly, and
system software System software is software designed to provide a platform for other software. Examples of system software include operating systems (OS) like macOS, Linux, Android and Microsoft Windows, computational science software, game engines, search engin ...
had to follow hardware innovations that usually took rapid turns. In the early systems, operating systems were custom tailored to each supercomputer to gain speed, yet in the rush to develop them, serious software quality challenges surfaced and in many cases the cost and complexity of system software development became as much an issue as that of hardware. In the 1980s the cost for software development at
Cray Cray Inc., a subsidiary of Hewlett Packard Enterprise, is an American supercomputer manufacturer headquartered in Seattle, Washington. It also manufactures systems for data storage and analytics. Several Cray supercomputer systems are listed ...
came to equal what they spent on hardware and that trend was partly responsible for a move away from the in-house operating systems to the adaptation of generic software.''Knowing machines: essays on technical change'' by Donald MacKenzie 1998 page 149–151. The first wave in operating system changes came in the mid-1980s, as vendor specific operating systems were abandoned in favor of
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...
. Despite early skepticism, this transition proved successful. By the early 1990s, major changes were occurring in supercomputing system software.''Encyclopedia of Parallel Computing'' by David Padua 2011 pages 426–429. By this time, the growing use of Unix had begun to change the way system software was viewed. The use of a high level language ( C) to implement the operating system, and the reliance on standardized interfaces was in contrast to the
assembly language In computer programming, assembly language (or assembler language, or symbolic machine code), often referred to simply as Assembly and commonly abbreviated as ASM or asm, is any low-level programming language with a very strong correspondence be ...
oriented approaches of the past. As hardware vendors adapted Unix to their systems, new and useful features were added to Unix, e.g., fast file systems and tunable
process scheduler In computing, scheduling is the action of assigning ''resources'' to perform ''tasks''. The ''resources'' may be processors, network links or expansion cards. The ''tasks'' may be threads, processes or data flows. The scheduling activity is ca ...
s. However, all the companies that adapted Unix made unique changes to it, rather than collaborating on an industry standard to create "Unix for supercomputers". This was partly because differences in their architectures required these changes to optimize Unix to each architecture. Thus as general purpose operating systems became stable, supercomputers began to borrow and adapt the critical system code from them and relied on the rich set of secondary functions that came with them, not having to reinvent the wheel. However, at the same time the size of the code for general purpose operating systems was growing rapidly. By the time Unix-based code had reached 500,000 lines long, its maintenance and use was a challenge. This resulted in the move to use
microkernel In computer science, a microkernel (often abbreviated as μ-kernel) is the near-minimum amount of software that can provide the mechanisms needed to implement an operating system (OS). These mechanisms include low-level address space management, ...
s which used a minimal set of the operating system functions. Systems such as Mach at
Carnegie Mellon University Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. One of its predecessors was established in 1900 by Andrew Carnegie as the Carnegie Technical Schools; it became the Carnegie Institute of Technology ...
and
ChorusOS ChorusOS is a microkernel real-time operating system designed as a message passing computing model. ChorusOS began as the Chorus distributed real-time operating system research project at the French Institute for Research in Computer Science an ...
at
INRIA The National Institute for Research in Digital Science and Technology (Inria) () is a French national research institution focusing on computer science and applied mathematics. It was created under the name ''Institut de recherche en informatiq ...
were examples of early microkernels. The separation of the operating system into separate components became necessary as supercomputers developed different types of nodes, e.g., compute nodes versus I/O nodes. Thus modern supercomputers usually run different operating systems on different nodes, e.g., using a small and efficient lightweight kernel such as CNK or CNL on compute nodes, but a larger system such as a
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
-derivative on server and I/O nodes.


Early systems

The
CDC 6600 The CDC 6600 was the flagship of the 6000 series of mainframe computer systems manufactured by Control Data Corporation. Generally considered to be the first successful supercomputer, it outperformed the industry's prior recordholder, the IBM ...
, generally considered the first supercomputer in the world, ran the
Chippewa Operating System The Chippewa Operating System (COS) is a discontinued operating system developed by Control Data Corporation for the CDC 6600, generally considered the first supercomputer in the world. The Chippewa was initially developed as an experimental system ...
, which was then deployed on various other
CDC 6000 series The CDC 6000 series is a discontinued family of mainframe computers manufactured by Control Data Corporation in the 1960s. It consisted of the CDC 6200, CDC 6300, CDC 6400, CDC 6500, CDC 6600 and CDC 6700 computers, which were all extremely rapid a ...
computers.''The computer revolution in Canada'' by John N. Vardalas 2001 page 258. The Chippewa was a rather simple job control oriented system derived from the earlier
CDC 3000 The CDC 3000 series ("thirty-six hundred" of "thirty-one hundred") computers from Control Data Corporation were mid-1960s follow-ons to the CDC 1604 and CDC 924 systems. Over time, a range of machines were produced - divided into * the 48-bit upp ...
, but it influenced the later KRONOS and
SCOPE Scope or scopes may refer to: People with the surname * Jamie Scope (born 1986), English footballer * John T. Scopes (1900–1970), central figure in the Scopes Trial regarding the teaching of evolution Arts, media, and entertainment * Cinem ...
systems. The first
Cray-1 The Cray-1 was a supercomputer designed, manufactured and marketed by Cray Research. Announced in 1975, the first Cray-1 system was installed at Los Alamos National Laboratory in 1976. Eventually, over 100 Cray-1s were sold, making it one of the ...
was delivered to the Los Alamos Lab with no operating system, or any other software.''Targeting the computer: government support and international competition'' by Kenneth Flamm 1987 pages 81–83. Los Alamos developed the application software for it, and the operating system. The main timesharing system for the Cray 1, the
Cray Time Sharing System The Cray Time Sharing System, also known in the Cray user community as CTSS, was developed as an operating system for the Cray-1 or Cray X-MP line of supercomputers. CTSS was developed by the Los Alamos Scientific Laboratory (LASL now LANL) in co ...
(CTSS), was then developed at the Livermore Labs as a direct descendant of the
Livermore Time Sharing System The Livermore Time Sharing System (LTSS) was a supercomputer operating system originally developed by the Lawrence Livermore Laboratories for the Control Data Corporation 6600 and 7600 series of supercomputers. LTSS resulted in the Cray Time ...
(LTSS) for the CDC 6600 operating system from twenty years earlier. In developing supercomputers, rising software costs soon became dominant, as evidenced by the 1980s cost for software development at Cray growing to equal their cost for hardware. That trend was partly responsible for a move away from the in-house
Cray Operating System The Cray Operating System (COS) is a Cray Research operating system for its now-discontinued Cray-1 (1976) and Cray X-MP supercomputers. It succeeded the Chippewa Operating System (shipped with earlier Control Data Corporation CDC 6000 series a ...
to
UNICOS UNICOS is a range of Unix and after it Linux operating system (OS) variants developed by Cray for its supercomputers. UNICOS is the successor of the Cray Operating System (COS). It provides network clustering and source code compatibility layer ...
system based on
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, and ot ...
. In 1985, the
Cray-2 The Cray-2 is a supercomputer with four vector processors made by Cray Research starting in 1985. At 1.9 GFLOPS peak performance, it was the fastest machine in the world when it was released, replacing the Cray X-MP in that spot. It was, i ...
was the first system to ship with the UNICOS operating system.Lester T. Davis, ''The balance of power, a brief history of Cray Research hardware architectures'' in "High performance computing: technology, methods, and applications" by J. J. Dongarra 1995 page 12

Around the same time, the EOS (operating system), EOS operating system was developed by
ETA Systems Eta (uppercase , lowercase ; grc, ἦτα ''ē̂ta'' or ell, ήτα ''ita'' ) is the seventh letter of the Greek alphabet, representing the close front unrounded vowel . Originally denoting the voiceless glottal fricative in most dialects, ...
for use in their
ETA10 The ETA10 is a vector supercomputer designed, manufactured, and marketed by ETA Systems, a spin-off division of Control Data Corporation (CDC). The ETA10 was an evolution of the CDC Cyber 205, which can trace its origins back to the CDC ST ...
supercomputers. Lloyd M. Thorndyke, ''The Demise of the ETA Systems'' in "Frontiers of Supercomputing II by Karyn R. Ames, Alan Brenner 1994 pages 489–497. Written in Cybil, a Pascal-like language from
Control Data Corporation Control Data Corporation (CDC) was a mainframe and supercomputer firm. CDC was one of the nine major United States computer companies through most of the 1960s; the others were IBM, Burroughs Corporation, DEC, NCR, General Electric, Honeywel ...
, EOS highlighted the stability problems in developing stable operating systems for supercomputers and eventually a Unix-like system was offered on the same machine. The lessons learned from developing ETA system software included the high level of risk associated with developing a new supercomputer operating system, and the advantages of using Unix with its large extant base of system software libraries. By the middle 1990s, despite the extant investment in older operating systems, the trend was toward the use of Unix-based systems, which also facilitated the use of interactive
graphical user interface The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, inste ...
s (GUIs) for
scientific computing Computational science, also known as scientific computing or scientific computation (SC), is a field in mathematics that uses advanced computing capabilities to understand and solve complex problems. It is an area of science that spans many disc ...
across multiple platforms. The move toward a ''commodity OS'' had opponents, who cited the fast pace and focus of Linux development as a major obstacle against adoption. As one author wrote "Linux will likely catch up, but we have large-scale systems now". Nevertheless, that trend continued to gain momentum and by 2005, virtually all supercomputers used some
Unix-like A Unix-like (sometimes referred to as UN*X or *nix) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Unix-li ...
OS.''Getting up to speed: the future of supercomputing'' by Susan L. Graham, Marc Snir, Cynthia A. Patterson, National Research Council 2005 page 136. These variants of Unix included
IBM AIX AIX (Advanced Interactive eXecutive, pronounced , "ay-eye-ex") is a series of Proprietary software, proprietary Unix operating systems developed and sold by IBM for several of its computer platforms. Background Originally released for the ...
, the open source
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
system, and other adaptations such as
UNICOS UNICOS is a range of Unix and after it Linux operating system (OS) variants developed by Cray for its supercomputers. UNICOS is the successor of the Cray Operating System (COS). It provides network clustering and source code compatibility layer ...
from Cray. By the end of the 20th century, Linux was estimated to command the highest share of the supercomputing pie.


Modern approaches

The IBM
Blue Gene Blue Gene is an IBM project aimed at designing supercomputers that can reach operating speeds in the petaFLOPS (PFLOPS) range, with low power consumption. The project created three generations of supercomputers, Blue Gene/L, Blue Gene/P, ...
supercomputer uses the
CNK operating system Compute Node Kernel (CNK) is the node level operating system for the IBM Blue Gene series of supercomputers.''Euro-Par 2004 Parallel Processing: 10th International Euro-Par Conference'' 2004, by Marco Danelutto, Marco Vanneschi and Domenico Lafore ...
on the compute nodes, but uses a modified
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
-based kernel called I/O Node Kernel (
INK Ink is a gel, sol, or solution that contains at least one colorant, such as a dye or pigment, and is used to color a surface to produce an image, text, or design. Ink is used for drawing or writing with a pen, brush, reed pen, or quill. Thicker ...
) on the I/O nodes.''Euro-Par 2004 Parallel Processing: 10th International Euro-Par Conference'' 2004, by Marco Danelutto, Marco Vanneschi and Domenico Laforenza page 835.''Euro-Par 2006 Parallel Processing: 12th International Euro-Par Conference'', 2006, by Wolfgang E. Nagel, Wolfgang V. Walter and Wolfgang Lehner . CNK is a lightweight kernel that runs on each node and supports a single application running for a single user on that node. For the sake of efficient operation, the design of CNK was kept simple and minimal, with physical memory being statically mapped and the CNK neither needing nor providing scheduling or context switching. CNK does not even implement file I/O on the compute node, but delegates that to dedicated I/O nodes. However, given that on the Blue Gene multiple compute nodes share a single I/O node, the I/O node operating system does require multi-tasking, hence the selection of the Linux-based operating system. While in traditional multi-user computer systems and early supercomputers,
job scheduling A job scheduler is a computer application for controlling unattended background program execution of jobs. This is commonly called batch scheduling, as execution of non-interactive jobs is often called batch processing, though traditional ''job'' ...
was in effect a
task scheduling In computing, scheduling is the action of assigning ''resources'' to perform ''tasks''. The ''resources'' may be processors, network links or expansion cards. The ''tasks'' may be threads, processes or data flows. The scheduling activity is ca ...
problem for processing and peripheral resources, in a massively parallel system, the job management system needs to manage the allocation of both computational and communication resources. It is essential to tune task scheduling, and the operating system, in different configurations of a supercomputer. A typical parallel job scheduler has a master scheduler which instructs some number of slave schedulers to launch, monitor, and control parallel jobs, and periodically receives reports from them about the status of job progress. Some, but not all supercomputer schedulers attempt to maintain locality of job execution. The PBS Pro scheduler used on the Cray XT3 and
Cray XT4 The Cray XT4 (codenamed ''Hood'' during development) is an updated version of the Cray XT3 supercomputer. It was released on November 18, 2006. It includes an updated version of the SeaStar interconnect router called SeaStar2, processor sockets ...
systems does not attempt to optimize locality on its three-dimensional
torus interconnect A torus interconnect is a switch-less network topology for connecting processing nodes in a parallel computer system. Introduction In geometry, a torus is created by revolving a circle about an axis coplanar to the circle. While this is a ...
, but simply uses the first available processor. On the other hand, IBM's scheduler on the Blue Gene supercomputers aims to exploit locality and minimize network contention by assigning tasks from the same application to one or more midplanes of an 8x8x8 node group.''Job Scheduling Strategies for Parallel Processing:'' by Eitan Frachtenberg and Uwe Schwiegelshohn 2010 pages 138–144. The
Slurm Workload Manager The Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management (SLURM), or simply Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and compu ...
scheduler uses a best fit algorithm, and performs
Hilbert curve scheduling In parallel processing, the Hilbert curve scheduling method turns a multidimensional task allocation problem into a one-dimensional space filling problem using Hilbert curves, assigning related tasks to locations with higher levels of proximity.''S ...
to optimize locality of task assignments. Several modern supercomputers such as the
Tianhe-2 Tianhe-2 or TH-2 (, i.e. 'Milky Way 2') is a 33.86-petaflops supercomputer located in the National Supercomputer Center in Guangzhou, China. It was developed by a team of 1,300 scientists and engineers. It was the world's fastest supercomputer ac ...
use Slurm, which arbitrates contention for resources across the system. Slurm is
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
, Linux-based, very scalable, and can manage thousands of nodes in a computer cluster with a sustained throughput of over 100,000 jobs per hour.Jette, M. and M. Grondona, ''SLURM: Simple Linux Utility for Resource Management'' in the Proceedings of ClusterWorld Conference, San Jose, California, June 200

/ref>


See also

*
Distributed operating system A distributed operating system is system software over a collection of independent software, networked, communicating, and physically separate computational nodes. They handle jobs which are serviced by multiple CPUs. Each individual node holds a ...
*
Supercomputer architecture Approaches to supercomputer architecture have taken dramatic turns since the earliest systems were introduced in the 1960s. Early supercomputer architectures pioneered by Seymour Cray relied on compact innovative designs and local parallelism to ...
* Usage share of supercomputer operating systems


References

{{Operating system Operating systems *Supercomputer operating systems