Lustre is a type of parallel
distributed file system
A clustered file system (CFS) is a file system which is shared by being simultaneously Mount (computing), mounted on multiple Server (computing), servers. There are several approaches to computer cluster, clustering, most of which do not emplo ...
, generally used for large-scale
cluster computing
A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike Grid computing, grid computers, computer clusters have each Node (networking), node set to perform the same task, controlled an ...
. The name Lustre is a
portmanteau word
In linguistics, a blend—also known as a blend word, lexical blend, or portmanteau—is a word formed by combining the meanings, and parts of the sounds, of two or more words together.) Israeli שלט ''shalát'' 'remote control', an ellipsis ...
derived from
Linux
Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
and
cluster
may refer to:
Science and technology Astronomy
* Cluster (spacecraft), constellation of four European Space Agency spacecraft
* Cluster II (spacecraft), a European Space Agency mission to study the magnetosphere
* Asteroid cluster, a small ...
. Lustre file system software is available under the
GNU General Public License
The GNU General Public Licenses (GNU GPL or simply GPL) are a series of widely used free software licenses, or ''copyleft'' licenses, that guarantee end users the freedom to run, study, share, or modify the software. The GPL was the first ...
(version 2 only) and provides high performance file systems for computer clusters ranging in size from small
workgroup clusters to large-scale, multi-site systems. Since June 2005, Lustre has consistently been used by at least half of the top ten, and more than 60 of the top 100 fastest
supercomputer
A supercomputer is a type of computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instruc ...
s in the world,
including the world's No. 1 ranked
TOP500
The TOP500 project ranks and details the 500 most powerful non-distributed computing, distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The first of these ...
supercomputer in November 2022,
Frontier
A frontier is a political and geographical term referring to areas near or beyond a boundary.
Australia
The term "frontier" was frequently used in colonial Australia in the meaning of country that borders the unknown or uncivilised, th ...
,
as well as previous top supercomputers such as
Fugaku,
Titan
Titan most often refers to:
* Titan (moon), the largest moon of Saturn
* Titans, a race of deities in Greek mythology
Titan or Titans may also refer to:
Arts and entertainment
Fictional entities
Fictional locations
* Titan in fiction, fictiona ...
and
Sequoia.
Lustre file systems are scalable and can be part of multiple computer clusters with tens of thousands of
client
Client(s) or The Client may refer to:
* Client (business)
* Client (computing), hardware or software that accesses a remote service on another computer
* Customer or client, a recipient of goods or services in return for monetary or other valuable ...
nodes, hundreds of
petabyte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
s (PB) of storage on hundreds of servers, and tens of
terabyte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
s per second (TB/s) of aggregate
I/O throughput
Network throughput (or just throughput, when in context) refers to the rate of message delivery over a communication channel in a communication network, such as Ethernet or packet radio. The data that these messages contain may be delivered ov ...
. This makes Lustre file systems a popular choice for businesses with large data centers, including those in industries such as
meteorology
Meteorology is the scientific study of the Earth's atmosphere and short-term atmospheric phenomena (i.e. weather), with a focus on weather forecasting. It has applications in the military, aviation, energy production, transport, agricultur ...
,
simulation
A simulation is an imitative representation of a process or system that could exist in the real world. In this broad sense, simulation can often be used interchangeably with model. Sometimes a clear distinction between the two terms is made, in ...
,
artificial intelligence
Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
and
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, oil and gas,
life science
Life, also known as biota, refers to matter that has biological processes, such as signaling and self-sustaining processes. It is defined descriptively by the capacity for homeostasis, organisation, metabolism, growth, adaptation, respon ...
,
rich media
Interactive media refers to digital experiences that dynamically respond to user input, delivering content such as text, images, animations, video, audio, and even AI-driven interactions. Over the years, interactive media has expanded across ...
, and finance. The I/O performance of Lustre has widespread impact on these applications and has attracted broad attention.
History
The Lustre file system architecture was started as a research project in 1999 by
Peter J. Braam, who was a staff member of
Carnegie Mellon University
Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania, United States. The institution was established in 1900 by Andrew Carnegie as the Carnegie Technical Schools. In 1912, it became the Carnegie Institu ...
(CMU) at the time. Braam went on to found his own company Cluster File Systems in 2001, starting from work on the
InterMezzo file system in the
Coda project at CMU.
Lustre was developed under the
Accelerated Strategic Computing Initiative Path Forward project funded by the
United States Department of Energy
The United States Department of Energy (DOE) is an executive department of the U.S. federal government that oversees U.S. national energy policy and energy production, the research and development of nuclear power, the military's nuclear w ...
, which included
Hewlett-Packard
The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company. It was founded by Bill Hewlett and David Packard in 1939 in a one-car garage in Palo Alto, California ...
and
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and Delaware General Corporation Law, incorporated in Delaware. Intel designs, manufactures, and sells computer compo ...
.
In September 2007,
Sun Microsystems
Sun Microsystems, Inc., often known as Sun for short, was an American technology company that existed from 1982 to 2010 which developed and sold computers, computer components, software, and information technology services. Sun contributed sig ...
acquired the assets of Cluster File Systems Inc. including its "intellectual property".
Sun included Lustre with its
high-performance computing
High-performance computing (HPC) is the use of supercomputers and computer clusters to solve advanced computation problems.
Overview
HPC integrates systems administration (including network and security knowledge) and parallel programming into ...
hardware offerings, with the intent to bring Lustre technologies to Sun's
ZFS
ZFS (previously Zettabyte File System) is a file system with Volume manager, volume management capabilities. It began as part of the Sun Microsystems Solaris (operating system), Solaris operating system in 2001. Large parts of Solaris, includin ...
file system and the
Solaris
Solaris is the Latin word for sun.
It may refer to:
Arts and entertainment Literature, television and film
* ''Solaris'' (novel), a 1961 science fiction novel by Stanisław Lem
** ''Solaris'' (1968 film), directed by Boris Nirenburg
** ''Sol ...
operating system
An operating system (OS) is system software that manages computer hardware and software resources, and provides common daemon (computing), services for computer programs.
Time-sharing operating systems scheduler (computing), schedule tasks for ...
. In November 2008, Braam left Sun Microsystems, and Eric Barton and Andreas Dilger took control of the project.
In 2010
Oracle Corporation
Oracle Corporation is an American Multinational corporation, multinational computer technology company headquartered in Austin, Texas. Co-founded in 1977 in Santa Clara, California, by Larry Ellison, who remains executive chairman, Oracle was ...
, by way of its acquisition of Sun, began to manage and release Lustre.
In December 2010, Oracle announced that they would cease Lustre 2.x development and place Lustre 1.8 into maintenance-only support, creating uncertainty around the future development of the file system.
Following this announcement, several new organizations sprang up to provide support and development in an open community development model, including Whamcloud,
Open Scalable File Systems, Inc. (OpenSFS), EUROPEAN Open File Systems (EOFS) and others. By the end of 2010, most Lustre developers had left Oracle. Braam and several associates joined the hardware-oriented
Xyratex when it acquired the assets of ClusterStor,
while Barton, Dilger, and others formed software startup Whamcloud, where they continued to work on Lustre.
In August 2011,
OpenSFS
Open Scalable File Systems, Inc. (OpenSFS) is a nonprofit organization promoting the Lustre file system. OpenSFS was founded in 2010 to ensure Lustre remains vendor-neutral, open, and free.
History
The Lustre is a high-performance parallel fi ...
awarded a contract for Lustre feature development to Whamcloud. This contract covered the completion of features, including improved Single Server Metadata Performance scaling, which allows Lustre to better take advantage of many-core metadata server; online Lustre distributed filesystem checking (LFSCK), which allows verification of the distributed filesystem state between data and metadata servers while the filesystem is mounted and in use; and Distributed Namespace Environment (DNE), formerly Clustered Metadata (CMD), which allows the Lustre metadata to be distributed across multiple servers. Development also continued on ZFS-based back-end
object storage
Object storage (also known as object-based storage or blob storage) is a computer data storage approach that manages data as "blobs" or "objects", as opposed to other storage architectures like file systems, which manage data as a file hierarchy, ...
at
Lawrence Livermore National Laboratory
Lawrence Livermore National Laboratory (LLNL) is a Federally funded research and development centers, federally funded research and development center in Livermore, California, United States. Originally established in 1952, the laboratory now i ...
.
These features were in the Lustre 2.2 through 2.4 community release roadmap.
In November 2011, a separate contract was awarded to Whamcloud for the maintenance of the Lustre 2.x source code to ensure that the Lustre code would receive sufficient testing and bug fixing while new features were being developed.
In July 2012 Whamcloud was acquired by
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and Delaware General Corporation Law, incorporated in Delaware. Intel designs, manufactures, and sells computer compo ...
, after Whamcloud won the FastForward DOE contract to prepare Lustre for use with
exascale computing
Exascale computing refers to Computer system, computing systems capable of calculating at least 1018 IEEE 754 Double Precision (64-bit) operations (multiplications and/or additions) per second (exaFLOPS)"; it is a measure of supercomputer perform ...
systems in the 2018 timeframe.
OpenSFS
Open Scalable File Systems, Inc. (OpenSFS) is a nonprofit organization promoting the Lustre file system. OpenSFS was founded in 2010 to ensure Lustre remains vendor-neutral, open, and free.
History
The Lustre is a high-performance parallel fi ...
then transitioned contracts for Lustre development to Intel.
In February 2013, Xyratex Ltd., announced it acquired the original Lustre trademark, logo, website and associated
intellectual property
Intellectual property (IP) is a category of property that includes intangible creations of the human intellect. There are many types of intellectual property, and some countries recognize more than others. The best-known types are patents, co ...
from Oracle.
In June 2013, Intel began expanding Lustre usage beyond traditional HPC, such as within
Hadoop
Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
. For 2013 as a whole,
OpenSFS
Open Scalable File Systems, Inc. (OpenSFS) is a nonprofit organization promoting the Lustre file system. OpenSFS was founded in 2010 to ensure Lustre remains vendor-neutral, open, and free.
History
The Lustre is a high-performance parallel fi ...
announced request for proposals (RFP) to cover Lustre feature development,
parallel file system
A clustered file system (CFS) is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system (only direct attached stora ...
tools, addressing Lustre technical debt, and parallel file system incubators.
OpenSFS
Open Scalable File Systems, Inc. (OpenSFS) is a nonprofit organization promoting the Lustre file system. OpenSFS was founded in 2010 to ensure Lustre remains vendor-neutral, open, and free.
History
The Lustre is a high-performance parallel fi ...
also established the Lustre Community Portal, a technical site that provides a collection of information and documentation in one area for reference and guidance to support the Lustre
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
community. On April 8, 2014, Ken Claffey announced that Xyratex/Seagate was donating th
lustre.orgdomain back to the user community, and this was completed in March, 2015.
In June 2018, the Lustre team and assets were acquired from Intel by
DDN. DDN organized the new acquisition as an independent division, reviving the Whamcloud name for the new division.
In November 2019,
OpenSFS
Open Scalable File Systems, Inc. (OpenSFS) is a nonprofit organization promoting the Lustre file system. OpenSFS was founded in 2010 to ensure Lustre remains vendor-neutral, open, and free.
History
The Lustre is a high-performance parallel fi ...
and EOFS announced at the
SC19 Lustre BOF that the Lustre trademark had been transferred to them jointly from
Seagate.
Release history
Lustre file system was first installed for production use in March 2003 on the MCR Linux Cluster at the
Lawrence Livermore National Laboratory
Lawrence Livermore National Laboratory (LLNL) is a Federally funded research and development centers, federally funded research and development center in Livermore, California, United States. Originally established in 1952, the laboratory now i ...
, the third-largest supercomputer in the Top500 list at the time.
Lustre 1.0.0 was released in December 2003,
and provided basic Lustre filesystem functionality, including server failover and recovery.
Lustre 1.2.0, released in March 2004, worked on
Linux kernel
The Linux kernel is a Free and open-source software, free and open source Unix-like kernel (operating system), kernel that is used in many computer systems worldwide. The kernel was created by Linus Torvalds in 1991 and was soon adopted as the k ...
2.6, and had a "size glimpse" feature to avoid lock revocation on files undergoing write, and client side data write-back cache accounting (grant).
Lustre 1.4.0, released in November 2004, provided protocol compatibility between versions, could use
InfiniBand
InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used ...
networks, and could exploit extents/mballoc in the
ldiskfs on-disk filesystem.
Lustre 1.6.0, released in April 2007, allowed mount configuration (“mountconf”) allowing servers to be configured with "mkfs" and "mount", allowed dynamic addition of ''object storage targets'' (OSTs), enabled Lustre distributed lock manager (LDLM) scalability on
symmetric multiprocessing
Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all ...
(SMP) servers, and provided free space management for object allocations.
Lustre 1.8.0, released in May 2009, provided OSS Read Cache, improved recovery in the face of multiple failures, added basic heterogeneous storage management via OST Pools, adaptive network timeouts, and version-based recovery. It was a transition release, being interoperable with both Lustre 1.6 and Lustre 2.0.
Lustre 2.0, released in August 2010, was based on significant internally restructured code to prepare for major architectural advancements. Lustre 2.x ''clients'' cannot interoperate with 1.8 or earlier ''servers''. However, Lustre 1.8.6 and later clients can interoperate with Lustre 2.0 and later servers. The Metadata Target (MDT) and OST on-disk format from 1.8 can be upgraded to 2.0 and later without the need to reformat the filesystem.
Lustre 2.1, released in September 2011, was a community-wide initiative in response to Oracle suspending development on Lustre 2.x releases. It added the ability to run servers on
Red Hat Linux
Red Hat Linux was a widely used commercial open-source Linux distribution created by Red Hat until its discontinuation in 2004.
Early releases of Red Hat Linux were called Red Hat Commercial Linux. Red Hat published the first non-beta release ...
6 and increased the maximum ext4-based OST size from 24 TB to 128 TB, as well as a number of performance and stability improvements. Lustre 2.1 servers remained inter-operable with 1.8.6 and later clients.
Lustre 2.2, released in March 2012, focused on providing metadata performance improvements and new features. It added parallel directory operations allowing multiple clients to traverse and modify a single large directory concurrently, faster recovery from server failures, increased stripe counts for a single file (across up to 2000 OSTs), and improved single-client directory traversal performance.
Lustre 2.3, released in October 2012, continued to improve the metadata server code to remove internal locking bottlenecks on nodes with many CPU cores (over 16). The object store added a preliminary ability to use
ZFS
ZFS (previously Zettabyte File System) is a file system with Volume manager, volume management capabilities. It began as part of the Sun Microsystems Solaris (operating system), Solaris operating system in 2001. Large parts of Solaris, includin ...
as the backing file system. The Lustre File System ChecK (LFSCK) feature can verify and repair the MDS Object Index (OI) while the file system is in use, after a file-level backup/restore or in case of MDS corruption. The server-side IO statistics were enhanced to allow integration with batch job schedulers such as
SLURM to track per-job statistics. Client-side software was updated to work with Linux kernels up to version 3.0.
Lustre 2.4, released in May 2013, added a considerable number of major features, many funded directly through
OpenSFS
Open Scalable File Systems, Inc. (OpenSFS) is a nonprofit organization promoting the Lustre file system. OpenSFS was founded in 2010 to ensure Lustre remains vendor-neutral, open, and free.
History
The Lustre is a high-performance parallel fi ...
. Distributed Namespace Environment (DNE) allows horizontal metadata capacity and performance scaling for 2.4 clients, by allowing subdirectory trees of a single namespace to be located on separate MDTs.
ZFS
ZFS (previously Zettabyte File System) is a file system with Volume manager, volume management capabilities. It began as part of the Sun Microsystems Solaris (operating system), Solaris operating system in 2001. Large parts of Solaris, includin ...
can now be used as the backing filesystem for both MDT and OST storage. The LFSCK feature added the ability to scan and verify the internal consistency of the MDT FID and LinkEA attributes. The Network Request Scheduler
(NRS) adds policies to optimize client request processing for disk ordering or fairness. Clients can optionally send bulk RPCs up to 4 MB in size. Client-side software was updated to work with Linux kernels up to version 3.6, and is still interoperable with 1.8 clients.
Lustre 2.5, released in October 2013, added the highly anticipated feature,
Hierarchical Storage Management
Hierarchical storage management (HSM), also known as tiered storage, is a Computer data storage, data storage and data management technique that automatically moves data between high-cost and low-cost data storage media, storage media. HSM systems ...
(HSM). A core requirement in enterprise environments, HSM allows customers to easily implement tiered storage solutions in their operational environment. This release is the current OpenSFS-designated Maintenance Release branch of Lustre.
The most recent maintenance version is 2.5.3 and was released in September 2014.
[
]
Lustre 2.6, released in July 2014,
was a more modest release feature wise, adding LFSCK functionality to do local consistency checks on the OST as well as consistency checks between MDT and OST objects. The NRS Token Bucket Filter
(TBF) policy was added. Single-client IO performance was improved over the previous releases. This release also added a preview of DNE striped directories, allowing single large directories to be stored on multiple MDTs to improve performance and scalability.
Lustre 2.7, released in March 2015,
added LFSCK functionality to verify DNE consistency of remote and striped directories between multiple MDTs. Dynamic LNet Config adds the ability to configure and modify LNet network interfaces, routes, and routers at runtime. A new evaluation feature was added for
UID/
GID mapping for clients with different administrative domains, along with improvements to the DNE striped directory functionality.
Lustre 2.8, released in March 2016,
finished the DNE striped directory feature, including support for migrating directories between MDTs, and cross-MDT
hard link
In computing, a hard link is a directory entry (in a Directory (computing), directory-based file system) that associates a name with a Computer file, file. Thus, each file must have at least one hard link. Creating additional hard links for a fil ...
and rename. As well, it included improved support for
Security-Enhanced Linux
Security-Enhanced Linux (SELinux) is a Linux kernel security module that provides a mechanism for supporting access control security policies, including mandatory access controls (MAC).
SELinux is a set of kernel modifications and user-space to ...
(
SELinux
Security-Enhanced Linux (SELinux) is a Linux kernel security module that provides a mechanism for supporting access control security policies, including mandatory access controls (MAC).
SELinux is a set of kernel modifications and user-space too ...
) on the client,
Kerberos authentication and RPC encryption over the network, and performance improvements for LFSCK.
Lustre 2.9 was released in December 2016
and included a number of features related to security and performance. The Shared Secret Key security flavour uses the same
GSSAPI
The Generic Security Service Application Programming Interface (GSSAPI, also GSS-API) is an application programming interface for programs to access security services.
The GSSAPI is an IETF standard that addresses the problem of many similar but ...
mechanism as Kerberos to provide client and server node authentication, and RPC message integrity and security (encryption). The Nodemap feature allows categorizing client nodes into groups and then mapping the UID/GID for those clients, allowing remotely administered clients to transparently use a shared filesystem without having a single set of UID/GIDs for all client nodes. The subdirectory mount feature allows clients to mount a subset of the filesystem namespace from the MDS. This release also added support for up to 16 MiB RPCs for more efficient I/O submission to disk, and added the
ladvise
interface to allow clients to provide I/O hints to the servers to prefetch file data into server cache or flush file data from server cache. There was improved support for specifying filesystem-wide default OST pools, and improved inheritance of OST pools in conjunction with other file layout parameters.
Lustre 2.10 was released in July 2017
and has a number of significant improvements. Th
LNet Multi-Rail(LMR) feature allows bonding multiple network interfaces (
InfiniBand
InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used ...
,
Omni-Path, and/or
Ethernet
Ethernet ( ) is a family of wired computer networking technologies commonly used in local area networks (LAN), metropolitan area networks (MAN) and wide area networks (WAN). It was commercially introduced in 1980 and first standardized in 198 ...
) on a client and server to increase aggregate I/O bandwidth. Individual files can use composite file layouts that are constructed of multiple components, which are file regions based on the file offset, that allow different layout parameters such as stripe count, OST pool/storage type, etc
Progressive File Layout(PFL) is the first feature to use composite layouts, but the implementation is flexible for use with other file layouts such as mirroring and erasure coding. The NRS Token Bucket Filter (TBF) server-side scheduler has implemented new rule types, including RPC-type scheduling and the ability to specify multiple parameters such as JobID and NID for rule matching. Tools for managing ZFS snapshots of Lustre filesystems have been added, to simplify the creation, mounting, and management of MDT and OST ZFS snapshots as separate Lustre
mountpoints.
Lustre 2.11 was released in April 2018
and contains two significant new features, and several smaller features. Th
File Level Redundancy(FLR) feature expands on the 2.10 PFL implementation, adding the ability to specify
mirrored file layouts for improved availability in case of storage or server failure and/or improved performance with highly concurrent reads. Th
Data-on-MDT(DoM) feature allows small (few MiB) files to be stored on the MDT to leverage typical flash-based
RAID-10 storage for lower latency and reduced IO contention, instead of the typical
HDD RAID-6 storage used on OSTs. As well, the LNet Dynamic Discovery feature allows auto-configuration of LNet Multi-Rail between peers that share an LNet network. The LDLM Lock Ahead feature allows appropriately modified applications and libraries to pre-fetch DLM extent locks from the OSTs for files, if the application knows (or predicts) that this file extent will be modified in the near future, which can reduce
lock contention for multiple clients writing to the same file.
Lustre 2.12 was released on December 21, 2018
and focused on improving Lustre usability and stability, with improvements the performance and functionality of the FLR and DoM features added in Lustre 2.11, as well as smaller changes to NRS
TBF,
HSM, and JobStats. It adde
LNet Network Health to allow the LNet Multi-Rail feature from Lustre 2.10 to better handle network faults when a node has multiple network interfaces. The Lazy Size on MDT (LSOM) feature allows storing an estimate of the file size on the MDT for use by policy engines, filesystem scanners, and other management tools that can more efficiently make decisions about files without a fully accurate file sizes or blocks count without having to query the OSTs for this information. This release also added the ability to manually ''restripe'' an existing directory across multiple MDTs, to allow migration of directories with large numbers of files to use the capacity and performance of several MDS nodes. The Lustre RPC data checksum added
SCSI T10-PI integrated data checksums from the client to the kernel block layer, SCSI
host adapter
In computer hardware a host controller, host adapter or host bus adapter (HBA) connects a computer system bus which acts as the host system to other network and storage devices. The terms are primarily used to refer to devices for connecting ...
, and T10-enabled
hard drive
A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating hard disk drive platter, pla ...
s.
Lustre 2.13 was released on December 5, 2019
and added a new performance-related features Persistent Client Cache (PCC), which allows direct use of
NVMe
NVM Express (NVMe) or Non-Volatile Memory Host Controller Interface Specification (NVMHCIS) is an open, logical-device interface specification for accessing a computer's non-volatile storage media usually attached via the PCI Express bus. The in ...
and
NVRAM
Non-volatile random-access memory (NVRAM) is random-access memory that retains data without applied power. This is in contrast to dynamic random-access memory (DRAM) and static random-access memory (SRAM), which both maintain data only for as l ...
storage on the client nodes while keeping the files part of the global filesystem namespace, and OST Overstriping which allows files to store multiple stripes on a single OST to better utilize fast OSS hardware. As well, the LNet Multi-Rail Network Health functionality was improved to work with LNet RDMA router nodes. The PFL functionality was enhanced with Self-Extending Layouts (SEL) to allow file components to be dynamically sized, to better deal with flash OSTs that may be much smaller than disk OSTs within the same filesystem. The release also included a number of smaller improvements, such as balancing DNE remote directory creation across MDTs, using Lazy-size-on-MDT to reduce the overhead of "lfs find", directories with 10M files per shard for ldiskfs, and bulk RPC sizes up to 64 MB.
Lustre 2.14 was released on February 19, 2021
and includes three main features. Client Data Encryption implements fscrypt to allow file data to be encrypted on the client before network transfer and persistent storage on the OST and MDT. OST Pool Quotas extends the quota framework to allow the assignment and enforcement of quotas on the basis of OST storage pools. DNE Auto Restriping can now adjust how many MDTs a large directory is striped over based on size thresholds defined by the administrator, similar to Progressive File Layouts for directories.
Lustre 2.15 was released on June 16, 2022
and includes three main features. Client Directory Encryption
expands on the fscrypt data encryption in the 2.14 release to also allow file and directory names to be encrypted on the client before network transfer and persistent storage on the MDT. DNE MDT space balancing automatically balances new directory creation across MDTs in the filesystem in round-robin and/or based on available inodes and space, which in turn helps distribute client metadata workload over MDTs more evenly. For applications using the
NVIDIA
Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...
GPU Direct Storage interface (GDS),
the Lustre client can do zero-copy
RDMA read and write from the storage server directly into the
GPU
A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
memory to avoid an extra data copy from
CPU
A central processing unit (CPU), also called a central processor, main processor, or just processor, is the primary processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, log ...
memory and extra processing overhead
User Defined Selection Policy(UDSP) allows setting interface selection policies for nodes with multiple network interfaces.
Lustre 2.16 was released on November 8, 2024
and includes three main features. Large network addressing support allows for
IPv6
Internet Protocol version 6 (IPv6) is the most recent version of the Internet Protocol (IP), the communication protocol, communications protocol that provides an identification and location system for computers on networks and routes traffic ...
and potentially other large address formats like
Infiniband
InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used ...
GUIDs beyond to be used for client and server node addressing, in addition to the standard
IPv4
Internet Protocol version 4 (IPv4) is the first version of the Internet Protocol (IP) as a standalone specification. It is one of the core protocols of standards-based internetworking methods in the Internet and other packet-switched networks. ...
addresses. The Unaligned and Hybrid Direct IO feature improves performance for applications doing large buffered and direct read/write operations by avoiding overhead in the client page cache. The Optimized Directory Traversal (batched statahead) feature improves application workloads that traverse directory hierarchies and access file attributes in a systematic access pattern by prefetching file attributes in parallel from the MDS(es) using bulk RPCs.
Architecture
A Lustre file system has three major functional units:
* One or more ''metadata servers (MDS)'' nodes that have one or more ''
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
target (MDT)'' devices per Lustre filesystem that stores namespace metadata, such as filenames, directories, access permissions, and file layout. The MDT data is stored in a local disk filesystem. However, unlike block-based distributed filesystems, such as
GPFS
GPFS (General Parallel File System, brand name IBM Storage Scale and previously IBM Spectrum Scale) is a high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel ...
and
PanFS, where the metadata server controls all of the block allocation, the Lustre metadata server is only involved in pathname and permission checks, and is not involved in any file operations, avoiding scalability bottlenecks on the metadata server. The ability to have multiple MDTs in a single filesystem is a new feature in Lustre 2.4, and allows directory subtrees to reside on the secondary MDTs, while 2.7 and later allow large single directories to be distributed across multiple MDTs as well.
* One or more ''object storage server (OSS)'' nodes that store file data on one or more ''object storage target (OST)'' devices. Depending on the server's hardware, an OSS typically serves between two and eight OSTs, with each OST managing a single local disk filesystem. The capacity of a Lustre file system is the sum of the capacities provided by the OSTs.
* ''Client(s)'' that access and use the data. Lustre presents all clients with a unified namespace for all of the files and data in the filesystem, using standard
POSIX
The Portable Operating System Interface (POSIX; ) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX defines application programming interfaces (APIs), along with comm ...
semantics, and allows concurrent and coherent read and write access to the files in the filesystem.
The MDT, OST, and client may be on the same node (usually for testing purposes), but in typical production installations these devices are on separate nodes communicating over a network. Each MDT and OST may be part of only a single filesystem, though it is possible to have multiple MDTs or OSTs on a single node that are part of different filesystems. The ''Lustre Network (LNet)'' layer can use several types of network interconnects, including native
InfiniBand
InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used ...
verbs,
Omni-Path,
RoCE, and
iWARP via
OFED,
TCP/IP
The Internet protocol suite, commonly known as TCP/IP, is a framework for organizing the communication protocols used in the Internet and similar computer networks according to functional criteria. The foundational protocols in the suite are ...
on
Ethernet
Ethernet ( ) is a family of wired computer networking technologies commonly used in local area networks (LAN), metropolitan area networks (MAN) and wide area networks (WAN). It was commercially introduced in 1980 and first standardized in 198 ...
, and other proprietary network technologies such as the
Cray
Cray Inc., a subsidiary of Hewlett Packard Enterprise, is an American supercomputer manufacturer headquartered in Seattle, Washington. It also manufactures systems for data storage and analytics. Several Cray supercomputer systems are listed ...
Gemini interconnect. In Lustre 2.3 and earlier,
Myrinet
Myrinet, ANSI/VITA 26-1998, is a high-speed local area networking system designed by the company Myricom to be used as an interconnect between multiple machines to form computer clusters.
Description
Myrinet was promoted as having lower protocol ...
,
Quadrics
In mathematics, a quadric or quadric surface is a generalization of conic sections (ellipses, parabolas, and hyperbolas). In three-dimensional space, quadrics include ellipsoids, paraboloids, and hyperboloids.
More generally, a quadric hyper ...
, Cray SeaStar and RapidArray networks were also supported, but these network drivers were deprecated when these networks were no longer commercially available, and support was removed completely in Lustre 2.8. Lustre will take advantage of remote direct memory access (
RDMA) transfers, when available, to improve throughput and reduce CPU usage.
The storage used for the MDT and OST backing filesystems is normally provided by hardware
RAID
RAID (; redundant array of inexpensive disks or redundant array of independent disks) is a data storage virtualization technology that combines multiple physical Computer data storage, data storage components into one or more logical units for th ...
devices, though will work with any block devices. Since Lustre 2.4, the MDT and OST can also use
ZFS
ZFS (previously Zettabyte File System) is a file system with Volume manager, volume management capabilities. It began as part of the Sun Microsystems Solaris (operating system), Solaris operating system in 2001. Large parts of Solaris, includin ...
for the backing filesystem in addition to
ext4
ext4 (fourth extended filesystem) is a journaling file system for Linux, developed as the successor to ext3.
ext4 was initially a series of backward-compatible extensions to ext3, many of them originally developed by Cluster File Systems for ...
, allowing them to effectively use
JBOD
The most widespread standard for configuring multiple hard disk drives is RAID (redundant array of inexpensive/independent disks), which comes in a number of standard configurations and non-standard configurations. Non-RAID drive architectures a ...
storage instead of hardware RAID devices. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by the backing filesystem and return this data to the clients. This allows Lustre to take advantage of improvements and features in the underlying filesystem, such as compression and data checksums in ZFS. Clients do not have any direct access to the underlying storage, which ensures that a malfunctioning or malicious client cannot corrupt the filesystem structure.
An OST is a dedicated filesystem that exports an interface to byte ranges of file objects for read/write operations, with
extent locks to protect data consistency. An MDT is a dedicated filesystem that stores inodes, directories,
POSIX
The Portable Operating System Interface (POSIX; ) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX defines application programming interfaces (APIs), along with comm ...
and
extended file attributes
Extended file attributes are file system features that enable users to associate computer files with metadata not interpreted by the filesystem, whereas regular attributes have a purpose strictly defined by the filesystem (such as permissions or ...
, controls file access permissions/
ACLs, and tells clients the layout of the object(s) that make up each regular file. MDTs and OSTs currently use either an enhanced version of
ext4
ext4 (fourth extended filesystem) is a journaling file system for Linux, developed as the successor to ext3.
ext4 was initially a series of backward-compatible extensions to ext3, many of them originally developed by Cluster File Systems for ...
called ''ldiskfs'', or
ZFS
ZFS (previously Zettabyte File System) is a file system with Volume manager, volume management capabilities. It began as part of the Sun Microsystems Solaris (operating system), Solaris operating system in 2001. Large parts of Solaris, includin ...
/DMU for back-end data storage to store files/objects using the open source ZFS-on-Linux port.
The client mounts the Lustre filesystem locally with a
VFS driver for the
Linux
Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
kernel that connects the client to the server(s). Upon initial mount, the client is provided a File Identifier (FID) for the
root directory
In a Computing, computer file system, and primarily used in the Unix and Unix-like operating systems, the root directory is the first or top-most Directory (computing), directory in a hierarchy. It can be likened to the trunk of a Tree (data st ...
of the mountpoint. When the client accesses a file, it
performs a filename lookup on the MDS. When the MDS filename lookup is complete and the user and client have permission to access and/or create the file, either the layout of an existing file is returned to the client or a new file is created on behalf of the client, if requested. For read or write operations, the client then interprets the file layout in the ''logical object volume (LOV)'' layer, which
maps the file logical offset and size to one or more objects. The client then
locks the file range being operated on and executes one or more parallel read or write operations directly to the OSS nodes that hold the data objects. With this approach, bottlenecks for client-to-OSS communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.
After the initial lookup of the file layout, the MDS is not normally involved in file IO operations since all block allocation and data IO is managed internally by the OST. Clients do not directly modify the objects or data on the OST filesystems, but instead delegate this task to OSS nodes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as
GPFS
GPFS (General Parallel File System, brand name IBM Storage Scale and previously IBM Spectrum Scale) is a high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel ...
and
OCFS
The Oracle Cluster File System (OCFS, in its second version OCFS2) is a shared disk file system developed by Oracle Corporation and released under the GNU General Public License.
The first version of OCFS was developed with the main focus to accom ...
allow direct access to the underlying storage by all of the clients in the filesystem, which requires a large back-end
SAN attached to all clients, and increases the risk of filesystem corruption from misbehaving/defective clients.
Implementation
In a typical Lustre installation on a Linux client, a Lustre filesystem driver module is loaded into the kernel and the filesystem is mounted like any other local or network filesystem. Client applications see a single, unified filesystem even though it may be composed of tens to thousands of individual servers and MDT/OST filesystems.
On some
massively parallel processor (MPP) installations, computational processors can access a Lustre file system by redirecting their I/O requests to a dedicated I/O node configured as a Lustre client. This approach is used in the
Blue Gene
Blue Gene was an IBM project aimed at designing supercomputers that can reach operating speeds in the petaFLOPS (PFLOPS) range, with relatively low power consumption.
The project created three generations of supercomputers, Blue Gene/L, Blue ...
installation at
Lawrence Livermore National Laboratory
Lawrence Livermore National Laboratory (LLNL) is a Federally funded research and development centers, federally funded research and development center in Livermore, California, United States. Originally established in 1952, the laboratory now i ...
.
Another approach used in the early years of Lustre is the ''liblustre''
library
A library is a collection of Book, books, and possibly other Document, materials and Media (communication), media, that is accessible for use by its members and members of allied institutions. Libraries provide physical (hard copies) or electron ...
on the
Cray XT3
The Cray XT3, also known by codename '' Red Storm'', is a distributed memory massively parallel MIMD supercomputer designed by Cray Inc.. Cray collaborated with and delivered to Sandia National Laboratories in 2004. The XT3 derives much of its ...
using the
Catamount operating system on systems such as
Sandia Red Storm, which provided
userspace
A modern computer operating system usually uses virtual memory to provide separate address spaces or regions of a single address space, called user space and kernel space. This separation primarily provides memory protection and hardware prote ...
applications with direct filesystem access. Liblustre was a user-level library that allows computational processors to mount and use the Lustre file system as a client. Using liblustre, the computational processors could access a Lustre file system even if the service node on which the job was launched is not a Linux client. Liblustre allowed data movement directly between application space and the Lustre OSSs without requiring an intervening data copy through the kernel, thus providing access from computational processors to the Lustre file system directly in a constrained operating environment. The liblustre functionality was deleted from Lustre 2.7.0 after having been disabled since Lustre 2.6.0, and was untested since Lustre 2.3.0.
In Linux Kernel version 4.18, the incomplete port of the Lustre client was removed from the kernel staging area in order to speed up development and porting to newer kernels. The out-of-tree Lustre client and server is still available for RHEL, SLES, and
Ubuntu
Ubuntu ( ) is a Linux distribution based on Debian and composed primarily of free and open-source software. Developed by the British company Canonical (company), Canonical and a community of contributors under a Meritocracy, meritocratic gover ...
distro kernels, as well as vanilla kernels.
Data objects and file striping
In a traditional
Unix
Unix (, ; trademarked as UNIX) is a family of multitasking, multi-user computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, a ...
disk file system, an
inode
An inode (index node) is a data structure in a Unix-style file system that describes a file-system object such as a file or a directory. Each inode stores the attributes and disk block locations of the object's data. File-system object attribu ...
data structure contains basic information about each file, such as where the data contained in the file is stored. The Lustre file system also uses inodes, but inodes on MDTs point to one or more OST objects associated with the file rather than to data blocks. These objects are implemented as files on the OSTs. When a client opens a file, the file open operation transfers a set of object identifiers and their layout from the MDS to the client, so that the client can directly interact with the OSS node(s) that hold the object(s). This allows the client(s) to perform I/O in parallel across all of the OST objects in the file without further communication with the MDS, avoiding contention from centralized block and lock management.
If only one OST object is associated with an MDT inode, that object contains all the data in the Lustre file. When more than one object is associated with a file, data in the file is "striped" in chunks in a
round-robin manner across the OST objects similar to
RAID 0 in chunks typically 1 MB or larger. Striping a file over multiple OST objects provides significant performance benefits if there is a need for high bandwidth access to a single large file. When striping is used, the maximum file size is not limited by the size of a single target. Capacity and aggregate I/O bandwidth scale with the number of OSTs a file is striped over. Also, since the locking of each object is managed independently for each OST, adding more stripes (one per OST) scales the file I/O locking capacity of the file proportionately. Each file created in the filesystem may specify different layout parameters, such as the stripe count (number of OST objects making up that file), stripe size (unit of data stored on each OST before moving to the next), and OST selection, so that performance and capacity can be tuned optimally for each file. When many application threads are reading or writing to separate files in parallel, it is optimal to have a single stripe per file, since the application is providing its own parallelism. When there are many threads reading or writing a single large file concurrently, then it is optimal to have at least one stripe on each OST to maximize the performance and capacity of that file.
In the Lustre 2.10 release, the ability to specify ''composite layouts'' was added to allow files to have different layout parameters for different regions of the file. The Progressive File Layout (PFL) feature uses composite layouts to improve file IO performance over a wider range of workloads, as well as simplify usage and administration. For example, a small PFL file can have a single stripe on flash for low access overhead, while larger files can have many stripes for high aggregate bandwidth and better OST load balancing. The composite layouts are further enhanced in the 2.11 release with the File Level Redundancy (FLR) feature, which allows a file to have multiple overlapping layouts for a file, providing
RAID 0+1 redundancy for these files as well as improved read performance. The Lustre 2.11 release also added the Data-on-Metadata (DoM) feature, which allows the ''first'' component of a PFL file to be stored directly on the MDT with the inode. This reduces overhead for accessing small files, both in terms of space usage (no OST object is needed) as well as network usage (fewer RPCs needed to access the data). DoM also improves performance for small files if the MDT is
SSD
A solid-state drive (SSD) is a type of solid-state storage device that uses Integrated circuit, integrated circuits to store data persistence (computer science), persistently. It is sometimes called semiconductor storage device, solid-stat ...
-based, while the OSTs are disk-based. In Lustre 2.13 the OST Overstriping feature allows a single component to have multiple stripes on one OST to further improve parallelism of locking, while the Self-Extending Layout feature allows the component size to be dynamic during write so that it can cope with individual (flash) OSTs running out of space before the whole filesystem is out of space.
Metadata objects and DNE remote or striped directories
When a client initially mounts a filesystem, it is provided the 128-bit Lustre File Identifier (FID, composed of the 64-bit Sequence number, 32-bit Object ID, and 32-bit Version) of the root directory for the mountpoint. When doing a filename lookup, the client performs a lookup of each pathname component by mapping the parent directory FID Sequence number to a specific MDT via the FID Location Database (FLDB), and then does a lookup on the MDS managing this MDT using the parent FID and filename. The MDS will return the FID for the requested pathname component along with a
DLM lock. Once the MDT of the last directory in the path is determined, further directory operations (for non-striped directories) will normally take place on that MDT, avoiding contention between MDTs.
For DNE striped directories, the per-directory layout stored on the parent directory provides a hash function and a list of MDT directory FIDs across which the directory is distributed. The ''Logical Metadata Volume'' (LMV) on the client hashes the filename and maps it to a specific MDT directory
shard, which will handle further operations on that file in an identical manner to a non-striped directory. For readdir() operations, the entries from each directory shard are returned to the client sorted in the local MDT directory hash order, and the client performs a merge sort to interleave the filenames in hash order so that a single 64-bit cookie can be used to determine the current offset within the directory.
In Lustre 2.15, the client LMV implements round-robin and space balanced default directory layouts, so that clients can use a large number of MDTs in a single filesystem more effectively. When a new subdirectory is created near the root of the filesystem (the top three directory levels by default), it will automatically be created by the client as a remote directory on one of the available MDTs (selected in sequential order) to balance space usage and load across servers. If the free space on the MDTs becomes imbalanced (more than 5% difference in free space and inodes) then the client creating the directory will bias creation toward an MDT with more free space in order to restore balance.
Locking
The Lustre
distributed lock manager
A distributed lock manager (DLM) runs in every machine in a cluster, with an identical copy of a cluster-wide lock database. Operating systems use lock managers to organise and serialise the access to resources. In this way a DLM provides software ...
(LDLM), implemented in the
OpenVMS
OpenVMS, often referred to as just VMS, is a multi-user, multiprocessing and virtual memory-based operating system. It is designed to support time-sharing, batch processing, transaction processing and workstation applications. Customers using Op ...
style, protects the integrity of each file's data and metadata. Access and modification of a Lustre file is completely
cache coherent among all of the clients. Metadata locks are managed by the MDT that stores the
inode
An inode (index node) is a data structure in a Unix-style file system that describes a file-system object such as a file or a directory. Each inode stores the attributes and disk block locations of the object's data. File-system object attribu ...
for the file, using FID as the resource name. The metadata locks are split into separate bits that protect the lookup of the file (file owner and group, permission and mode, and
access control list
In computer security, an access-control list (ACL) is a list of permissions associated with a system resource (object or facility). An ACL specifies which users or system processes are granted access to resources, as well as what operations are ...
(ACL)), the state of the inode (directory size, directory contents, link count, timestamps), layout (file striping, since Lustre 2.4), and
extended attributes (xattrs, since Lustre 2.5). A client can fetch multiple metadata lock bits for a single inode with a single RPC request, but currently they are only ever granted a read lock for the inode. The MDS manages all modifications to the inode in order to avoid lock
resource contention
In computer science, resource contention is a conflict over access to a shared resource such as random access memory, disk storage, cache memory, internal buses or external network devices. A resource experiencing ongoing contention can be descr ...
and is currently the only node that gets write locks on inodes.
File data locks are managed by the OST on which each object of the file is striped, using byte-range
extent locks. Clients can be granted overlapping read extent locks for part or all of the file, allowing multiple concurrent readers of the same file, and/or non-overlapping write extent locks for independent regions of the file. This allows many Lustre clients to access a single file concurrently for both read and write, avoiding bottlenecks during file I/O. In practice, because Linux clients manage their data cache in units of
pages, the clients will request locks that are always an integer multiple of the page size (4096 bytes on most clients). When a client is requesting an extent lock the OST may grant a lock for a larger extent than originally requested, in order to reduce the number of lock requests that the client makes. The actual size of the granted lock depends on several factors, including the number of currently granted locks on that object, whether there are conflicting write locks for the requested lock extent, and the number of pending lock requests on that object. The granted lock is never smaller than the originally requested extent. OST extent locks use the Lustre FID of the object as the resource name for the lock. Since the number of extent lock servers scales with the number of OSTs in the filesystem, this also scales the aggregate locking performance of the filesystem, and of a single file if it is striped over multiple OSTs.
Networking
The communication between the Lustre clients and servers is implemented using Lustre Networking (LNet), which was originally based on the
Sandia Portals network programming application programming interface. Disk storage is connected to the Lustre MDS and OSS server nodes using direct attached storage (
SAS,
FC,
iSCSI
Internet Small Computer Systems Interface or iSCSI ( ) is an Internet Protocol-based storage networking standard for linking data storage facilities. iSCSI provides block-level access to storage devices by carrying SCSI commands over a TCP/IP ...
) or traditional
storage area network
A storage area network (SAN) or storage network is a computer network which provides access to consolidated, block device, block-level data storage. SANs are primarily used to access Computer data storage, data storage devices, such as disk ...
(SAN) technologies, which is independent of the client-to-server network.
LNet can use many commonly used network types, such as
InfiniBand
InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used ...
and
TCP (commonly
Ethernet
Ethernet ( ) is a family of wired computer networking technologies commonly used in local area networks (LAN), metropolitan area networks (MAN) and wide area networks (WAN). It was commercially introduced in 1980 and first standardized in 198 ...
) networks, and allows simultaneous availability across multiple network types with routing between them.
Remote Direct Memory Access
In computing, remote direct memory access (RDMA) is a direct memory access from the memory of one computer into that of another without involving either one's operating system. This permits high-throughput, low- latency networking, which is especia ...
(RDMA) is used for data and metadata transfer between nodes when provided by the underlying networks, such as InfiniBand,
RoCE,
iWARP, and
Omni-Path, as well as proprietary high-speed networks such as
Cray
Cray Inc., a subsidiary of Hewlett Packard Enterprise, is an American supercomputer manufacturer headquartered in Seattle, Washington. It also manufactures systems for data storage and analytics. Several Cray supercomputer systems are listed ...
Aries and Gemini, and
Atos
Atos SE is a European multinational information technology (IT) service and consulting company with headquarters in Bezons suburb of Paris, France, and offices worldwide. It specialises in hi-tech transactional services, unified communicat ...
BXI. High availability and recovery features enable transparent recovery in conjunction with failover servers.
Since Lustre 2.10 the LNet Multi-Rail (MR) feature
allows
link aggregation
In computer networking, link aggregation is the combining ( aggregating) of multiple network connections in parallel by any of several methods. Link aggregation increases total throughput beyond what a single connection could sustain, and prov ...
of two or more network interfaces between a client and server to improve bandwidth. The LNet interface types do not need to be the same network type. In 2.12 Multi-Rail was enhanced to improve fault tolerance if multiple network interfaces are available between peers.
LNet provides end-to-end throughput over
Gigabit Ethernet
In computer networking, Gigabit Ethernet (GbE or 1 GigE) is the term applied to transmitting Ethernet frames at a rate of a gigabit per second. The most popular variant, 1000BASE-T, is defined by the IEEE 802.3ab standard. It came into use in ...
networks in excess of 100 MB/s, throughput up to 11 GB/s using InfiniBand enhanced data rate (EDR) links, and throughput over 11 GB/s across
100 Gigabit Ethernet
40 Gigabit Ethernet (40GbE) and 100 Gigabit Ethernet (100GbE) are groups of computer networking technologies for transmitting Ethernet frames at rates of 40 and 100 gigabits per second (Gbit/s), respectively. These technologies offer significantly ...
interfaces.
High availability
Lustre file system high availability features include a robust failover and recovery mechanism, making server failures and reboots transparent. Version interoperability between successive minor versions of the Lustre software enables a server to be upgraded by taking it offline (or failing it over to a standby server), performing the upgrade, and restarting it, while all active jobs continue to run, experiencing a delay while the backup server takes over the storage.
Lustre MDSes are configured as an active/passive pair exporting a single MDT, or one or more active/active MDS pairs with DNE exporting two or more separate MDTs, while OSSes are typically deployed in an active/active configuration exporting separate OSTs to provide redundancy without extra system overhead. In single-MDT filesystems, the standby MDS for one filesystem is the MGS and/or monitoring node, or the active MDS for another file system, so no nodes are idle in the cluster.
HSM (Hierarchical Storage Management)
Lustre provides the capability to have multiple storage tiers within a single filesystem namespace. It allows traditional HSM functionality to copy (archive) files off the primary filesystem to a secondary archive storage tier. The archive tier is typically a tape-based system, that is often fronted by a disk cache. Once a file is archived, it can be released from the main filesystem, leaving only a stub that references the archive copy. If a released file is opened, the Coordinator blocks the open, sends a restore request to a copytool, and then completes the open once the copytool has completed restoring the file.
In addition to external storage tiering, it is possible to have multiple storage tiers within a single filesystem namespace. OSTs of different types (e.g. HDD and SSD) can be declared in named storage pools. The OST pools can be selected when specifying file layouts, and different pools can be used within a single PFL file layout. Files can be migrated between storage tiers either manually or under control of the Policy Engine. Since Lustre 2.11, it is also possible to mirror a file to different OST pools with a FLR file layout, for example to pre-stage files into flash for a computing job.
HSM includes some additional Lustre components to manage the interface between the primary filesystem and the archive:
* Coordinator: receives archive and restore requests and dispatches them to agent nodes.
* Agent: runs a copytool to copy data from primary storage to the archive and vice versa.
* Copytool: handles data motion and metadata updates. There are different copytools to interface with different archive systems. A generic POSIX copytool is available for archives that provide a POSIX-like front-end interface. Copytools are also available for the
High Performance Storage System
High Performance Storage System (HPSS) is a flexible, scalable, policy-based, software-defined hierarchical storage management (HSM) product developed by the HPSS Collaboration. It provides scalable HSM, archive, and file system services using ...
(HPSS),
Tivoli Storage Manager (TSM),
Amazon S3
Amazon Simple Storage Service (S3) is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its e-commerc ...
, and
Google Drive
Google Drive is a file-hosting service and synchronization service developed by Google. Launched on April 24, 2012, Google Drive allows users to store files in the cloud (on Google servers), synchronize files across devices, and share files ...
.
* Policy Engine: watches filesystem Changelogs for new files to archive, applies policies to release files based on age or space usage, and communicates with MDT and Coordinator. The Policy Engine can also trigger actions like migration between, purge, and removal. The most commonly used policy engine i
RobinHood but other policy engines can also be used.
HSM also defines new states for files including:
* Exist: Some copy, possibly incomplete exists in a HSM.
* Archive: A full copy exists on the archive side of the HSM.
* Dirty: The primary copy of the file has been modified and differs from the archived copy.
* Released: A stub inode exists on an MDT, but the data objects have been removed and the only copy exists in the archive.
* Lost: the archive copy of the file has been lost and cannot be restored
* No Release: the file should not be released from the filesystem
* No Archive: the file should not be archived
Deployments
Lustre is used by many of the
TOP500
The TOP500 project ranks and details the 500 most powerful non-distributed computing, distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The first of these ...
supercomputers and large multi-cluster sites. Six of the top 10 and more than 60 of the top 100 supercomputers use Lustre file systems. These include the 700PB 13 TB/s Orion filesystem for the
Frontier supercomputer at
Oak Ridge National Laboratory
Oak Ridge National Laboratory (ORNL) is a federally funded research and development centers, federally funded research and development center in Oak Ridge, Tennessee, United States. Founded in 1943, the laboratory is sponsored by the United Sta ...
(ORNL),
Fugaku and
K Computer
The K computer named for the Japanese word/numeral , meaning 10 quadrillion (1016)See Japanese numbers was a supercomputer manufactured by Fujitsu, installed at the Riken Advanced Institute for Computational Science campus in Kobe, Hyōgo P ...
at the
RIKEN
is a national scientific research institute in Japan. Founded in 1917, it now has about 3,000 scientists on seven campuses across Japan, including the main site at Wakō, Saitama, Wakō, Saitama Prefecture, on the outskirts of Tokyo. Riken is a ...
Advanced Institute for Computational Science,
Tianhe-1A
Tianhe-I, Tianhe-1, or TH-1 (, ; '' Sky River Number One'') is a supercomputer capable of an Rmax (maximum range) of 2.5 peta FLOPS. Located at the National Supercomputing Center of Tianjin, China, it was the fastest computer in the world fr ...
at the
National Supercomputing Center in
Tianjin, China
Tianjin is a direct-administered municipality in North China, northern China on the shore of the Bohai Sea. It is one of the National Central City, nine national central cities, with a total population of 13,866,009 inhabitants at the time of the ...
,
LUMI
LUMI (Large Unified Modern Infrastructure) is a petascale supercomputer located at the CSC data center in Kajaani, Finland. In January 2023, the computer became the fastest supercomputer in Europe.
The completed system consists of 362,496 core ...
at
CSC,
Jaguar
The jaguar (''Panthera onca'') is a large felidae, cat species and the only extant taxon, living member of the genus ''Panthera'' that is native to the Americas. With a body length of up to and a weight of up to , it is the biggest cat spe ...
and
Titan
Titan most often refers to:
* Titan (moon), the largest moon of Saturn
* Titans, a race of deities in Greek mythology
Titan or Titans may also refer to:
Arts and entertainment
Fictional entities
Fictional locations
* Titan in fiction, fictiona ...
at ORNL,
Blue Waters at the
University of Illinois
The University of Illinois Urbana-Champaign (UIUC, U of I, Illinois, or University of Illinois) is a public university, public land-grant university, land-grant research university in the Champaign–Urbana metropolitan area, Illinois, United ...
, and
Sequoia and
Blue Gene
Blue Gene was an IBM project aimed at designing supercomputers that can reach operating speeds in the petaFLOPS (PFLOPS) range, with relatively low power consumption.
The project created three generations of supercomputers, Blue Gene/L, Blue ...
/L at
Lawrence Livermore National Laboratory
Lawrence Livermore National Laboratory (LLNL) is a Federally funded research and development centers, federally funded research and development center in Livermore, California, United States. Originally established in 1952, the laboratory now i ...
(LLNL).
There are also large Lustre filesystems at the
National Energy Research Scientific Computing Center
The National Energy Research Scientific Computing Center (NERSC) is a high-performance computing (supercomputer) research facility that was founded in 1974. The National User Facility is operated by Lawrence Berkeley National Laboratory for th ...
,
Pacific Northwest National Laboratory
Pacific Northwest National Laboratory (PNNL) is one of the United States Department of Energy national laboratories, managed by the Department of Energy's (DOE) Office of Science. The main campus of the laboratory is in Richland, Washington ...
,
Texas Advanced Computing Center, Brazilian National Laboratory of Scientific Computing, and
NASA
The National Aeronautics and Space Administration (NASA ) is an independent agencies of the United States government, independent agency of the federal government of the United States, US federal government responsible for the United States ...
in North America, in Asia at
Tokyo Institute of Technology
The Tokyo Institute of Technology () was a public university in Meguro, Tokyo, Japan. It merged with Tokyo Medical and Dental University to form the Institute of Science Tokyo on 1 October 2024.
The Tokyo Institute of Technology was a De ...
, in Europe at
CEA, and many others.
Commercial technical support
Commercial technical support for Lustre is often bundled along with the computing system or storage hardware sold by the vendor. Some vendors include
Hewlett-Packard
The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company. It was founded by Bill Hewlett and David Packard in 1939 in a one-car garage in Palo Alto, California ...
(as the HP StorageWorks Scalable File Share, circa 2004 through 2008),
ATOS
Atos SE is a European multinational information technology (IT) service and consulting company with headquarters in Bezons suburb of Paris, France, and offices worldwide. It specialises in hi-tech transactional services, unified communicat ...
,
Fujitsu. Vendors selling storage hardware with bundled Lustre support include
Hitachi Data Systems
Hitachi Data Systems (HDS) was a provider of modular mid-range and high-end computer data storage systems, software, and services. Its operations are now a part of Hitachi Vantara.
In 2010, Hitachi Data Systems sold through direct and indirec ...
(2012),
DataDirect Networks
DataDirect Networks (DDN) is a privately held data storage company, and is headquartered in Chatsworth, California, USA.
Summary
DDN provides storage systems for unstructured data and big data, like AI, analytics and high performance comput ...
(DDN)
Aeon Computing and others. It is also possible to get software-only support for Lustre file systems from some vendors, including Whamcloud.
Amazon Web Services
Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...
offers Amazon FSx for Lustre, a fully managed service, making it easy to launch and run high-performance file systems cost effectively in their cloud.
Microsoft Azure
Microsoft Azure, or just Azure ( /ˈæʒər, ˈeɪʒər/ ''AZH-ər, AY-zhər'', UK also /ˈæzjʊər, ˈeɪzjʊər/ ''AZ-ure, AY-zure''), is the cloud computing platform developed by Microsoft. It has management, access and development of ...
offers Azure Managed Lustre (AMLFS).
Azure Managed Lustre is a fully managed, pay-as-you-go file system for high-performance computing (HPC) and AI workloads in their cloud.
See also
*
List of file systems, the distributed parallel fault-tolerant file system section
References
External links
*
Documentation
Understanding Lustre Internals, Second Edition** Internal workings of Lustre file system and its core subsystems
Information wikis
Lustre Community wikiLustre (DDN) wikiLustre (OpenSFS) wiki
Community foundations
OpenSFSEOFS – European Open File System
Hardware/software vendors
DataDirect Networks (DDN)Cray(including forme
Xyratexemployees
[
])
NetAppAeon Computing
{{Sun Microsystems
2002 software
Computer file systems
Distributed file systems supported by the Linux kernel
Network file systems
Sun Microsystems software
Free special-purpose file systems
Distributed file systems