Parallel Virtual File System
   HOME

TheInfoList



OR:

The Parallel Virtual File System (PVFS) is an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
parallel file system A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system (only direct attached storage for ...
. A parallel file system is a type of
distributed file system A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system (only direct attached storage for ...
that distributes file data across multiple servers and provides for concurrent access by multiple tasks of a parallel application. PVFS was designed for use in large scale
cluster computing A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The comp ...
. PVFS focuses on high performance access to large data sets. It consists of a server process and a client library, both of which are written entirely of user-level code. A
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
kernel module and pvfs-client process allow the file system to be mounted and used with standard utilities. The client library provides for high performance access via the message passing interface (MPI). PVFS is being jointly developed between The Parallel Architecture Research Laboratory at
Clemson University Clemson University () is a public land-grant research university in Clemson, South Carolina. Founded in 1889, Clemson is the second-largest university in the student population in South Carolina. For the fall 2019 semester, the university enro ...
and the Mathematics and Computer Science Division at
Argonne National Laboratory Argonne National Laboratory is a science and engineering research United States Department of Energy National Labs, national laboratory operated by University of Chicago, UChicago Argonne LLC for the United States Department of Energy. The facil ...
, and the
Ohio Supercomputer Center The Ohio Supercomputer Center (OSC) is a supercomputer facility located on the western end of the Ohio State University campus, just north of Columbus. Established in 1987, the OSC partners with Ohio universities, labs and industries, providing st ...
. PVFS development has been funded by NASA
Goddard Space Flight Center The Goddard Space Flight Center (GSFC) is a major NASA space research laboratory located approximately northeast of Washington, D.C. in Greenbelt, Maryland, United States. Established on May 1, 1959 as NASA's first space flight center, GSFC empl ...
, The DOE Office of Science Advanced Scientific Computing Research program, NSF PACI and HECURA programs, and other government and private agencies. PVFS is now known as OrangeFS in its newest development branch.


History

PVFS was first developed in 1993 by Walt Ligon and Eric Blumer as a parallel file system for
Parallel Virtual Machine Parallel Virtual Machine (PVM) is a software tool for parallel networking of computers. It is designed to allow a network of heterogeneous Unix and/or Windows machines to be used as a single distributed parallel processor. Thus large computatio ...
(PVM) as part of a NASA grant to study the I/O patterns of parallel programs. PVFS version 0 was based on Vesta, a parallel file system developed at IBM T. J. Watson Research Center. Starting in 1994 Rob Ross re-wrote PVFS to use
TCP/IP The Internet protocol suite, commonly known as TCP/IP, is a framework for organizing the set of communication protocols used in the Internet and similar computer networks according to functional criteria. The foundational protocols in the suit ...
and departed from many of the original Vesta design points. PVFS version 1 was targeted to a cluster of DEC Alpha workstations networked using switched
FDDI Fiber Distributed Data Interface (FDDI) is a standard for data transmission in a local area network. It uses optical fiber as its standard underlying physical medium, although it was also later specified to use copper cable, in which case it m ...
. Like Vesta, PVFS striped data across multiple servers and allowed I/O requests based on a file view that described a strided access pattern. Unlike Vesta, the striping and view were not dependent on a common record size. Ross' research focused on scheduling of disk I/O when multiple clients were accessing the same file. Previous results had shown that scheduling according to the best possible disk access pattern was preferable. Ross showed that this depended on a number of factors including the relative speed of the network and the details of the file view. In some cases a scheduling based on network traffic was preferable, thus a dynamically adaptable schedule provided the best overall performance. In late 1994 Ligon met with Thomas Sterling and John Dorband at
Goddard Space Flight Center The Goddard Space Flight Center (GSFC) is a major NASA space research laboratory located approximately northeast of Washington, D.C. in Greenbelt, Maryland, United States. Established on May 1, 1959 as NASA's first space flight center, GSFC empl ...
(GSFC) and discussed their plans to build the first
Beowulf ''Beowulf'' (; ang, Bēowulf ) is an Old English epic poem in the tradition of Germanic heroic legend consisting of 3,182 alliterative lines. It is one of the most important and most often translated works of Old English literature. The ...
computer. It was agreed that PVFS would be ported to Linux and be featured on the new machine. Over the next several years Ligon and Ross worked with the GSFC group including Donald Becker, Dan Ridge, and Eric Hendricks. In 1997, at a cluster meeting in Pasadena, CA Sterling asked that PVFS be released as an open source package.


PVFS2

In 1999 Ligon proposed the development of a new version of PVFS initially dubbed PVFS2000 and later PVFS2. The design was initially developed by Ligon, Ross, and Phil Carns. Ross completed his PhD in 2000 and moved to
Argonne National Laboratory Argonne National Laboratory is a science and engineering research United States Department of Energy National Labs, national laboratory operated by University of Chicago, UChicago Argonne LLC for the United States Department of Energy. The facil ...
and the design and implementation was carried out by Ligon, Carns, Dale Witchurch, and Harish Ramachandran at
Clemson University Clemson University () is a public land-grant research university in Clemson, South Carolina. Founded in 1889, Clemson is the second-largest university in the student population in South Carolina. For the fall 2019 semester, the university enro ...
, Ross, Neil Miller, and Rob Latham at
Argonne National Laboratory Argonne National Laboratory is a science and engineering research United States Department of Energy National Labs, national laboratory operated by University of Chicago, UChicago Argonne LLC for the United States Department of Energy. The facil ...
, and Pete Wyckoff at Ohio Supercomputer Center. The new file system was released in 2003. The new design featured object servers, distributed metadata, views based on MPI, support for multiple network types, and a software architecture for easy experimentation and extensibility. PVFS version 1 was retired in 2005. PVFS version 2 is still supported by Clemson and Argonne. Carns completed his PhD in 2006 and joined Axicom, Inc. where PVFS was deployed on several thousand nodes for data mining. In 2008 Carns moved to Argonne and continues to work on PVFS along with Ross, Latham, and Sam Lang. Brad Settlemyer developed a mirroring subsystem at Clemson, and later a detailed simulation of PVFS used for researching new developments. Settlemyer is now at
Oak Ridge National Laboratory Oak Ridge National Laboratory (ORNL) is a U.S. multiprogram science and technology national laboratory sponsored by the U.S. Department of Energy (DOE) and administered, managed, and operated by UT–Battelle as a federally funded research and ...
. in 2007 Argonne began porting PVFS for use on an IBM
Blue Gene Blue Gene is an IBM project aimed at designing supercomputers that can reach operating speeds in the petaFLOPS (PFLOPS) range, with low power consumption. The project created three generations of supercomputers, Blue Gene/L, Blue Gene/P, ...
/P. In 2008 Clemson began developing extensions for supporting large directories of small files, security enhancements, and redundancy capabilities. As many of these goals conflicted with development for Blue Gene, a second branch of the CVS source tree was created and dubbed "Orange" and the original branch was dubbed "Blue." PVFS and OrangeFS track each other very closely, but represent two different groups of user requirements. Most patches and upgrades are applied to both branches. As of 2011 OrangeFS is the main development line.


Features

In a cluster using PVFS, nodes are designated as one or more of: client, data server, metadata server. Data servers hold file data. Metadata servers hold metadata include stat-info, attributes, and datafile-handles as well as directory-entries. Clients run applications that utilize the file system by sending requests to the servers over the network.


Object-based design

PVFS has an object based design, which is to say all PVFS server requests involved objects called dataspaces. A dataspace can be used to hold file data, file metadata, directory metadata, directory entries, or symbolic links. Every dataspace in a file system has a unique handle. Any client or server can look up which server holds the dataspace based on the handle. A dataspace has two components: a bytestream and a set of key/value pairs. The bytestream is an ordered sequence of bytes, typically used to hold file data, and the key/value pairs are typically used to hold metadata. The object-based design has become typical of many distributed file systems including
Lustre Lustre or Luster may refer to: Places * Luster, Norway, a municipality in Vestlandet, Norway ** Luster (village), a village in the municipality of Luster * Lustre, Montana, an unincorporated community in the United States Entertainment * '' ...
,
Panasas Panasas is a data storage company that creates network-attached storage for technical computing environments. History Panasas is a computer data storage product company and is headquartered in San Jose, California. Panasas received seed funding ...
, and pNFS.


Separation of data and metadata

PVFS is designed so that a client can access a server for metadata once, and then can access the data servers without further interaction with the metadata servers. This removes a critical bottleneck from the system and allows much greater performance.


MPI-based requests

When a client program requests data from PVFS it can supply a description of the data that is based on MPI_Datatypes. This facility allows MPI file views to be directly implemented by the file system. MPI_Datatypes can describe complex non-contiguous patterns of data. The PVFS server and data codes implement data flows that efficiently transfer data between multiple servers and clients.


Multiple network support

PVFS uses a networking layer named BMI which provides a non-blocking message interface designed specifically for file systems. BMI has multiple implementation modules for a number of different networks used in high performance computing including TCP/IP,
Myrinet Myrinet, ANSI/VITA 26-1998, is a high-speed local area networking system designed by the company Myricom to be used as an interconnect between multiple machines to form computer clusters. Description Myrinet was promoted as having lower protocol ...
,
Infiniband InfiniBand (IB) is a computer networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used ...
, and Portals.


Stateless (lockless) servers

PVFS servers are designed so that they do not share any state with each other or with clients. If a server crashes another can easily be restarted in its place. Updates are performed without using locks.


User-level implementation

PVFS clients and servers run at user level. Kernel modifications are not needed. There is an optional kernel module that allows a PVFS file system to be mounted like any other file system, or programs can link directly to a user interface such as MPI-IO or a
Posix The Portable Operating System Interface (POSIX) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX defines both the system- and user-level application programming interf ...
-like interface. This features makes PVFS easy to install and less prone to causing system crashes.


System-level interface

The PVFS interface is designed to integrate at the system level. It has similarities with the Linux VFS, this making it easy to implement as a mountable file system, but is equally adaptable to user level interfaces such as MPI-IO or
Posix The Portable Operating System Interface (POSIX) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX defines both the system- and user-level application programming interf ...
-like interfaces. It exposes many of the features of the underlying file system so that interfaces can take advantage of them if desired.


Architecture

PVFS consists of 4 main components and a number of utility programs. The components are the PVFS2-server, the pvfslib, the PVFS-client-core, and the PVFS kernel module. Utilities include the karma management tool, utilities (e.g., pvfs-ping, pvfs-ls, pvfs-cp, etc.) that all operate directly on the file system without using the kernel module (primarily for maintenance and testing). Another key design point is the PVFS protocol which describes the messages passed between client and server, though this is not strictly a component.


PVFS2-server

The PVFS server runs as a process on a node designated as an I/O node. I/O nodes are often dedicated nodes but can be regular nodes that run application tasks as well. The PVFS server usually runs as root, but can be run as a user if preferred. Each server can manage multiple distinct file systems and is designated to run as a metadata server, data server, or both. All configuration is controlled by a configuration file specified on the command line, and all servers managing a given file system use the same configuration file. The server receives requests over the network, carries out the request which may involve disk I/O and responds back to the original requester. Requests normally come from client nodes running application tasks but can come from other servers. The server is composed of the request processor, the job layer, Trove, BMI, and flow layers.


Request processor

The request processor consists of the server process' main loop and a number of state machines. State machines are based on a simple language developed for PVFS that manage concurrency within the server and client. A state machine consists of a number of states, each of which either runs a C state action function or calls a nested (subroutine) state machine. In either case return codes select which state to go to next. State action functions typically submit a job via the job layer which performs some kind of I/O via Trove or BMI. Jobs are non-blocking, so that once a job is issued the state machine's execution is deferred so that another state machine can run servicing another request. When Jobs are completed the main loop restarts the associated state machine. The request processor has state machines for each of the various request types defined in the PVFS request protocol plus a number of nested state machines used internally. The state machine architecture makes it relatively easy to add new requests to the server in order to add features or optimize for specific situations.


Job layer

The Job layer provides a common interface for submitting Trove, BMI, and flow jobs and reporting their completion. It also implements the request scheduler as a non-blocking job that records what kind of requests are in progress on which objects and prevents consistency errors due to simultaneously operating on the same file data.


Trove

Trove manages I/O to the objects stored on the local server. Trove operates on collections of data spaces. A collection has its own independent handle space and is used to implement distinct PVFS file systems, A data space is a PVFS object and has its own unique (within the collection) handle and is stored on one server. Handles are mapped to servers through a table in the configuration file. A data space consists of two parts: a bytestream, and a set of key/value pairs. A bytestream is sequence of bytes of indeterminate length and is used to store file data, typically in a file on the local file system. Key/value pairs are used to store metadata, attributes, and directory entries. Trove has a well defined interface and can be implemented in various ways. To date the only implementation has been the Trove-dbfs implementation that stores bytestreams in files and key/value pairs in a
Berkeley DB Berkeley DB (BDB) is an unmaintained embedded database software library for key/value data, historically significant in open source software. Berkeley DB is written in C with API bindings for many other programming languages. BDB stores arbitr ...
database.RCE 35: PVFS Parallel Virtual FileSystem
/ref> Trove operations are non-blocking, the API provides post functions to read or write the various components and functions to check or wait for completion.


BMI


Flows


pvfslib


PVFS-client-core


PVFS kernel module


See also

* List of file systems, the distributed parallel file system section


References


External links

*
Orange File System
- A branch of the Parallel Virtual File System
Architecture of a Next-Generation Parallel File SystemVideo archive
{{File systems Distributed file systems Distributed file systems supported by the Linux kernel Network file systems