Parallel Virtual File System

Last updated
Parallel Virtual File System
Original author(s) Clemson University, Argonne National Laboratory, Ohio Supercomputer Center
Developer(s) Walt Ligon, Rob Ross, Phil Carns, Pete Wyckoff, Neil Miller, Rob Latham, Sam Lang, Brad Settlemyer
Initial release2003
Stable release
2.8.2 / January 1, 2010;13 years ago (2010-01-01)
Written in C
Operating system Linux kernel
License LGPL
Website web.archive.org/web/20160701052501/http://www.pvfs.org/

The Parallel Virtual File System (PVFS) is an open-source parallel file system. A parallel file system is a type of distributed file system that distributes file data across multiple servers and provides for concurrent access by multiple tasks of a parallel application. PVFS was designed for use in large scale cluster computing. PVFS focuses on high performance access to large data sets. It consists of a server process and a client library, both of which are written entirely of user-level code. A Linux kernel module and pvfs-client process allow the file system to be mounted and used with standard utilities. The client library provides for high performance access via the message passing interface (MPI). PVFS is being jointly developed between The Parallel Architecture Research Laboratory at Clemson University and the Mathematics and Computer Science Division at Argonne National Laboratory, and the Ohio Supercomputer Center. PVFS development has been funded by NASA Goddard Space Flight Center, The DOE Office of Science Advanced Scientific Computing Research program, NSF PACI and HECURA programs, and other government and private agencies. PVFS is now known as OrangeFS in its newest development branch.

Contents

History

PVFS was first developed in 1993 by Walt Ligon and Eric Blumer as a parallel file system for Parallel Virtual Machine (PVM) [1] as part of a NASA grant to study the I/O patterns of parallel programs. PVFS version 0 was based on Vesta, a parallel file system developed at IBM T. J. Watson Research Center. [2] Starting in 1994 Rob Ross re-wrote PVFS to use TCP/IP and departed from many of the original Vesta design points. PVFS version 1 was targeted to a cluster of DEC Alpha workstations networked using switched FDDI. Like Vesta, PVFS striped data across multiple servers and allowed I/O requests based on a file view that described a strided access pattern. Unlike Vesta, the striping and view were not dependent on a common record size. Ross' research focused on scheduling of disk I/O when multiple clients were accessing the same file. [3] Previous results had shown that scheduling according to the best possible disk access pattern was preferable. Ross showed that this depended on a number of factors including the relative speed of the network and the details of the file view. In some cases a scheduling based on network traffic was preferable, thus a dynamically adaptable schedule provided the best overall performance. [4]

In late 1994 Ligon met with Thomas Sterling and John Dorband at Goddard Space Flight Center (GSFC) and discussed their plans to build the first Beowulf computer. [5] It was agreed that PVFS would be ported to Linux and be featured on the new machine. Over the next several years Ligon and Ross worked with the GSFC group including Donald Becker, Dan Ridge, and Eric Hendricks. In 1997, at a cluster meeting in Pasadena, CA Sterling asked that PVFS be released as an open source package. [6]

PVFS2

In 1999 Ligon proposed the development of a new version of PVFS initially dubbed PVFS2000 and later PVFS2. The design was initially developed by Ligon, Ross, and Phil Carns. Ross completed his PhD in 2000 and moved to Argonne National Laboratory and the design and implementation was carried out by Ligon, Carns, Dale Witchurch, and Harish Ramachandran at Clemson University, Ross, Neil Miller, and Rob Latham at Argonne National Laboratory, and Pete Wyckoff at Ohio Supercomputer Center. [7] The new file system was released in 2003. The new design featured object servers, distributed metadata, views based on MPI, support for multiple network types, and a software architecture for easy experimentation and extensibility.

PVFS version 1 was retired in 2005. PVFS version 2 is still supported by Clemson and Argonne. Carns completed his PhD in 2006 and joined Axicom, Inc. where PVFS was deployed on several thousand nodes for data mining. In 2008 Carns moved to Argonne and continues to work on PVFS along with Ross, Latham, and Sam Lang. Brad Settlemyer developed a mirroring subsystem at Clemson, and later a detailed simulation of PVFS used for researching new developments. Settlemyer is now at Oak Ridge National Laboratory. in 2007 Argonne began porting PVFS for use on an IBM Blue Gene/P. [8] In 2008 Clemson began developing extensions for supporting large directories of small files, security enhancements, and redundancy capabilities. As many of these goals conflicted with development for Blue Gene, a second branch of the CVS source tree was created and dubbed "Orange" and the original branch was dubbed "Blue." PVFS and OrangeFS track each other very closely, but represent two different groups of user requirements. Most patches and upgrades are applied to both branches. As of 2011 OrangeFS is the main development line.

Features

In a cluster using PVFS, nodes are designated as one or more of: client, data server, metadata server. Data servers hold file data. Metadata servers hold metadata include stat-info, attributes, and datafile-handles as well as directory-entries. Clients run applications that utilize the file system by sending requests to the servers over the network.

Object-based design

PVFS has an object based design, which is to say all PVFS server requests involved objects called dataspaces. A dataspace can be used to hold file data, file metadata, directory metadata, directory entries, or symbolic links. Every dataspace in a file system has a unique handle. Any client or server can look up which server holds the dataspace based on the handle. A dataspace has two components: a bytestream and a set of key/value pairs. The bytestream is an ordered sequence of bytes, typically used to hold file data, and the key/value pairs are typically used to hold metadata. The object-based design has become typical of many distributed file systems including Lustre, Panasas, and pNFS.

Separation of data and metadata

PVFS is designed so that a client can access a server for metadata once, and then can access the data servers without further interaction with the metadata servers. This removes a critical bottleneck from the system and allows much greater performance.

MPI-based requests

When a client program requests data from PVFS it can supply a description of the data that is based on MPI_Datatypes. This facility allows MPI file views to be directly implemented by the file system. MPI_Datatypes can describe complex non-contiguous patterns of data. The PVFS server and data codes implement data flows that efficiently transfer data between multiple servers and clients.

Multiple network support

PVFS uses a networking layer named BMI which provides a non-blocking message interface designed specifically for file systems. BMI has multiple implementation modules for a number of different networks used in high performance computing including TCP/IP, Myrinet, Infiniband, and Portals. [9]

Stateless (lockless) servers

PVFS servers are designed so that they do not share any state with each other or with clients. If a server crashes another can easily be restarted in its place. Updates are performed without using locks.

User-level implementation

PVFS clients and servers run at user level. Kernel modifications are not needed. There is an optional kernel module that allows a PVFS file system to be mounted like any other file system, or programs can link directly to a user interface such as MPI-IO or a Posix-like interface. This features makes PVFS easy to install and less prone to causing system crashes.

System-level interface

The PVFS interface is designed to integrate at the system level. It has similarities with the Linux VFS, this making it easy to implement as a mountable file system, but is equally adaptable to user level interfaces such as MPI-IO or Posix-like interfaces. It exposes many of the features of the underlying file system so that interfaces can take advantage of them if desired. [10] [11]

Architecture

PVFS consists of 4 main components and a number of utility programs. The components are the PVFS2-server, the pvfslib, the PVFS-client-core, and the PVFS kernel module. Utilities include the karma management tool, utilities (e.g., pvfs-ping, pvfs-ls, pvfs-cp, etc.) that all operate directly on the file system without using the kernel module (primarily for maintenance and testing). Another key design point is the PVFS protocol which describes the messages passed between client and server, though this is not strictly a component.

PVFS2-server

The PVFS server runs as a process on a node designated as an I/O node. I/O nodes are often dedicated nodes but can be regular nodes that run application tasks as well. The PVFS server usually runs as root, but can be run as a user if preferred. Each server can manage multiple distinct file systems and is designated to run as a metadata server, data server, or both. All configuration is controlled by a configuration file specified on the command line, and all servers managing a given file system use the same configuration file. The server receives requests over the network, carries out the request which may involve disk I/O and responds back to the original requester. Requests normally come from client nodes running application tasks but can come from other servers. The server is composed of the request processor, the job layer, Trove, BMI, and flow layers.

Request processor

The request processor consists of the server process' main loop and a number of state machines. State machines are based on a simple language developed for PVFS that manage concurrency within the server and client. A state machine consists of a number of states, each of which either runs a C state action function or calls a nested (subroutine) state machine. In either case return codes select which state to go to next. State action functions typically submit a job via the job layer which performs some kind of I/O via Trove or BMI. Jobs are non-blocking, so that once a job is issued the state machine's execution is deferred so that another state machine can run servicing another request. When Jobs are completed the main loop restarts the associated state machine. The request processor has state machines for each of the various request types defined in the PVFS request protocol plus a number of nested state machines used internally. The state machine architecture makes it relatively easy to add new requests to the server in order to add features or optimize for specific situations.

Job layer

The Job layer provides a common interface for submitting Trove, BMI, and flow jobs and reporting their completion. It also implements the request scheduler as a non-blocking job that records what kind of requests are in progress on which objects and prevents consistency errors due to simultaneously operating on the same file data.

Trove

Trove manages I/O to the objects stored on the local server. Trove operates on collections of data spaces. A collection has its own independent handle space and is used to implement distinct PVFS file systems, A data space is a PVFS object and has its own unique (within the collection) handle and is stored on one server. Handles are mapped to servers through a table in the configuration file. A data space consists of two parts: a bytestream, and a set of key/value pairs. A bytestream is sequence of bytes of indeterminate length and is used to store file data, typically in a file on the local file system. Key/value pairs are used to store metadata, attributes, and directory entries. Trove has a well defined interface and can be implemented in various ways. To date the only implementation has been the Trove-dbfs implementation that stores bytestreams in files and key/value pairs in a Berkeley DB database. [12] Trove operations are non-blocking, the API provides post functions to read or write the various components and functions to check or wait for completion.

BMI

Flows

pvfslib

PVFS-client-core

PVFS kernel module

See also

Related Research Articles

<span class="mw-page-title-main">Beowulf cluster</span> Type of computing cluster

A Beowulf cluster is a computer cluster of what are normally identical, commodity-grade computers networked into a small local area network with libraries and programs installed which allow processing to be shared among them. The result is a high-performance parallel computing cluster from inexpensive personal computer hardware.

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on parallel computing architectures. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several open-source MPI implementations, which fostered the development of a parallel software industry, and encouraged development of portable and scalable large-scale parallel applications.

Checkpointing is a technique that provides fault tolerance for computing systems. It basically consists of saving a snapshot of the application's state, so that applications can restart from that point in case of failure. This is particularly important for long running applications that are executed in failure-prone computing systems.

In computing, the Global File System 2 or GFS2 is a shared-disk file system for Linux computer clusters. GFS2 allows all members of a cluster to have direct concurrent access to the same shared block storage, in contrast to distributed file systems which distribute data throughout the cluster. GFS2 can also be used as a local file system on a single computer.

Google File System is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. Google file system was replaced by Colossus in 2010.

Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Lustre file system software is available under the GNU General Public License and provides high performance file systems for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site systems. Since June 2005, Lustre has consistently been used by at least half of the top ten, and more than 60 of the top 100 fastest supercomputers in the world, including the world's No. 1 ranked TOP500 supercomputer in November 2022, Frontier, as well as previous top supercomputers such as Fugaku, Titan and Sequoia.

GPFS is high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel modes, or a combination of these. It is used by many of the world's largest commercial companies, as well as some of the supercomputers on the Top 500 List. For example, it is the filesystem of the Summit at Oak Ridge National Laboratory which was the #1 fastest supercomputer in the world in the November 2019 TOP500 list of supercomputers. Summit is a 200 Petaflops system composed of more than 9,000 POWER9 processors and 27,000 NVIDIA Volta GPUs. The storage filesystem called Alpine has 250 PB of storage using Spectrum Scale on IBM ESS storage hardware, capable of approximately 2.5TB/s of sequential I/O and 2.2TB/s of random I/O.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system. Clustered file systems can provide features like location-independent addressing and redundancy which improve reliability or reduce the complexity of the other parts of the cluster. Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance.

<span class="mw-page-title-main">Ceph (software)</span> Open-source storage platform

Ceph is an open-source software-defined storage platform that implements object storage on a single distributed computer cluster and provides 3-in-1 interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, scalability to the exabyte level, and to be freely available. Since version 12, Ceph does not rely on other filesystems and can directly manage HDDs and SSDs with its own storage backend BlueStore and can completely self reliantly expose a POSIX filesystem.

<span class="mw-page-title-main">Network block device</span> Network storage protocol

On Linux, network block device (NBD) is a network protocol that can be used to forward a block device from one machine to a second machine. As an example, a local machine can access a hard disk drive that is attached to another computer.

<span class="mw-page-title-main">Computer cluster</span> Set of computers configured in a distributed computing system

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

<span class="mw-page-title-main">BeeGFS</span> Distributed file system

BeeGFS is a parallel file system, developed and optimized for high-performance computing. BeeGFS includes a distributed metadata architecture for scalability and flexibility reasons. Its most used and widely known aspect is data throughput.

Blue Whale Clustered file system (BWFS) is a shared disk file system made by Tianjin Zhongke Blue Whale Information Technologies Company in China.

OrangeFS is an open-source parallel file system, the next generation of Parallel Virtual File System (PVFS). A parallel file system is a type of distributed file system that distributes file data across multiple servers and provides for concurrent access by multiple tasks of a parallel application. OrangeFS was designed for use in large-scale cluster computing and is used by companies, universities, national laboratories and similar sites worldwide.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

Elliptics is a distributed key–value data storage with open source code. By default it is a classic distributed hash table (DHT) with multiple replicas put in different groups. Elliptics was created to meet requirements of multi-datacenter and physically distributed storage locations when storing huge amount of medium and large files.

<span class="mw-page-title-main">Hierarchical Cluster Engine Project</span>

Hierarchical Cluster Engine (HCE) is a FOSS complex solution for: construct custom network mesh or distributed network cluster structure with several relations types between nodes, formalize the data flow processing goes from upper node level central source point to down nodes and backward, formalize the management requests handling from multiple source points, support native reducing of multiple nodes results, internally support powerful full-text search engine and data storage, provide transactions-less and transactional requests processing, support flexible run-time changes of cluster infrastructure, have many languages bindings for client-side integration APIs in one product build on C++ language.

LizardFS is an open source distributed file system that is POSIX-compliant and licensed under GPLv3. It was released in 2013 as fork of MooseFS. LizardFS is also offering a paid Technical Support with possibility of configurating and setting up the cluster and active cluster monitoring.

References

  1. A. Blumer and W. B. Ligon, "The Parallel Virtual File System," 1994 PVM Users Group Meeting, 1994.
  2. Peter F. Corbett , Dror G. Feitelson, The Vesta parallel file system, ACM Transactions on Computer Systems (TOCS), v.14 n.3, p.225-264, Aug. 1996.
  3. W. B. Ligon, III, and R. B. Ross, "Implementation and Performance of a Parallel File System for High Performance Distributed Applications", 5th IEEE Symposium on High Performance Distributed Computing, August, 1996.
  4. W. B. Ligon, III, and R. B. Ross, "Server-Side Scheduling in Cluster Parallel I/O Systems," Parallel I/O for Cluster Computing, Christophe Cèrin and Hai Jin editors, pages 157-177, Kogan Page Science, September, 2003.
  5. W. B. Ligon III, R. B. Ross, D. Becker, P. Merkey, "Beowulf: Low-Cost Supercomputing Using Linux," IEEE Software magazine special issue on Linux, Volume 16, Number 1, page 79, January, 1999.
  6. Walt Ligon and Rob Ross, "Parallel I/O and the Parallel Virtual File System," Beowulf Cluster Computing with Linux, 2nd Edition, William Gropp, Ewing Lusk, and Thomas Sterling, editors, pages 489-530, MIT Press, November, 2003.
  7. P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur, "PVFS: A Parallel File System For Linux Clusters," Extreme Linux Workshop, Atlanta, October, 2000. Best paper of conference award.
  8. Samuel Lang, Philip Carns, Robert Latham, Robert Ross, Kevin Harms, William Allcock, "I/O Performance Challenges at Leadership Scale," Proceedings of Supercomputing, 2009
  9. Philip H. Carns, Walter B. III, Robert Ross, Pete Wyckoff, "BMI: a network abstraction layer for parallel I/O," Proceedings of IPDPS '05, 2005
  10. M. Vilayannur, S. Lang, R. Ross, R. Klundt, L. Ward, "Extending the POSIX I/O Interface: A Parallel File System Perspective," Technical Memorandum ANL/MCS-TM-302, 2008.
  11. Swapnil A. Patil, Garth A. Gibson, Gregory R. Ganger, Julio Lopez, Milo Polte, Wittawat Tantisiroj, Lin Xiao, "In Search of an API for Scalable File Systems: Under the table or above it?," USENIX HotCloud Workshop 2009.
  12. RCE 35: PVFS Parallel Virtual FileSystem