HOME

TheInfoList



OR:

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations (create, delete, modify, read, write) on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured.
Confidentiality Confidentiality involves a set of rules or a promise usually executed through confidentiality agreements that limits the access or places restrictions on certain types of information. Legal confidentiality By law, lawyers are often required ...
,
availability In reliability engineering, the term availability has the following meanings: * The degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at ...
and
integrity Integrity is the practice of being honest and showing a consistent and uncompromising adherence to strong moral and ethical principles and values. In ethics, integrity is regarded as the honesty and truthfulness or accuracy of one's actions. In ...
are the main keys for a secure system. Users can share computing resources through the
Internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, pub ...
thanks to
cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mu ...
which is typically characterized by scalable and
elastic Elastic is a word often used to describe or identify certain types of elastomer, elastic used in garments or stretchable fabrics. Elastic may also refer to: Alternative name * Rubber band, ring-shaped band of rubber used to hold objects togeth ...
resources – such as physical servers, applications and any services that are
virtualized In computing, virtualization or virtualisation (sometimes abbreviated v12n, a numeronym) is the act of creating a virtual (rather than actual) version of something at the same abstraction level, including virtual computer hardware platforms, stor ...
and allocated dynamically. Synchronization is required to make sure that all devices are up-to-date. Distributed file systems enable many big, medium, and small enterprises to store and access their remote data as they do local data, facilitating the use of variable resources.


Overview


History

Today, there are many implementations of distributed file systems. The first file servers were developed by researchers in the 1970s. Sun Microsystem's
Network File System Network File System (NFS) is a distributed file system protocol originally developed by Sun Microsystems (Sun) in 1984, allowing a user on a client computer to access files over a computer network much like local storage is accessed. NFS, lik ...
became available in the 1980s. Before that, people who wanted to share files used the
sneakernet Sneakernet, also called sneaker net, is an informal term for the transfer of electronic information by physically moving media such as magnetic tape, floppy disks, optical discs, USB flash drives or external hard drives between computers, rather ...
method, physically transporting files on storage media from place to place. Once computer networks started to proliferate, it became obvious that the existing file systems had many limitations and were unsuitable for multi-user environments. Users initially used
FTP The File Transfer Protocol (FTP) is a standard communication protocol used for the transfer of computer files from a server to a client on a computer network. FTP is built on a client–server model architecture using separate control and data ...
to share files. FTP first ran on the
PDP-10 Digital Equipment Corporation (DEC)'s PDP-10, later marketed as the DECsystem-10, is a mainframe computer family manufactured beginning in 1966 and discontinued in 1983. 1970s models and beyond were marketed under the DECsystem-10 name, espec ...
at the end of 1973. Even with FTP, files needed to be copied from the source computer onto a server and then from the server onto the destination computer. Users were required to know the physical addresses of all computers involved with the file sharing.


Supporting techniques

Modern data centers must support large, heterogenous environments, consisting of large numbers of computers of varying capacities. Cloud computing coordinates the operation of all such systems, with techniques such as data center networking (DCN), the
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
framework, which supports
data-intensive computing Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which ...
applications in parallel and distributed systems, and
virtualization In computing, virtualization or virtualisation (sometimes abbreviated v12n, a numeronym) is the act of creating a virtual (rather than actual) version of something at the same abstraction level, including virtual computer hardware platforms, stor ...
techniques that provide dynamic resource allocation, allowing multiple operating systems to coexist on the same physical server.


Applications

Cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mu ...
provides large-scale computing thanks to its ability to provide the needed CPU and storage resources to the user with complete transparency. This makes cloud computing particularly suited to support different types of applications that require large-scale distributed processing. This
data-intensive computing Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which ...
needs a high performance file system that can share data between
virtual machines In computing, a virtual machine (VM) is the virtualization/ emulation of a computer system. Virtual machines are based on computer architectures and provide functionality of a physical computer. Their implementations may involve specialized hard ...
(VM). Cloud computing dynamically allocates the needed resources, releasing them once a task is finished, requiring users to pay only for needed services, often via a
service-level agreement A service-level agreement (SLA) is a commitment between a service provider and a customer. Particular aspects of the service – quality, availability, responsibilities – are agreed between the service provider and the service user. T ...
. Cloud computing and
cluster computing A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The compo ...
paradigms are becoming increasingly important to industrial data processing and scientific applications such as
astronomy Astronomy () is a natural science that studies celestial objects and phenomena. It uses mathematics, physics, and chemistry in order to explain their origin and evolution. Objects of interest include planets, moons, stars, nebulae, g ...
and physics, which frequently require the availability of large numbers of computers to carry out experiments.


Architectures

Most distributed file systems are built on the client-server architecture, but other, decentralized, solutions exist as well.


Client-server architecture

Network File System Network File System (NFS) is a distributed file system protocol originally developed by Sun Microsystems (Sun) in 1984, allowing a user on a client computer to access files over a computer network much like local storage is accessed. NFS, lik ...
(NFS) uses a client-server architecture, which allows sharing files between a number of machines on a network as if they were located locally, providing a standardized view. The NFS protocol allows heterogeneous clients' processes, probably running on different machines and under different operating systems, to access files on a distant server, ignoring the actual location of files. Relying on a single server results in the NFS protocol suffering from potentially low availability and poor scalability. Using multiple servers does not solve the availability problem since each server is working independently. The model of NFS is a remote file service. This model is also called the remote access model, which is in contrast with the upload/download model: * Remote access model: Provides transparency, the client has access to a file. He send requests to the remote file (while the file remains on the server). * Upload/download model: The client can access the file only locally. It means that the client has to download the file, make modifications, and upload it again, to be used by others' clients. The file system used by NFS is almost the same as the one used by
Unix Unix (; trademarked as UNIX) is a family of multitasking, multiuser computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, an ...
systems. Files are hierarchically organized into a naming graph in which directories and files are represented by nodes.


Cluster-based architectures

A cluster-based architecture ameliorates some of the issues in client-server architectures, improving the execution of applications in parallel. The technique used here is file-striping: a file is split into multiple chunks, which are "striped" across several storage servers. The goal is to allow access to different parts of a file in parallel. If the application does not benefit from this technique, then it would be more convenient to store different files on different servers. However, when it comes to organizing a distributed file system for large data centers, such as Amazon and Google, that offer services to web clients allowing multiple operations (reading, updating, deleting,...) to a large number of files distributed among a large number of computers, then cluster-based solutions become more beneficial. Note that having a large number of computers may mean more hardware failures. Two of the most widely used distributed file systems (DFS) of this type are the
Google File System Google File System (GFS or GoogleFS, not to be confused with the GFS Linux file system) is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. Goo ...
(GFS) and the Hadoop Distributed File System (HDFS). The file systems of both are implemented by user level processes running on top of a standard operating system (
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, w ...
in the case of GFS).


Design principles


= Goals

=
Google File System Google File System (GFS or GoogleFS, not to be confused with the GFS Linux file system) is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. Goo ...
(GFS) and Hadoop Distributed File System (HDFS) are specifically built for handling
batch processing Computerized batch processing is a method of running software programs called jobs in batches automatically. While users are required to submit the jobs, no other interaction by the user is required to process the batch. Batches may automatically ...
on very large data sets. For that, the following hypotheses must be taken into account: * High availability: the
cluster may refer to: Science and technology Astronomy * Cluster (spacecraft), constellation of four European Space Agency spacecraft * Asteroid cluster, a small asteroid family * Cluster II (spacecraft), a European Space Agency mission to study t ...
can contain thousands of file servers and some of them can be down at any time * A server belongs to a rack, a room, a data center, a country, and a continent, in order to precisely identify its geographical location * The size of a file can vary from many gigabytes to many terabytes. The file system should be able to support a massive number of files * The need to support append operations and allow file contents to be visible even while a file is being written * Communication is reliable among working machines:
TCP/IP The Internet protocol suite, commonly known as TCP/IP, is a framework for organizing the set of communication protocols used in the Internet and similar computer networks according to functional criteria. The foundational protocols in the suit ...
is used with a remote procedure call RPC communication abstraction. TCP allows the client to know almost immediately when there is a problem and a need to make a new connection.


= Load balancing

= Load balancing is essential for efficient operation in distributed environments. It means distributing work among different servers, fairly, in order to get more work done in the same amount of time and to serve clients faster. In a system containing N chunkservers in a cloud (N being 1000, 10000, or more), where a certain number of files are stored, each file is split into several parts or chunks of fixed size (for example, 64 megabytes), the load of each chunkserver being proportional to the number of chunks hosted by the server. In a load-balanced cloud, resources can be efficiently used while maximizing the performance of MapReduce-based applications.


= Load rebalancing

= In a cloud computing environment, failure is the norm, and chunkservers may be upgraded, replaced, and added to the system. Files can also be dynamically created, deleted, and appended. That leads to load imbalance in a distributed file system, meaning that the file chunks are not distributed equitably between the servers. Distributed file systems in clouds such as GFS and HDFS rely on central or master servers or nodes (Master for GFS and NameNode for HDFS) to manage the metadata and the load balancing. The master rebalances replicas periodically: data must be moved from one DataNode/chunkserver to another if free space on the first server falls below a certain threshold. However, this centralized approach can become a bottleneck for those master servers, if they become unable to manage a large number of file accesses, as it increases their already heavy loads. The load rebalance problem is NP-hard. In order to get large number of chunkservers to work in collaboration, and to solve the problem of load balancing in distributed file systems, several approaches have been proposed, such as reallocating file chunks so that the chunks can be distributed as uniformly as possible while reducing the movement cost as much as possible.


Google file system


= Description

= Google, one of the biggest internet companies, has created its own distributed file system, named Google File System (GFS), to meet the rapidly growing demands of Google's data processing needs, and it is used for all cloud services. GFS is a scalable distributed file system for data-intensive applications. It provides fault-tolerant, high-performance data storage a large number of clients accessing it simultaneously. GFS uses
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
, which allows users to create programs and run them on multiple machines without thinking about parallelization and load-balancing issues. GFS architecture is based on having a single master server for multiple chunkservers and multiple clients. The master server running in dedicated node is responsible for coordinating storage resources and managing files's metadata (the equivalent of, for example, inodes in classical file systems). Each file is split to multiple chunks of 64 megabytes. Each chunk is stored in a chunk server. A chunk is identified by a chunk handle, which is a globally unique 64-bit number that is assigned by the master when the chunk is first created. The master maintains all of the files's metadata, including file names, directories, and the mapping of files to the list of chunks that contain each file's data. The metadata is kept in the master server's main memory, along with the mapping of files to chunks. Updates to this data are logged to an operation log on disk. This operation log is replicated onto remote machines. When the log become too large, a checkpoint is made and the main-memory data is stored in a
B-tree In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for ...
structure to facilitate mapping back into main memory.


= Fault tolerance

= To facilitate
fault tolerance Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
, each chunk is replicated onto multiple (default, three) chunk servers. A chunk is available on at least one chunk server. The advantage of this scheme is simplicity. The master is responsible for allocating the chunk servers for each chunk and is contacted only for metadata information. For all other data, the client has to interact with the chunk servers. The master keeps track of where a chunk is located. However, it does not attempt to maintain the chunk locations precisely but only occasionally contacts the chunk servers to see which chunks they have stored. This allows for scalability, and helps prevent bottlenecks due to increased workload. In GFS, most files are modified by appending new data and not overwriting existing data. Once written, the files are usually only read sequentially rather than randomly, and that makes this DFS the most suitable for scenarios in which many large files are created once but read many times.


= File processing

= When a client wants to write-to/update a file, the master will assign a replica, which will be the primary replica if it is the first modification. The process of writing is composed of two steps: * Sending: First, and by far the most important, the client contacts the master to find out which chunk servers hold the data. The client is given a list of replicas identifying the primary and secondary chunk servers. The client then contacts the nearest replica chunk server, and sends the data to it. This server will send the data to the next closest one, which then forwards it to yet another replica, and so on. The data is then propagated and cached in memory but not yet written to a file. * Writing: When all the replicas have received the data, the client sends a write request to the primary chunk server, identifying the data that was sent in the sending phase. The primary server will then assign a sequence number to the write operations that it has received, apply the writes to the file in serial-number order, and forward the write requests in that order to the secondaries. Meanwhile, the master is kept out of the loop. Consequently, we can differentiate two types of flows: the data flow and the control flow. Data flow is associated with the sending phase and control flow is associated to the writing phase. This assures that the primary chunk server takes control of the write order. Note that when the master assigns the write operation to a replica, it increments the chunk version number and informs all of the replicas containing that chunk of the new version number. Chunk version numbers allow for update error-detection, if a replica wasn't updated because its chunk server was down. Some new Google applications did not work well with the 64-megabyte chunk size. To solve that problem, GFS started, in 2004, to implement the
Bigtable Bigtable is a fully managed wide-column and key-value NoSQL database service for large analytical and operational workloads as part of the Google Cloud portfolio. History Bigtable development began in 2004.. It is now used by a number of Googl ...
approach.


Hadoop distributed file system

, developed by the Apache Software Foundation, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes). Its architecture is similar to GFS, i.e. a master/slave architecture. The HDFS is normally installed on a cluster of computers. The design concept of Hadoop is informed by Google's, with Google File System, Google MapReduce and
Bigtable Bigtable is a fully managed wide-column and key-value NoSQL database service for large analytical and operational workloads as part of the Google Cloud portfolio. History Bigtable development began in 2004.. It is now used by a number of Googl ...
, being implemented by Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Hadoop Base (HBase) respectively. Like GFS, HDFS is suited for scenarios with write-once-read-many file access, and supports file appends and truncates in lieu of random reads and writes to simplify data coherency issues. An HDFS cluster consists of a single NameNode and several DataNode machines. The NameNode, a master server, manages and maintains the metadata of storage DataNodes in its RAM. DataNodes manage storage attached to the nodes that they run on. NameNode and DataNode are software designed to run on everyday-use machines, which typically run under a Linux OS. HDFS can be run on any machine that supports Java and therefore can run either a NameNode or the Datanode software. On an HDFS cluster, a file is split into one or more equal-size blocks, except for the possibility of the last block being smaller. Each block is stored on multiple DataNodes, and each may be replicated on multiple DataNodes to guarantee availability. By default, each block is replicated three times, a process called "Block Level Replication". The NameNode manages the file system namespace operations such as opening, closing, and renaming files and directories, and regulates file access. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for servicing read and write requests from the file system's clients, managing the block allocation or deletion, and replicating blocks. When a client wants to read or write data, it contacts the NameNode and the NameNode checks where the data should be read from or written to. After that, the client has the location of the DataNode and can send read or write requests to it. The HDFS is typically characterized by its compatibility with data rebalancing schemes. In general, managing the free space on a DataNode is very important. Data must be moved from one DataNode to another, if free space is not adequate; and in the case of creating additional replicas, data should be moved to assure system balance.


Other examples

Distributed file systems can be optimized for different purposes. Some, such as those designed for internet services, including GFS, are optimized for scalability. Other designs for distributed file systems support performance-intensive applications usually executed in parallel. Some examples include: MapR File System (MapR-FS), Ceph-FS, Fraunhofer File System (BeeGFS), Lustre File System,
IBM General Parallel File System GPFS (General Parallel File System, brand name IBM Spectrum Scale) is high-performance clustered file system software developed by IBM. It can be deployed in shared-disk or shared-nothing distributed parallel modes, or a combination of these. It ...
(GPFS), and Parallel Virtual File System. MapR-FS is a distributed file system that is the basis of the MapR Converged Platform, with capabilities for distributed file storage, a NoSQL database with multiple APIs, and an integrated message streaming system. MapR-FS is optimized for scalability, performance, reliability, and availability. Its file storage capability is compatible with the Apache Hadoop Distributed File System (HDFS) API but with several design characteristics that distinguish it from HDFS. Among the most notable differences are that MapR-FS is a fully read/write filesystem with metadata for files and directories distributed across the namespace, so there is no NameNode. Ceph-FS is a distributed file system that provides excellent performance and reliability. It answers the challenges of dealing with huge files and directories, coordinating the activity of thousands of disks, providing parallel access to metadata on a massive scale, manipulating both scientific and general-purpose workloads, authenticating and encrypting on a large scale, and increasing or decreasing dynamically due to frequent device decommissioning, device failures, and cluster expansions. BeeGFS is the high-performance parallel file system from the Fraunhofer Competence Centre for High Performance Computing. The distributed metadata architecture of BeeGFS has been designed to provide the scalability and flexibility needed to run HPC and similar applications with high I/O demands. Lustre File System has been designed and implemented to deal with the issue of bottlenecks traditionally found in distributed systems. Lustre is characterized by its efficiency, scalability, and redundancy. GPFS was also designed with the goal of removing such bottlenecks.


Communication

High performance of distributed file systems requires efficient communication between computing nodes and fast access to the storage systems. Operations such as open, close, read, write, send, and receive need to be fast, to ensure that performance. For example, each read or write request accesses disk storage, which introduces seek, rotational, and network latencies. The data communication (send/receive) operations transfer data from the application buffer to the machine kernel, TCP controlling the process and being implemented in the kernel. However, in case of network congestion or errors, TCP may not send the data directly. While transferring data from a buffer in the
kernel Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learn ...
to the application, the machine does not read the byte stream from the remote machine. In fact, TCP is responsible for buffering the data for the application. Choosing the buffer-size, for file reading and writing, or file sending and receiving, is done at the application level. The buffer is maintained using a circular linked list. It consists of a set of BufferNodes. Each BufferNode has a DataField. The DataField contains the data and a pointer called NextBufferNode that points to the next BufferNode. To find the current position, two
pointers Pointer may refer to: Places * Pointer, Kentucky * Pointers, New Jersey * Pointers Airport, Wasco County, Oregon, United States * The Pointers, a pair of rocks off Antarctica People with the name * Pointer (surname), a surname (including a lis ...
are used: CurrentBufferNode and EndBufferNode, that represent the position in the BufferNode for the last write and read positions. If the BufferNode has no free space, it will send a wait signal to the client to wait until there is available space.


Cloud-based Synchronization of Distributed File System

More and more users have multiple devices with ad hoc connectivity. The data sets replicated on these devices need to be synchronized among an arbitrary number of servers. This is useful for backups and also for offline operation. Indeed, when user network conditions are not good, then the user device will selectively replicate a part of data that will be modified later and off-line. Once the network conditions become good, the device is synchronized. Two approaches exist to tackle the distributed synchronization issue: user-controlled peer-to-peer synchronization and cloud master-replica synchronization. * user-controlled peer-to-peer: software such as
rsync rsync is a utility for efficiently transferring and synchronizing files between a computer and a storage drive and across networked computers by comparing the modification times and sizes of files. It is commonly found on Unix-like opera ...
must be installed in all users' computers that contain their data. The files are synchronized by peer-to-peer synchronization where users must specify network addresses and synchronization parameters, and is thus a manual process. * cloud master-replica synchronization: widely used by cloud services, in which a master replica is maintained in the cloud, and all updates and synchronization operations are to this master copy, offering a high level of availability and reliability in case of failures.


Security keys

In cloud computing, the most important
security" \n\n\nsecurity.txt is a proposed standard for websites' security information that is meant to allow security researchers to easily report security vulnerabilities. The standard prescribes a text file called \"security.txt\" in the well known locat ...
concepts are
confidentiality Confidentiality involves a set of rules or a promise usually executed through confidentiality agreements that limits the access or places restrictions on certain types of information. Legal confidentiality By law, lawyers are often required ...
,
integrity Integrity is the practice of being honest and showing a consistent and uncompromising adherence to strong moral and ethical principles and values. In ethics, integrity is regarded as the honesty and truthfulness or accuracy of one's actions. In ...
, and
availability In reliability engineering, the term availability has the following meanings: * The degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at ...
("
CIA The Central Intelligence Agency (CIA ), known informally as the Agency and historically as the Company, is a civilian foreign intelligence service of the federal government of the United States, officially tasked with gathering, processing, ...
"). Confidentiality becomes indispensable in order to keep private data from being disclosed. Integrity ensures that data is not corrupted.


Confidentiality

Confidentiality Confidentiality involves a set of rules or a promise usually executed through confidentiality agreements that limits the access or places restrictions on certain types of information. Legal confidentiality By law, lawyers are often required ...
means that data and computation tasks are confidential: neither cloud provider nor other clients can access the client's data. Much research has been done about confidentiality, because it is one of the crucial points that still presents challenges for cloud computing. A lack of trust in the cloud providers is also a related issue. The infrastructure of the cloud must ensure that customers' data will not be accessed by unauthorized parties. The environment becomes insecure if the service provider can do all of the following: * locate the consumer's data in the cloud * access and retrieve consumer's data * understand the meaning of the data (types of data, functionalities and interfaces of the application and format of the data). The geographic location of data helps determine privacy and confidentiality. The location of clients should be taken into account. For example, clients in Europe won't be interested in using datacenters located in United States, because that affects the guarantee of the confidentiality of data. In order to deal with that problem, some cloud computing vendors have included the geographic location of the host as a parameter of the service-level agreement made with the customer, allowing users to choose themselves the locations of the servers that will host their data. Another approach to confidentiality involves data encryption. Otherwise, there will be serious risk of unauthorized use. A variety of solutions exists, such as encrypting only sensitive data, and supporting only some operations, in order to simplify computation. Furthermore, cryptographic techniques and tools as FHE, are used to preserve privacy in the cloud.


Integrity

Integrity in cloud computing implies
data integrity Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire life-cycle and is a critical aspect to the design, implementation, and usage of any system that stores, processes, or retrieves data. The ter ...
as well as computing integrity. Such integrity means that data has to be stored correctly on cloud servers and, in case of failures or incorrect computing, that problems have to be detected. Data integrity can be affected by malicious events or from administration errors (e.g. during backup and restore,
data migration Data migration is the process of selecting, preparing, extracting, and transforming data and permanently transferring it from one computer storage system to another. Additionally, the validation of migrated data for completeness and the decommis ...
, or changing memberships in
P2P P2P may refer to: * Pay to play, where money is exchanged for services * Peer-to-peer, a distributed application architecture in computing or networking ** List of P2P protocols * Phenylacetone, an organic compound commonly known as P2P * Poin ...
systems). Integrity is easy to achieve using cryptography (typically through
message-authentication code In cryptography, a message authentication code (MAC), sometimes known as a ''tag'', is a short piece of information used for authenticating a message. In other words, to confirm that the message came from the stated sender (its authenticity) and ...
, or MACs, on data blocks). There exist checking mechanisms that effect data integrity. For instance: * HAIL (High-Availability and Integrity Layer) is a distributed cryptographic system that allows a set of servers to prove to a client that a stored file is intact and retrievable. * Hach PORs (proofs of retrievability for large files) is based on a symmetric cryptographic system, where there is only one verification key that must be stored in a file to improve its integrity. This method serves to encrypt a file F and then generate a random string named "sentinel" that must be added at the end of the encrypted file. The server cannot locate the sentinel, which is impossible differentiate from other blocks, so a small change would indicate whether the file has been changed or not. * PDP (provable data possession) checking is a class of efficient and practical methods that provide an efficient way to check data integrity on untrusted servers: ** PDP: Before storing the data on a server, the client must store, locally, some meta-data. At a later time, and without downloading data, the client is able to ask the server to check that the data has not been falsified. This approach is used for static data. ** Scalable PDP: This approach is premised upon a symmetric-key, which is more efficient than public-key encryption. It supports some dynamic operations (modification, deletion, and append) but it cannot be used for public verification. ** Dynamic PDP: This approach extends the PDP model to support several update operations such as append, insert, modify, and delete, which is well suited for intensive computation.


Availability

Availability In reliability engineering, the term availability has the following meanings: * The degree to which a system, subsystem or equipment is in a specified operable and committable state at the start of a mission, when the mission is called for at ...
is generally effected by replication. Meanwhile, consistency must be guaranteed. However, consistency and availability cannot be achieved at the same time; each is prioritized at some sacrifice of the other. A balance must be struck. Data must have an identity to be accessible. For instance, Skute is a mechanism based on key/value storage that allows dynamic data allocation in an efficient way. Each server must be identified by a label in the form continent-country-datacenter-room-rack-server. The server can reference multiple virtual nodes, with each node having a selection of data (or multiple partitions of multiple data). Each piece of data is identified by a key space which is generated by a one-way cryptographic hash function (e.g. MD5) and is localised by the hash function value of this key. The key space may be partitioned into multiple partitions with each partition referring to a piece of data. To perform replication, virtual nodes must be replicated and referenced by other servers. To maximize data durability and data availability, the replicas must be placed on different servers and every server should be in a different geographical location, because data availability increases with geographical diversity. The process of replication includes an evaluation of space availability, which must be above a certain minimum thresh-hold on each chunk server. Otherwise, data are replicated to another chunk server. Each partition, i, has an availability value represented by the following formula: avail_i=\sum_^\sum_^ conf_i.conf_j.diversity(s_i,s_j) where s_ are the servers hosting the replicas, conf_ and conf_ are the confidence of servers _ and _ (relying on technical factors such as hardware components and non-technical ones like the economic and political situation of a country) and the diversity is the geographical distance between s_ and s_ . Replication is a great solution to ensure data availability, but it costs too much in terms of memory space. DiskReduce is a modified version of HDFS that's based on
RAID Raid, RAID or Raids may refer to: Attack * Raid (military), a sudden attack behind the enemy's lines without the intention of holding ground * Corporate raid, a type of hostile takeover in business * Panty raid, a prankish raid by male college ...
technology (RAID-5 and RAID-6) and allows asynchronous encoding of replicated data. Indeed, there is a background process which looks for widely replicated data and deletes extra copies after encoding it. Another approach is to replace replication with erasure coding. In addition, to ensure data availability there are many approaches that allow for data recovery. In fact, data must be coded, and if it is lost, it can be recovered from fragments which were constructed during the coding phase. Some other approaches that apply different mechanisms to guarantee availability are: Reed-Solomon code of Microsoft Azure and RaidNode for HDFS. Also Google is still working on a new approach based on an erasure-coding mechanism. There is no RAID implementation for cloud storage.


Economic aspects

The cloud computing economy is growing rapidly. The US government has decided to spend 40% of its
compound annual growth rate Compound annual growth rate (CAGR) is a business and investing specific term for the geometric progression ratio that provides a constant rate of return over the time period. CAGR is not an accounting term, but it is often used to describe some ele ...
(CAGR), expected to be 7 billion dollars by 2015. More and more companies have been utilizing cloud computing to manage the massive amount of data and to overcome the lack of storage capacity, and because it enables them to use such resources as a service, ensuring that their computing needs will be met without having to invest in infrastructure (Pay-as-you-go model). Every application provider has to periodically pay the cost of each server where replicas of data are stored. The cost of a server is determined by the quality of the hardware, the storage capacities, and its query-processing and communication overhead. Cloud computing allows providers to scale their services according to client demands. The pay-as-you-go model has also eased the burden on startup companies that wish to benefit from compute-intensive business. Cloud computing also offers an opportunity to many third-world countries that wouldn't have such computing resources otherwise. Cloud computing can lower IT barriers to innovation. Despite the wide utilization of cloud computing, efficient sharing of large volumes of data in an untrusted cloud is still a challenge.


References


Bibliography

* * * * * # Architecture, structure, and design: #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* # Security #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* #* # Synchronization #* # Economic aspects #* #* #* {{Cloud computing Cloud storage