Google File System: The Mother of Hadoop Distributed File System (HDFS), CephFS, and GlusterFS.
Introduction & Background:
The Google File System (GFS) was designed and implemented by a team of engineers at Google led by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung.
Sanjay Ghemawat received his Ph.D. from MIT and joined Google in 1999. He has made significant contributions to several of Google’s core infrastructure technologies, including MapReduce, BigTable, and Spanner.
Howard Gobioff, who received his Ph.D. from the University of California, Berkeley, joined Google in 2002. He has also worked on several other Google infrastructure projects, including Colossus (Google’s successor to GFS) and TensorFlow.
Shun-Tak Leung received his Ph.D. from Carnegie Mellon University and joined Google in 2002. He has since worked on various Google projects, including AdWords and Google Cloud Platform.
The development of GFS was driven by the need for a scalable and fault-tolerant file system that could handle the massive amounts of data generated by Google’s growing search engine and other applications. The team recognized that traditional file systems were not suitable for Google’s needs, as they were designed to run on small-scale systems with a limited number of machines.
To address this issue, the GFS team designed a new distributed file system that could scale to handle petabytes of data across thousands of machines. The system was designed to be fault-tolerant, meaning that it could continue to operate even in the face of hardware failures.
The team also implemented a number of other features to make the system more efficient and reliable, such as the use of a master server to coordinate access to the file system and the use of data replication to ensure that data is not lost in the event of a failure.
The GFS was first deployed internally at Google in 2003, and it quickly proved to be a success. The system was able to handle the massive amounts of data generated by Google’s search engine and other applications, and it was also able to scale to handle the growing demand for storage capacity.
Key Takeaways/Findings from the paper:
‘File Blocks are stored on the hard disks, and GFS is a manager. Hence, the level of GFS is higher than the level of the underlying file system and virtual file system’
Goals of GFS:
Like other distributed file systems it has the following goals:
- Performance
- Reliability
- Scalability
- Recoverability
However, it was driven by the key observations of application workloads, and the technical environment — both actual and anticipated — have influenced its design, which deviates significantly from several earlier file system design presumptions.
- Component Failure
- Huge files
- Frequent Mutations
- Co-designing the applications and the file system API benefits the overall system by increasing flexibility
The GFS system mainly stores 2 types of data
- File metadata
- File data
Design Assumptions:
- Inexpensive commodity hardware
- Modest Number of large files
- Large streaming reads, small random reads(MapReduce)
- Mostly appends
- Consistent concurrent execution
- High throughput and low latency
Architecture Building Blocks:
- Single GFS master and multiple GFS chunk servers accessed by multiple GFS clients
- GFS Files are divided into fixed-size chunks(64MB)
- Each chunk has unique ID(chunk handle)
- For reliability each chunk is replicated on multiple chunkservers(default=3)
Source: http://mytechroad.com/client-server-design-issues/
SYSTEM COMPONENTS:
Master Server:
The master contains metadata such as
- Namespace — A lookup table mapping full pathnames to metadata
- Filename — Array of chunk handles (where to find the data)
- Chunk handle — list of all chunk servers available with the master
- List of chunk servers
- Primary — Mapping the server as primary during the lease time.
- Version number
- Lease expiration time
- Log & checkpoints on disk
Role of master includes:
Chunk management:
- Each chunk handler is assigned a unique identifier by the master server.
- The master server polls each chunk server for chunk information upon startup or addition of a new server.
- The master server establishes a lease for a replica of the chunk.
- The master server communicates with each chunk server periodically via Heartbeat.
- The metadata is scanned periodically by the master server for garbage collection, re-copying data, load balancing, and chunk migration.
Operation log:
- The operation log contains a record of key metadata changes.
- The log serves as a timeline that defines the sequence of concurrent operations.
- Files and blocks are identified by the logical time of their creation.
- The log is copied to multiple remote devices after local and remote writes.
- The master server restores its filesystem state by repeating the operation log.
- A small log size is required to minimize startup time, and a checkpoint is generated after a certain size to limit recovery to a smaller number of disk log records.
Snapshot:
- Snapshot operations are quick and cause minimal disruption to other operations.
- Copy-on-write technique is used in snapshot operations.
- When a snapshot request is received, the master server cancels all chunk leases for files requiring snapshots.
- The master server records the operation in the log and saves the log record into memory via the metadata of the replication source.
- The master server requires all chunk servers owning the chunk copy to create a new copy, set lease on one of the three new copies, and return it to the client when the reference count of the chunk is greater than one.
- Canceling the chunk of the file that needs a snapshot ensures that the client triggers the creation of a new copy of the chunk if it cannot find the primary.
- Copy-on-write technique reduces unnecessary copying because not all chunks are modified during snapshot creation.
Namespace lock and garbage collection(explained later in features)
Chunk server:
- Each chunk copy is saved as a typical Linux file in the chunk server.
- Chunk Server writes and reads chunks in accordance with the chunk handle and byte-range and stores chunks as Linux files on the local hard drive.
- To maintain reliability, each chunk is replicated over three chunk servers.
- Because it only requires one initial request to the master, a high chunk size can decrease the client’s interaction with the master. With a higher chunk size, the client can carry out more operations on a chunk and keep the TCP connection to the chunk server open for longer.
Chunk Copy Location
- To increase data reliability, multiple copies of a chunk are kept in storage. As a result, choosing a replica location is a rather important decision. A decent replica location definition method meets the following criteria. Chunk replica placement policy serves two purposes
- Make ensuring data is reliable and accessible — In addition to preventing hard disc loss and server downtime, it can also stop network and power outages, making it more reliable than simply deploying on different servers.
- Making sure network capacity is being used: To decrease write delay and boost read and write performance, many copies are placed as near together as possible.
Client server:
- In the form of a library, the client code is linked to the client programme.
- The programme communicates with the master and the chunk server, implements the API interface functions of the GFS file system, and reads and writes data.
- The client will make a request to have the metadata sent by the master server. Additionally, it will inquire about the chunk server to contact from the master. The remaining activities will use the chunk server to read and write data directly.
Data Flow & System Interactions:
Source: https://en.wikipedia.org/wiki/Google_File_System
Process:
Reading:
- To read a chunk of data, the client-server sends the filename and chunk index to the master server.
- The master server responds with the chunk handle and replica location for the latest version.
- The client-server caches this data and sends a request to the chunk server, usually the closest one.
- The request includes the chunk handle and byte-range, and the chunk server reads the file and returns the data.
- The client-server receives the data and performs padding removal and deduplication by unique ID if needed
- The client-server can avoid communicating with the master until the cache information expires or the file is reopened.
- Typically, the client-server queries multiple chunks in a single request.
Write:
- When writing data, the client-server first asks the master server for the primary replica and the chunk’s location.
- If there is no primary replica or if all chunk servers lack the most recent version number, an error is returned.
- The master server selects the primary replica with the latest version number and increments it, writing it to the log on the hard disk.
- The primary and secondary replicas’ new version numbers are then written to the chunk server’s local hard disk.
- The master server returns the primary replica’s identifier and the secondary replica’s location to the client, which caches the information for later use.
- The client-server pushes the data to all replicas, and the chunk server saves it in the LRU cache.
- After all replicas confirm receipt of the data, the client-server sends the write request to the primary replica, which executes it locally in sequence number order.
- The primary replica then passes the write request to all secondary replicas, which operate according to the same sequence number.
- Each secondary replica informs the primary replica when it has completed the operation, and the primary replica informs the client-server.
Record Append:
The Google File System (GFS) has a feature called record append, which is an atomic append operation. With record append, clients only need to specify the data they want to append, and GFS appends it to the file at least once atomically at an offset of GFS’s choosing. This ensures that multiple writes to the same region are serializable, and the region contains data fragments from only one client. Record append is heavily used in distributed applications where many clients on different machines append to the same file concurrently. GFS guarantees that the data is written at least once as an atomic unit, but it does not guarantee that all replicas are bytewise identical. Successful record append operations define consistent regions, whereas intervening regions are undefined.
Lease:
- A primary copy is leased by the master server from among several secondary copies of the chunk. In order to maintain an update order that is consistent throughout all copies, the master copy coordinates client writing. In other words, all modifications to the chunk are serialised in the primary copy, and they are replicated in the same order.
- It is to minimize the administrative burden of the master server.
- Issues created:
- Assuming that S1 is the primary of a chunk, and the network connection between the master and S1 has been disconnected, the master will establish a lease for S2 as the primary of a chunk after finding that S1 has no response. However, S1 suddenly can be connected.
- Solution?
- Set the lease timeout to 60 seconds to resolve this issue. S2 can only be set to primary after S1 expires if the master determines that S1 has no response and allocates a new lease after the lease expires.
Snapshots:
These operation makes a copy of a file or a directory tree (the “source”) almost instantaneously, while minimizing any interruptions of ongoing mutations. The users use it to quickly create branch copies of huge data sets (and often copies of those copies, recursively), or to checkpoint the current state before experimenting with changes that can later be committed or rolled backeasily.
Consistency model:
- File namespace mutations are atomic and handled by the master namespace locking.
- The master’s operation log defines a global total order of file namespace mutations.
- The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations.
- File regions are consistent if all clients will see the same data, and are defined after a successful mutation.
- Concurrent successful mutations leave the region undefined but consistent, while failed mutations make the region inconsistent.
- Data mutations may be writes or record appends, with record appends being appended atomically at an offset chosen by GFS.
- After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation.
- GFS identifies failed chunk servers by regular handshakes and detects data corruption by check summing.
- A chunk is lost irreversibly only if all its replicas are lost before GFS can react, typically within minutes.
Features of GFS:
Consistent:
Explained in the consistency model
Namespace Management and Locking:
Using namespace locks ensures that concurrent activities on the master are performed in the correct sequence. Before starting, each master operation obtains a number of locks.
On the same directory, multiple modifications may be made simultaneously. One such instance is the simultaneous creation of numerous files in the same directory. A read lock on the directory name and a write lock on the file name are required for each new file creation. When a directory name has a read lock, it cannot be deleted, renamed, or snapshotted. When a file name has a write lock, efforts to create the same file twice are serialised.
They are lexicographically sorted at the same level as the namespace tree level in order to prevent deadlocks.
Replicas: Placement, Creation, Re-replication, Rebalancing:
Chunk creation:
The master will decide where to put the first empty replica when it produces a chunk, mostly taking into account the following factors:
- Keep fresh replicas on chunk servers with minimal disc utilization.
- The maximum number of chunk creation operations per chunk server is to be avoided.
Re-copy
When the number of valid replicas < 3, the master re-replicates it, because
- the chunk server is unavailable,
- the replica on the chunk server is corrupt,
- the chunk server disk is malfunctioning
- The chunk replica’s replication factor rises.
- The master chooses the chunk with the highest priority, instructing the chunk server to replicate from the nearest replica.
Balanced chunk reloading
Periodically, the master server balances the number of replicas. To make better use of available disc space, it first evaluates the distribution of the present replicas before moving them. The master will progressively fill up any new chunk servers that are added.
Garbage Collection:
The master server promptly logs the deletion record of a file when it is deleted by an application. Although the filename is changed to a concealed name that includes the deletion timestamp, the master server does not immediately reclaim the resource.
The files will be deleted before the set time (the default is three days) when the master node routinely examines the file namespace.
Just rename the files using their normal names when you restore them.
The associated metadata kept in the master server won’t be erased until the hidden file is removed from the namespace, which will lead to a real disassociation from its associated chunk server.
Thus, it offers a consistent and trustworthy method of eliminating unnecessary copies.
High availability:
- GFS cluster has hundreds of servers, and some are bound to be unavailable at any time.
- The system stays highly available through fast recovery and replication strategies.
- Fast recovery: Master and chunk servers can restore their state and start within seconds, and clients and other servers can reconnect and retry, no matter how they are terminated.
- Chunk replication: Each chunk is replicated on multiple chunk servers on different racks, with users able to specify different replication levels. The master clones existing replicas as needed to keep chunks fully replicated.
- Master replication: The master state is replicated for reliability, with operation logs and checkpoints replicated on multiple machines. Shadow masters provide read-only access to the file system, enhancing read availability for files that are not being actively mutated or applications that do not mind getting slightly stale results.
- Shadow master may become single point of failure.
- It may lag slightly.
- Operation logs replicated remotely.
- DNS change can change master.
Data Integrity:
- Each chunkserver in GFS uses checksumming to detect corruption of stored data.
- GFS cluster regularly experiences disk failures that can cause data corruption or loss.
- Each chunk is divided into 64 KB blocks, and each block has a corresponding 32-bit checksum.
- Checksums are kept in memory and stored persistently with logging, separate from user data.
- Checksums are verified before returning data to a requester, and any mismatches are reported to the master.
- Checksumming has little effect on read performance, and computation is optimized for writes that append to the end of a chunk.
- During idle periods, chunkservers can scan and verify the contents of inactive chunks to detect corruption.
Diagnostic tools:
- Diagnostic logging has greatly helped in problem isolation, debugging, and performance analysis with minimal cost.
- Without logs, it’s hard to understand non-repeatable interactions between machines.
- GFS servers generate diagnostic logs that record significant events and all RPC requests and replies.
- RPC logs include exact requests and responses, except for file data being read or written.
- By matching requests with replies and collating RPC records on different machines, the entire interaction history can be reconstructed for diagnosis.
- The logs also serve as traces for load testing and performance analysis.
- The performance impact of logging is minimal and far outweighed by the benefits.
- The most recent events are kept in memory and available for continuous online monitoring
Fault tolerance
High availability along with data integrity and diagnostic tools help in google file system being fault tolerance.
Microbenchmarks and Actual work performance metrics:
The authors conducted micro-benchmarks on a GFS cluster with one master, two master replicas, 16 chunkservers, and 16 clients.
Each machine had dual 1.4 GHz PIII processors, 2 GB of memory, two 80 GB 5400 rpm disks, and a 100 Mbps full-duplex Ethernet connection to an HP 2524 switch.
Reads:
- The aggregate read rate for N clients peaks at 125 MB/s, and the observed read rate is 10 MB/s for one client and 94 MB/s for 16 clients.
- When multiple clients read from the file system simultaneously, the aggregate read rate is limited by the network bandwidth and the performance of individual clients is limited by their network interface.
- The efficiency drops as the number of clients increases due to the increased probability of multiple clients simultaneously reading from the same chunkserver.
Writes:
- The aggregate write rate plateaus at 67 MB/s for N clients.
- When multiple clients write simultaneously to distinct files, the aggregate write rate is limited by the need to write each byte to multiple chunk servers.
- The write rate for one client is about half of the theoretical limit due to delays in propagating data from one replica to another.
- The efficiency drops as the number of clients increases due to the increased probability of multiple clients writing concurrently to the same chunkserver.
Record Appends:
- The aggregate record append performance starts at 6.0 MB/s for one client and drops to 4.8 MB/s for 16 clients.
- Performance is limited by the network bandwidth of the chunkservers that store the last chunk of the file, independent of the number of clients.
- The efficiency drops as the number of clients increases due to congestion and variances in network transfer rates seen by different clients.
Overall, GFS performs well in read-heavy workloads but suffers from slower writes and record appends. However, in practice, this has not been a major problem as it does not significantly affect the aggregate write bandwidth delivered by the system to a large number of clients. The network congestion seen in the experiments is also not a significant issue in practice since clients can make progress on writing one file while the chunkservers for another file are busy.
Details of the working model:
Real world clusters:
From the experiments performed by the author they provide various results:
The author provides a statistical analytic summary of two GFS clusters, A and B, used by Google for research and development and production data processing, respectively. The clusters have hundreds of chunkservers, store many TBs of data, and have similar numbers of files, although B has a larger proportion of dead files and more chunks because its files tend to be larger. The chunkservers store tens of GBs of metadata, mostly checksums for 64 KB blocks of user data, and the master stores only tens of MBs of metadata.
- The read rates of both clusters were higher than the write rates, with A sustaining a read rate of 580 MB/s for the preceding week, and both clusters were in the middle of heavy read activity. Its network configuration can support 750 MB/s, so it was using its resources efficiently. Cluster B can support peakread rates of 1300 MB/s, but its applications were using just 380 MB/s.
- The average write rate was less than 30 MB/s since the restart, with B in the middle of a burst of write activity generating about 100 MB/s of data, which produced a 300 MB/s network load because writes are propagated to three replicas.
- The paper discusses the performance of GFS and improvements made to overcome previous bottlenecks. The master can handle a rate of 200 to 500 operations per second without being a bottleneck. The data structures have been changed to allow efficient binary searches through the namespace, which has increased file access support to thousands per second.
- The recovery time after a chunkserver fails depends on the number of resources available. In an experiment where a single chunkserver failed, all chunks were restored in 23.2 minutes, while in another experiment where two chunkservers failed, the 266 chunks that were under-replicated were restored to at least 2x replication within 2 minutes.
Workload Breakdown
The paper presents a detailed breakdown of the workloads on two Google File System (GFS) clusters: Cluster X for research and development, and Cluster Y for production data processing. The workload breakdown statistics are based on client-originated requests only, to reflect the workload generated by the applications for the file system as a whole. The paper cautions that the workload breakdown should not be overly generalized since Google completely controls both GFS and its applications, and the applications are tuned for GFS, and conversely GFS is designed for these applications.
The small reads (under 64 KB) come from seek-intensive clients that look up small pieces of data within huge files. The large reads (over 512 KB) come from long sequential reads through entire files. Write sizes also exhibit a bimodal distribution. The large writes (over 256 KB) typically result from significant buffering within the writers. Writers that buffer less data, checkpoint or synchronize more often, or simply generate less data account for the smaller writes (under 64 KB). As for record appends, cluster Y sees a much higher percentage of large record appends than cluster X does because the production systems, which use cluster Y, are more aggressively tuned for GFS.
The original paper of GFS shows that for all kinds of operations, the larger operations (over 256 KB) generally account for most of the bytes transferred. Small reads (under 64 KB) do transfer a small but significant portion of the read data because of the random seek workload.
The paper notes that record appends are heavily used, especially in production systems. The ratio of writes to record appends for cluster X is 108:1 by bytes transferred and 8:1 by operation counts. For cluster Y, the ratios are 3.7:1 and 2.5:1, respectively. The paper also states that data mutation workload is dominated by appending rather than overwriting.
Issues during development:
GFS experienced operational and technical issues during development and deployment
- GFS evolved from a backend file system for production systems to include research and development tasks
- More infrastructure was required to prevent users from interfering with each other
- Disk and Linux-related issues were among the biggest problems
- Disks claimed to support certain IDE protocol versions but only reliably responded to newer ones, leading to occasional data corruption
- Checksums were used to detect data corruption, and the kernel was modified to handle protocol mismatches
- Linux 2.2 kernels had issues with the cost of fsync(), which was proportional to the file size and problematic for large operation logs
- A single reader-writer lock caused timeouts in the system under light load
- Replacing mmap() with pread() worked around the issue
- Despite occasional problems, Linux code helped developers understand system behavior and improve the kernel when appropriate.
Critical Analysis of the proposed method by the paper:
The paper very well presented the novel file system developed by google for managing its vast files and it’s data.
Strengths:
- High accessibility Data is still accessible even if a few nodes fail. (replication) Component failures are more common than not, as the saying goes.
- Excessive throughput. many nodes operating concurrently.
- Dependable storing. Data that has been corrupted can be found and duplicated.
Weakness:
- Not the best fit for small files.
- Master may act as a bottleneck.
- unable to type at random.
- Suitable for procedures or data that are written once and only read (appended) later.
- Limited compatibility: GFS is designed specifically for Google’s infrastructure, making it less accessible for external use.
Opportunities:
- Continuous improvement: GFS can be further improved with new innovations and advancements in the field of distributed file systems.
- Adaptability: GFS can be adapted for different use cases and environments, expanding its potential applications.
- Increased adoption: GFS can be adopted by more organizations, expanding its user base and increasing its impact.
Threats
- Competition: There are other distributed file systems available in the market that may pose a threat to GFS.
- Emerging technologies: New technologies may emerge that can offer more efficient and effective solutions for handling large amounts of data.
- Security risks: With the increasing amount of data being stored, there is a greater risk of security breaches and data theft.
Comparison with Related Studies and current relevance:
The authors of the paper comparing GFS to other large distributed file systems, such as AFS, xFS, Swift, Frangipani, Intermezzo, GPFS, and Lustre.
- GFS provides a location-independent namespace for transparent data movement and spreads a file’s data across storage servers to deliver aggregate performance and increased fault tolerance.
- GFS does not provide any caching below the file system interface and uses replication for redundancy, consuming more raw storage than xFS or Swift.
- The centralized approach was chosen to simplify design, increase reliability, and gain flexibility. Fault tolerance is addressed by keeping the master state small and fully replicated on other machines, and the shadow master mechanism provides scalability and high availability for read.
- GFS is compared to Lustre in terms of delivering aggregate performance to a large number of clients but focuses on the needs of its applications and assumes a large number of unreliable components, making fault tolerance central to the design.
- GFS is most similar to the NASD architecture, but it uses commodity machines as chunkservers and implements features such as rebalancing, replication, and recovery required in a production environment.
- GFS does not seek to alter the storage device model but addresses day-to-day data processing needs for complicated distributed systems with existing commodity components. The producer-consumer queues enabled by atomic record appends address a similar problem as the distributed queues in River, but GFS uses a persistent file that can be appended to concurrently by many producers.
GFS was first introduced in 2003 and since then has been used by many large-scale applications at Google, such as Google Search, Google Maps, and Gmail.
While GFS was groundbreaking in its time and had a significant impact on the development of distributed file systems, its relevance today is somewhat limited. Since the introduction of GFS, several other distributed file systems, such as Hadoop Distributed File System (HDFS) and Apache Cassandra, have been developed, which provide similar capabilities and have gained popularity.
That being said, GFS remains relevant in some specific use cases. For example, organizations that need to store and process large amounts of data and require a highly reliable and scalable file system may still find GFS to be a useful solution. Additionally, researchers and students studying distributed file systems may still find GFS to be an important reference point due to its historical significance and impact on the field.
Conclusion:
Designed to support large-scale data processing workloads on commodity hardware, the design decisions for GFS were made by re-examining traditional file system assumptions in the light of current and anticipated application workloads and technological environment. The system treats component failures as the norm, optimizes for huge files that are mostly appended to and then read, and extends and relaxes the standard file system interface. Fault tolerance is provided by constant monitoring, replicating crucial data, and fast and automatic recovery. The design delivers high aggregate throughput to many concurrent readers and writers performing a variety of tasks. GFS has successfully met Google’s storage needs and is widely used within the company as the storage platform for research and development as well as production data processing. The Google File System is a highly scalable, fault-tolerant, and efficient distributed file system that has been widely adopted by Google and other organizations.