A Survey of HDFS Optimization for Storage Media
HDFS, short for Hadoop Distributed File System, is the distributed operating system within Hadoop and is now widely used for big-data storage. HDFS was originally designed for traditional disks, but with the rapid development of hardware devices, optimization and support for new storage media have become challenging topics for HDFS. This article summarizes HDFS optimization techniques for external storage media such as traditional disks, shingled magnetic recording disks, and solid-state drives, as well as for in-memory media such as conventional memory and non-volatile memory. It argues that future HDFS systems will necessarily be heterogeneous clusters that balance performance and cost, and it proposes four possible research directions for discussion.
1 Overview of the HDFS Storage System
HDFS, short for Hadoop Distributed File System, is the distributed operating system within Hadoop and is now widely used for big-data storage. HDFS was originally designed to store massive datasets reliably on large clusters of inexpensive servers while providing high-throughput data reads and writes[1]. It divides stored files into large data blocks and distributes those blocks across cluster nodes, while the NameNode manages cluster metadata. To prevent NameNode failure, HDFS also includes its backup. By default, HDFS adopts a rack-aware replica placement policy[2]. To reduce storage-space consumption, HDFS also supports erasure coding (EC) as a fault-tolerance mechanism, which improves storage utilization[3].
HDFS offers high fault tolerance, high scalability, and high portability, matching the current needs of big-data storage for large capacity, rapid generation, and diverse data types, and it has therefore been widely adopted. However, HDFS was designed for traditional disks, and with the rapid development of hardware devices, optimization and support for different storage media have become challenging topics.
2 HDFS Optimization for External Storage Media
2.1 Traditional Disks
The advantages of traditional disks are low price, high storage density, and long lifetime. Their drawback is that reads and writes depend on disk rotation and head movement, so random I/O performance is poor. Therefore, optimization methods for traditional disks mainly focus on merging many small random reads and writes into large sequential reads and writes. Reference [4] proposed the LSM-Tree, which uses a copy-on-write structure on disks and greatly improves data throughput. It is a representative structure based on this idea.
2.2 SMR Disks
Adjacent tracks on shingled magnetic recording (SMR) disks overlap with each other. Compared with traditional disks, they therefore offer higher storage density and lower cost, but they also suffer from severe write amplification and poor random-write performance. In HDFS, much data is cold data that is rarely read or written, making it highly suitable for SMR disks to reduce storage cost. Optimization techniques for SMR disks mainly aim to improve their poor read-write performance.
Reference [5] proposed a new filesystem log-writing system that transforms many small random logs into sequential writes. Reference [6] proposed an STL (shingled translation layer) and GC (garbage collection) method suitable for shingled disks. It applies in-place updates and out-of-place updates to different tracks and performs garbage collection on demand when free space becomes too fragmented. These optimizations allow shingled disks to be used more effectively in HDFS.
2.3 Solid-state Drives
Solid-state drives offer high bandwidth, high throughput, and fast random I/O, but they also have limited erase cycles, shorter lifetimes, and higher cost. As SSD technology advances and the unit cost of capacity falls, more and more enterprises use SSDs to accelerate data reads and writes.
Reference [10] points out that replacing HDDs directly with SSDs in current HDFS does not guarantee performance improvement, and it explores how to optimize performance, cost, and energy consumption in HDFS. The conclusion is that the advantages of SSDs can be realized in HDFS only when they are combined with high network bandwidth and parallel storage access. Reference [9] also argues that the current HDFS architecture cannot make full use of all SSD advantages, and proposes a series of structural improvements, including introducing a simplified data-processing algorithm that leverages SSD high IOPS storage and shuffles Map output, using a libaio-based prefetching model to reduce read latency, and using a record-size-based reduce program to overcome data-offset problems in the reduce phase. In the end, [9] achieved performance improvements of 30% and 18% on the TeraSort and DFSIO benchmarks, respectively.
Besides faster reads and writes than disks, SSDs can also serve as compute devices. Reference [8] uses the computational capability of solid-state drives to implement a MapReduce framework for ISC (in-storage computing), pushing Mapper tasks down to SSDs and improving MapReduce performance by 2.3 times.
To address the limited erase count of SSDs, reference [7] modified the NameNode block placement policy to balance SSD usage across DataNodes as much as possible. Reference [9] introduced a block replacement strategy based on disk wear information.
3 HDFS Optimization for In-memory Media
3.1 In-memory Caching Mechanisms
Key-value storage on HDFS uses memory as a read-write buffer, transforming random I/O into sequential I/O. Besides accelerating data writes, memory is more importantly used to cache files and improve data-access performance for applications.
3.2 Non-volatile Memory
Byte-addressable non-volatile memory (NVM) is a new kind of storage medium. Compared with flash, it is not only more durable, but also has lower access latency.
Reference [13] proposed a scheme for managing NameNode metadata in HDFS by using non-volatile memory, reducing the NameNode bottleneck caused by journal processing for filesystem metadata persistence. Reference [11] proposed NVFS (NVM- and RDMA-aware HDFS), which is based on existing HDFS and adds support for NVM and RDMA (remote direct memory access). In NVFS, NVM provides two access modes, block access and memory access, corresponding to the NVFS-BlkIO and NVFS-MemIO interfaces. These increase HDFS write and read throughput by up to 4x and 2x, respectively, and reduce execution time for the data-generation benchmark by up to 45%.
For NVM, reference [10] reached a conclusion similar to the one for SSDs, as discussed in Section 2.3. Reference [12] analyzed the data-recovery mechanism in HDFS and found that the configuration of replication tasks on DataNodes has a significant impact on data recovery. By optimizing configuration for NVM, data-recovery performance improved from 17% to 71%.
4 Summary and Outlook
HDFS was originally designed and implemented for traditional disks, and optimization and application based on traditional disks and memory have already become relatively mature. But high-density, low-cost SMR disks, high-throughput but high-cost SSDs, and higher-performance yet higher-cost NVM have introduced new research questions for HDFS, and current research still has many limitations. Most studies currently focus on how much performance improvement a single medium can bring to HDFS, or on configuration optimization for a particular medium.
Future HDFS systems will certainly be heterogeneous clusters that balance performance and cost, fully utilizing existing and future storage media. That calls for four major capabilities at the architectural level:
- Standardized testing of I/O performance for any storage medium
- Intelligent classification and placement strategies for files and replicas based on item 1
- Dynamic data-migration strategies based on item 1 and usage history
- Consideration of the physical characteristics of newly integrated hardware, such as SMR write amplification and SSD lifetime
For issue 2, I believe that for a given cost-performance balance point, it should be possible to build a model that computes the optimal strategy. File-write classification can then be treated as a classification task and solved with machine learning models. For issue 3, a statistical model could estimate in real time the probability that already stored data will be accessed, and schedule its placement across different media. By using network-traffic cost and write cost as penalty functions, and reduced access latency and shorter transfer time as reward functions, the problem can be transformed into a reinforcement-learning task. For issue 4, hand-designed rules may still be appropriate.
References
[1] Konstantin S, Hairong K, Sanjay R, Robert C. The Hadoop distributed file system. In: Proc. of the MSST. 2010. 1.10. [11] White T. Hadoop: The Definitive Guide. 4th ed., O’Reilly Media, Inc., 20
[2] White T. Hadoop: The Definitive Guide. 4th ed., O’Reilly Media, Inc., 20
[3] Hakim W, John K. Erasure coding vs. replication: A quantitative comparison. In: Proc. of the IPTPS Workshop. 2001. 328338
[4] O’Neil P, Cheng E, Gawlick D, O’Neil E. The log-structured merge-tree (LSM-tree). Acta Informatica, 1996,33(4):351385
[5] Patana-anake T, Martin V, Sandler N, Wu C, Gunawi HS. Manylogs: Improved CMR/SMR disk bandwidth and faster durability with scattered logs. In: Proc. of the MSST. 2016. 1.16.
[6] He WP, Du DHC. SMaRT: An approach to shingled magnetic recording translation. In: Proc. of the FAST. 2017. 121134.
[7] Hong J, Li L, Han C, Jin B, Yang Q, Yang Z. Optimizing Hadoop framework for solid state drives. In: Proc. of the Big Data. 2016. 9.17.
[8] Dongchul P, Wang JG, Kee YS. In-storage computing for Hadoop MapReduce framework: Challenges and possibilities. IEEE Trans. on Computers, 2016,PP(99):1.1
[9] J. Hong, L. Li, C. Han, B. Jin, Q. Yang and Z. Yang, “Optimizing Hadoop Framework for Solid State Drives,” 2016 IEEE International Congress on Big Data (BigData Congress), San Francisco, CA, 2016, pp. 9-17, doi: 10.1109/BigDataCongress.2016.11.
[10] Moon, S., Lee, J., Sun, X. et al. Optimizing the Hadoop MapReduce Framework with high-performance storage devices. J Supercomput 71, 3525–3548 (2015). https://doi.org/10.1007/s11227-015-1447-3
[11] Islam NS, Wasi-ur-Rahman M, Lu XY, Panda DK. High performance design for HDFS with byte-addressability of NVM and RDMA. In: Proc. of the Int’l Conf. on Supercomputing. 2016.8.21
[12] H. Li, X. Li, Y. Lu and X. Qin, “An Experimental Study on Data Recovery Performance Improvement for HDFS with NVM,” 2020 29th International Conference on Computer Communications and Networks (ICCCN), Honolulu, HI, USA, 2020, pp. 1-9, doi: 10.1109/ICCCN49398.2020.9209698.
[13] Choi, W.G., Park, S. A write-friendly approach to manage namespace of Hadoop distributed file system by utilizing nonvolatile memory. J Supercomput 75, 6632–6662 (2019). https://doi.org/10.1007/s11227-019-02876-9
A Survey of HDFS Optimization for Storage Media