Apache Hadoop Ozone was designed to address the scale limitation of HDFS with respect to small files and the total number of file system objects. On current data center hardware, HDFS has a limit of about 350 million files and 700 million file system objects. Ozone’s architecture addresses these limitations[4]. This article compares the performance of Ozone with HDFS, the de-facto big data file system.
We chose a widely used benchmark, TPC-DS, for this test and a conventional Hadoop stack consisting of Hive, Tez, YARN, and HDFS side by side with Ozone. True to the current industry need for separation of compute and storage, which enables dense storage nodes and elastic compute, we run these tests with the datanodes and node managers segregated. The fundamental ambition of this endeavor, and the subsequent effort in optimizing the product, is to be comparable in terms of stability and performance to HDFS. To that end, we would like to call out the amazing amount of work put in by the community over the past several months towards this goal.
Ozone is currently scheduled for a Beta release along with Cloudera Data Platform – Data Center (CDP-DC) 7.1 release this year.
The following measurements were obtained by generating two independent datasets of 100GB and 1 TB on a cluster with 12 dedicated storage and 12 dedicated compute nodes. The last section of this article will provide information in greater detail about the setup.
The following charts show that considering the total runtime of our 99 benchmark queries Ozone outperformed HDFS by an average 3.5% margin on both datasets.
To have a finer grasp of the detailed results, we have categorized our queries into three groups:
In over 70% of the cases, queries run faster when the data resides in Ozone versus HDFS. The community effort put into stabilization and performance improvements seems to be paying off. But there is still room to grow further.
The following scatter plot maps the average runtime difference between Ozone vs HDFS of each individual TPC-DS query for each dataset. Every query on the plot that hovers around 0% has an insignificant performance difference between Ozone and HDFS. The numbers have been averaged out for each query over 10 consecutive runs to normalize any variance due to noise.
The values on the y-axis represent the proportion of the runtime difference compared to the runtime of the query on HDFS. So for example, 50% means the difference is half of the runtime on HDFS, effectively meaning that the query ran 2 times faster on Ozone while -50% (negative) means the query runtime on Ozone is 1.5x that of HDFS.
The test runs show that Ozone is faster by a small margin on slightly more than 70% of the TPC-DS queries. There are a few outliers that we are actively investigating to find bottlenecks and iron out the kinks with the highest priority.
Ozone is currently released as Tech Preview with CDP-DC and we are in the process of collecting feedback and continuing the evolution of the next-gen Big Data distributed storage system. As mentioned earlier, a Beta release should be coming out very soon, and a GA will follow soon after. The results that we have obtained here and the amount of work that is still ongoing into the stability and performance areas make this goal look like a very realistic target.
The tools we used to do the benchmark are also available and open source if you want to try it out in your environment. All the tools can be found in this repository.
This may have been caused by one of the following: