Over the past decade, the successful deployment of large scale data platforms at our customers has acted as a big data flywheel driving demand to bring in even more data, apply more sophisticated analytics, and on-board many new data practitioners from business analysts to data scientists. This unprecedented level of big data workloads hasn’t come without its fair share of challenges. The data architecture layer is one such area where growing datasets have pushed the limits of scalability and performance. The data explosion has to be met with new solutions, that’s why we are excited to introduce the next generation table format for large scale analytic datasets within Cloudera Data Platform (CDP) – Apache Iceberg. Today, we are announcing a private technical preview (TP) release of Iceberg for CDP Data Services in the public cloud, including Cloudera Data Warehousing (CDW) and Cloudera Data Engineering (CDE).
Apache Iceberg is a new open table format targeted for petabyte-scale analytic datasets. It has been designed and developed as an open community standard to ensure compatibility across languages and implementations. Apache Iceberg is open source, and is developed through the Apache Software Foundation. Companies such as Adobe, Expedia, LinkedIn, Tencent, and Netflix have published blogs about their Apache Iceberg adoption for processing their large scale analytics datasets.
To satisfy multi-function analytics over large datasets with the flexibility offered by hybrid and multi-cloud deployments, we integrated Apache Iceberg with CDP to provide a unique solution that future-proofs the data architecture for our customers. By optimizing the various CDP Data Services, including CDW, CDE, and Cloudera Machine Learning (CML) with Iceberg, Cloudera customers can define and manipulate datasets with SQL commands, build complex data pipelines using features like Time Travel operations, and deploy machine learning models built from Iceberg tables. Along with CDP’s enterprise features such as Shared Data Experience (SDX), unified management and deployment across hybrid cloud and multi-cloud, customers can benefit from Cloudera’s contribution to Apache Iceberg, the next generation table format for large scale analytic datasets.
As we set out to integrate Apache Iceberg with CDP, we not only wanted to incorporate the advantages of the new table format but also expand its capabilities to meet the needs of modernizing enterprises, including security and multi-function analytics. That’s why we set the following innovation goals that will increase scalability, performance and ease of use of large scale datasets across a multi-function analytics platform:
In the subsequent sections, we will take a closer look at how we are integrating Apache Iceberg within CDP to address these key challenges in the areas of performance and ease of use. We will also talk about what you can expect from the TP release as well as unique capabilities customers can benefit from.
Iceberg provides a well defined open table format which can be plugged into many different platforms. It includes a catalog that supports atomic changes to snapshots – this is required to ensure that we know changes to an Iceberg table either succeeded or failed. In addition, the File I/O implementation provides a way to read / write / delete files – this is required to access the data and metadata files with a well defined API.
These characteristics and their pre-existing implementations made it quite straightforward to integrate Iceberg into CDP. In CDP we enable Iceberg tables side-by-side with the Hive table types, both of which are part of our SDX metadata and security framework. By leveraging SDX and its native metastore, a small footprint of catalog information is registered to identify the Iceberg tables, and by keeping the interaction lightweight allows scaling to large tables without incurring the usual overhead of metadata storage and querying.
After the Iceberg tables become available in SDX, the next step is to enable the execution engines to leverage the new tables. The Apache Iceberg community has a sizable contribution pool of seasoned Spark developers who integrated the execution engine. On the other hand, Hive and Impala integration with Iceberg was lacking so Cloudera contributed this work back into the community.
During the last few months we have made good progress on enabling Hive writes (above the already available Hive reads) and both Impala reads and writes. Using Iceberg tables, the data could be partitioned more aggressively. As an example, with the repartitioning one of our customers found that Iceberg tables perform 10x times better than the previously used Hive external tables using Impala queries. Previously this aggressive partitioning strategy was not possible with Metastore tables because the high number of partitions would make the compilation of any query against these tables prohibitively slow. A perfect example of why Iceberg shines at such large scales.
Integrating Iceberg tables into SDX has the added benefit of the Ranger integration which you get out of the box. Administrators can leverage Ranger’s ability to restrict full tables / columns / rows for specific groups of users. They can mask the column and the values can be redacted / nullified / hashed in both Hive and Impala. CDP provides unique capabilities for Iceberg table fine grained access control to satisfy enterprise customers requirements for security and governance.
In order to continue using your existing ORC, Parquet and Avro datasets stored in external tables, we integrated and enhanced the existing support for migrating these tables to the Iceberg table format by adding support for Hive on top of what is there today for Spark. The table migration will leave all the data files in place, without creating any copies, only generating the necessary Iceberg metadata files for them and publishing them in a single commit. Once the migration has completed successfully, all your subsequent reads and writes for the table will go through Iceberg and your table changes will start generating new commits.
First we will focus on additional performance testing to check for and remove any bottlenecks we identify. This will be across all the CDP Data Services starting with CDE and CDW. As we move towards GA, we will target specific workload patterns such as Spark ETL/ELT and Impala BI SQL analytics using Apache Iceberg.
Beyond the initial GA release, we will expand support for other workload patterns to realize the vision we layed out earlier of multi-function analytics on this new data architecture. That’s why we are keen on enhancing the integration of Apache Iceberg with CDP along the following capabilities:
If you are running into challenges with your large datasets, or want to take advantage of the latest innovations in managing datasets through snapshots and time-travel we highly recommend you try out CDP and see for yourself the benefits of Apache Iceberg within a mult-cloud, multi-function analytics platform. Please contact your account team if you are interested in learning more about Apache Iceberg integration with CDP.
To try out CDW and CDE, please sign up for a 60 day trial, or test drive CDP. As always, please provide your feedback in the comments section below.
This may have been caused by one of the following: