The move into any new technology requires planning and coordinated effort to ensure a successful transition. This blog will describe the four paths to move from a legacy platform such as Cloudera CDH or HDP into CDP Public Cloud or CDP Private Cloud. The four paths are In-place Upgrade, Side-car Migration, Rolling Side-car Migration, and Migrate to Public Cloud.
Each mechanism has common aspects of work, risk mitigation, and successful outcomes expected across all paths from legacy distributions into CDP. These include workload reviews, testing and validation, managing service-level agreements (SLAs), and minimizing workload unavailability during the move.
No matter which path you choose, a successful upgrade or migration requires a detailed planning effort. This statement is especially true in larger environments with many workloads, multiple tenants, and complex data dependencies.
An in-place upgrade is a process where you directly upgrade your existing legacy CDH or HDP clusters to CDP. This process involves planned downtime and requires coordination across all tenants. Coordination is necessary because it prepares everyone to be upgrade-ready on the same day.
Based on the legacy platform that you are upgrading from, such an upgrade process would keep some of your existing settings and other configurations for various services. Cloudera replaces several legacy components in the transition to CDP Private Cloud Base. The upgrade process provides tools and automation where possible to assist with conversions from those legacy components into the CDP equivalents. For example, CDH users would convert their Apache Sentry implementation to Apache Ranger using an automated conversion tool and HDP users would transition Ambari configurations to Cloudera Manager using AM2CM. But, Spark 1.6 users on either platform may still need to manually update code for compatibility with Spark 2 and Spark 3.
This diagram describes the logical phases and major areas of work for an In-place Upgrade as you progress from evaluation and discover, to upgrading the development, test, and production environments.
An in-place upgrade is best suited for larger clusters with more significant data footprints. The applications’ SLA and downtime requirements play an essential role in the decision-making as such an upgrade process requires planned downtime. The age and hardware refresh cycle for legacy clusters is another important consideration when deciding on the in-place upgrade strategy. If the cluster nodes are not due for a hardware refresh in the near term, an in-place upgrade might be the best-suited option to get to CDP.
Cloudera recently performed an in-place upgrade of its internal operations cluster, deciding to perform an in-place upgrade due to:
For more information you can read more about Cloudera’s upgrade experience here.
The side-car migration method is the second path to CDP. A new, greenfield CDP Private Cloud Base cluster is configured on a second set of hardware. This process aims to minimize downtime on individual workloads while providing a straightforward roll-back mechanism on a per-workload basis. The side-car migration breaks down into three major phases.
First, the new CDP Private Cloud Base cluster is built and configured. Second, configure a replication process to provide periodic and consistent snapshots of data, metadata, and accompanying governance policies. Third, deploy the workloads onto the new cluster, test them, and flip them into a production state once validated. Once moved, disable them on the legacy cluster. This change means you will temporarily have production workloads running across multiple clusters during the migration period.
Cloudera provides tools to assist with this process, including DistCP, Replication Manager (previously called BDR) for data replication, and hms-mirror for Hive schema and migration. Authzmigrator provides a Sentry-to-Ranger policy conversion path. FS2CS simplifies the switch from YARN FairScheduler to CapacityScheduler. Where conversions aren’t required, policy and configuration exports allow direct re-use by importing into their corresponding components in CDP.
This mechanism is best employed when you may have tighter service-level agreements that preclude an extended, multi-hour downtime for your workloads. Additionally, the upgrade journey is often an excellent time to implement a full hardware refresh to take advantage of newer, more capable equipment. Factors such as the age of the hardware and its refresh cycle, need for data center relocation can play an important role while deciding on a side-car migration approach. Customers combining a hardware refresh and a data center relocation have utilized this mechanism to implement and lessen the upgrade lifecycle and minimize multiple impacts to their business plans.
The Rolling Side-car Migration is a modification of the typical Side-car style. In this path, you drain capacity from the existing legacy cluster and repurpose that as a greenfield CDP cluster, much like the process in the traditional Side-car Migration path. Once the new cluster is running, the initial data, metadata, and workload migration occurs for an application or tenant.
Workload testing and validation occur, at which point, the workload is promoted on the new cluster and disabled on the legacy cluster. The workload and its data are removed from the legacy cluster, freeing up new spare capacity. This capacity is now drained from the legacy cluster and moved to the new CDP cluster.
This process iterates multiple times until all tenants, workloads, and data transition fully to the CDP environment.
The Rolling Side-car Migration is a good alternative for a customer with spare capacity but who cannot tolerate an extended downtime from an In-place Upgrade. Because this path attempts to minimize the initial capital outlay incurred by a regular Side-car Migration, cost-conscious customers may want to consider this route.
Migrating to CDP Public Cloud from a legacy platform is very similar to the Side-car Migration path, with some minor modifications. In the Side-car, you build the new CDP environment alongside the legacy environment and replicate data to the new HDFS. When migrating to the cloud, data replicates to a cloud object store, and you associate compute-focused CDP Datahub clusters to those buckets.
This design enables you to scale the compute and storage independently. Additionally, shifting tenants into isolated clusters allows you to scale their resources and tune for their individual needs rather than peak usage in a multi-tenant environment.
Migrating to cloud is a good option for when your on-premises environment is at end-of-life and you wish to transition to a more flexible infrastructure model. This path is also good for when you need more adaptability and control in resource allocation that often gets slowed down by long budget and hardware purchase cycles. In some cases, you may use a hybrid approach where specific tenants and workloads migrate to public cloud for better cost optimization opportunities, while your well-defined workloads stay on-premises and the cluster still goes through an in-place or side-car migration. Public cloud is also ideally suited to bursty workloads or those that are heavily CPU-intensive.
Choosing the right path may seem difficult at first glance. The decision ultimately depends on your required SLAs for availability, access to hardware, or interest in moving to the cloud. This decision tree can help point you in the right direction. We will discuss this in-depth in a follow-on blog.
To plan your upgrade or migration to CDP Private Cloud Base, please contact your Cloudera account team, who will set up some time to walk through the available options with you. Additionally, here are some helpful resources:
This may have been caused by one of the following: