X

Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

Cloudera acquires Octopai's platform to enhance metadata management capabilities

Read the press release

 

Introduction

 

Data Lifecycle - data enrichment. This tutorial will walk you through running a simple PySpark job to enrich your data using an existing data warehouse. We will use Cloudera Data Engineering (CDE) on Cloudera Data Platform - Public Cloud (CDP-PC).

 

 

Prerequisites

 

  • Have access to Cloudera Data Platform (CDP) Public Cloud
  • Have access to a virtual warehouse for your environment. If you need to create one, refer to From 0 to Query with Cloudera Data Warehouse
  • Have created a CDP workload User
  • Ensure proper CDE role access
    • DEAdmin: enable CDE and create virtual clusters
    • DEUser: access virtual cluster and run jobs
  • Basic AWS CLI skills

 

 

Watch Video

 

The video below provides a brief overview of what is covered in this tutorial:

 

 

Download Assets

 

There are two (2) options in getting assets for this tutorial:

  1. Download a ZIP file

It contains only necessary files used in this tutorial. Unzip tutorial-files.zip and remember its location.

  1. Clone our GitHub repository

It provides assets used in this and other tutorials; organized by tutorial title.

 

Using AWS CLI, copy the following files to S3 bucket, defined by your environment’s storage.location.base attribute:

car_installs.csv
car_sales.csv
customer_data.csv
experimental_motors.csv
postal_codes.csv

Note: You may need to ask your environment's administrator to get property value for storage.location.base.

 

For example, property storage.location.base has value s3a://usermarketing-cdp-demo, therefore copy the files using the command:

aws s3 cp . s3://usermarketing-cdp-demo --recursive --exclude "*" --include "*.csv"

 

output-awscli-copyfiles

 

Setup Cloudera Data Engineering (CDE)

 

Enable CDE Service

 

If you don’t already have Cloudera Data Engineering (CDE) service enabled for your environment, let’s enable one.

Starting from Cloudera Data Platform (CDP) Home Page, select Data Engineering:

 

cdp-homepage-data-engineering

 

  1. Click on  to enable new Cloudera Data Engineering (CDE) service
  2. Name: data-engineering
  3. Environment: <your environment name>
  4. Workload Type: General - Small
  5. Make other changes (optional)
  6. Enable

 

cde-enable-service

 

Create Data Engineering Virtual Cluster

 

If you don’t already have a CDE virtual cluster created, let’s create one.

Starting from Cloudera Data Engineering > Overview:

  1. Click on  to create cluster
  2. Cluster Name: data-engineering
  3. CDE Service: <your environment name>
  4. Autoscale Max Capacity: CPU: 4, Memory 4 GB
  5. Create
  6.  

cde-create-cluster

 

Create and Run Jobs

 

We will be using the GUI to run our jobs. If you would like to use the CLI, take a look at Using CLI-API to Automate Access to Cloudera Data Engineering.

In your virtual cluster, view jobs by selecting .

 

cde-view-jobs

 

We will create and run two (2) jobs:

  • Pre-SetupDW

As a prerequisite, this PySpark job creates a data warehouse with mock sales, factory and customer data.

IMPORTANT: Before running the job, you need to modify one (1) variable in Pre-SetupDW.py. Update variable s3BucketName definition using storage.location.base attribute; defined by your environment.


  • EnrichData_ETL

Bring in data from Cloudera Data Warehouse (CDW), filter out non-representative data, and then join in sales, factory, and customer data together to create a new enriched table and store it back in CDW.

 

In the Jobs section, select Create Job to create a new job:

  1. Name: Pre-SetupDW
  2. Upload File: Pre-SetupDW.py (provided in download assets)
  3. Select Python 3
  4. Turn off Schedule
  5. Create and Run

 

Give it a minute for this job to complete and create the next job:

  1. Name: EnrichData_ETL
  2. Upload File: EnrichData_ETL.py (provided in download assets)
  3. Select Python 3
  4. Turn off Schedule
  5. Create and Run

 

cde-create-job

 

Review Job Output

 

Let’s take a look at the job output generated.

 

First, let’s take a look at the output for Pre-SetupDW:

Select Job Runs tab.

  1. Select the Run ID number for your Job name
  2. Select Logs
  3. Select stdout

 

The results should look like this:

 

cde-jobrun-setupdw

 

Next, let’s take a look at the job output for EnrichData_ETL:

Select Job Runs tab.

  1. Select the Run ID number for your Job name
  2. Select Logs
  3. Select stdout

 

The results should look like this:

 

cde-jobrun-enrichdata

 

Summary

 

Congratulations on completing the tutorial.

As you've now experienced, Cloudera Data Engineering Experience (CDE) provides an easy way for developers to run workloads.

 

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.