X

Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

Cloudera acquires Octopai's platform to enhance metadata management capabilities

Read the press release

 

Introduction

 

This tutorial will walk you through running a simple Apache Spark ETL job using Cloudera Data Engineering on Cloudera Public Cloud.

 

 

Prerequisites

 

  • Have access to Cloudera Public Cloud with a data lake running.
  • Basic AWS CLI skills
  • Ensure proper Data Engineering role access
    • DEAdmin: enable Data Engineering and create virtual clusters
    • DEUser: access virtual cluster and run jobs

 

 

Watch Video

 

The video below provides a brief overview of what is covered in this tutorial:

 

 

Download Assets

 

Download and unzip tutorial files; remember location where you extracted the files.

Using AWS CLI, copy file access-log.txt to your S3 bucket, s3a://<storage.location>/tutorial-data/data-engineering, where <storage.location> is your environment’s property value for storage.location.base. In this example, property storage.location.base has value s3a://usermarketing-cdp-demo, therefore the command will be:

 

aws s3 cp access.log.txt s3://usermarketing-cdp-demo/tutorial-data/data-engineering/access-log.txt

 

cmd-aws-upload

 

Enable Cloudera Data Engineering

 

If you don’t already have Cloudera Data Engineering service enabled for your environment, let’s enable one.

Starting from the Cloudera Home Page, select Data Engineering:

 

cdp-home-data-engineering

 

  1. Click on plus icon to enable new Cloudera Data Engineering
  2. Provide the environment name: usermarketing
  3. Workload Type: General - Small
  4. Set Auto-Scale Range: Min 1, Max 20

 

cde-enable-service

 

Create Data Engineering Virtual Cluster

 

  1. Click on plus icon to create cluster
  2. Cluster name: usermarketing-cde-demo
  3. Data Engineering Service: usermarketing
  4. Auto-Scale Range: CPU Max 4, Memory Max 4 GB
  5. Create

 

cde-create-cluster

 

Create and Schedule a Job

 

You can schedule a job to be run periodically or just run it once. We will take a look at both methods.

Click on  for View Jobs.

 

cde-view-jobs

 

In the Jobs section, select Create Job to create a job, access-logs-ETL - fill out job details:

  1. Name: access-logs-ETL
  2. Upload File: access-logs-ETL.py, from tutorial files provided
  3. Select Python 3
  4. Turn off Schedule
  5. Create and Run

 

cde-jobs-create-job

 

Let’s take a look at the job output generated. In the Job Runs section, select the Run ID for the Job you are interested in. In this case, let’s select Run ID 11 associated with Job access-logs-ETL.

 

cde-jobruns

 

Let’s take a look at the job output. Here you can see all of the output from the Spark job that has just been run. You can see that this spark job prints some user-friendly segments of the data being processed so the data engineer can validate that the process is working correctly.

Select Logs > stdout

 

cde-jobruns-stdout

 

Let’s take a deeper look and see the different stages of the job. You can see that the Spark job has been split into multiple stages. You can zoom into each stage getting utilization details on each stage. These details will help the data engineer to validate the job is working correctly and utilizing the right amount of resources.

You are encouraged to explore all the stages of the job.

Select Analysis.

 

cde-jobruns-analysis

 

Once you are satisfied with the application and its output, we can schedule it to run periodically based on time interval.

In the Jobs section, select three dots icon next to the job you’d like to schedule runs.

Select Add Schedule.

 

cde-jobs-add-schedule

 

  1. Select Edit
  2. Set schedule: Every hour at 0 minute(s) past the hour 
  3. Add Schedule

 

cde-jobs-accesslogs-add-schedule

 

Summary

 

Congratulations on completing the tutorial.

As you've now experienced, Cloudera Data Engineering provides an easy way for developers to run workloads and to schedule them to run periodically.

 

 

Further Reading

 

 

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.