Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

First Name

Last Name

Job Title

Business Email

Company

Phone

Country

By registering or submitting your data, you acknowledge, understand, and agree to Cloudera's Terms and Conditions, including our Privacy Statement.

By checking this box, you consent to receive marketing and promotional communications about Cloudera’s products and services and/or related offerings from us, or sent on our behalf, in accordance with our Privacy Statement. You may withdraw your consent by using the unsubscribe or opt-out link in our communications.

Back to main tutorial page

ClouderaNOW Learn about the latest innovations in data, analytics, and AI | July 16

Introduction

This tutorial is inspired by the Kaggle competition RSNA-MICCAI Brain Tumor Radiogenomic Classification. We will use Cloudera Data Engineering on Cloudera to transform the DICOM files produced by an MRI into PNG images.

In a future tutorial, we will use the PNG images to train a machine learning model to detect the presence of a protein found in certain brain cancers.

Prerequisites

Have access to Cloudera Public Cloud
Have created a Cloudera workload User
Ensure proper Data Engineering role access
- DEAdmin: enable Data Engineering and create virtual clusters
- DEUser: access virtual cluster and run jobs
Have Data Engineering CLI configured. Take a look at Using CLI-API to Automate Access to Cloudera Data Engineering to learn how.
Basic AWS CLI skills

Outline

Watch Video
Download Assets
Setup Cloudera Data Engineering
Submit a Spark Job
Summary
Further Reading

Watch Video

The video below provides a brief overview of what is covered in this tutorial:

Download Assets

There are two (2) options in getting assets for this tutorial:

Download a ZIP file

It only contains the necessary files for this tutorial. Unzip tutorial-files.zip and remember its location.

Clone our GitHub repository

It provides assets used in this and other tutorials; organized by tutorial title.

In addition to the files above, you will also need to download the test and train datasets from the Kaggle competition, RSNA-MICCAI Brain Tumor Radiogenomic Classification.

NOTE: The datasets use approximately 137 GB of storage. It will take some time to download and unzip the file.

Using AWS CLI, copy the train directory to your S3 bucket, defined by your environment’s storage.location.base attribute.

For example, the property storage.location.base has the value s3a://usermarketing-cdp-demo; copy the train folder using the command:

aws s3 cp train s3://usermarketing-cdp-demo/train --recursive

There are two (2) variables in file spark-etl.py that need to be updated. The values are based on the S3 location you stored the data:

S3_BUCKET = set to S3 bucket name. For example, usermarketing-cdp-demo
S3_INPUT_KEY_PREFIX = set to folder(s) location. For example, train

Setup Cloudera Data Engineering

For this tutorial, we will use Cloudera Data Engineering.

Beginning from the Cloudera Home Page, select Data Engineering.

Enable Data Engineering Service

If your environment doesn’t already have a Data Engineering Service enabled, let’s enable it.

Click on to enable new Cloudera Data Engineering service
Name: data-engineering
Environment: <your environment name>
Workload Type: General - Medium

Make other configuration changes (optional)

Select Enable

Create Virtual Cluster

If you don’t already have a Data Engineering virtual cluster created, let’s create it.

Click on to create cluster
Cluster Name: DICOM-Spark-ETL
Data Engineering Service: data-engineering
Autoscale Max Capacity: CPU: 20, Memory 160 GB
Spark Version: Spark 3.1.1
Select Restrict Access and provide at least one (1) user to the access list. Only provide access to users who will running the job as they will be able to see the AWS credentials.
Select Create

Submit a Spark Job

The prerequisites for this tutorial requires you to already have Data Engineering CLI configured. If you need help configuring, take a look at Using CLI-API to Automate Access to Cloudera Data Engineering.

Create Data Engineering Resource

On the command line, issue the following commands to create a Data Engineering resource and upload the requirements.txt file to install required libraries in a new Python environment:

cde resource create --name rsna-etl --type python-env

cde resource upload --local-path requirements.txt --name rsna-etl

The Python environment will take a few minutes to build. You can issue this command to see the status. When the status becomes ready, it is ready to be used, and you can submit jobs.

cde resource list --filter 'name[rlike]rsna'

Run Spark Job

Now that we have our Python environment setup, let’s run the Spark job, spark-etl.py, to transform the DICOM files produced by the MRI into PNG images.

IMPORTANT: Restrict access to the virtual cluster only to users that are allowed to access the AWS credentials used in the job.

In the command prompt, create two (2) environment variables to hold your AWS credentials, which are needed to write the PNG images into S3.

AWS_ACCESS_KEY_ID='<your-AWS-access-key>'

AWS_SECRET_ACCESS_KEY='<your-AWS-secret-access-key>'

Run the job using the command:

cde spark submit --python-env-resource-name rsna-etl \
  --conf spark.kubernetes.driverEnv.AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
  --conf spark.kubernetes.driverEnv.AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
  --conf spark.executorEnv.AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
  --conf spark.executorEnv.AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
  spark-etl.py

When the job completes, you can review the output using the command:

cde run logs --type "driver/stdout" --id #, where # is the job ID

processing-dicom-files-with-spark-on-cde

Finally, you can verify that the DICOM images have been transformed into PNG images using the following command. The files are located in the same S3 folder you specified, with _processed_images appended to the folder name.

aws s3 ls s3://usermarketing-cdp-demo/train_processed_images --recursive

Summary

Congratulations on completing the tutorial.