X

Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

Cloudera acquires Octopai's platform to enhance metadata management capabilities

Read the press release

 

Introduction

 

This tutorial is inspired by the Kaggle competition RSNA-MICCAI Brain Tumor Radiogenomic Classification. We will use Cloudera Data Engineering on Cloudera to transform the DICOM files produced by an MRI into PNG images.

In a future tutorial, we will use the PNG images to train a machine learning model to detect the presence of a protein found in certain brain cancers.

 

 

Prerequisites

 

 

Watch Video

 

The video below provides a brief overview of what is covered in this tutorial:

 

 

Download Assets

 

There are two (2) options in getting assets for this tutorial:

  1. Download a ZIP file

It only contains the necessary files for this tutorial. Unzip tutorial-files.zip and remember its location.

  1. Clone our GitHub repository

It provides assets used in this and other tutorials; organized by tutorial title.

 

In addition to the files above, you will also need to download the test and train datasets from the Kaggle competition, RSNA-MICCAI Brain Tumor Radiogenomic Classification.

NOTE: The datasets use approximately 137 GB of storage. It will take some time to download and unzip the file.

 

Using AWS CLI, copy the train directory to your S3 bucket, defined by your environment’s storage.location.base attribute.

For example, the property storage.location.base has the value s3a://usermarketing-cdp-demo; copy the train folder using the command:

aws s3 cp train s3://usermarketing-cdp-demo/train --recursive

 

aws-copy-data

 

There are two (2) variables in file spark-etl.py that need to be updated. The values are based on the S3 location you stored the data:

  • S3_BUCKET = set to S3 bucket name. For example, usermarketing-cdp-demo
  • S3_INPUT_KEY_PREFIX = set to folder(s) location. For example, train

 

update-program-variables

 

Setup Cloudera Data Engineering 

 

For this tutorial, we will use Cloudera Data Engineering.

Beginning from the Cloudera Home Page, select Data Engineering.

 

cdp-homepage-data-engineering

 

Enable Data Engineering Service

 

If your environment doesn’t already have a Data Engineering Service enabled, let’s enable it.

  1. Click on  to enable new Cloudera Data Engineering service
  2. Name: data-engineering
  3. Environment: <your environment name>
  4. Workload Type: General - Medium

Make other configuration changes (optional)

  1. Select Enable

 

cde-enable-service

 

Create Virtual Cluster

 

If you don’t already have a Data Engineering virtual cluster created, let’s create it.

  1. Click on  to create cluster
  2. Cluster Name: DICOM-Spark-ETL
  3. Data Engineering Service: data-engineering
  4. Autoscale Max Capacity: CPU: 20, Memory 160 GB
  5. Spark Version: Spark 3.1.1
  6. Select Restrict Access and provide at least one (1) user to the access list. Only provide access to users who will running the job as they will be able to see the AWS credentials. 
  7. Select Create

 

cde-create-virtual-cluster

 

Submit a Spark Job

 

The prerequisites for this tutorial requires you to already have Data Engineering CLI configured. If you need help configuring, take a look at Using CLI-API to Automate Access to Cloudera Data Engineering.

 

 

Create Data Engineering Resource

 

On the command line, issue the following commands to create a Data Engineering resource and upload the requirements.txt file to install required libraries in a new Python environment:

cde resource create --name rsna-etl --type python-env

cde resource upload --local-path requirements.txt --name rsna-etl

 

The Python environment will take a few minutes to build. You can issue this command to see the status. When the status becomes ready, it is ready to be used, and you can submit jobs.

cde resource list --filter 'name[rlike]rsna'

 

cde-cli-create-resource

 

Run Spark Job

 

Now that we have our Python environment setup, let’s run the Spark job, spark-etl.py, to transform the DICOM files produced by the MRI into PNG images.

IMPORTANT: Restrict access to the virtual cluster only to users that are allowed to access the AWS credentials used in the job.

 

In the command prompt, create two (2) environment variables to hold your AWS credentials, which are needed to write the PNG images into S3.

AWS_ACCESS_KEY_ID='<your-AWS-access-key>'

AWS_SECRET_ACCESS_KEY='<your-AWS-secret-access-key>'

Run the job using the command:

cde spark submit --python-env-resource-name rsna-etl \
  --conf spark.kubernetes.driverEnv.AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
  --conf spark.kubernetes.driverEnv.AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
  --conf spark.executorEnv.AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
  --conf spark.executorEnv.AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
  spark-etl.py

 

When the job completes, you can review the output using the command:

cde run logs --type "driver/stdout" --id #, where # is the job ID

 

processing-dicom-files-with-spark-on-cde

 

Finally, you can verify that the DICOM images have been transformed into PNG images using the following command. The files are located in the same S3 folder you specified, with _processed_images appended to the folder name.

aws s3 ls s3://usermarketing-cdp-demo/train_processed_images --recursive

 

 

Summary

 

Congratulations on completing the tutorial.

In a future tutorial, we will use the PNG images to train a machine learning model to detect the presence of a protein found in certain brain cancers.

 

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.