X

Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

By registering or submitting your data, you acknowledge, understand, and agree to Cloudera's Terms and Conditions, including our Privacy Statement.
By checking this box, you consent to receive marketing and promotional communications about Cloudera’s products and services and/or related offerings from us, or sent on our behalf, in accordance with our Privacy Statement. You may withdraw your consent by using the unsubscribe or opt-out link in our communications.

Cloudera acquires Octopai's platform to enhance metadata management capabilities

Read the press release

 

Introduction

 

Experience the benefits of having access to a hybrid cloud solution, which provides us to access many resources, including NVIDIA GPUs. Explore how you can leverage NVIDIA's RAPIDS framework using Cloudera Machine Learning on Cloudera's data platform. Harness the GPU power and see significant speed improvements compared to commonly used machine learning libraries such as pandas, NumPy and Sklearn in both data preprocessing and model training.

 

Prerequisites

 

 

Watch Video

 

The video below provides a brief overview of what is covered in this tutorial:

 

 

Download Assets

 

There are two (2) options in getting assets for this tutorial:

  1. Download a ZIP file

It contains only necessary files used in this tutorial. Remember its location. No need to unzip the file.

  1. Clone our GitHub repository

It provides assets used in this and other tutorials; organized by tutorial title.

 

 

Setup Cloudera Machine Learning

 

Provision Machine Learning Workspace

 

If your environment doesn’t already have a Machine Learning Workspace provisioned, let’s provision it.

Select Machine Learning from the Cloudera home page:

 

cdp-homepage-machine-learning

 

In the ML Workspaces section, select Provision Workspace.

Two simple pieces of information are needed to provision an ML workspace - the Workspace name and the Environment name. For example:

  1. Workspace Name: cml-tutorial
  2. Environment: <your environment name>
  3. Select Provision Workspace

NOTE: You may need to activate GPU usage in Advanced Options

 

cml-workspace-provision

 

Create Resource Profile

 

Resource profiles define how many vCPUs and how much memory Cloudera AI will reserve for a particular workload (for example, session, job, model). You must have MLAdmin role access to create a new resource profile.

Let’s create a new resource profile.

Beginning from the ML Workspaces section, open your workspace by selecting its name, cml-tutorial.

In the Site Administration section, select Runtime/Engine.

Create a new resource profile using the following information:

vCPU: 4

Memory (GiB): 32

Select Add

 

cml-create-resource-profile

 

Create Project

 

Beginning from the ML Workspaces section, open your workspace by selecting its name, cml-tutorial.

Select New Project.

Complete the New Project form using:

  1. Project Name: Fare Prediction
  2. Project Description:
    A project showcasing speed improvements using RAPIDS framework.
  3. Initial Setup: Local Files
    Upload or Drag-Drop tutorial-files.zip file you downloaded earlier
  4. Select Create Project

 

cml-new-project

 

Create Session

 

Beginning from the Projects section, select the project name, Fare Prediction.

Select New Session and complete the session form:

  1. Session Name: cml-rapids
  2. Editor: JupyterLab
  3. Kernel: Python 3.7
  4. Edition: RAPIDS
  5. Resource Profile: 4 vCPU / 32 GiB Memory, 1 GPU
  6. Select Start Session

 

cml-new-session

 

Run Program in Jupyter Notebook

 

Let’s open the file gpu_fare_prediction.ipynb by double-clicking the filename.

The default, mode = ‘cpu’, will run the program using only the CPU. 

Without making any changes to the program, run all the cells by selecting Kernel > Restart Kernel and Run All Cells..., then click Restart.

The majority of the time is spent calculating the Haversine formula to determine the great-circle distance between pickup/drop-off locations.

 

jupyter-cpu

Let’s make a simple change to run the program and run it using GPUs - change mode = ‘gpu’. Run all the cells by selecting Kernel > Restart Kernel and Run All Cells..., then click Restart.

Notice the time spent calculating the Haversine formula is now insignificant due to the multi-threaded functionality of the Rapids framework.

 

jupyter-gpu

What changed? If you look at the import statements, located in the Program Initialization section, you’ll notice that we bound GPU libraries to the same name as we bound the CPU libraries.
For example, instead of import pandas as pd we used import cudf as pd when mode = ‘gpu’.
The syntax between the two libraries are so similar, that in most cases you can simply switch out your common machine learning libraries and achieve up to 60x performance improvements.

The instances where the syntax is different between the CPU libraries and the RAPIDS GPU libraries can be found where there are if statements in the notebook. There are only two of them. The first is when importing date type columns from a CSV. The second is when applying a user-defined function.

 

 

Summary

 

Congratulations on completing the tutorial.

As you’ve now experienced, the NVIDIA’s RAPIDS framework has a familiar look and feel to other common machine learning libraries. This is great because no major code re-write is needed to obtain the benefits of the RAPIDS libraries. It allowed us to harness the GPU power and see significant speed improvements compared to other commonly used machine learning libraries.

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.