DS-200 Study Guide

Begin Your Journey to Data Science

Recommended Cloudera Training Courses


Practice Test

DS–200 Practice Test Subscription


Online Resources

  • New to Data Science: tutorials, papers, meetups, books, etc.
  • Data Processing & Analytics: Hadoop resources and materials listed by function
  • New to Hadoop: introductory topics from Cloudera’s Developer Center
  • Quora.com: Data Science topic

Helpful Books


Useful Blogs


Exam Sections


Data Acquisition

Objectives

  • Access and load data from a variety of sources into a Hadoop cluster, including from databases and systems such as OLTP and OLAP as well as log files and documents.
  • Deploy a variety of acquisition techniques for acquiring data, including database integration, working with APIs
  • Use command line tools such wget and curl
  • Use Hadoop tools such as Sqoop and Flume

Study Resources


Data Evaluation

Objectives

  • Knowledge of the file types commonly used for input and output and the advantages and disadvantages of each
  • Methods for working with various file formats including binary files, JSON, XML, and .csv
  • Tools, techniques, and utilities for evaluating data from the command line and at scale
  • An understanding of sampling and filtering techniques
  • A familiarity with Hadoop SequenceFiles and serialization using Avro

Study Resources


Data Transformation

Objectives

  • Write a map-only Hadoop Streaming job
  • Write a script that receives records on stdin and write them to stdout
  • Invoke Unix tools to convert file formats
  • Join data sets
  • Write scripts to anonymize data sets
  • Write a Mapper using Python and invoke via Hadoop streaming
  • Write a custom subclass of FileOutputFormat
  • Write records into a new format such AvroOutputFormat or SequenceFileOutputFormat

Study Resources


Machine Learning Basics

Objectives

  • Understand how to use Mappers and Reducers to create predictive models
  • Understand the different kinds of machine learning, including supervised and unsupervised learning
  • Recognize appropriate uses of the following: parametric/non-parametric algorithms, support vector machines, kernels, neural networks, clustering, dimensionality reduction, and recommender systems

Section Study Resources


Clustering

Objectives

  • Define clustering and identify appropriate use cases
  • Identify appropriate uses of various models including centroid, distribution, density, group, and graph
  • Describe the value and use of similarity metrics including Pearson correlation, Euclidean distance, and block distance
  • Identify the algorithms applicable to each model (k-means, SVD/PCA, etc.)

Study Resources

  • Programming Collective Intelligence: Chapter 3
  • Algorithms of the Intelligent Web: Chapter 4
  • Mahout In Action: Part 2

Classification

Objectives

  • Describe the steps for training a set of data in order to identify new data based on known data
  • Identify the use cases for logistic regression, Bayes theorem
  • Define classification techniques and formulas

Study Resources

  • Programming Collective Intelligence: Chapters 6, 7, 8, 9, 12
  • Algorithms of the Intelligent Web: Chapters 5, 6
  • Mahout In Action: Part 3

Collaborative Filtering

Objectives

  • Identify the use of user-based and item-based collaborative filtering techniques
  • describe the limitations and strengths of collaborative filtering techniques
  • Given a scenario, determine the appropriate collaborative filtering implementation
  • Given a scenario, determine the metrics one should use to evaluate the accuracy of a reccomender system

Study Resources


Model/Feature Selection

Objectives

  • Describe the role and function of feature selection
  • Analyze a scenario and determine the appropriate features and attributes to select
  • Analyze a scenario and determine the methods to deploy for optimal feature selection

Study Resources

  • Programming Collective Intelligence: Chapter 10
  • Pattern Recognition and Machine Learning: Chapter 1.3

Probability

Objectives

  • Analyze a scenario and determine the likelihood of a particular outcome
  • Determine sample percentiles
  • Determine a range of items based on a sample probability density function
  • Summarize a distribution of sample numbers

Study Resources


Visualization

Objectives

  • Determine the most effective visualization for a given problem
  • Analyze a data visualization and interpret its meaning

Study Resources


Optimization

Objectives

  • Understand optimization methods
  • Identify 1st order and 2nd order optimization techniques
  • Determine the learning rate for a particular algorithm
  • Determine the sources of errors in a model

Study Resources

  • Leon Bottou on stochastic learning from Advanced Lectures on Machine Learning
  • Leon Bottou on online algorithms and stochastic approximations
  • Programming Collective Intelligence: Chapter 5
  • Data-Intensive Text Processing with MapReduce: Chapter 6