Data Science Challenge 1
Web Analytics: Classification, Clustering, and Collaborative Filtering
Class of 2013
These five CCP: Data Scientists have demonstrated their skills in working with Big Data at an elite level. Candidates must prove their abilities under real-world conditions by designing and developing a production-ready data science solution that is peer-evaluated for its accuracy, scaleability, and robustness.
Data Science Challenge Solution Kit
The Web Analytics Challenge Solution Kit is your best resource to get hands-on experience with a real-world data science challenge in a self-paced, learner-centric environment. It includes a live data set, a step-by-step tutorial, and a detailed explanation of the processes required to arrive at the correct outcomes.
The Web Analytics Challenge runs July 15, 2013 to September 30, 2013. This Challenge is closed. The next Challenge will launch in Q1 2014. You must pass Data Science Essentials (DS-200) prior to registering for the next Challenge.
Machine-generated data is one of the primary data sources classically labeled as Big Data, and the log files generated by web servers and web applications are a significant source of modern machine generated data. Locked within these log files is a wealth of information on user behavior and preferences. For an online retailer, unlocking that information can be a significant competitive advantage. For companies that sell online services, like online games developers or video on demand providers, the data in those log files represent their life’s blood. Understanding their users and being able to predict their needs and actions can be the difference between success and failure.
Cloudera Movies is a fledgling video on demand site. It has recently grown from a few hundred users to tens of thousands of users, and the company is now at a critical juncture. Now that their user base has grown beyond its initial, well understood, core user base, it is critical for the company to understand their new user base so as to better market to them and better adapt their services to their users’ needs.
In this Challenge, you will have access to Cloudera Movies’s raw application log files. The company has tasked you with creating a picture of the user base and building a recommendation engine that accurately models its customers’ preferences. The challenge has three parts. First, based on only the log files, the Cloudera Movies legal team wants to understand which of its user accounts are used most often with parental controls enabled and what content those accounts actually view. Second, "the CM product team wants to segment sessions based on the actions that users take in order to improve the site's usability. Third, the Cloudera Movies product team wants a recommendation engine they can deploy to their site to help drive users to the content they will like, in an effort to increase time on site and reduce churn.
This challenge consists of three parts, all using the same data set.
- A binary classification problem
- A sessionization problem
- A user prediction problem
While it is possible to complete the challenge using only local scripts and tools, submissions will also be evaluated against a scaleability criterion. To get the best score, submissions should take advantage of a clustered execution environment.
Each submission must be a complete ‘data product,’ meaning that it should be a utility that accepts data in the formats specified for this challenge and produces results that conform to the challenge results requirements. (For this challenge, you may have a separate utility for each part of the challenge.) Cloudera must be able to run your utility (or utilities) in the Cloudera execution environment. The Cloudera execution environment will have all of the same components as virtual machine image, installed in the same locations.
Submissions will be scored against three criteria:
- Accuracy: for each part of the challenge, each submission will scored against the known correct results:
- Classification: submissions will get 1 point for each correct classification.
- Clustering: submissions will get 1 point for each pair of points correctly placed into the same cluster.
- Recommendations: submissions will be scored by RMSE.
- Scaleability: each submission will be evaluated on how well the implemented solution performs against very large data sets and in a clustered environment.
- Robustness: each submission will be evaluated in its tolerance for noisy or bad data inputs.
Full submission guidelines are given to each participant after entry. All submissions will evaluated and scored between October 1 and October 28 when we ill announce CCP: Data Scientist status to those who pass.
Challenge participants will be given:
- The data science challenge and objectives
- Submission guidelines
- Evaluation criteria
- Challenge data set: web application log files in JSON format
- Virtual Machine image with the following:
- 64-bit CentOS 6.4 with 1GB RAM and a maximum hard disk capacity of 128GB (candidate may customize)
- CHD4.3, Impala, Hive, Pig, Apache Crunch, ClouderaML, R, Octave, Scala, Python, NumPy, SciPy
- This virtual machine gives you a basic environment for the challenge. If you want to add other tools, or modify the environment to fit your way of working, you are completely free to do so. You have full root access. We do not provide support for customizing the environment, however, as we consider the ability to setup, customize, and maintain your working environment one of the tools of the data scientist’s trade.
- Participants are not required to use the VM or any specific environment. You may use whatever environment you choose. You will benefit from using Hadoop tools if you’re proficient with them.
Individual Contributions Only
You must participate in this challenge only on an individual basis; teams are not permitted.
Any sharing of code or solutions or collaboration with another person or entity is strictly forbidden.
You may use any tools or sofware you desire to complete the challenge.