This is the documentation for CDH 4.7.0.
Documentation for other versions is available at Cloudera Documentation.

Managing Hadoop API Dependencies in CDH4

In CDH3, all of the Hadoop API implementations were confined to a single JAR file (hadoop-core) plus a few of its dependencies. It was relatively straightforward to make sure that classes from these JAR files were available at runtime.

CDH4 is more complex: it not only introduces a Maven-based MRv2 (YARN) implementation, but also bundles MRv1. To simplify things, CDH4 provides a new, Maven-based way of managing client-side Hadoop API dependencies that saves you from having to figure out the exact names and locations of all the JAR files needed to provide Hadoop APIs.

In CDH4, Cloudera recommends that you use a hadoop-client artifact for all clients, instead of managing JAR-file-based dependencies manually.

Flavors of the hadoop-client Artifact

There are two different flavors of the hadoop-client artifact: a Maven-based Project Object Model (POM) artifact and a Linux package, hadoop-client. The former lets you manage Hadoop API dependencies at both compile and run time for your Maven- or Ivy-based projects; the latter provides a familiar interface in the form of a collection of JAR files that can be added to your classpath directly.

Versions of the hadoop-client Artifact

CDH4 provides two distinct versions of the hadoop-client artifact: one for MRv1 and one for MRv2 (YARN). If you're using the Maven-based POM hadoop-client artifact, youcan use the version string to distinguish between them: 2.0.0-mr1-cdh4.0.0 for MRv1 APIs and 2.0.0-cdh4.0.0 for YARN. If you're using the Linux package, you can distinguish by the location of the JAR files: /usr/lib/hadoop/client-0.20 for MRv1 APIs and /usr/lib/hadoop/client for YARN.

  Important:

Make sure that one and only one version of the hadoop-client artifact is available to your project. Mixing MRv1 and YARN hadoop-client artifacts in the same application could lead to failures that are hard to debug.

Using hadoop-client for Maven-based Java Projects

Make sure you add the following dependency specification to your pom.xml file:

  <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-client</artifactId>
     <version>VERSION</version>
   </dependency>

where the <VERSION> string can be either 2.0.0-cdh4.0.0 for YARN APIs or 2.0.0-mr1-cdh4.0.0 for MRv1 APIs.

Using hadoop-client for Ivy-based Java Projects

Make sure you add the following dependency specification to your ivy.xml file:

  <dependency org="org.apache.hadoop" name="hadoop-client" rev="VERSION"/>

where the <VERSION> string can be either 2.0.0-cdh4.0.0 for YARN APIs or 2.0.0-mr1-cdh4.0.0 for MRv1 APIs.

Using JAR Files Provided in the hadoop-client Package

Make sure you add to your project all of the JAR files provided under /usr/lib/hadoop/client-0.20 (for MRv1 APIs) or /usr/lib/hadoop/client (for YARN).

For example, you can add this location to the JVM classpath:

$ export CLASSPATH=/usr/lib/hadoop/client-0.20/\*