This is the documentation for CDH 4.7.0.
Documentation for other versions is available at Cloudera Documentation.

Installing CDH4

This section describes the process for installing CDH4.

Ways To Install CDH4

You can install CDH4 in any of the following ways:

  • Automated method using Cloudera Manager; instructions here. Cloudera Manager automates the installation and configuration of CDH4 on an entire cluster if you have root or password-less sudo SSH access to your cluster's machines.
      Note: Cloudera recommends that you use the automated method if possible.
  • Manual methods described under Installing CDH4:
    • Download and install the CDH4 "1-click Install" package
    • Add the CDH4 repository
    • Build your own CDH4 repository

    If you use one of these methods rather than Cloudera Manager, the first of these methods (downloading and installing the "1-click Install" package) is recommended in most cases because it is simpler than building or adding a repository.

  • Install from a CDH4 tarball — see How Packaging Affects CDH4 Deployment.

How Packaging Affects CDH4 Deployment

Installing from Packages

Installing from a Tarball

  Note: The instructions in this Installation Guide are tailored for a package installation, as described in the sections that follow, and do not cover installation or deployment from tarballs.
  • If you install CDH4 from a tarball, you will install YARN. Read the discussion of YARN under New Features before you proceed.
  • As of CDH4.3.0, there is no separate tarball for MRv1. Instead, the MRv1 binaries, examples, etc., are delivered in the Hadoop tarball itself. The scripts for running MRv1 are in the bin-mapreduce1 directory in the tarball, and the MRv1 examples are in the examples-mapreduce1 directory.

Before You Begin Installing CDH4 Manually

  Note: Running Services

When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in/etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).)

  Important:
  • Java Development Kit: if you have not already done so, install the Oracle Java Development Kit (JDK); see Java Development Kit Installation.
  • Scheduler defaults: note the following differences between MRv1 and MRv2 (YARN).
    • MRv1:
      • Cloudera Manager sets the default to FIFO.
      • CDH 4 sets the default to FIFO, with FIFO, Fair Scheduler, and Capacity Scheduler on the classpath by default.
    • MRv2 (YARN):
      • Cloudera Manager sets the default to Fair Scheduler.
      • CDH 4 sets the default to Fair Scheduler, with FIFO and Fair Scheduler on the classpath by default.
      • YARN does not support Capacity Scheduler.

Steps for Installing CDH4 Manually

Step 1: Add or Build the CDH4 Repository or Download the "1-click Install" package.

  • If you are installing CDH4 on a Red Hat system, you can download Cloudera packages using yum or your web browser.
  • If you are installing CDH4 on a SLES system, you can download the Cloudera packages using zypper or YaST or your web browser.
  • If you are installing CDH4 on an Ubuntu or Debian system, you can download the Cloudera packages using apt or your web browser.

On Red Hat-compatible Systems

Use one of the following methods to add or build the CDH4 repository or download the package on Red Hat-compatible systems:
  Note:

Use only one of the three methods.

Do this on all the systems in the cluster.

To download and install the CDH4 "1-click Install" package:

  1. Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).

    For OS Version

    Click this Link

    Red Hat/CentOS/Oracle 5

    Red Hat/CentOS/Oracle 5 link

    Red Hat/CentOS 6 (32-bit)

    Red Hat/CentOS 6 link (32-bit)

    Red Hat/CentOS/Oracle 6 (64-bit)

    Red Hat/CentOS/Oracle 6 link (64-bit)

  2. Install the RPM. For Red Hat/CentOS/Oracle 5:
    $ sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm

    For Red Hat/CentOS 6 (32-bit):

    $ sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.i386.rpm

    For Red Hat/CentOS/Oracle 6 (64-bit):

    $ sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm

Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations.

OR: To add the CDH4 repository:

Click the entry in the table below that matches your Red Hat or CentOS system, navigate to the repo file for your system and save it in the /etc/yum.repos.d/ directory.

For OS Version

Click this Link

Red Hat/CentOS/Oracle 5

Red Hat/CentOS/Oracle 5 link

Red Hat/CentOS 6 (32-bit)

Red Hat/CentOS 6 link

Red Hat/CentOS/Oracle 6 (64-bit)

Red Hat/CentOS/Oracle 6 link

Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations.

OR: To build a Yum repository:

If you want to create your own yum repository, download the appropriate repo file, create the repo, distribute the repo file and set up a web server, as described under Creating a Local Yum Repository.

Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations.

On SLES Systems

Use one of the following methods to download the CDH4 repository or package on SLES systems:
  Note:

Use only one of the three methods.

To download and install the CDH4 "1-click Install" package:

  1. Click this link, choose Save File, and save it to a directory to which you have write access (it can be your home directory).
  2. Install the RPM:
    $ sudo rpm -i cloudera-cdh-4-0.x86_64.rpm

Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations.

OR: To add the CDH4 repository:

  1. Run the following command:
    $ sudo zypper addrepo -f http://archive.cloudera.com/cdh4/sles/11/x86_64/cdh/cloudera-cdh4.repo
  2. Update your system package index by running:
    $ sudo zypper refresh

Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations.

OR: To build a SLES repository:

If you want to create your own SLES repository, create a mirror of the CDH SLES directory by following these instructions that explain how to create a SLES repository from the mirror.

Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations.

On Ubuntu or Debian Systems

Use one of the following methods to download the CDH4 repository or package.
  Note:

Use only one of the three methods.

To download and install the CDH4 "1-click Install" package:

  1. Click one of the following: this link for a Squeeze system, or this link for a Lucid system, or this link for a Precise system.
  2. Install the package. Do one of the following: Choose Open with in the download window to use the package manager, or Choose Save File, save the package to a directory to which you have write access (it can be your home directory) and install it from the command line, for example:
    sudo dpkg -i cdh4-repository_1.0_all.deb

Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations.

OR: To add the CDH4 repository:

Create a new file /etc/apt/sources.list.d/cloudera.list with the following contents:

  • For Ubuntu systems:
    deb [arch=amd64] http://archive.cloudera.com/cdh4/<OS-release-arch><RELEASE>-cdh4 contrib 
    deb-src http://archive.cloudera.com/cdh4/<OS-release-arch><RELEASE>-cdh4 contrib
  • For Debian systems:
    deb http://archive.cloudera.com/cdh4/<OS-release-arch><RELEASE>-cdh4 contrib
    deb-src http://archive.cloudera.com/cdh4/<OS-release-arch><RELEASE>-cdh4 contrib

where: <OS-release-arch> is debian/squeeze/amd64/cdh, ubuntu/lucid/amd64/cdh, or ubuntu/precise/amd64/cdh, and <RELEASE> is the name of your distribution, which you can find by running lsb_release -c.

For example, to install CDH4 for 64-bit Ubuntu Lucid:

deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh lucid-cdh4 contrib 
deb-src http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh lucid-cdh4 contrib

Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations.

OR: To build a Debian repository:

If you want to create your own apt repository, create a mirror of the CDH Debian directory and then create an apt repository from the mirror.

Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations.

Step 1a: Optionally Add a Repository Key

Before installing MRv1 or YARN: (Optionally) add a repository key on each system in the cluster. Add the Cloudera Public GPG Key to your repository by executing one of the following commands:

  • For Red Hat/CentOS/Oracle 5 systems:
    $ sudo rpm --import http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera
    
  • For Red Hat/CentOS/Oracle 6 systems:
    $ sudo rpm --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
    
  • For all SLES systems:
    $ sudo rpm --import http://archive.cloudera.com/cdh4/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
    
  • For Ubuntu Lucid systems:
    $ curl -s http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh/archive.key
    | sudo apt-key add -
  • For Ubuntu Precise systems:
    $ curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key
    | sudo apt-key add -
  • For Debian Squeeze systems:
    $ curl -s http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh/archive.key
    | sudo apt-key add -

This key enables you to verify that you are downloading genuine packages.

Step 2: Install CDH4 with MRv1

  Note:

Skip this step and go to Step 3 if you intend to use only YARN.

  Important:

Before proceeding, you need to decide:

  1. Whether to configure High Availability (HA) for the NameNode and/or JobTracker; see the CDH4 High Availability Guide for more information and instructions.
  2. Where to deploy the NameNode, Secondary NameNode, and JobTracker daemons. As a general rule:
    • The NameNode and JobTracker run on the the same "master" host unless the cluster is large (more than a few tens of nodes), and the master host (or hosts) should not run the Secondary NameNode (if used), DataNode or TaskTracker services.
    • In a large cluster, it is especially important that the Secondary NameNode (if used) runs on a separate machine from the NameNode.
    • Each node in the cluster except the master host(s) should run the DataNode and TaskTracker services.

If you configure HA for the NameNode, do not install hadoop-hdfs-secondarynamenode. After completing the software configuration for your chosen HA method, follow the installation instructions under HDFS High Availability Initial Deployment.

  1. Install and deploy ZooKeeper.
      Important:

    Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding. This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker.

    Follow instructions under ZooKeeper Installation.

  2. Install each type of daemon package on the appropriate systems(s), as follows.

    Where to install

    Install commands

    JobTracker host running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-0.20-mapreduce-jobtracker

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-0.20-mapreduce-jobtracker

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-0.20-mapreduce-jobtracker

    NameNode host running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-hdfs-namenode

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-hdfs-namenode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-hdfs-namenode

    Secondary NameNode host (if used) running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-hdfs-secondarynamenode

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-hdfs-secondarynamenode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-hdfs-secondarynamenode

    All cluster hosts except the JobTracker, NameNode, and Secondary (or Standby) NameNode hosts running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-0.20-mapreduce-tasktracker hadoop-hdfs-datanode

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-0.20-mapreduce-tasktracker hadoop-hdfs-datanode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-0.20-mapreduce-tasktracker hadoop-hdfs-datanode

    All client hosts running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-client

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-client

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-client

Step 3: Install CDH4 with YARN

  Note:

Skip this step if you intend to use only MRv1. Directions for installing MRv1 are in Step 2.

To install CDH4 with YARN:

  Note:
  1. Install and deploy ZooKeeper.
      Important:

    Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding. This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker.

    Follow instructions under ZooKeeper Installation.

  2. Install each type of daemon package on the appropriate systems(s), as follows.

    Where to install

    Install commands

    Resource Manager host (analogous to MRv1 JobTracker) running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-yarn-resourcemanager

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-yarn-resourcemanager

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-yarn-resourcemanager

    NameNode host running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-hdfs-namenode

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-hdfs-namenode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-hdfs-namenode

    Secondary NameNode host (if used) running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-hdfs-secondarynamenode

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-hdfs-secondarynamenode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-hdfs-secondarynamenode

    All cluster hosts except the Resource Manager running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce

    One host in the cluster running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver

    SLES

    sudo zypper clean --allsudo zypper install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver

    All client hosts running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-client

    SLES

    sudo zypper clean --allsudo zypper install hadoop-client

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-client

  Note:

The hadoop-yarn and hadoop-hdfs packages are installed on each system automatically as dependencies of the other packages.

Step 4: (Optional) Install LZO

If you decide to install LZO ( Lempel–Ziv–Oberhumer compression), proceed as follows.

  1. Add the repository on each host in the cluster. Follow the instructions for your OS version:
    For OS Version Do this
    Red Hat/CentOS/Oracle 5 Navigate to this link and save the file in the /etc/yum.repos.d/ directory.
    Red Hat/CentOS 6 Navigate to this link and save the file in the /etc/yum.repos.d/ directory.
    SLES
    1. Run the following command:
       $ sudo zypper addrepo -f 
      http://archive.cloudera.com/gplextras/sles/11/x86_64/gplextras/
      cloudera-gplextras4.repo
    2. Update your system package index by running:
       $ sudo zypper refresh
    Ubuntu or Debian Navigate to this link and save the file as /etc/apt/sources.list.d/gplextras.list.
      Important: Make sure you do not let the file name default to cloudera.list, as that will overwrite your existing cloudera.list.
  2. Install the package on each host as follows:
    For OS version Install commands
    Red Hat/CentOS compatible
    sudo yum install hadoop-lzo-cdh4 
    SLES
    sudo zypper install hadoop-lzo-cdh4 
    Ubuntu or Debian
    sudo apt-get update; sudo apt-get install hadoop-lzo-cdh4 
  3. Continue with installing and deploying CDH. As part of the deployment, you will need to do some additional configuration for LZO, as shown under Configuring LZO .
      Important: Make sure you do this configuration after you have copied the default configuration files to a custom location and set alternatives to point to it.

Step 5: Deploy CDH and Install Components

Now proceed with:

  Note:
To see what files and directories a given package installs on your system, and the permissions, run one of the following commands:
  • For RPMs:
    rpm -v -ql -p <package>.rpm
  • For Ubuntu/Debian packages:
    deb -c <package>.deb