>

Single-node Installation

Running Hadoop on Ubuntu (Single node cluster setup)

Before we set up a hadoop cluster, we will understand the meaning of the following:

DataNode:
A DataNode stores data in the Hadoop File System. A functional file system has more than one DataNode, with the data replicated across them.

NameNode:
The NameNode is the centrepiece of an HDFS file system. It keeps the directory of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these file itself.

Jobtracker:
The Jobtracker is the service within hadoop that farms out MapReduce to specific nodes in the cluster, ideally the nodes that have the data, or atleast are in the same rack.

TaskTracker:
A TaskTracker is a node in the cluster that accepts tasks- Map, Reduce and Shuffle operatons – from a Job Tracker.

Secondary Namenode:
Secondary Namenode whole purpose is to have a checkpoint in HDFS. It is just a helper node for namenode.

System Environment Setup

Adding a dedicated Hadoop system user
We will use a dedicated Hadoop user account for running Hadoop.

$ sudo addgroup hdgroup
$ sudo adduser --ingroup hdgroup hduser1

Configuring SSH

The hadoop control scripts rely on SSH to peform cluster-wide operations. For example, there is a script for stopping and starting all the daemons in the clusters. To work seamlessly, SSH needs to be setup to allow password-less login for the hadoop user from machines in the cluster. The simplest way to achive this is to generate a public/private key pair, and it will be shared across the cluster.

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the earlier.

We have to generate an SSH key for the hduser user.

$ su – hduser1
$ ssh-keygen -t rsa -P ""

We have to enable SSH access to your local machine with this newly created key which is done by the following command.

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The final step is to test the SSH setup by connecting to the local machine with the hduser1 user. The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known hosts file.

$ ssh localhost

If the SSH connection fails, we can try the following (optional):

  • Enable debugging with ssh -vvv localhost and investigate the error in detail.
  • Check the SSH server configuration in /etc/ssh/sshd_config. If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload.

Format the namenode

$ hdfs namenode -format

Start hadoop cluster

$ sbin/start-dfs.sh
$ sbin/start-yarn.sh

Check hadoop cluster infomation

$ hdfs dfsadmin -report
$ yarn node -list

check cluster info through Web UI:

$ http://localhost:50070/

Reference
http://doctuts.readthedocs.io/en/latest/hadoop.html