>

Hadoop configurations

Once we have installed hadoop cluster we need to set up its configuration before it can work properly.
One of the common tasks when using Hadoop is interacting with its runtime - whether it is a local setup or a remote cluster, one needs to properly configure and bootstrap Hadoop in order to submit the required jobs.

1. Configurations files

All the Hadoop configuration files are located in $HADOOP_HOME/etc/hadoop, hadoop provides a lot of environment variables, e.g. $HADOOP_CONF_DIR represents etc/hadoop.

There are maily four files: core-site.xml、hdfs-site.xml、mapred-site.xml and yarn-site.xml

config file parameters content
core-site.xml global settings port number used for Hadoop instance, tmp dir, memory allocated for the file system, memory limit for storing the data, and size of Read/Write buffers
hdfs-site.xml HDFS parameters value of replication data, namenode path, and datanode paths of our local file systems. It specifies the place where we want to store the Hadoop infrastructure.
mapred-site.xml Mapreduce parameters Including two parts: JobHistory Server and application parameter. e.g. number of reduce tasks, buffer size, etc.
yarn-site.xml Resource config parameters Config which ports are used by ResourceManager,NodeManager, web monitor, etc.

2. Configuration settings

Hadoop configuration are specified by resources, a resource contains a set of name/value pairs as XML data. Each resource is named by either a String or by a Path. If named by a String, then the classpath is examined for a file with that name.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<?xml version="1.0"?>
<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
<description>define the default port used by hadoop instance</description>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
<final>4096</final>
<description>buffer size for io stream is 4K</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/hduser/tmp</value>
<description>Abase forother temporary directories.</description>
</property>
</configuration>

The root element of Hadoop configuration file is configuration, it contains sub-element property each property is a config option.
Each config option contains a name, value and description . The name of the config option can be string or filepath, the value can be any of boolean, int, long, float, or string, file, array.
The above configuration sets up host/port, buffer size of io stream, and temporary directory for a hadoop cluster.

3. Core parameters

Here is a list of the most common parameters of the four config files

core-site.xml

parameter name defualt value description
fs.defaultFS file:/// host and port of local file system used by hadoop instance
io.file.buffer.size 4096 buffer size of io stream
hadoop.tmp.dir /tmp/hadoop-${user.name} temp dir

hdfs-site.xml

parameter name defualt value description
dfs.replication 3 number of replications
dfs.namenode.name.dir file://${hadoop.tmp.dir}/dfs/name name node directory of local file system
dfs.datanode.data.dir file://${hadoop.tmp.dir}/dfs/data data node directory of local file system

mapred-site.xml

parameter name defualt value description
mapreduce.framework.name local value can be one of localclassic or yarn. yarn is recommended.
mapreduce.job.tracker specifies host:port for jobtracker
mapreduce.jobhistory.address 0.0.0.0:10020 host:port for jobhistory server
mapreduce.jobhistory.webapp.address 0.0.0.0:19888 host:port for webapp of jobhistory

yarn-site.xml

parameter name defualt value description
yarn.resourcemanager.address 0.0.0.0:8032 host:port for client to access. client can submit job, kill job through this address.
yarn.resourcemanager.scheduler.address 0.0.0.0:8030 address for ApplicationMaster to apply or release resource.
yarn.resourcemanager.resource-tracker.address 0.0.0.0:8031 Address for NodeManager to report heartbeat, receive job tasks.
yarn.resourcemanager.admin.address 0.0.0.0:8033 Address for administrator to send commands.
yarn.resourcemanager.webapp.address 0.0.0.0:8088 Address for Web service, users can check cluster info through browser.
yarn.nodemanager.aux-services User can custom some service through this option. e.g. shuffle of Map-Reduce

4. Setting up a Single node hadoop cluster in Pseudo Distributed Mode

core-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/tmp</value>
</property>
</configuration>

Note
Note here that we must specify a hadoop.tmp.dir, the reason is that if we don’t specify one here, hadoop will use a default tmp directory /tmp/hadoo-hadoop, this default directory will be deleted once the computer is restarted, therefore we have to re-format the hadoop file system.

hdfs-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>
</property>
</configuration>
mapred-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>localhost:19888</value>
</property>
</configuration>
yarn-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8132</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>localhost:8130</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>localhost:8131</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>localhost:8133</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>localhost:8188</value>
</property>
</configuration>

5. More configurations

Above provides only the most common configurations, if we want to better tune our hadoop cluster, we need to understand all the configuration parameters.
We have several ways to get more detailed configurations info

(1) Get information from official site.
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

(2) Get information from cluster tools
Once we have set up hadoop cluster, we can get cluster info from http://192.168.75.101:8188/conf, 192.168.75.101:8188 is the address we have set in yarn.resourcemanager.webapp.address.

References
http://ercoppa.github.io/HadoopInternals/HadoopConfigurationParameters.html
https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm