Hadoop Configuration

Hadoop configurations

Once we have installed hadoop cluster we need to set up its configuration before it can work properly.
One of the common tasks when using Hadoop is interacting with its runtime - whether it is a local setup or a remote cluster, one needs to properly configure and bootstrap Hadoop in order to submit the required jobs.

1. Configurations files

All the Hadoop configuration files are located in $HADOOP_HOME/etc/hadoop, hadoop provides a lot of environment variables, e.g. $HADOOP_CONF_DIR represents etc/hadoop.

There are maily four files: core-site.xml、hdfs-site.xml、mapred-site.xml and yarn-site.xml

config file	parameters	content
core-site.xml	global settings	port number used for Hadoop instance, tmp dir, memory allocated for the file system, memory limit for storing the data, and size of Read/Write buffers
hdfs-site.xml	HDFS parameters	value of replication data, namenode path, and datanode paths of our local file systems. It specifies the place where we want to store the Hadoop infrastructure.
mapred-site.xml	Mapreduce parameters	Including two parts: JobHistory Server and application parameter. e.g. number of reduce tasks, buffer size, etc.
yarn-site.xml	Resource config parameters	Config which ports are used by ResourceManager，NodeManager, web monitor, etc.

2. Configuration settings

Hadoop configuration are specified by resources, a resource contains a set of name/value pairs as XML data. Each resource is named by either a String or by a Path. If named by a String, then the classpath is examined for a file with that name.

<?xml version="1.0"?>
<?xml-stylesheettype="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
        <description>define the default port used by hadoop instance</description>
    </property>
    <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
        <final>4096</final>
        <description>buffer size for io stream is 4K</description>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/home/hduser/tmp</value>
        <description>Abase forother temporary directories.</description>
    </property>
</configuration>

The root element of Hadoop configuration file is configuration, it contains sub-element property each property is a config option.
Each config option contains a name, value and description . The name of the config option can be string or filepath, the value can be any of boolean, int, long, float, or string, file, array.
The above configuration sets up host/port, buffer size of io stream, and temporary directory for a hadoop cluster.

3. Core parameters

Here is a list of the most common parameters of the four config files

core-site.xml

parameter name	defualt value	description
fs.defaultFS	file:///	host and port of local file system used by hadoop instance
io.file.buffer.size	4096	buffer size of io stream
hadoop.tmp.dir	/tmp/hadoop-${user.name}	temp dir

hdfs-site.xml

parameter name	defualt value	description
dfs.replication	3	number of replications
dfs.namenode.name.dir	file://${hadoop.tmp.dir}/dfs/name	name node directory of local file system
dfs.datanode.data.dir	file://${hadoop.tmp.dir}/dfs/data	data node directory of local file system

mapred-site.xml

parameter name	defualt value	description
mapreduce.framework.name	local	value can be one of `local`、`classic` or `yarn`. yarn is recommended.
mapreduce.job.tracker		specifies host:port for jobtracker
mapreduce.jobhistory.address	0.0.0.0:10020	host:port for jobhistory server
mapreduce.jobhistory.webapp.address	0.0.0.0:19888	host:port for webapp of jobhistory

yarn-site.xml

parameter name	defualt value	description
yarn.resourcemanager.address	0.0.0.0:8032	host:port for client to access. client can submit job, kill job through this address.
yarn.resourcemanager.scheduler.address	0.0.0.0:8030	address for ApplicationMaster to apply or release resource.
yarn.resourcemanager.resource-tracker.address	0.0.0.0:8031	Address for NodeManager to report heartbeat, receive job tasks.
yarn.resourcemanager.admin.address	0.0.0.0:8033	Address for administrator to send commands.
yarn.resourcemanager.webapp.address	0.0.0.0:8088	Address for Web service, users can check cluster info through browser.
yarn.nodemanager.aux-services		User can custom some service through this option. e.g. shuffle of Map-Reduce

4. Setting up a Single node hadoop cluster in Pseudo Distributed Mode

core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
    </property>
    <property>
       <name>hadoop.tmp.dir</name>
       <value>file:/usr/local/hadoop/tmp</value>
    </property>
</configuration>

Note
Note here that we must specify a hadoop.tmp.dir, the reason is that if we don’t specify one here, hadoop will use a default tmp directory /tmp/hadoo-hadoop, this default directory will be deleted once the computer is restarted, therefore we have to re-format the hadoop file system.

hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>localhost:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>localhost:19888</value>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>localhost:8132</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>localhost:8130</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>localhost:8131</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>localhost:8133</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>localhost:8188</value>
    </property>
</configuration>

5. More configurations

Above provides only the most common configurations, if we want to better tune our hadoop cluster, we need to understand all the configuration parameters.
We have several ways to get more detailed configurations info

(1) Get information from official site.
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

(2) Get information from cluster tools
Once we have set up hadoop cluster, we can get cluster info from http://192.168.75.101:8188/conf, 192.168.75.101:8188 is the address we have set in yarn.resourcemanager.webapp.address.

References
http://ercoppa.github.io/HadoopInternals/HadoopConfigurationParameters.html
https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm