Hadoop runtime paramters

We can specify configuration parameters in configuration files like: mapred-site.xml, hdfs-site.xml, core-site.xml, we can also specify them dynamically using -D option when we submit job.

$HADOOP_BIN jar $HADOOP_HOME/contrib/streaming/hadoop-streaming.jar \
    -D mapred.map.task=1000 \
    -D mapred.job.name="$JOBNAME" \
    -D mapred.job.priority="HIGH" \
    -D mapred.ignore.badcompress=true \
    -D mapred.linerecordreader.maxlength=51200000 \
    -input $INPUT     \
    -output "$OUTPUT"   \
    -mapper "python2.7 $MAPRED -m" \

Example:

#!/bin/bash

cwd=`dirname $0`
cd $cwd

date="${1}"
if [ -z "$date" ]; then
    date=`date -d "1 day ago" +"%Y%m%d"`
else
    echo "Date is: ${date}"
fi
# exit 1

MAPRED='mr_result.py'

PHP_CMD="/usr/bin/php"
HADOOP_HOME="/usr/local/hadoop-2.7.2"
HADOOP_BIN="$HADOOP_HOME/bin/hadoop"
HADOOP_GET="$HADOOP_BIN fs -get "
alarm="/usr/bin/php ${cwd}/../lib/alarm/Alarm_Cli.php"
JOBNAME="shichunhui_${date}@ub-hadoop"
INPUT="./output/part-00000"
OUTPUT="./output_sorted"
TASK_NUM=1


function my_log()
{
    dd=`date +"%Y-%m-%d %H:%M:%S"`
    echo "[${dd}] | ${1}" >> hdp.log
}

$HADOOP_BIN fs -rm -r -f ${OUTPUT}

my_log "Hadoop Begin: ${date}"

# hadoop
$HADOOP_BIN jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \
    -D mapreduce.job.name="$JOBNAME" \
    -D mapreduce.job.priority="VERY_HIGH" \
    -D mapreduce.ignore.badcompress=true \
    -D mapreduce.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
    -D stream.num.reduce.output.key.fields=2 \
    -D mapreduce.text.key.comparator.options="-k1,1nr" \
    -input $INPUT \
    -output "$OUTPUT" \
    -mapper "python $MAPRED -m" \
    -reducer "python $MAPRED -r" \
    -file "$MAPRED"
    # -D stream.num.map.output.key.fields=2 \

    # -D mapreduce.input.linerecordreader.line.maxlength=51200000 \

if [ $? -ne 0 ]; then
    my_log "Hadoop Error: Run hadoop error"
    exit 1
else
    if [ -f "/home/hadoop/data/output/sorted.txt" ]; then
        rm -rf "/home/hadoop/data/output/sorted.txt"
    fi
    ${HADOOP_GET} "${OUTPUT}/part-00000" "/home/hadoop/data/output/sorted.txt"
fi

exit 0

Explanation

(1) -input: input file
(2) -output: output file
(3) -mapper: mapper application or commands
(4) -reducer: reducer application or commands
(5) -file: files that need to submitted to the hadoop jobs, they could be input files (e.g. config files, executables) mappers or reducer.
This is ussually required as all files are in local file system, we have to upload them to hdfs so that they can be executed.
(6) -partitioner：用户自定义的partitioner程序
(7) -D：Properties of job

common properties:

mapred.map.tasks number of map tasks
If input has M number of parts, and this property is set larger than M, then actual number of map task is still M.
mapred.reduce.tasks number of reduce task, default to 1