Hadoop 2.7.4 distributed cluster installation configuration file

Posted on

Cluster environment

  • Hadoop version is 2.7.4
  • JDK version 1.8.0_
  • Three virtual machines are installed and their names and IP addresses are set as follows
Host name IP address
master 192.168.1.15
slave01 192.168.1.16
slave02 192.168.1.17

  • The directory structure of Hadoop installation on the server is
    /home/user name/hadoop
    Software: the installed software package is stored
    App: store the installation directory of all software
    Hadoop 2.7.4 is in the app directory, where all the users of my machine are null
  • The main configuration files of Hadoop are core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml
    You can browse the official website to view the detailed default configuration. The links are as follows
    core-default.xml
    hdfs-default.xml
    mapred-default.xml
    yarn-default.xml
    You can also find these default configurations by downloading and decompressing Hadoop and searching * default.xml in the directory

Install Hadoop

Download Hadoop and extract it to the app directory

tar -zxvf hadoop-2.7.4.tar.gz -C ~/hadoop/app

Configure Hadoop environment variables

vim /etc/profileAdd to

# Hadoop Env
export HADOOP_HOME=/home/null/hadoop/app/hadoop-2.7.4
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Modify the configuration file under $Hadoop? Home / etc / Hadoop

$Hadoop? Home is the location of Hadoop installation directory
Here is just a list of the necessary simplified configurations of the distributed cluster. For more personalized configurations, please refer to the official documents

Modify hadoop-env.sh file

//Modify the Java home path to the installation path of JDK  
export JAVA_HOME=/home/null/hadoop/app/jdk1.8.0_144

Modify the yarn-env.sh file

//Modify the Java home path to the installation path of JDK  
export JAVA_HOME=/home/null/hadoop/app/jdk1.8.0_144

Modify the slaves file

Master as both namenode and datanode

master  
slave01  
slave02

Modify the core-site.xml file

This file can override some of the default key configurations used to control the Hadoop core

parameter Default explain
fs.defaultFS file:/// Namenode RPC interactive port
fs.default.name file:/// Deprecated, instead of fs.defaultfs
hadoop.tmp.dir /tmp/hadoop-${user.name} Root address of other temporary directory

First, create the tmp folder manually in the $Hadoop? Home directory, and then specify hadoop.tmp.dir as it. Hadoop.tmp.dir is the basic configuration that Hadoop file system depends on, and many paths depend on it. This is the basis of the address of the storage location of the default configuration namenode and datanode in HDFS site XML. In Linux system, after the service is restarted, the directory under / TMP is cleared, so you need to go to the persistent address

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/null/hadoop/app/tmp</value>
    </property>
</configuration>

Modify the hdfs-site.xml file

This configuration file allows you to modify the default configuration of HDFS

parameter Default explain
dfs.replication 3 Determines the number of file block data backups in the system
dfs.namenode.secondary.http-address 0.0.0.0:50090 Secondary namenode service address and port
dfs.namenode.name.dir file://${hadoop.tmp.dir}/dfs/name Determine the storage location of the fsimage file of namenode in the local file system. If it is a comma separated directory list, it will be copied to all directories, redundant
dfs.datanode.data.dir file://${hadoop.tmp.dir}/dfs/data Determine the storage location of datanode’s data block in the local file system. If the directory does not exist, it will be created if permission permits
<configuration>
    <property>    
        <name>dfs.namenode.secondary.http-address</name>    
        <value>master:50090</value>    
    </property> 
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
</configuration>

The number of replicas cannot be greater than the number of datanodes
Hadoop.tmp.dir is configured in the core site. Keep the default here


Modify mapred-site.xml file

The properties in this file can override the default property values used to control the execution of MapReduce tasks

parameter Default explain
mapreduce.framework.name local Execution framework of MapReduce job
mapreduce.jobhistory.address 0.0.0.0:10020 Historical server communication address of MapReduce
mapreduce.jobhistory.webapp.address 0.0.0.0:19888 Historical server web interface address of MapReduce
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

By default, the Hadoop history server is not started. We can start the Hadoop history server through the following command

sbin/mr-jobhistory-daemon.sh  start historyserver

Modify the yarn-site.xml file

The configuration items in this file can override the default property values used to control the yarn component

parameter Default explain
yarn.nodemanager.aux-services nothing Ancillary services running on nodemanager. You need to configure MapReduce to run MapReduce
yarn.resourcemanager.hostname 0.0.0.0 Hostname of ResourceManager
yarn.resourcemanager.address ${yarn.resourcemanager.hostname}:8032 The address exposed by ResourceManager to the client. The client submits the application to RM through this address, kills the application, etc
yarn.resourcemanager.scheduler.address ${yarn.resourcemanager.hostname}:8030 Access address exposed by ResourceManager to applicationmaster. Through this address, applicationmaster applies to RM for resources, releases resources, etc
yarn.resourcemanager.webapp.address ${yarn.resourcemanager.hostname}:8088 ResourceManager external Web UI address. Users can view all kinds of cluster information in the browser through this address
yarn.nodemanager.resource.memory-mb 8192 Total physical memory available for nodemanager. Note that this parameter is not modifiable. Once set, it cannot be dynamically modified during the whole operation. In addition, the default value of this parameter is 8192mb. Even if your machine is short of 8192mb memory, yarn will use it according to the memory
yarn.nodemanager.resource.cpu-vcores 8 Total virtual CPUs available for nodemanager
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master</value>
    </property>
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>1024</value>
    </property>
    <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>1</value>
     </property>
</configuration>

The virtual machine I set up here is 1G memory, 1cpu1 core. When the last two properties are not set, nodemanager starts to report an error and insufficient memory
Please refer to my other blog for related questions


Start Hadoop cluster

Format file system

Execute in master

hdfs namenode -format

Start namenode and datanode

Execute under master $Hadoop? Home

sbin/.start-dfs.sh

Use the JPS command to view the process on the master as follows

DataNode
SecondaryNameNode
NameNode
Jps

Use the JPS command to view the processes on slave01 and slave02 respectively as follows

Jps
DataNode

Start ResourceManager and nodemanager

Execute under master $Hadoop? Home

sbin/start-yarn.sh

Use the JPS command to view the process on the master as follows

DataNode
NodeManager
ResourceManager
SecondaryNameNode
NameNode
Jps

Use the JPS command to view the processes on slave01 and slave02 respectively as follows

Jps
NodeManager
DataNode

Finally! Hadoop cluster started successfully

Leave a Reply

Your email address will not be published.