Hadoop framework: pseudo distributed cluster construction under single service

Posted on

Source code of this article:GitHub · click here || Gitee · point here

1、 Basic environment

1. Environment version

Environment: centos7
Hadoop version: 2.7.2
JDK version: 1.8

2. Hadoop directory structure

  • Bin directory: stores scripts for operating HDFS and yarn services of Hadoop
  • Etc Directory: the directory of Hadoop related configuration files
  • Lib Directory: a local library for storing Hadoop, which provides data compression and decompression capabilities
  • SBIN Directory: stores scripts for starting or stopping Hadoop related services
  • Share Directory: stores Hadoop dependent jar packages, documents, and related cases

3. Configuration loading

vim /etc/profile
#Add environment
export JAVA_HOME=/opt/jdk1.8
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/opt/hadoop2.7
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

#Exit refresh configuration
source /etc/profile

2、 Pseudo cluster configuration

The path of the following configuration file is: / opt / Hadoop 2.7/etc/hadoop. This is the Linux environment. The script configuration is in SH format.

1. Configure Hadoop env

root# vim hadoop-env.sh
#Before modification
export JAVA_HOME=
#After modification
export JAVA_HOME=/opt/jdk1.8

2. Configure core site

Overview of document structure

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
</configuration>

Address of namenode

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://127.0.0.1:9000</value>
</property>

Data storage directory: the storage directory of files generated by Hadoop runtime.

<property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop2.7/data/tmp</value>
</property>

3. Configure HDFS site

The file structure is the same as the above. Configure the number of HDFS copies. Here, you can configure one pseudo environment.

<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>

4. Configure yen env

export JAVA_HOME=/opt/jdk1.8

5. Configure yarn site

Specify the address of the yarn’s ResourceManager

<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>192.168.72.132</value>
</property>

Specifies that the mechanism used to pass the intermediate results generated by the map to reduce is shuffle

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>

6. Configure mapred env

export JAVA_HOME=/opt/jdk1.8

7. Configure mapred site

Rename mapred-site.xml.template to mapred-site.xml.

Specifies that the MapReduce program runs on the cluster with resource allocation. If it is not specified as yarn, the MapReduce program will only run locally rather than in the entire cluster.

<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>

3、 Environment Startup Test

1. Test file system

HDFS correlation

Format namenode

Do this the first time you start.

[hadoop2.7]# bin/hdfs namenode -format

Formatting namenode will generate a new clusterid, resulting in inconsistent cluster IDS between namenode and datanode, and the cluster cannot find past data. Therefore, when formatting a namenode, you must stop the relevant processes, delete the data and log, and then format the namenode. The clusterid is in the version file in the following directory. You can view the comparison by yourself.

/opt/hadoop2.7/data/tmp/dfs/name/current
/opt/hadoop2.7/data/tmp/dfs/data/current

Start namenode

[hadoop2.7]# sbin/hadoop-daemon.sh start namenode

Start datanode

[hadoop2.7]# sbin/hadoop-daemon.sh start datanode

JPS view status

[root@localhost hadoop2.7]# jps
2450 Jps
2276 NameNode
2379 DataNode

Web interface view

You need Linux to turn off the firewall and related security enhancements (important here).

IP address: 50070

Hadoop framework: pseudo distributed cluster construction under single service

Yarn correlation

Start ResourceManager

[hadoop2.7]# sbin/yarn-daemon.sh start resourcemanager

Start nodemanager

[hadoop2.7]# sbin/yarn-daemon.sh start nodemanager

Web interface view

IP address: 8088 / cluster

Hadoop framework: pseudo distributed cluster construction under single service

MapReduce related

File operation test

Create a test file directory

[root@localhost inputfile]# pwd
/opt/inputfile
[root@localhost inputfile]# echo "hello word hadoop" > word.txt

Create folder on HDFS file system

[hadoop2.7] bin/hdfs dfs -mkdir -p /opt/upfile/input

Upload file

[hadoop2.7]# bin/hdfs dfs -put /opt/inputfile/word.txt /opt/upfile/input

see file

[hadoop2.7]# bin/hdfs dfs -ls /opt/upfile/input

2. View files on the web

Hadoop framework: pseudo distributed cluster construction under single service

Perform file analysis

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /opt/upfile/input /opt/upfile/output

View analysis results

bin/hdfs dfs -cat /opt/upfile/output/*

Result: each word appears once.

Delete analysis results

bin/hdfs dfs -rm -r /opt/upfile/output

4、 History server

MapReduce’s jobhistoryserver is an independent service that can display historical job logs through the Web UI.

1. Modify mapred site

<!--  Server side address -- >
<property>
<name>mapreduce.jobhistory.address</name>
<value>192.168.72.132:10020</value>
</property>

<!--  Server web address -- >
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>192.168.72.132:19888</value>
</property>

2. Start service

[hadoop2.7]# sbin/mr-jobhistory-daemon.sh start historyserver

3. Web view

IP address: 19888

Hadoop framework: pseudo distributed cluster construction under single service

4. Configure log aggregation

Log aggregation concept: after the application service runs, upload the operation log information to the HDFS system. It is convenient to view the details of program operation, which is convenient for development and debugging.

After the log aggregation function is enabled, you need to restart nodemanager, ResourceManager, and historymanager.

Close the above services

[hadoop2.7]# sbin/yarn-daemon.sh stop resourcemanager
[hadoop2.7]# sbin/yarn-daemon.sh stop nodemanager
[hadoop2.7]# sbin/mr-jobhistory-daemon.sh stop historyserver

Modify yen site

<!--  Log aggregation on -- >
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>

<!--  Log retention time 7 days -- >
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>

After modification, start the above server again. Perform the file analysis task again.

View web side
Hadoop framework: pseudo distributed cluster construction under single service
Hadoop framework: pseudo distributed cluster construction under single service

5、 Source code address

GitHub · address
https://github.com/cicadasmile/big-data-parent
Gitee · address
https://gitee.com/cicadasmile/big-data-parent

Hadoop framework: pseudo distributed cluster construction under single service

Leave a Reply

Your email address will not be published.