Hadoop配置与应用(一)

hadoop
数据下载

hadoop配置与应用

实践1 – 单机配置hadoop

  • 环境配置

    单台主机 192.168.2.131
    操作系统 Centos7.2
    hadoop-2.6.0-cdh5.7.0

  • 环境准备

  1. 主机名

hostnamectl set-hostname master
(centos6 使用vim /etc/sysconfig/network)

  1. 配置hosts
    1
    2
    3
    vi /etc/hosts
    添加
    192.168.2.131 master
  2. 配置SSH免密码
1
2
3
4
5
6
ssh-keygen -t rsa
cd .ssh/
touch authorized_keys
cat id_rsa.pub >>authorized_keys
测试:
ssh master
  1. 安装配置jdk
  1. 解压JDK并配置到环境变量中

tar -xvf jdk-8u171-linux-x64.tar -C /usr/local/

配置到环境变量中

vi ~/.bash_profile

添加

1
2
3
# 配置jdk
export JAVA_HOME=/usr/local/jdk1.8.0_171
export PATH=$PATH:$JAVA_HOME/bin

使环境变量生效

source ~/.bash_profile

验证是否生效
java -version
显示说明成功

1
2
3
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

配置单机伪分布

  1. 解压hadoop并配置到环境变量中

tar -zxvf hadoop-2.6.0-cdh5.7.0.tar.gz -C /usr/local/

配置到环境变量中

vi ~/.bash_profile

添加

1
2
3
# 配置hadoop
export HADOOP_HOME=/usr/local/hadoop-2.6.0-cdh5.7.0
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

使环境变量生效

source ~/.bash_profile

  1. 配置core-site.xml
1
2
cd $HADOOP_HOME/etc/hadoop/
vi core-site.xml

添加

1
2
3
4
5
6
7
8
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/root/hdfs/tmp</value>
</property>

创建hadoop常用目录

1
2
cd
mkdir -p hdfs/tmp

该目录默认使用/tmp,但是/tmp是Linux存放临时文件的,不适合用于保存数据

  1. 配置hdfs-site.xml
1
2
cd $HADOOP_HOME/etc/hadoop/
vi hdfs-site.xml

添加

1
2
3
4
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
  1. 配置hadoop-env.sh
1
2
cd $HADOOP_HOME/etc/hadoop/
vi hadoop-env.sh

修改

export JAVA_HOME=/usr/local/jdk1.8.0_171

  1. 编辑slaves文件
    1
    2
    cd $HADOOP_HOME/etc/hadoop/
    vi slaves

删除localhost,并添加

master

  1. 首次启动hdfs,需要格式化hdfs
1
2
cd $HADOOP_HOME
bin/hdfs namenode -format

启动hdfs

sbin/start-dfs.sh

  1. 配置mapred.xml
1
2
3
cd $HADOOP_HOME/etc/hadoop/
cp mapred-site.xml.template mapred-site.xml
vi mapred-site.xml

添加

1
2
3
4
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
  1. 配置yarn-site.xml
1
2
cd $HADOOP_HOME/etc/hadoop/
vi yarn-site.xml

添加

1
2
3
4
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
  1. 启动yarn
1
2
cd $HADOOP_HOME/
sbin/start-yarn.sh

验证

  1. 验证hdfs

注意:
1、防火墙是否放行端口
2、hdfs的服务端口是8020,但是web界面的端口是50070

  1. 验证yarn

实践2 –多主机配置Hadoop HA集群

hadoop HA集群 —— 用于小规模集群环境
环境要求
操作系统 Centos7.2最小安装
hadoop hadoop-2.6.0-cdh5.7.0

主机ip 主机名 集群中的角色
192.168.2.131 master NameNode(active)
192.168.2.132 standby NameNode(standby)
192.168.2.133 slave1 DataNode,JournalNode,zooKeeper
192.168.2.13 4 slave2 DataNode,JournalNode,zooKeeper
192.168.2.135 slave3 DataNode,JournalNode,zooKeeper

准备HA依赖环境

  1. 安装配置jdk
  • 实验提供的虚拟机已经安装并配置了oracle-java

echo $JAVA_HOME

  1. 解压hadoop,zookeeper文件到相应主机的/usr/local
1
2
tar -zxvf hadoop-2.6.0-cdh5.7.0.tar.gz -C /usr/local/
tar -zxvf zookeeper-3.4.5-cdh5.7.0.tar.gz -C /usr/local/

并配置到环境变量

1
2
3
4
vi /etc/profile
# 配置hadoop
export HADOOP_HOME=/usr/local/hadoop-2.6.0-cdh5.7.0
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

使环境变量生效

source /etc/profile

  1. 编辑hosts文件

vi /etc/hosts

添加

1
2
3
4
5
192.168.2.131 master
192.168.2.132 standby
192.168.2.133 slave1
192.168.2.134 slave2
192.168.2.135 slave3
  1. 关闭防火墙
1
2
systemctl stop firewalld
systemctl disable firewalld

配置ZooKeeper集群

  1. 修改conf配置文件
1
2
3
cd /usr/local/zookeeper-3.4.5-cdh5.7.0/conf
cp zoo_sample.cfg zoo.cfg
vi zoo.cfg

添加

1
2
3
4
5
6
7
dataDir=/root/zookeeper/data
dataLogDir=/root/zookeeper/logs
# the port at which the clients will connect
clientPort=2181
server.1=slave1:2888:3888
server.2=slave2:2888:3888
server.3=slave3:2888:3888
  1. 创建zookeeper所需目录
1
2
3
4
cd
mkdir zookeeper
cd zookeeper
mkdir data logs

3.配置myid

1
2
cd data
vi myid

添加

1

分别在slave各主机都设置myid,分别为1,2,3

  1. 启动zookeeper集群
  • 注意:设置防火墙放行相应端口
    如果启动集群失败,可通过以下输出文件参看原因。

cat bin/zookeeper.out
在每台服务器启动zookeeper

1
2
 cd /usr/local/zookeeper-3.4.5-cdh5.7.0/bin/
./zkServer.sh start
  1. 查看效果

./zkServer.sh status

应该看到:一个leader,两个follower

配置HDFS HA

  1. 准备SSH免密登录环境
1
2
3
4
ssh-keygen -t rsa
cd .ssh/
touch authorized_keys
cat id_rsa.pub >> authorized_keys

同样方式将其他主机生成的公钥均负责到authorized_keys,每个占一行;
将合并好的authorized_keys文件负责到其他机器

1
scp authorized_keys root@192.168.56.14:/root/.ssh/

测试下

ssh slave1

  1. 配置hadoop hdfs集群
  • 注意:先在master上配置
    创建必要的目录,用于存放hdfs的相关数据
1
2
3
4
cd
mkdir hdfs
cd hdfs
mkdir tmp journal

进入hadoop home目录进行配置:

1
2
cd /usr/local/hadoop-2.6.0-cdh5.7.0/
cd etc/hadoop

配置core-site.xml

hdfs集群信息在hdfs-site.xml中配置 -->
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<property>
<name>fs.defaultFS</name>
<value>hdfs://cluster</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/root/hdfs/tmp</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>4096</value>
</property>
<!-- 配置zookeeper集群 -->
<property>
<name>ha.zookeeper.quorum</name>
<value>slave1:2181,slave2:2181,slave3:2181</value>
</property>

配置hdfs-site.xml

HDFS集群信息 -->
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
  <property>
<name>dfs.nameservices</name>
<value>cluster</value>
</property>
<property>
<name>dfs.ha.namenodes.cluster</name>
<value>node1,node2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.cluster.node1</name>
<value>master:9000</value>
</property>
<property>
<name>dfs.namenode.http-address.cluster.node1</name>
<value>master:50070</value>
</property>
<property>
<name>dfs.namenode.rpc-address.cluster.node2</name>
<value>standby:9000</value>
</property>
<property>
<name>dfs.namenode.http-address.cluster.node2</name>
<value>standby:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://slave1:8485;slave2:8485;slave3:8485/cluster</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/root/hdfs/journal</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.cluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>
sshfence
</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
<!-- hdfs基础配置 -->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

配置mapred-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>

配置yarn-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
<!-- 配置yarn集群 --> 
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>yrc</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>master</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>standby-master</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>slave1:2181,slave2:2181,slave3:2181</value>
</property>
<!-- 启用自动恢复 -->
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<!-- 指定resourcemanager的状态信息存储在zookeeper集群 -->
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

配置hadoop-env.sh

export JAVA_HOME=/usr/local/jdk1.8.0_171

配置yarn-env.sh

export JAVA_HOME=/usr/local/jdk1.8.0_171

配置slaves

1
2
3
slave1
slave2
slave3
  1. 把配置文件同步到其他几台机器上
1
2
scp -r /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop root@slave2:/usr/local/hadoop-2.6.0-cdh5.7.0/etc/
`

把文件目录也同步到其他几台机器上

1
scp -r hdfs/ root@slave1:/root/

启动集群并验证

  1. 启动hadoop集群
  • 注意:先在master上操作。

启动journalnode:

sbin/hadoop-daemons.sh start journalnode
在每个slave下jps,应该看到JournalNode
接着,格式化zooKeeper:

1
bin/hdfs zkfc -formatZK

然后,格式化hdfs:

1
bin/hadoop namenode -formats

启动master的NameNode

sbin/hadoop-daemon.sh start namenode

切换在standby下执行:

bin/hdfs namenode -bootstrapStandby

现在可以启动整个集群了。回到master执行:

sbin/start-dfs.sh

  1. 通过浏览器参看hdfs集群信息
    停止active状态的NameNode,看是否能通过zookeeper自动切换

sbin/hadoop-daemon.sh stop namenode

  1. yarn集群
    master上执行:

sbin/start-yarn.sh

然后在standby上执行:

sbin/yarn-daemon.sh start resourcemanager
此时,可查看集群状态

bin/yarn rmadmin -getServiceState rm1

实践3 – 使用Ambari部署大数据集群

hadoop HA集群 —— 用于大规模生产环境
环境要求
操作系统 Centos7.2最小安装
hadoop HDP-2.3
ambari ambari-2.1
Ambari对安装主机有内存要求,ambari-server一般需要8G及以上,ambari-agent一般需要4G,磁盘也不宜太小,尽量不要少于50G。

主机ip 主机名 ambari中的角色
192.168.2.131 master ambari-server, ambari-agent
192.168.2.132 slave1 ambari-agent

环境准备

  1. 配置主机名
1
2
3
4
5
6
7
# master
hostnamectl set-hostname master
hostname

# slave1
hostnamectl set-hostname slave1
hostname
  1. 修改hosts文件
1
2
3
4
5
# master & slave1 
vi /etc/hosts
添加
192.168.200.131 master.hadoop master
192.168.200.133 slave1.hadoop
  1. 修改yum源
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# master & slave1 
cd /etc/yum.repos.d/
rm -vf *
vi ambari.repo
添加
[centos7]
baseurl=http://192.168.2.100/centos/
gpgcheck=0
enabled=1
name=centos
[ambari]
name=ambari
baseurl=http://192.168.2.100/ambari/centos7/2.x/updates/2.1.0
enabled=1
gpgcheck=0
  1. 配置ntp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# master
yum -y install ntp
vi /etc/ntp.conf
注释或者删除以下四行
server 0.centos.pool.ntp.org iburst
server 1.centos.pool.ntp.org iburst
server 2.centos.pool.ntp.org iburst
server 3.centos.pool.ntp.org iburst
添加以下两行
server 127.127.1.0
fudge 127.127.1.0 stratum 10
保存退出

systemctl enable ntpd
systemctl start ntpd

# slave1
yum -y install ntpdate
ntpdate master
systemctl enable ntpdate
  1. 配置SSH
1
2
3
4
5
# master & slave1 
yum install openssh-clients
ssh-keygen -t rsa
ssh-copy-id master
ssh-copy-id slave1
  1. 禁用Transparent Huge Pages

THP是一个系统级的内存优化服务,直接影响程序的内存访问性能,并且这个过程对于应用是透明的,在应用层面不可控制,对于专门为大页面进行优化的程序来说,可能会造成随机的性能下降现象。很多数据库应用都要求关闭该功能。

1
2
3
4
5
# master & slave1 
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

重启后失效,需要再次执行

  1. 安装JDK
    提供给各位老师的虚拟机已经安装配置了oracle的jdk

配置ambari-server

  1. 安装MariaDB数据库
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# master
yum install mariadb mariadb-server mysql-connector-java
启动服务
systemctl enable mariadb
systemctl start mariadb

配置MySQL
mysql_secure_installation
按enter确认后设置数据库root密码,我们这里设置为“bigdata”
Remove anonymous users? [Y/n] y
Disallow root login remotely? [Y/n] n
Remove test database and access to it? [Y/n] y
Reload privilege tables now? [Y/n] y

创建ambari数据库
mysql -uroot -pbigdata
MariaDB [(none)]> create database ambari;
MariaDB [(none)]> grant all privileges on ambari.* to 'ambari'@'localhost' identified by 'bigdata';
MariaDB [(none)]> grant all privileges on ambari.* to 'ambari'@'%' identified by 'bigdata';
MariaDB [(none)]> use ambari;
MariaDB [ambari]> source /var/lib/ambari-server/resources/Ambari-DDL-MySQL-CREATE.sql
MariaDB [ambari]> quit
  1. 安装配置ambari-server
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# master
vi /etc/profile
export buildNumber=2.3.0.0

ambari-server setup
WARNING: SELinux is set to 'permissive' mode and temporarily disabled.
OK to continue [y/n] (y)?
Customize user account for ambari-server daemon [y/n] (n)? n
Checking JDK...
[1] Oracle JDK 1.8 + Java Cryptography Extension (JCE) Policy Files 8
[2] Oracle JDK 1.7 + Java Cryptography Extension (JCE) Policy Files 7
[3] Custom JDK
====================================================================
Enter choice (1): 3
Path to JAVA_HOME: /usr/jdk64/jdk1.8.0_77
Validating JDK on Ambari Server...done.
Completing setup...
Configuring database...
Enter advanced database configuration [y/n] (n)? y
Configuring database...
====================================================================
Choose one of the following options:
[1] - PostgreSQL (Embedded)
[2] - Oracle
[3] - MySQL
[4] - PostgreSQL
[5] - Microsoft SQL Server (Tech Preview)
[6] - SQL Anywhere
====================================================================
Enter choice (1): 3
Hostname (localhost):
Port (3306):
Database name (ambari):
Username (ambari):
Enter Database Password (bigdata):
Proceed with configuring remote database connection properties [y/n] (y)?
Ambari Server 'setup' completed successfully.

ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar

启动ambari-server服务
ambari-server start

登陆界面http://192.168.2.131:8080/
登录用户名密码为admin/admin

配置ambari-agent

1
2
3
4
5
6
7
8
9
# master & slave1
yum -y install ambari-agent
vi /etc/ambari-agent/conf/ambari-agent.ini
修改
[server]
hostname= master

重启服务
ambari-agent restart

部署管理大数据集群

登陆界面http://{ambari server IP Address}:8080/,用户名密码为admin/admin。接下来就可以启动安装向导,创建集群,安装服务.
主要需要修改的就是要将HDP的yum修改为实验环境中提供的

1
2
http://192.168.2.100/HDP/centos7/2.x/updates/2.3.0.0/
http://192.168.2.100/HDP-UTILS-1.1.0.20/repos/centos7/

实践4 – HDFS基本命令

启动/关闭HDFS

  1. 启动
1
2
3
cd $HADOOP_HOME
sbin/start-dfs.sh
sbin/start-yarn.sh
  1. 关闭
    1
    2
    sbin/stop-yarn.sh
    sbin/stop-dfs.sh
  • 注意:
  1. 以下命令使用$HADOOP_HOME/bin/hadoop
  2. 以下操作也可以通过web方式操作

ls - 浏览hdfs上目录

1
2
hadoop fs -ls / # 需要参看的hdfs上的目录
hadoop fs -R / # 递归查询

mkdir - 在hdfs上创建目录

1
2
hadoop fs -mkdir /a # 需要创建的目录,父目录需存在
hadoop fs -mkdir -p /a/b # 多层目录

put - 把本地文件上传到hdfs

1
hadoop fs -put /path/local/file /path/on/hdsf # 多个src用空格分隔,上传文件会自动创建hdfs目录

get - 下载hdfs上文件到本地

1
hadoop fs -get /path/on/hdfs/file /path/local/file # 多个src用空格分隔

text - 显示hdfs上文件(一般为文本文件)内容

1
hadoop fs -text /path/on/hdfs/file

rm - 删除hdfs上文件或目录

1
2
hadoop fs -rm /path/on/hdfs/file # 删除文件
hadoop fs -rm -r /path/on/hdfs # 删除目录

执行mapreduce作业

  • 注意:
  1. 提交MapReduce作业前需启动yarn
  2. 本次练习使用官方案例程序
    $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar
1
2
3
4
5
sbin/start-yarn.sh

hadoop fs -put /path/for/data.txt /input/wc
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /input/wc/data.txt /output/wc/
hadoop fs -text /output/wc/part-r-0000* # part-r-0000*需根据具体文件名修改

Powered by Hexo and Hexo-theme-hiker

Copyright © 2013 - 2021 Inner peace All Rights Reserved.

UV : | PV :