数据下载
hadoop配置与应用 实践1 – 单机配置hadoop
环境配置
单台主机 192.168.2.131 操作系统 Centos7.2 hadoop-2.6.0-cdh5.7.0
环境准备
主机名
hostnamectl set-hostname master
(centos6 使用vim /etc/sysconfig/network
)
配置hosts1 2 3 vi /etc/hosts 添加 192.168.2.131 master
配置SSH免密码
1 2 3 4 5 6 ssh-keygen -t rsa cd .ssh/ touch authorized_keys cat id_rsa.pub >>authorized_keys 测试: ssh master
安装配置jdk
解压JDK并配置到环境变量中
tar -xvf jdk-8u171-linux-x64.tar -C /usr/local/
配置到环境变量中
vi ~/.bash_profile
添加
1 2 3 # 配置jdk export JAVA_HOME=/usr/local/jdk1.8.0_171 export PATH=$PATH:$JAVA_HOME/bin
使环境变量生效
source ~/.bash_profile
验证是否生效java -version
显示说明成功
1 2 3 java version "1.8.0_171" Java(TM) SE Runtime Environment (build 1.8.0_171-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
配置单机伪分布
解压hadoop并配置到环境变量中
tar -zxvf hadoop-2.6.0-cdh5.7.0.tar.gz -C /usr/local/
配置到环境变量中
vi ~/.bash_profile
添加
1 2 3 # 配置hadoop export HADOOP_HOME=/usr/local/hadoop-2.6.0-cdh5.7.0 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
使环境变量生效
source ~/.bash_profile
配置core-site.xml
1 2 cd $HADOOP_HOME/etc/hadoop/ vi core-site.xml
添加
1 2 3 4 5 6 7 8 <property> <name>fs.defaultFS</name> <value>hdfs://master:8020</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/root/hdfs/tmp</value> </property>
创建hadoop常用目录
该目录默认使用/tmp,但是/tmp是Linux存放临时文件的,不适合用于保存数据
配置hdfs-site.xml
1 2 cd $HADOOP_HOME/etc/hadoop/ vi hdfs-site.xml
添加
1 2 3 4 <property> <name>dfs.replication</name> <value>1</value> </property>
配置hadoop-env.sh
1 2 cd $HADOOP_HOME/etc/hadoop/ vi hadoop-env.sh
修改
export JAVA_HOME=/usr/local/jdk1.8.0_171
编辑slaves文件1 2 cd $HADOOP_HOME/etc/hadoop/ vi slaves
删除localhost,并添加
master
首次启动hdfs,需要格式化hdfs
1 2 cd $HADOOP_HOME bin/hdfs namenode -format
启动hdfs
sbin/start-dfs.sh
配置mapred.xml
1 2 3 cd $HADOOP_HOME/etc/hadoop/ cp mapred-site.xml.template mapred-site.xml vi mapred-site.xml
添加
1 2 3 4 <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
配置yarn-site.xml
1 2 cd $HADOOP_HOME/etc/hadoop/ vi yarn-site.xml
添加
1 2 3 4 <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property>
启动yarn
1 2 cd $HADOOP_HOME/ sbin/start-yarn.sh
验证
验证hdfs
注意: 1、防火墙是否放行端口 2、hdfs的服务端口是8020,但是web界面的端口是50070
验证yarn
实践2 –多主机配置Hadoop HA集群
hadoop HA集群 —— 用于小规模集群环境 环境要求 操作系统 Centos7.2最小安装 hadoop hadoop-2.6.0-cdh5.7.0
主机ip
主机名
集群中的角色
192.168.2.131
master
NameNode(active)
192.168.2.132
standby
NameNode(standby)
192.168.2.133
slave1
DataNode,JournalNode,zooKeeper
192.168.2.13 4
slave2
DataNode,JournalNode,zooKeeper
192.168.2.135
slave3
DataNode,JournalNode,zooKeeper
准备HA依赖环境
安装配置jdk
实验提供的虚拟机已经安装并配置了oracle-java
echo $JAVA_HOME
解压hadoop,zookeeper文件到相应主机的/usr/local
1 2 tar -zxvf hadoop-2.6.0-cdh5.7.0.tar.gz -C /usr/local/ tar -zxvf zookeeper-3.4.5-cdh5.7.0.tar.gz -C /usr/local/
并配置到环境变量
1 2 3 4 vi /etc/profile # 配置hadoop export HADOOP_HOME=/usr/local/hadoop-2.6.0-cdh5.7.0 export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
使环境变量生效
source /etc/profile
编辑hosts文件
vi /etc/hosts
添加
1 2 3 4 5 192.168.2.131 master 192.168.2.132 standby 192.168.2.133 slave1 192.168.2.134 slave2 192.168.2.135 slave3
关闭防火墙
1 2 systemctl stop firewalld systemctl disable firewalld
配置ZooKeeper集群
修改conf配置文件
1 2 3 cd /usr/local/zookeeper-3.4.5-cdh5.7.0/conf cp zoo_sample.cfg zoo.cfg vi zoo.cfg
添加
1 2 3 4 5 6 7 dataDir=/root/zookeeper/data dataLogDir=/root/zookeeper/logs # the port at which the clients will connect clientPort=2181 server.1=slave1:2888:3888 server.2=slave2:2888:3888 server.3=slave3:2888:3888
创建zookeeper所需目录
1 2 3 4 cd mkdir zookeeper cd zookeeper mkdir data logs
3.配置myid
添加
1
分别在slave各主机都设置myid,分别为1,2,3
启动zookeeper集群
注意:设置防火墙放行相应端口 如果启动集群失败,可通过以下输出文件参看原因。
cat bin/zookeeper.out
在每台服务器启动zookeeper
1 2 cd /usr/local/zookeeper-3.4.5-cdh5.7.0/bin/ ./zkServer.sh start
查看效果
./zkServer.sh status
应该看到:一个leader,两个follower
配置HDFS HA
准备SSH免密登录环境
1 2 3 4 ssh-keygen -t rsa cd .ssh/ touch authorized_keys cat id_rsa.pub >> authorized_keys
同样方式将其他主机生成的公钥均负责到authorized_keys,每个占一行; 将合并好的authorized_keys文件负责到其他机器
1 scp authorized_keys root@192.168.56.14:/root/.ssh/
测试下
ssh slave1
配置hadoop hdfs集群
注意:先在master 上配置 创建必要的目录,用于存放hdfs的相关数据
1 2 3 4 cd mkdir hdfs cd hdfs mkdir tmp journal
进入hadoop home目录进行配置:
1 2 cd /usr/local/hadoop-2.6.0-cdh5.7.0/ cd etc/hadoop
配置core-site.xml
hdfs集群信息在hdfs-site.xml中配置 --> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 <property> <name>fs.defaultFS</name> <value>hdfs://cluster</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/root/hdfs/tmp</value> </property> <property> <name>io.file.buffer.size</name> <value>4096</value> </property> <!-- 配置zookeeper集群 --> <property> <name>ha.zookeeper.quorum</name> <value>slave1:2181,slave2:2181,slave3:2181</value> </property>
配置hdfs-site.xml
HDFS集群信息 --> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 <property> <name>dfs.nameservices</name> <value>cluster</value> </property> <property> <name>dfs.ha.namenodes.cluster</name> <value>node1,node2</value> </property> <property> <name>dfs.namenode.rpc-address.cluster.node1</name> <value>master:9000</value> </property> <property> <name>dfs.namenode.http-address.cluster.node1</name> <value>master:50070</value> </property> <property> <name>dfs.namenode.rpc-address.cluster.node2</name> <value>standby:9000</value> </property> <property> <name>dfs.namenode.http-address.cluster.node2</name> <value>standby:50070</value> </property> <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://slave1:8485;slave2:8485;slave3:8485/cluster</value> </property> <property> <name>dfs.journalnode.edits.dir</name> <value>/root/hdfs/journal</value> </property> <property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property> <property> <name>dfs.client.failover.proxy.provider.cluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.fencing.methods</name> <value> sshfence </value> </property> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/root/.ssh/id_rsa</value> </property> <property> <name>dfs.ha.fencing.ssh.connect-timeout</name> <value>30000</value> </property> <!-- hdfs基础配置 --> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property>
配置mapred-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>master:19888</value> </property>
配置yarn-site.xml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 <!-- 配置yarn集群 --> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>yrc</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>master</value> </property> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>standby-master</value> </property> <property> <name>yarn.resourcemanager.zk-address</name> <value>slave1:2181,slave2:2181,slave3:2181</value> </property> <!-- 启用自动恢复 --> <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <!-- 指定resourcemanager的状态信息存储在zookeeper集群 --> <property> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property>
配置hadoop-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0_171
配置yarn-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0_171
配置slaves
把配置文件同步到其他几台机器上
1 2 scp -r /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop root@slave2:/usr/local/hadoop-2.6.0-cdh5.7.0/etc/ `
把文件目录也同步到其他几台机器上
1 scp -r hdfs/ root@slave1:/root/
启动集群并验证
启动hadoop集群
启动journalnode:
sbin/hadoop-daemons.sh start journalnode
在每个slave下jps,应该看到JournalNode 。 接着,格式化zooKeeper:
然后,格式化hdfs:
1 bin/hadoop namenode -formats
启动master 的NameNode
sbin/hadoop-daemon.sh start namenode
切换在standby 下执行:
bin/hdfs namenode -bootstrapStandby
现在可以启动整个集群了。回到master 执行:
sbin/start-dfs.sh
通过浏览器参看hdfs集群信息 停止active状态的NameNode,看是否能通过zookeeper自动切换
sbin/hadoop-daemon.sh stop namenode
yarn集群 在master 上执行:
sbin/start-yarn.sh
然后在standby 上执行:
sbin/yarn-daemon.sh start resourcemanager
此时,可查看集群状态
bin/yarn rmadmin -getServiceState rm1
实践3 – 使用Ambari部署大数据集群
hadoop HA集群 —— 用于大规模生产环境 环境要求 操作系统 Centos7.2最小安装 hadoop HDP-2.3 ambari ambari-2.1 Ambari对安装主机有内存要求,ambari-server一般需要8G及以上,ambari-agent一般需要4G,磁盘也不宜太小,尽量不要少于50G。
主机ip
主机名
ambari中的角色
192.168.2.131
master
ambari-server, ambari-agent
192.168.2.132
slave1
ambari-agent
环境准备
配置主机名
1 2 3 4 5 6 7 # master hostnamectl set-hostname master hostname # slave1 hostnamectl set-hostname slave1 hostname
修改hosts文件
1 2 3 4 5 # master & slave1 vi /etc/hosts 添加 192.168.200.131 master.hadoop master 192.168.200.133 slave1.hadoop
修改yum源
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # master & slave1 cd /etc/yum.repos.d/ rm -vf * vi ambari.repo 添加 [centos7] baseurl=http://192.168.2.100/centos/ gpgcheck=0 enabled=1 name=centos [ambari] name=ambari baseurl=http://192.168.2.100/ambari/centos7/2.x/updates/2.1.0 enabled=1 gpgcheck=0
配置ntp
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # master yum -y install ntp vi /etc/ntp.conf 注释或者删除以下四行 server 0.centos.pool.ntp.org iburst server 1.centos.pool.ntp.org iburst server 2.centos.pool.ntp.org iburst server 3.centos.pool.ntp.org iburst 添加以下两行 server 127.127.1.0 fudge 127.127.1.0 stratum 10 保存退出 systemctl enable ntpd systemctl start ntpd # slave1 yum -y install ntpdate ntpdate master systemctl enable ntpdate
配置SSH
1 2 3 4 5 # master & slave1 yum install openssh-clients ssh-keygen -t rsa ssh-copy-id master ssh-copy-id slave1
禁用Transparent Huge Pages
THP是一个系统级的内存优化服务,直接影响程序的内存访问性能,并且这个过程对于应用是透明的,在应用层面不可控制,对于专门为大页面进行优化的程序来说,可能会造成随机的性能下降现象。很多数据库应用都要求关闭该功能。
1 2 3 4 5 # master & slave1 echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never]
重启后失效,需要再次执行
安装JDK 提供给各位老师的虚拟机已经安装配置了oracle的jdk
配置ambari-server
安装MariaDB数据库
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # master yum install mariadb mariadb-server mysql-connector-java 启动服务 systemctl enable mariadb systemctl start mariadb 配置MySQL mysql_secure_installation 按enter确认后设置数据库root密码,我们这里设置为“bigdata” Remove anonymous users? [Y/n] y Disallow root login remotely? [Y/n] n Remove test database and access to it? [Y/n] y Reload privilege tables now? [Y/n] y 创建ambari数据库 mysql -uroot -pbigdata MariaDB [(none)]> create database ambari; MariaDB [(none)]> grant all privileges on ambari.* to 'ambari'@'localhost' identified by 'bigdata'; MariaDB [(none)]> grant all privileges on ambari.* to 'ambari'@'%' identified by 'bigdata'; MariaDB [(none)]> use ambari; MariaDB [ambari]> source /var/lib/ambari-server/resources/Ambari-DDL-MySQL-CREATE.sql MariaDB [ambari]> quit
安装配置ambari-server
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 # master vi /etc/profile export buildNumber=2.3.0.0 ambari-server setup WARNING: SELinux is set to 'permissive' mode and temporarily disabled. OK to continue [y/n] (y)? Customize user account for ambari-server daemon [y/n] (n)? n Checking JDK... [1] Oracle JDK 1.8 + Java Cryptography Extension (JCE) Policy Files 8 [2] Oracle JDK 1.7 + Java Cryptography Extension (JCE) Policy Files 7 [3] Custom JDK ==================================================================== Enter choice (1): 3 Path to JAVA_HOME: /usr/jdk64/jdk1.8.0_77 Validating JDK on Ambari Server...done. Completing setup... Configuring database... Enter advanced database configuration [y/n] (n)? y Configuring database... ==================================================================== Choose one of the following options: [1] - PostgreSQL (Embedded) [2] - Oracle [3] - MySQL [4] - PostgreSQL [5] - Microsoft SQL Server (Tech Preview) [6] - SQL Anywhere ==================================================================== Enter choice (1): 3 Hostname (localhost): Port (3306): Database name (ambari): Username (ambari): Enter Database Password (bigdata): Proceed with configuring remote database connection properties [y/n] (y)? Ambari Server 'setup' completed successfully. ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar 启动ambari-server服务 ambari-server start 登陆界面http://192.168.2.131:8080/ 登录用户名密码为admin/admin
配置ambari-agent
1 2 3 4 5 6 7 8 9 # master & slave1 yum -y install ambari-agent vi /etc/ambari-agent/conf/ambari-agent.ini 修改 [server] hostname= master 重启服务 ambari-agent restart
部署管理大数据集群
登陆界面http://{ambari server IP Address}:8080/,用户名密码为admin/admin。接下来就可以启动安装向导,创建集群,安装服务. 主要需要修改的就是要将HDP的yum修改为实验环境中提供的
1 2 http://192.168.2.100/HDP/centos7/2.x/updates/2.3.0.0/ http://192.168.2.100/HDP-UTILS-1.1.0.20/repos/centos7/
实践4 – HDFS基本命令
启动/关闭HDFS
启动
1 2 3 cd $HADOOP_HOME sbin/start-dfs.sh sbin/start-yarn.sh
关闭1 2 sbin/stop-yarn.sh sbin/stop-dfs.sh
以下命令使用$HADOOP_HOME/bin/hadoop
以下操作也可以通过web方式操作
ls - 浏览hdfs上目录
1 2 hadoop fs -ls / # 需要参看的hdfs上的目录 hadoop fs -R / # 递归查询
mkdir - 在hdfs上创建目录
1 2 hadoop fs -mkdir /a # 需要创建的目录,父目录需存在 hadoop fs -mkdir -p /a/b # 多层目录
put - 把本地文件上传到hdfs
1 hadoop fs -put /path/local/file /path/on/hdsf # 多个src用空格分隔,上传文件会自动创建hdfs目录
get - 下载hdfs上文件到本地
1 hadoop fs -get /path/on/hdfs/file /path/local/file # 多个src用空格分隔
text - 显示hdfs上文件(一般为文本文件)内容
1 hadoop fs -text /path/on/hdfs/file
rm - 删除hdfs上文件或目录
1 2 hadoop fs -rm /path/on/hdfs/file # 删除文件 hadoop fs -rm -r /path/on/hdfs # 删除目录
执行mapreduce作业
提交MapReduce作业前需启动yarn
本次练习使用官方案例程序 $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar
1 2 3 4 5 sbin/start-yarn.sh hadoop fs -put /path/for/data.txt /input/wc hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /input/wc/data.txt /output/wc/ hadoop fs -text /output/wc/part-r-0000* # part-r-0000*需根据具体文件名修改