1.环境准备
- jdk1.8/maven3.6.0/hadoop伪分布式
- 安装需要的库
1 | yum -y install lzo-devel zlib-devel gcc autoconf automake libtool |
2.安装 lzo
2.1.下载并解压lzo
1 | 下载lzo压缩包 |
2.2.编译并安装
1 | [hadoop@hadoop000 lzo-2.06]$ pwd |
3.安装hadoop-lzo
3.1.下载并解压
1 | 下载 |
3.2.修改pom.xml并增加配置
1 | 修改pom.xml |
3.3.编译
1 | [hadoop@hadoop000 hadoop-lzo-master]$ mvn clean package -Dmaven.test.skip=true |
4.修改hadoop配置并重启
4.1.修改配置文件
1 | [hadoop@hadoop000 hadoop]$ pwd |
4.2.重启hadoop
1 | [hadoop@hadoop000 data]$ stop-dfs.sh |
5.测试lzo压缩
准备测试数据
1
2
3
4
5
6
7
8
9压缩前 root用户下安装 lzop
yum install lzop
准备测试数据
[hadoop@hadoop000 data]$ split -b 350m access.log
压缩数据
[hadoop@hadoop000 data]$ lzop access_testlzo.log
[hadoop@hadoop000 data]$ ll -lh access_testlzo.log.lzo
-rw-rw-r--. 1 hadoop hadoop 160M 9月 30 13:11 access_testlzo.log.lzo运行hadoop样例wordcount
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17[hadoop@hadoop000 mapreduce]$ pwd
/home/hadoop/app/hadoop/share/hadoop/mapreduce
运行wc 查看日志发现 hadoop并没有给lzo切片
[hadoop@hadoop000 mapreduce]$ hadoop jar \
hadoop-mapreduce-examples-2.6.0-cdh5.15.1.jar wordcount \
/data/access_testlzo.log.lzo \
/out
19/09/30 13:25:16 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/09/30 13:25:18 INFO input.FileInputFormat: Total input paths to process : 1
19/09/30 13:25:18 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
19/09/30 13:25:18 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 5dbdddb8cfb544e58b4e0b9664b9d1b66657faf5]
该处可以发现 lzo 默认是不进行分片的
19/09/30 13:25:18 INFO mapreduce.JobSubmitter: number of splits:1
19/09/30 13:25:18 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1569814006008_0001
19/09/30 13:25:19 INFO impl.YarnClientImpl: Submitted application application_1569814006008_0001
19/09/30 13:25:19 INFO mapreduce.Job: The url to track the job: http://hadoop000:8088/proxy/application_1569814006008_0001/加索引并且设置inputformat
⚠️让lzo支持分片的两步操作1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27[hadoop@hadoop000 lib]$ pwd
/home/hadoop/app/hadoop/share/hadoop/mapreduce/lib
添加索引操作
[hadoop@hadoop000 lib]$ hadoop jar \
hadoop-lzo-0.4.21-SNAPSHOT.jar \
com.hadoop.compression.lzo.DistributedLzoIndexer \
/data/access_testlzo.log.lzo
[hadoop@hadoop000 lib]$ hdfs dfs -ls /data
Found 3 items
-rw-r--r-- 1 hadoop supergroup 167664788 2019-09-30 13:17 /data/access_testlzo.log.lzo
-rw-r--r-- 1 hadoop supergroup 11200 2019-09-30 13:31 /data/access_testlzo.log.lzo.index
[hadoop@hadoop000 mapreduce]$ hadoop jar \
hadoop-mapreduce-examples-2.6.0-cdh5.15.1.jar wordcount \
-Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat \
/data/access_testlzo.log.lzo \
/out
19/09/30 13:34:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/09/30 13:34:52 INFO input.FileInputFormat: Total input paths to process : 1
可以看出此出被且分成两个
19/09/30 13:34:52 INFO mapreduce.JobSubmitter: number of splits:2
19/09/30 13:34:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1569814006008_0003
19/09/30 13:34:53 INFO impl.YarnClientImpl: Submitted application application_1569814006008_0003
19/09/30 13:34:53 INFO mapreduce.Job: The url to track the job: http://hadoop000:8088/proxy/application_1569814006008_0003/
- 本文作者: cll
- 本文链接: https://keeponcoding.github.io/2018/09/30/hadoop集成lzo压缩测试/
- 版权声明: 本博客所有文章除特别声明外,均采用 Apache License 2.0 许可协议。转载请注明出处!