第二章——hive入門教程之word count
如果使用java
編寫mapreduce
程序?qū)崿F(xiàn)wordcount
也很簡單,如下代碼就實現(xiàn)了一個簡單的hello world
程序:word count
。需要的pom.xml
依賴
<!-- 版本信息 -->
<properties>
<log4j.version>2.5</log4j.version>
<hadoop.version>2.7.2</hadoop.version>
<scopeType>provided</scopeType>
</properties>
<dependencies>
<dependency>
<groupId>org.ikeguang</groupId>
<artifactId>common</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>{hadoop.version}</version>
<scope>{scopeType}</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>{hadoop.version}</version>
<scope>{scopeType}</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>{hadoop.version}</version>
<scope>{scopeType}</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>{hadoop.version}</version>
<scope>{scopeType}</scope>
</dependency>
</dependencies>
XML
代碼
1)、WordCountMapper.java
程序:
package org.ikeguang.hadoop.mapreduce.wordcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* Created by keguang on 2019-12-07.
*/
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split(" ");
for(String word : words){
context.write(new Text(word), new IntWritable(1));
}
}
}
Java
2)、WordCountReducer.java
程序:
package org.ikeguang.hadoop.mapreduce.wordcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* Created by keguang on 2019-12-07.
*/
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for(IntWritable val : values){
sum = sum + val.get();
}
context.write(key, new IntWritable(sum));
}
}
Java
3)、WordCountDriver.java
程序:
package org.ikeguang.hadoop.mapreduce.wordcount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.ikeguang.hadoop.util.HdfsUtil;
/**
* Created by keguang on 2019-12-07.
*/
public class WordCountDriver extends Configured implements Tool{
public static void main(String[] args) throws Exception {
int ec = ToolRunner.run(new Configuration(),new WordCountDriver(),args);
System.exit(ec);
}
@Override
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance();
job.setJobName("wordcount");
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 輸入輸出路徑
String inpath = args[0];
String output_path = args[1];
FileInputFormat.addInputPath(job, new Path(inpath));
if(HdfsUtil.existsFiles(conf,output_path)){
HdfsUtil.deleteFolder(conf,output_path);
}
// 輸入路徑可以遞歸
FileInputFormat.setInputDirRecursive(job,true);
// 輸入數(shù)據(jù)小文件合并
job.setInputFormatClass(CombineTextInputFormat.class);
// 一個map最少處理128M文件
CombineTextInputFormat.setMinInputSplitSize(job,134217728);
// 最多處理256M文件
CombineTextInputFormat.setMaxInputSplitSize(job,new Long(268435456));
// job.setNumReduceTasks(10);
// 輸出路徑
FileOutputFormat.setOutputPath(job,new Path(output_path));
return job.waitForCompletion(true)?0:1;
}
}
Java
統(tǒng)計英文的單詞數(shù),啟動程序的命令是:
hadoop jar hadoop-1.0-SNAPSHOT.jar org.ikeguang.hadoop.mapreduce.wordcount.WordCountDriver /data/wordcount/input /data/wordcount/output
Bash
hadoop-1.0-SNAPSHOT.jar
:最終的jar
包名字;org.ikeguang.hadoop.mapreduce.wordcount.WordCountDriver
:java
程序主類(入口);data/wordcount/input
:hdfs
數(shù)據(jù)輸入目錄;/data/wordcount/output
:hdfs
數(shù)據(jù)輸出目錄;
結(jié)果:
Bingley 3
But 2
England; 1
Her 1
However 1
I 15
IT 1
Indeed 1
Jane, 1
Lady 1
Lizzy 2
但是需要寫代碼程序,終歸是有門檻的,如果寫hive sql
簡稱HQL
的話,只需要這樣:
select word, count(1) from table group by word;
SQL
注:假設(shè)這里的word
列存放單個詞。
只要會sql
基本語法,每個人都能上手hive
進行大數(shù)據(jù)分析,其功德是無量的。
作者:柯廣的網(wǎng)絡(luò)日志
微信公眾號:Java大數(shù)據(jù)與數(shù)據(jù)倉庫