手机在线有码无码,色五月中文av

Hive 中的四種排序詳解，再也不會混淆用法了

Hive 中的四種排序

排序操作是一個(gè)比較常見的操作，尤其是在數(shù)據(jù)分析的時(shí)候，我們往往需要對數(shù)據(jù)進(jìn)行排序，hive 中和排序相關(guān)的有四個(gè)關(guān)鍵字，今天我們就看一下，它們都是什么作用。

數(shù)據(jù)準(zhǔn)備

下面我們有一份溫度數(shù)據(jù),tab 分割

2008    32.0
2008    21.0
2008    31.5
2008    17.0
2013    34.0
2015    32.0
2015    33.0
2015    15.9
2015    31.0
2015    19.9
2015    27.0
2016    23.0
2016    39.9
2016    32.0

建表加載數(shù)據(jù)

create table ods_temperature(
    `year` int,
    temper float
)
row format delimited fields terminated by '\t';
load data local inpath '/Users/liuwenqiang/workspace/hive/temperature.data' overwrite into table ods_temperature;

1. order by(全局排序)

order by會對輸入做全局排序，因此只有一個(gè)Reducer(多個(gè)Reducer無法保證全局有序)，然而只有一個(gè)reducer，會導(dǎo)致當(dāng)輸入規(guī)模較大時(shí)，消耗較長的計(jì)算時(shí)間

降序：desc
升序：asc 不需要指定，默認(rèn)是升序

需要注意的是它受 hive.mapred.mode的影響，在嚴(yán)格模式下，必須使用limit 對排序的數(shù)據(jù)量進(jìn)行限制，因?yàn)閿?shù)據(jù)量很大只有一個(gè)reducer的話，會出現(xiàn)OOM 或者運(yùn)行時(shí)間超長的情況，所以嚴(yán)格模式下，不適用limit 則會報(bào)錯(cuò)，更多請參考Hive的嚴(yán)格模式和本地模式

Error: Error while compiling statement: FAILED: SemanticException 1:39 Order by-s without limit are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.orderby.no.limit to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.. Error encountered near token 'year' (state=42000,code=40000)

接下來我們看一下order by的排序結(jié)果select * from ods_temperature order by year;

2. sort by(分區(qū)內(nèi)排序)

不是全局排序，其在數(shù)據(jù)進(jìn)入reducer前完成排序，也就是說它會在數(shù)據(jù)進(jìn)入reduce之前為每個(gè)reducer都產(chǎn)生一個(gè)排序后的文件。因此，如果用sort by進(jìn)行排序，并且設(shè)置mapreduce.job.reduces>1，則sort by只保證每個(gè)reducer的輸出有序，不保證全局有序。

它不受Hive.mapred.mode屬性的影響，sort by的數(shù)據(jù)只能保證在同一個(gè)reduce中的數(shù)據(jù)可以按指定字段排序。使用sort by你可以指定執(zhí)行的reduce個(gè)數(shù)(通過set mapred.reduce.tasks=n來指定)，對輸出的數(shù)據(jù)再執(zhí)行歸并排序，即可得到全部結(jié)果。

set mapred.reduce.tasks=3;
select * from ods_temperature sort by year;

發(fā)現(xiàn)上面的輸出好像看不出來啥，只能看到不是有序的，哈哈，那我們換一種方法，將數(shù)據(jù)輸出到文件，因?yàn)槲覀冊O(shè)置了reduce數(shù)是3，那應(yīng)該會有三個(gè)文件輸出

set mapred.reduce.tasks=3;
insert overwrite local directory '/Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by '\t' select * from ods_temperature sort by year;

可以看出這下就清楚多了，我們看到一個(gè)分區(qū)內(nèi)的年份并不同意，那個(gè)年份的數(shù)據(jù)都有

sort by 和order by 的執(zhí)行效率

首先我們看一個(gè)現(xiàn)象，一般情況下我們認(rèn)為sort by 應(yīng)該是比 order by 快的，因?yàn)?order by 只能使用一個(gè)reducer,進(jìn)行全部排序，但是當(dāng)數(shù)據(jù)量比較小的時(shí)候就不一定了，因?yàn)閞educer 的啟動(dòng)耗時(shí)可能遠(yuǎn)遠(yuǎn)數(shù)據(jù)處理的時(shí)間長，就像下面的例子order by 是比sort by快的

sort by 中的limt

可以在sort by 用limit子句減少數(shù)據(jù)量，使用limit n 后，傳輸?shù)絩educe端的數(shù)據(jù)記錄數(shù)就減少到 n *（map個(gè)數(shù)）,也就是說我們在sort by 中使用limit 限制的實(shí)際上是每個(gè)reducer 中的數(shù)量，然后再根據(jù)sort by的排序字段進(jìn)行order by，最后返回n 條數(shù)據(jù)給客戶端，也就是說你在sort by 用limit子句，最后還是會使用order by 進(jìn)行最后的排序

order by 中使用limit 是對排序好的結(jié)果文件去limit 然后交給reducer,可以看到sort by 中l(wèi)imit 子句會減少參與排序的數(shù)據(jù)量，而order by 中的不行，只會限制返回客戶端數(shù)據(jù)量的多少。

從上面的執(zhí)行效率，我們看到sort by limit 幾乎是 order by limit 的兩倍了，大概才出來應(yīng)該是多了某個(gè)環(huán)節(jié)

接下來我們分別看一下order by limit 和 sort by limit 的執(zhí)行計(jì)劃

explain select * from ods_temperature order by year limit 2;

explain select * from ods_temperature sort by year limit 2;

從上面截圖我圈出來的地方可以看到

sort by limit 比 order by limit 多出了一個(gè)stage(order limit)
sort by limit 實(shí)際上執(zhí)行了兩次limit ,減少了參與排序的數(shù)據(jù)量

3. distribute by(數(shù)據(jù)分發(fā))

distribute by是控制在map端如何拆分?jǐn)?shù)據(jù)給reduce端的。類似于MapReduce中分區(qū)partationer對數(shù)據(jù)進(jìn)行分區(qū)

hive會根據(jù)distribute by后面列，將數(shù)據(jù)分發(fā)給對應(yīng)的reducer，默認(rèn)是采用hash算法+取余數(shù)的方式。

sort by為每個(gè)reduce產(chǎn)生一個(gè)排序文件，在有些情況下，你需要控制某寫特定的行應(yīng)該到哪個(gè)reducer，這通常是為了進(jìn)行后續(xù)的聚集操作。distribute by剛好可以做這件事。因此，distribute by經(jīng)常和sort by配合使用。

例如上面的sort by 的例子中，我們發(fā)現(xiàn)不同年份的數(shù)據(jù)并不在一個(gè)文件中，也就說不在同一個(gè)reducer 中，接下來我們看一下如何將相同的年份輸出在一起，然后按照溫度升序排序

首先我們嘗試一下沒有distribute by 的SQL的實(shí)現(xiàn)

insert overwrite local directory '/Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by '\t'  select * from ods_temperature sort by temper ;

發(fā)現(xiàn)結(jié)果并沒有把相同年份的數(shù)據(jù)分配在一起,接下來我們使用一下distribute by

insert overwrite local directory '/Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by '\t' 
select * from ods_temperature distribute by year sort by temper ;

這下我們看到相同年份的都放在了一下，可以看出2013 和 2016 放在了一起，但是沒有一定順序，這個(gè)時(shí)候我們可以對 distribute by 字段再進(jìn)行一下排序

insert overwrite local directory '/Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by '\t' 
select * from ods_temperature distribute by year sort by year,temper ;

4. cluster by

cluster by除了具有distribute by的功能外還兼具sort by的功能。但是排序只能是升序排序，不能指定排序規(guī)則為ASC或者DESC。

當(dāng)分區(qū)字段和排序字段相同cluster by可以簡化distribute by+sort by 的SQL 寫法，也就是說當(dāng)distribute by和sort by 字段相同時(shí)，可以使用cluster by 代替distribute by和sort by

insert overwrite local directory '/Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by '\t' 
select * from ods_temperature  distribute by year sort by year ;

insert overwrite local directory '/Users/liuwenqiang/workspace/hive/sort' row format delimited fields terminated by '\t' 
select * from ods_temperature cluster by year;

我們看到上面兩種SQL寫法的輸出結(jié)果是一樣的，這也就證明了我們的說法，當(dāng)distribute by和sort by 字段相同時(shí)，可以使用cluster by 代替distribute by和sort by

當(dāng)你嘗試給cluster by 指定排序方向的時(shí)候，你就會得到如下錯(cuò)誤。

Error: Error while compiling statement: FAILED: ParseException line 2:46 extraneous input 'desc' expecting EOF near '<EOF>' (state=42000,code=40000)

總結(jié)

order by 是全局排序，可能性能會比較差；
sort by分區(qū)內(nèi)有序，往往配合distribute by來確定該分區(qū)都有那些數(shù)據(jù)；
distribute by 確定了數(shù)據(jù)分發(fā)的規(guī)則，滿足相同條件的數(shù)據(jù)被分發(fā)到一個(gè)reducer；
cluster by 當(dāng)distribute by和sort by 字段相同時(shí)，可以使用cluster by 代替distribute by和sort by,但是cluster by默認(rèn)是升序，不能指定排序方向；
sort by limit 相當(dāng)于每個(gè)reduce 的數(shù)據(jù)limit 之后，進(jìn)行order by 然后再limit ；

作者：柯廣的網(wǎng)絡(luò)日志
微信公眾號：Java大數(shù)據(jù)與數(shù)據(jù)倉庫

在线午夜精品自拍小视频_无码av无码专区线_亚洲无码精品人妻_人人澡欧美一区

大數(shù)據(jù)