文章目录
数仓实际开发中经常会涉及到多表关联,这个时候就会涉及到on与where的使用。如果对这两者在数仓中的作用比较混乱的,读完这一文就可以理解透彻了。
先来说一下where与on在SQL中最直观的区别
on 在筛选条件的时候,on会显示所有满足 | 不满足条件的数据(补NULL),而 where 只显示满足条件的数据。
on对join类型(内外连接)的改变而会有反应而where没有,对where来说只是当个连接作用。
上面的说法就不具体举例验证了,这里我们主要研究where与on在hive中对性能的影响,有条件的小伙伴可以手动试一下,贴上数据源
CREATE TABLE a (id string,name string) PARTITIONED BY (dt STRING);
CREATE TABLE b (id string,dept string) PARTITIONED BY (dt STRING);
INSERT INTO TABLE a PARTITION(dt='2022-09-08')VALUES ("1","Daniel");
INSERT INTO TABLE a PARTITION(dt='2022-09-08')VALUES ("2","Andy");
INSERT INTO TABLE a PARTITION(dt='2022-09-08')VALUES ("3","Marc");
INSERT INTO TABLE b PARTITION(dt='2022-09-08')VALUES ("1","BD");
INSERT INTO TABLE b PARTITION(dt='2022-09-08')VALUES ("2","BE");
SELECT * from a where dt = '2022-09-08';
SELECT * from b where dt = '2022-09-08';
先上一个实际的需求,关联a,b两表,取a表最新日期的数据
SELECT *
FROM a
JOIN b ON a.id = b.id
WHERE a.dt = '2022-09-08';
相信绝大多数人会这么写,先说结论,这样写没有任何问题
可能有的小伙伴会这样尝试
SELECT *
FROM a
JOIN b ON a.id = b.id
AND a.dt = '2022-09-08';
这样与上面的效果是等同的,也没有问题,那么问题在哪里?
如果需要以a表为主表,关联查询b表,也就是左外连接,这个时候两种写法就有问题了
SELECT *
FROM a
LEFT JOIN b ON a.id = b.id
WHERE a.dt = '2022-09-08';
高效写法,hive会只取指定日期的数据
SELECT *
FROM a
LEFT JOIN b ON a.id = b.id
AND a.dt = '2022-09-08';
缓慢写法,hive会先查出所有数据做关联,然后再去关联指定日期的数据
SELECT *
FROM
(SELECT *
FROM a
WHERE dt = '2022-09-08') t1
LEFT JOIN b ON t1.id = b.id;
高效写法,hive会只取指定日期的数据。虽然写法看着比较low,但是效果是等同于1的,为了写出不那么low的sql,这里先介绍一下Hive中的谓词下推
这里拿写法一和写法二的执行计划来简单说明证明一下这个观点,我这里引擎为hive on spark
Explain
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-1 depends on stages: Stage-2
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-2
Spark
DagName: hive_20220909110604_3af93825-e92f-4a19-ab13-38a8d5ed0542:53374
Vertices:
Map 2
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 2 Data size: 30 Basic stats: COMPLETE Column stats: NONE
// 无需过滤
Spark HashTable Sink Operator
keys:
0 id (type: string)
1 id (type: string)
Local Work:
Map Reduce Local Work
Stage: Stage-1
Spark
DagName: hive_20220909110604_3af93825-e92f-4a19-ab13-38a8d5ed0542:53373
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a
// 可以看到在表扫描的时候就做了过滤,所以在后面的HashTable Sink Operator就不需要过滤了
filterExpr: (dt = '2022-09-08') (type: boolean)
Statistics: Num rows: 3 Data size: 53 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (dt = '2022-09-08') (type: boolean)
Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 id (type: string)
1 id (type: string)
outputColumnNames: _col0, _col1, _col6, _col7, _col8
input vertices:
1 Map 2
Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), '2022-09-08' (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Explain
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-1 depends on stages: Stage-2
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-2
Spark
DagName: hive_20220909110827_88d2aa5e-449a-442f-aa51-21d6a021455d:53395
Vertices:
Map 2
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 2 Data size: 30 Basic stats: COMPLETE Column stats: NONE
Spark HashTable Sink Operator
// 过滤一次
filter predicates:
0 {(dt = '2022-09-08')}
1
keys:
0 id (type: string)
1 id (type: string)
Local Work:
Map Reduce Local Work
Stage: Stage-1
Spark
DagName: hive_20220909110827_88d2aa5e-449a-442f-aa51-21d6a021455d:53394
Vertices:
Map 1
Map Operator Tree:
TableScan
// 可以看到表扫描的时候没有过滤,所以需要在每个stage HashTable Sink Operator的进行过滤
alias: a
Statistics: Num rows: 3 Data size: 53 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Outer Join0 to 1
// 过滤两次
filter predicates:
0 {(dt = '2022-09-08')}
1
keys:
0 id (type: string)
1 id (type: string)
outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
input vertices:
1 Map 2
Statistics: Num rows: 3 Data size: 58 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 3 Data size: 58 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 3 Data size: 58 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
从上面的注释可以看出,在写法一的谓词下推后,数据在一开始扫描的时候就已经被过滤掉了。而在写法的不推的情况下,会拿所有的数据进行查询,最后再进行多次过滤。
谓词下推 Predicate Pushdown(PPD):简而言之,就是在不影响结果的情况下,尽量将过滤条件提前执行。谓词下推后,过滤条件在map端执行,减少了map端的输出,降低了数据在集群上传输的量,节约了集群的资源,也提升了任务的性能。
PPD控制参数:hive.optimize.ppd 默认开启
| Name | 名称 | 解释 |
|---|---|---|
| Preserved Row table | 保留表 | 在outer join中需要返回所有数据的表叫做保留表; left outer join中,左表是保留表; right outer join中,右表则是保留表; full outer join中左表和右表都要返回所有数据,则左右表都是保留表。 |
| Null Supplying table | 空表 | 相对来讲,在outer join中对于没有匹配到的行需要用NULL来填充的表称为空表; left outer join中,左表的数据全返回,对于左表在右表中无法匹配的数据的列用NULL表示,则此时右表是空表; right outer join中,左表是空表; full outer join中左表和右表都是Null Supplying table,因为左表和右表都会用NULL来填充无法匹配的数据。 |
| During Join predicate | Join中的谓词 | Join中的谓词是指Join On语句中的谓词; 如:a join b on a.id=1 那么a.id=1是Join中的谓词。 |
| After Join predicate | Join之后的谓词 | where语句中的谓词称之为Join之后的谓词。 |
The logic can be summarized by these two rules:
- During Join predicates cannot be pushed past Preserved Row tables.(保留表的谓词写在join中不能下推)
- After Join predicates cannot be pushed past Null Supplying tables.(空表的谓词写在join之后不能下推)
This captured in the following table:
Preserved Row Table Null Supplying Table Join Predicate Case J1: Not Pushed Case J2: Pushed Where Predicate Case W1: Pushed Case W2: Not Pushed
具体case见官网,这里有比较详细的执行计划分析https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior
具体案例
| Pushed or Not | SQL |
|---|---|
| Pushed | select * from a join b on a.id = b.id and a.dt = ‘2022-09-08’; |
| Pushed | select * from a join b on a.id = b.id where a.dt = ‘2022-09-08’; |
| Pushed | select * from a join b on a.id = b.id and b.dt = ‘2022-09-08’; |
| Pushed | select * from a join b on a.id = b.id where b.dt = ‘2022-09-08’; |
| Not Pushed | select * from a left join b on a.id = b.id and a.dt = ‘2022-09-08’; |
| Pushed | select * from a left join b on a.id = b.id where a.dt = ‘2022-09-08’; |
| Pushed | select * from a left join b on a.id = b.id and b.dt = ‘2022-09-08’; |
| Not Pushed | select * from a left join b on a.id = b.id where b.dt = ‘2022-09-08’; |
| Pushed | select * from a right join b on a.id = b.id and a.dt = ‘2022-09-08’; |
| Not Pushed | select * from a right join b on a.id = b.id where a.dt = ‘2022-09-08’; |
| Not Pushed | select * from a right join b on a.id = b.id and b.dt = ‘2022-09-08’; |
| Pushed | select * from a right join b on a.id = b.id where b.dt = ‘2022-09-08’; |
| Not Pushed | select * from a full join b on a.id = b.id and a.dt = ‘2022-09-08’; |
| Not Pushed | select * from a full join b on a.id = b.id where a.dt = ‘2022-09-08’; |
| Not Pushed | select * from a full join b on a.id = b.id and b.dt = ‘2022-09-08’; |
| Not Pushed | select * from a full join b on a.id = b.id where b.dt = ‘2022-09-08’; |
| join(inner join) | left outer join | right outer join | full outer join | |||||
|---|---|---|---|---|---|---|---|---|
| left table | right table | left table | right table | left table | right table | left table | right table | |
| join | Pushed | Pushed | Not Pushed | Pushed | Pushed | Not Pushed | Not Pushed | Not Pushed |
| where | Pushed | Pushed | Pushed | Not Pushed | Not Pushed | Pushed | Not Pushed | Not Pushed |
不确定函数之类的函数的是不能下推的,例如rand()类,但是unix_timestamp()除外,观察它的执行计划可以知,它可以下推
EXPLAIN
SELECT *
FROM a
LEFT JOIN b ON a.id = b.id
WHERE a.dt = unix_timestamp();
Explain
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-1 depends on stages: Stage-2
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-2
Spark
DagName: hive_20220909114638_7c328579-23dc-434b-9109-8af34c166272:53432
Vertices:
Map 2
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 2 Data size: 30 Basic stats: COMPLETE Column stats: NONE
Spark HashTable Sink Operator
// 无需过滤
keys:
0 id (type: string)
1 id (type: string)
Local Work:
Map Reduce Local Work
Stage: Stage-1
Spark
DagName: hive_20220909114638_7c328579-23dc-434b-9109-8af34c166272:53431
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a
// 表扫描时已过滤
filterExpr: (dt = 1662522398) (type: boolean)
Statistics: Num rows: 3 Data size: 53 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (dt = 1662522398) (type: boolean)
Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 id (type: string)
1 id (type: string)
outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
input vertices:
1 Map 2
Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
1. 对于Join(Inner Join)、Full outer Join,条件写在on后面,还是where后面,性能上面没有区别;
2. 对于Left outer Join ,右侧的表写在on后面、左侧的表写在where后面,性能上有提高;
3. 对于Right outer Join,左侧的表写在on后面、右侧的表写在where后面,性能上有提高;
4. 当条件分散在两个表时,谓词下推可按上述结论2和3自由组合,情况如下:
| SQL | 过滤时机 |
|---|---|
| select * from a left outer join b on ( a.id = b.id and a.dt=‘2022-09-08’ and b.id = ‘2022-09-08’); | id在map端过滤,dt在reduce端过滤,低效 |
| select * from a left outer join b on ( a.id = b.id and b.id = ‘2022-09-08’) where a.dt=‘2022-09-08’; | id,dt都在map端过滤,高效 |
| select * from a left outer join b on ( a.id = b.id and a.dt=‘2022-09-08’) where b.id = ‘2022-09-08’; | id,dt都在reduce端过滤,极低效 |
| select * from a left outer join b on ( a.id = b.id ) where a.dt=‘2022-09-08’ and b.id = ‘2022-09-08’; | id在reduce端过滤,dt在map端过滤,低效 |
很好奇,就使用rubyonrails自动化单元测试而言,你们正在做什么?您是否创建了一个脚本来在cron中运行rake作业并将结果邮寄给您?git中的预提交Hook?只是手动调用?我完全理解测试,但想知道在错误发生之前捕获错误的最佳实践是什么。让我们理所当然地认为测试本身是完美无缺的,并且可以正常工作。下一步是什么以确保他们在正确的时间将可能有害的结果传达给您? 最佳答案 不确定您到底想听什么,但是有几个级别的自动代码库控制:在处理某项功能时,您可以使用类似autotest的内容获得关于哪些有效,哪些无效的即时反馈。要确保您的提
这似乎应该有一个直截了当的答案,但在Google上花了很多时间,所以我找不到它。这可能是缺少正确关键字的情况。在我的RoR应用程序中,我有几个模型共享一种特定类型的字符串属性,该属性具有特殊验证和其他功能。我能想到的最接近的类似示例是表示URL的字符串。这会导致模型中出现大量重复(甚至单元测试中会出现更多重复),但我不确定如何让它更DRY。我能想到几个可能的方向...按照“validates_url_format_of”插件,但这只会让验证干给这个特殊的字符串它自己的模型,但这看起来很像重溶液为这个特殊的字符串创建一个ruby类,但是我如何得到ActiveRecord关联这个类模型
我的目标是转换表单输入,例如“100兆字节”或“1GB”,并将其转换为我可以存储在数据库中的文件大小(以千字节为单位)。目前,我有这个:defquota_convert@regex=/([0-9]+)(.*)s/@sizes=%w{kilobytemegabytegigabyte}m=self.quota.match(@regex)if@sizes.include?m[2]eval("self.quota=#{m[1]}.#{m[2]}")endend这有效,但前提是输入是倍数(“gigabytes”,而不是“gigabyte”)并且由于使用了eval看起来疯狂不安全。所以,功能正常,
作为我的Rails应用程序的一部分,我编写了一个小导入程序,它从我们的LDAP系统中吸取数据并将其塞入一个用户表中。不幸的是,与LDAP相关的代码在遍历我们的32K用户时泄漏了大量内存,我一直无法弄清楚如何解决这个问题。这个问题似乎在某种程度上与LDAP库有关,因为当我删除对LDAP内容的调用时,内存使用情况会很好地稳定下来。此外,不断增加的对象是Net::BER::BerIdentifiedString和Net::BER::BerIdentifiedArray,它们都是LDAP库的一部分。当我运行导入时,内存使用量最终达到超过1GB的峰值。如果问题存在,我需要找到一些方法来更正我的代
在我的Rails(2.3,Ruby1.8.7)应用程序中,我需要将字符串截断到一定长度。该字符串是unicode,在控制台中运行测试时,例如'א'.length,我意识到返回了双倍长度。我想要一个与编码无关的长度,以便对unicode字符串或latin1编码字符串进行相同的截断。我已经了解了Ruby的大部分unicode资料,但仍然有些一头雾水。应该如何解决这个问题? 最佳答案 Rails有一个返回多字节字符的mb_chars方法。试试unicode_string.mb_chars.slice(0,50)
如何正确创建Rails迁移,以便将表更改为MySQL中的MyISAM?目前是InnoDB。运行原始执行语句会更改表,但它不会更新db/schema.rb,因此当在测试环境中重新创建表时,它会返回到InnoDB并且我的全文搜索失败。我如何着手更改/添加迁移,以便将现有表修改为MyISAM并更新schema.rb,以便我的数据库和相应的测试数据库得到相应更新? 最佳答案 我没有找到执行此操作的好方法。您可以像有人建议的那样更改您的schema.rb,然后运行:rakedb:schema:load,但是,这将覆盖您的数据。我的做法是(假设
我正在尝试测试是否存在表单。我是Rails新手。我的new.html.erb_spec.rb文件的内容是:require'spec_helper'describe"messages/new.html.erb"doit"shouldrendertheform"dorender'/messages/new.html.erb'reponse.shouldhave_form_putting_to(@message)with_submit_buttonendendView本身,new.html.erb,有代码:当我运行rspec时,它失败了:1)messages/new.html.erbshou
我在从html页面生成PDF时遇到问题。我正在使用PDFkit。在安装它的过程中,我注意到我需要wkhtmltopdf。所以我也安装了它。我做了PDFkit的文档所说的一切......现在我在尝试加载PDF时遇到了这个错误。这里是错误:commandfailed:"/usr/local/bin/wkhtmltopdf""--margin-right""0.75in""--page-size""Letter""--margin-top""0.75in""--margin-bottom""0.75in""--encoding""UTF-8""--margin-left""0.75in""-
Rails2.3可以选择随时使用RouteSet#add_configuration_file添加更多路由。是否可以在Rails3项目中做同样的事情? 最佳答案 在config/application.rb中:config.paths.config.routes在Rails3.2(也可能是Rails3.1)中,使用:config.paths["config/routes"] 关于ruby-on-rails-Rails3中的多个路由文件,我们在StackOverflow上找到一个类似的问题
给定这段代码defcreate@upgrades=User.update_all(["role=?","upgraded"],:id=>params[:upgrade])redirect_toadmin_upgrades_path,:notice=>"Successfullyupgradeduser."end我如何在该操作中实际验证它们是否已保存或未重定向到适当的页面和消息? 最佳答案 在Rails3中,update_all不返回任何有意义的信息,除了已更新的记录数(这可能取决于您的DBMS是否返回该信息)。http://ar.ru