这是indexloc提供的服务,不要输入任何密码

HIVE-19326: stats auto gather: incorrect aggregation during UNION queries (may lead to incorrect results)

Review Request #67126 - Created May 15, 2018 and updated

Information
Zoltan Haindrich
hive-git
HIVE-19326
468857c...
Reviewers
hive
ashutoshc, sershe

in queries like: INSERT ... SELECT ... UNION ALL SELECT ...
the stats are only collected for the first select

there are 2 issues fixed - which both resulted in the same result:

  • statscollectors have overwritten eachothers result; because the filename was only dependent from the resulting table name
  • in case tez.merge.files the 2. task have not been set to collect statistics

  
Zoltan Haindrich

   

this change fixes the tez.merge.files case; because the only problem in that cases is that the second filesink is not gathering stats

my own opinion: is that gathering statistics has no real overhead (it writes a file)...I think by enabling it here and there it somewhat just adds complexity

FS_7 is present 2 times in this plan

operator ids are reused multiple times in queries like:

from (select * from src union all select * from src)s
insert overwrite table t1 select *
insert overwrite table t2 select *;

if I understand correctly actually the file sink id's are reused for in every union branch to do output.

HIVE-19237 should fix this; and probably also remove indexInTezUnion setters/etc

Loading...