HIVE-19326: stats auto gather: incorrect aggregation during UNION queries (may lead to incorrect results)
Review Request #67126 - Created May 15, 2018 and updated
| Information | |
|---|---|
| Zoltan Haindrich | |
| hive-git | |
| HIVE-19326 | |
| 468857c... | |
| Reviewers | |
| hive | |
| ashutoshc, sershe | |
in queries like: INSERT ... SELECT ... UNION ALL SELECT ...
the stats are only collected for the first selectthere are 2 issues fixed - which both resulted in the same result:
- statscollectors have overwritten eachothers result; because the filename was only dependent from the resulting table name
- in case tez.merge.files the 2. task have not been set to collect statistics
-
ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java (Diff revision 1) -
this change fixes the tez.merge.files case; because the only problem in that cases is that the second filesink is not gathering stats
my own opinion: is that gathering statistics has no real overhead (it writes a file)...I think by enabling it here and there it somewhat just adds complexity
-
ql/src/test/results/clientpositive/llap/union_stats.q.out (Diff revision 1) -
FS_7 is present 2 times in this plan
operator ids are reused multiple times in queries like:
from (select * from src union all select * from src)s insert overwrite table t1 select * insert overwrite table t2 select *;
if I understand correctly actually the file sink id's are reused for in every union branch to do output.
HIVE-19237 should fix this; and probably also remove indexInTezUnion setters/etc