HIVE-19326: stats auto gather: incorrect aggregation during UNION queries (may lead to incorrect results)

Reviews
Diff

- Download Diff

Information
Submitter:	Zoltan Haindrich
Repository:	hive-git
Branch:
Bugs:	HIVE-19326
Depends On:
Commit:	468857c...
Reviewers
Groups:	hive
People:	ashutoshc, sershe

Description

in queries like: INSERT ... SELECT ... UNION ALL SELECT ... 

the stats are only collected for the first select
there are 2 issues fixed - which both resulted in the same result:

statscollectors have overwritten eachothers result; because the filename was only dependent from the resulting table name
in case tez.merge.files the 2. task have not been set to collect statistics

Testing Done

ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java (Diff revision 1)

this change fixes the tez.merge.files case; because the only problem in that cases is that the second filesink is not gathering stats

my own opinion: is that gathering statistics has no real overhead (it writes a file)...I think by enabling it here and there it somewhat just adds complexity

ql/src/test/results/clientpositive/llap/union_stats.q.out (Diff revision 1)

FS_7 is present 2 times in this plan

operator ids are reused multiple times in queries like:

from (select * from src union all select * from src)s
insert overwrite table t1 select *
insert overwrite table t2 select *;

if I understand correctly actually the file sink id's are reused for in every union branch to do output.

HIVE-19237 should fix this; and probably also remove indexInTezUnion setters/etc

You have a pending review.

Review Board 2.5.9

Screenshots

Files