-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
Description
Context / Goal
If the two queries expressed in a dataset have different numbers of columns, they cannot possibly produce matches, when doing a hash-based comparison.
We should ideally fail fast, or at the very least warn the user clearly in the results somehow.
Currently we do not do any query parsing at startup, as this is entirely delegated to the runtime drivers for the relevant databases. We also do not want to introduce a startup connectivity dependency on an given datasource.
Expected Outcome
Evaluate and implement some approach to addressing this
- is there a simple way to do basic parsing of SQL queries at startup to at least count the number of columns?
- if people get fancy with database specific things like pivots and collations (not sure why) this would likely break down
- easiest solution (but least performant) is to just run the whole rec, and when calculating/storing metadata check at this point. Warnings could just be inferred and attached to the returned run from the API. e.g something like the below (or at top level)
"summary": { "bothMatched": 0, "bothMismatched": 3, "sourceOnly": 0, "sourceTotal": 3, "targetOnly": 0, "targetTotal": 3, "total": 3, "warnings": [ "source query has different number of columns to target query. This is guaranteed to produce 100% mismatches." ] },
- next easiest would be to abort the rec after checking the first row of the
target
dataset. This would have implications for Interleave persistence of source + target rows #41 however, and makes things a bit more complex, as the target parsing needs to understand something about the source parsing.
Out of Scope
Additional context / implementation notes
- Related to Improve experience for comparing floating point/decimal numbers #74 if we take the "warning" approach there