Remove OutputSchemaVisitor, add schema() as a QueryExpr method #74

TedTed · 2025-10-13T15:10:06Z

This is one step towards getting rid of the visitor pattern in our code.

Halfway through this change, I suddenly realized that it would have been better to do this differently, and instead add schema as a field of QueryExpr, then add a rewriting pass that fills this value. But I still think this change is doing more good than harm, so here is a PR.

tmager

I need to read through the code change in more detail, but a couple of initial thoughts:

"add schema as a field of QueryExpr, then add a rewriting pass that fills this value": Could you expand on this a bit, and why you think it would be a better approach? My first thought is that having an additional field that may or may not be set at a given point in time will make things less clear.
Not something for this MR, but it seems like _query_expr.py is getting very long, and we haven't incorporated that much of the compilation into it yet. Should we consider breaking it up into a multi-file submodule?

src/tmlt/analytics/_query_expr.py

TedTed · 2025-10-13T21:37:06Z

"add schema as a field of QueryExpr, then add a rewriting pass that fills this value": Could you expand on this a bit, and why you think it would be a better approach? My first thought is that having an additional field that may or may not be set at a given point in time will make things less clear.

Yes, I initially proposed to have a separate intermediary object but it was super hard to prototype this and start making progress in that direction. After discussing this with @Maegereg, I changed the proposal to instead use a series of rewrite rules, in a similar way than how KeySets now work. This does lead to a situation where some fields start being not set and then get set in rewrite rules, but I think we will still end up with much cleaner logic — with a clear separation between successive rewrite rules, and each one being simpler and self-contained. In that world, it would sense to me to also have schema determination be one of the rewrite rules (but I'm not 100% sure about this, in particular I'm not super clear on when this would need to happen in the sequence of rewrite rules). I would love your feedback on this!

Not something for this MR, but it seems like _query_expr.py is getting very long, and we haven't incorporated that much of the compilation into it yet. Should we consider breaking it up into a multi-file submodule?

Yes, I think this PR basically gets this file into "should be broken up if we add more logic" territory; I may do this as a follow-up issue if I need to add more stuff in there.

tmager · 2025-10-13T23:42:28Z

Ah, I see where you're coming from. For some properties I could see that, but for schemas specifically: are there situations where the rewrite of a particular expression changes the resulting schema? It seems to me that shouldn't happen for the expression end-to-end (though some of the intermediate pieces may have their schemas changed/rearranged in the rewrite). That might be why it's unclear when the rewrite rule that fills in the schema should run -- there's no obvious good time, because it should work equivalently at any point. On a related note, what would a rewrite rule that fills in the schemas look like? We presumably don't want to embed all the schema logic in that rewrite rule (that just gets us back to something like the schema visitor), which means calling some QueryExpr method... which is exactly what this MR is doing already, but now there's this extra step and the property isn't always defined.

(This is slightly complicated by the fact that rewrite rules need to be able to do things like "pull out this entire expression and turn it into a separate query to get groups from it", but I think the above idea about the rewrites not changing the schema still holds: we're replacing a general expression that does a groupby where the groups have a particular schema with a concrete groupby referencing a new expression with that same output schema, so the overall schema of the original expression doesn't change.)

TedTed · 2025-10-14T07:29:19Z

Hmm, you're right, so maybe the way of computing the schema in this PR is the right approach.

Maegereg

I have some suggestions for improvements, but I'm open to the idea that this should be a clean refactor that doesn't change functionality (and thus improvements should be pushed to follow-on work).

src/tmlt/analytics/_query_expr.py

Maegereg · 2025-10-14T19:07:19Z

Also, I think I agree with Tom - doing schemas as a rewrite seems worse than this approach.

tmager

Found a few possible bugs, but I think they're all in the original code -- I would argue that we should try to keep this as a straight refactor, it's already a pretty big diff.

src/tmlt/analytics/_query_expr.py

…s way makes a lot more sense

src/tmlt/analytics/_query_expr.py

tmager

LGTM aside from the existing comments.

* Removes OutputSchemaVisitor, add schema() as a QueryExpr method * make linters happy * first batch of comments * review comments, mostly splitting validation * actually perform validation. otherwise validation is not performed. * ah it was just because get_bounds has no transformation! doing it this way makes a lot more sense * check the schema for *both* transformations and measurements --------- Co-authored-by: Damien Desfontaines <TedTed@users.noreply.github.com*>

Damien Desfontaines added 2 commits October 13, 2025 17:03

Removes OutputSchemaVisitor, add schema() as a QueryExpr method

6c340d5

make linters happy

2aee15e

tmager reviewed Oct 13, 2025

View reviewed changes

src/tmlt/analytics/_query_expr.py Outdated Show resolved Hide resolved

Maegereg reviewed Oct 14, 2025

View reviewed changes

tmager reviewed Oct 14, 2025

View reviewed changes

src/tmlt/analytics/_query_expr.py Show resolved Hide resolved

src/tmlt/analytics/_query_expr.py Outdated Show resolved Hide resolved

src/tmlt/analytics/_query_expr.py Outdated Show resolved Hide resolved

src/tmlt/analytics/_query_expr.py Outdated Show resolved Hide resolved

tmager mentioned this pull request Oct 15, 2025

Catch additional invalid replacement values in ReplaceNullAndNan/ReplaceInfinity #81

Open

TedTed changed the title ~~Removes OutputSchemaVisitor, add schema() as a QueryExpr method~~ Remove OutputSchemaVisitor, add schema() as a QueryExpr method Oct 15, 2025

first batch of comments

2f21dc6

TedTed mentioned this pull request Oct 15, 2025

Print a warning if the user tries to drop/replace infinities but there are no floating-point columns #88

Open

Maegereg reviewed Oct 15, 2025

View reviewed changes

src/tmlt/analytics/_query_expr.py Outdated Show resolved Hide resolved

Damien Desfontaines added 3 commits October 15, 2025 19:32

review comments, mostly splitting validation

568fc6d

actually perform validation. otherwise validation is not performed.

e63c6d2

ah it was just because get_bounds has no transformation! doing it thi…

e76ceec

…s way makes a lot more sense

TedTed mentioned this pull request Oct 15, 2025

Fix the documentation of public join regarding column ordering #89

Open

TedTed requested review from Maegereg and tmager October 15, 2025 17:54

Maegereg approved these changes Oct 15, 2025

View reviewed changes

src/tmlt/analytics/_query_expr.py Outdated Show resolved Hide resolved

src/tmlt/analytics/_query_expr.py Outdated Show resolved Hide resolved

src/tmlt/analytics/_query_expr.py Outdated Show resolved Hide resolved

tmager approved these changes Oct 16, 2025

View reviewed changes

check the schema for *both* transformations and measurements

5d3e4b4

TedTed added this pull request to the merge queue Oct 16, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 16, 2025

lint

3ea3afd

TedTed enabled auto-merge October 16, 2025 09:01

TedTed added this pull request to the merge queue Oct 16, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 16, 2025

TedTed added this pull request to the merge queue Oct 16, 2025

TedTed removed this pull request from the merge queue due to a manual request Oct 16, 2025

TedTed added this pull request to the merge queue Oct 16, 2025

Merged via the queue into main with commit b967845 Oct 16, 2025
3 checks passed

TedTed deleted the nomoreschemavisitor branch October 16, 2025 12:27

Remove OutputSchemaVisitor, add schema() as a QueryExpr method #74

Remove OutputSchemaVisitor, add schema() as a QueryExpr method #74

Conversation

TedTed commented Oct 13, 2025

Uh oh!

tmager left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TedTed commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmager commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TedTed commented Oct 14, 2025

Uh oh!

Maegereg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Maegereg commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmager left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tmager left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TedTed commented Oct 13, 2025 •

edited

Loading

tmager commented Oct 13, 2025 •

edited

Loading

Maegereg commented Oct 14, 2025 •

edited

Loading