这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@TedTed
Copy link
Contributor

@TedTed TedTed commented Oct 13, 2025

This is one step towards getting rid of the visitor pattern in our code.

Halfway through this change, I suddenly realized that it would have been better to do this differently, and instead add schema as a field of QueryExpr, then add a rewriting pass that fills this value. But I still think this change is doing more good than harm, so here is a PR.

Copy link
Contributor

@tmager tmager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to read through the code change in more detail, but a couple of initial thoughts:

  • "add schema as a field of QueryExpr, then add a rewriting pass that fills this value": Could you expand on this a bit, and why you think it would be a better approach? My first thought is that having an additional field that may or may not be set at a given point in time will make things less clear.
  • Not something for this MR, but it seems like _query_expr.py is getting very long, and we haven't incorporated that much of the compilation into it yet. Should we consider breaking it up into a multi-file submodule?

@TedTed
Copy link
Contributor Author

TedTed commented Oct 13, 2025

  • "add schema as a field of QueryExpr, then add a rewriting pass that fills this value": Could you expand on this a bit, and why you think it would be a better approach? My first thought is that having an additional field that may or may not be set at a given point in time will make things less clear.

Yes, I initially proposed to have a separate intermediary object but it was super hard to prototype this and start making progress in that direction. After discussing this with @Maegereg, I changed the proposal to instead use a series of rewrite rules, in a similar way than how KeySets now work. This does lead to a situation where some fields start being not set and then get set in rewrite rules, but I think we will still end up with much cleaner logic — with a clear separation between successive rewrite rules, and each one being simpler and self-contained. In that world, it would sense to me to also have schema determination be one of the rewrite rules (but I'm not 100% sure about this, in particular I'm not super clear on when this would need to happen in the sequence of rewrite rules). I would love your feedback on this!

  • Not something for this MR, but it seems like _query_expr.py is getting very long, and we haven't incorporated that much of the compilation into it yet. Should we consider breaking it up into a multi-file submodule?

Yes, I think this PR basically gets this file into "should be broken up if we add more logic" territory; I may do this as a follow-up issue if I need to add more stuff in there.

@tmager
Copy link
Contributor

tmager commented Oct 13, 2025

Ah, I see where you're coming from. For some properties I could see that, but for schemas specifically: are there situations where the rewrite of a particular expression changes the resulting schema? It seems to me that shouldn't happen for the expression end-to-end (though some of the intermediate pieces may have their schemas changed/rearranged in the rewrite). That might be why it's unclear when the rewrite rule that fills in the schema should run -- there's no obvious good time, because it should work equivalently at any point. On a related note, what would a rewrite rule that fills in the schemas look like? We presumably don't want to embed all the schema logic in that rewrite rule (that just gets us back to something like the schema visitor), which means calling some QueryExpr method... which is exactly what this MR is doing already, but now there's this extra step and the property isn't always defined.

(This is slightly complicated by the fact that rewrite rules need to be able to do things like "pull out this entire expression and turn it into a separate query to get groups from it", but I think the above idea about the rewrites not changing the schema still holds: we're replacing a general expression that does a groupby where the groups have a particular schema with a concrete groupby referencing a new expression with that same output schema, so the overall schema of the original expression doesn't change.)

@TedTed
Copy link
Contributor Author

TedTed commented Oct 14, 2025

Hmm, you're right, so maybe the way of computing the schema in this PR is the right approach.

Copy link
Contributor

@Maegereg Maegereg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some suggestions for improvements, but I'm open to the idea that this should be a clean refactor that doesn't change functionality (and thus improvements should be pushed to follow-on work).

@Maegereg
Copy link
Contributor

Maegereg commented Oct 14, 2025

Also, I think I agree with Tom - doing schemas as a rewrite seems worse than this approach.

Copy link
Contributor

@tmager tmager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a few possible bugs, but I think they're all in the original code -- I would argue that we should try to keep this as a straight refactor, it's already a pretty big diff.

@TedTed TedTed changed the title Removes OutputSchemaVisitor, add schema() as a QueryExpr method Remove OutputSchemaVisitor, add schema() as a QueryExpr method Oct 15, 2025
Copy link
Contributor

@tmager tmager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from the existing comments.

@TedTed TedTed added this pull request to the merge queue Oct 16, 2025
github-merge-queue bot pushed a commit that referenced this pull request Oct 16, 2025
* Removes OutputSchemaVisitor, add schema() as a QueryExpr method

* make linters happy

* first batch of comments

* review comments, mostly splitting validation

* actually perform validation. otherwise validation is not performed.

* ah it was just because get_bounds has no transformation! doing it this way makes a lot more sense

* check the schema for *both* transformations and measurements

---------

Co-authored-by: Damien Desfontaines <TedTed@users.noreply.github.com*>
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 16, 2025
@TedTed TedTed enabled auto-merge October 16, 2025 09:01
@TedTed TedTed added this pull request to the merge queue Oct 16, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 16, 2025
@TedTed TedTed added this pull request to the merge queue Oct 16, 2025
@TedTed TedTed removed this pull request from the merge queue due to a manual request Oct 16, 2025
@TedTed TedTed added this pull request to the merge queue Oct 16, 2025
Merged via the queue into main with commit b967845 Oct 16, 2025
3 checks passed
@TedTed TedTed deleted the nomoreschemavisitor branch October 16, 2025 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants