Refactor logical plan serialization. #905

mdwelsh · 2024-10-10T23:36:37Z

This PR changes how we serialize and deserialize Luna LogicalPlans so everything uses the same set of Pydantic BaseModel subclasses.

Removed LogicalOperator and folded its fields into Node.
The LLM is instructed to generate serialized LogicalPlans directly, so there is no need for an additional JSON format.
The inputs field of Node replaces dependencies and is a list of integers, always.
There is a hidden _input_nodes field in Node that resolves inputs to the actual Node instances, with input_nodes() returning them.
Use Pydantic model validator on LogicalPlan to ensure that _input_nodes are populated on nodes.
Override LogicalPlan constructor to ensure that the correct duck-typed Node subclass is used, based on the node_type field.

Sorry for the length of this PR, but I think the result is more understandable and flexible.

baitsguy

mypy failing

Can you also validate the query-ui works and doesn't need updates

queryui/util.py.

* Working on this. * Working on refactoring. * Tests pass - is such a thing even possible? * Fix tests. * Fix mypy. * Cleanup. * Fix NTSB examples. * A few tweaks to the query planner prompt, and a workaround in queryui/util.py. * Fix mypy.

* added ability to read schema from file * small typo Co-authored-by: Matt Welsh <matt@aryn.ai> * fixed two funtion refs that were modified * reformatted file with black * fixed schema file format (was json), added more exception handling * Fix anonymous reading in materialize and add rate limited logging. (#898) * Fix anonymous reading in materialize and add rate limited logging. * In materialize, try reading using the credentials, but if it doesn't work, fall back to reading anonymously if that seems to be working. * Add rate limited logging to reading via materialize in local mode. * Check for no root before checking if a source since that makes more sense. * switch ntsb_loader_materialized.py over to read in local mode, it was working (with the anonymous fix), but was very slow hence the logging. * Bump version to v0.1.23. (#903) * fix asdict in the reader too. duh (#907) Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com> * Add text reprentation for empty tables (#909) * Refactor logical plan serialization. (#905) * Working on this. * Working on refactoring. * Tests pass - is such a thing even possible? * Fix tests. * Fix mypy. * Cleanup. * Fix NTSB examples. * A few tweaks to the query planner prompt, and a workaround in queryui/util.py. * Fix mypy. * seriously small performance improvement that matters when youre processing tens of thousands of tables (from training code) (#906) Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com> * Handle opensearch reader doc resconstruction when no parent doc in results (#908) * Fix bug in entity extraction. (#911) * Notebooks like default-prep-script.ipynb would fail because the wrong way of generating the prompt would be used. * Rename test to match with name of file being tested. * Fix existing tests to verify parameters on all branches -- the reason the tests were passing was that it was taking the default branch in the test cases * Update all of the tests to directly call run rather than route everything through ray. * Enable copying of the hash context. (#910) * Enable copying of the hash context. * Address comments. * Add option to extract line-based bounding boxes from pdfminer. (#874) We have been using pdfminer's layout detection to group text into boxes. This can cause issues, especially with table extraction, when the boxes don't line up with cells or what we detect with the DETR model. This change adds support for an object_type parameter to the PdfMinerExtractor that can be set to "boxes" (the current behavior), or "lines", which groups characters into lines, but does not group them further. To avoid an explosion of options, we introduce a "text_extractor_options" dict as a paramter, and refactor the TextExtractor class hierarchy a bit to support it. * Support random sample in local mode. (#913) This transform isn't widely used, but still worth supporting in local model to bring it to parity. * Opensearch kwargs fix (#914) * Fix kwargs in opensearch reader * simplify test assertion * lint * pr comments * fix typo (#917) * Update using_jupyter.md (#902) * Update using_jupyter.md Update link * Fixed path --------- Co-authored-by: dtecuci <168428824+dtecuci@users.noreply.github.com> * Rebased. Added ability to read schema from file * rebased. small typo Co-authored-by: Matt Welsh <matt@aryn.ai> * rebased. reformatted file with black * resolved conflicts * changed schema file format to yaml * removed unused import * small typos fixed * fixed spacing --------- Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com> Co-authored-by: Matt Welsh <matt@aryn.ai> Co-authored-by: Eric Anderson <eric@aryn.ai> Co-authored-by: Ben Sowell <ben@aryn.ai> Co-authored-by: Henry Lindeman <hmlindeman@yahoo.com> Co-authored-by: Dhruv Kaliraman <112497058+dhruvkaliraman7@users.noreply.github.com> Co-authored-by: Vinayak Thapliyal <vinayak@aryn.ai> Co-authored-by: Alex Meyer <144723289+alexaryn@users.noreply.github.com> Co-authored-by: Karan Sampath <176953591+karanataryn@users.noreply.github.com> Co-authored-by: jonfritz <134336691+jonfritz@users.noreply.github.com>

mdwelsh added 6 commits October 9, 2024 16:20

Working on this.

ad713e4

Working on refactoring.

61f5d9a

Tests pass - is such a thing even possible?

d4173be

Fix tests.

c91ced2

Fix mypy.

9619ca8

Cleanup.

574126d

mdwelsh requested a review from baitsguy October 10, 2024 23:36

baitsguy approved these changes Oct 11, 2024

View reviewed changes

mdwelsh added 3 commits October 10, 2024 20:25

Fix NTSB examples.

89176a8

A few tweaks to the query planner prompt, and a workaround in

1f0525a

queryui/util.py.

Fix mypy.

88eb02d

mdwelsh enabled auto-merge (squash) October 11, 2024 05:44

Fix lint.

40cfe3e

mdwelsh merged commit 8736557 into main Oct 11, 2024
10 of 11 checks passed

HenryL27 deleted the matt/logical-plan-serialization branch August 30, 2025 00:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor logical plan serialization. #905

Refactor logical plan serialization. #905

Uh oh!

mdwelsh commented Oct 10, 2024

Uh oh!

baitsguy left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor logical plan serialization. #905

Refactor logical plan serialization. #905

Uh oh!

Conversation

mdwelsh commented Oct 10, 2024

Uh oh!

baitsguy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants