-
Notifications
You must be signed in to change notification settings - Fork 65
Refactor Process Batch #1357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Process Batch #1357
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the batch processing methods by splitting the original process_batch functionality into dedicated sub functions to improve modularity and reusability.
- Removed the original process_batch implementation
- Updated process_batch_inference to support optional extractor_inputs and text_extractor parameters
- Added a new process_batch that orchestrates inference and extraction based on provided flags
extracted_pages = [] | ||
with LogTime("text_extraction"): | ||
for i, page_data in enumerate(extractor_inputs): | ||
if isinstance(page_data, dict): | ||
width, height = page_data.get("dimensions") | ||
page = text_extractor.parse_output(page_data.get("data"), width, height) | ||
else: | ||
page = text_extractor.extract_page(page_data) | ||
extracted_pages.append(page) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this is yer pdfminer caller? If you factor it to a method then it gets easier for the perf people to run it in a process pool or something right
Unfortunately it’s not that simple. It should be, but there is a separate path for calling pdfminer in managed-server.
________________________________
From: Henry Lindeman ***@***.***>
Sent: Thursday, June 19, 2025 11:17:37 AM
To: aryn-ai/sycamore ***@***.***>
Cc: Benjamin Sowell ***@***.***>; Review requested ***@***.***>
Subject: Re: [aryn-ai/sycamore] Refactor Process Batch (PR #1357)
@HenryL27 commented on this pull request.
________________________________
In lib/sycamore/sycamore/transforms/detr_partitioner.py<#1357 (comment)>:
+ extracted_pages = []
+ with LogTime("text_extraction"):
+ for i, page_data in enumerate(extractor_inputs):
+ if isinstance(page_data, dict):
+ width, height = page_data.get("dimensions")
+ page = text_extractor.parse_output(page_data.get("data"), width, height)
+ else:
+ page = text_extractor.extract_page(page_data)
+ extracted_pages.append(page)
looks like this is yer pdfminer caller? If you factor it to a method then it gets easier for the perf people to run it in a process pool or something right
—
Reply to this email directly, view it on GitHub<#1357 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAJT7R3723PNO4GTSYZXXWD3EL5EDAVCNFSM6AAAAAB7ROTPNCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDSNBTG43TKOJVG4>.
You are receiving this because your review was requested.Message ID: ***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. not sure why you need it for the table thing but I'm sure all will be revealed shortly and it seems like a good refactor
Refactor to ensure it uses sub functions when possible.