+
Skip to content

Conversation

karanataryn
Copy link
Contributor

Refactor to ensure it uses sub functions when possible.

@karanataryn karanataryn requested a review from Copilot June 17, 2025 22:38
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the batch processing methods by splitting the original process_batch functionality into dedicated sub functions to improve modularity and reusability.

  • Removed the original process_batch implementation
  • Updated process_batch_inference to support optional extractor_inputs and text_extractor parameters
  • Added a new process_batch that orchestrates inference and extraction based on provided flags

Comment on lines +497 to +505
extracted_pages = []
with LogTime("text_extraction"):
for i, page_data in enumerate(extractor_inputs):
if isinstance(page_data, dict):
width, height = page_data.get("dimensions")
page = text_extractor.parse_output(page_data.get("data"), width, height)
else:
page = text_extractor.extract_page(page_data)
extracted_pages.append(page)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is yer pdfminer caller? If you factor it to a method then it gets easier for the perf people to run it in a process pool or something right

@bsowell
Copy link
Contributor

bsowell commented Jun 19, 2025 via email

Copy link
Collaborator

@HenryL27 HenryL27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. not sure why you need it for the table thing but I'm sure all will be revealed shortly and it seems like a good refactor

@karanataryn karanataryn merged commit 8c1a0ad into main Jun 19, 2025
11 of 15 checks passed
@karanataryn karanataryn deleted the ksampath/refactor-batch-process branch June 19, 2025 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载