Refactor Process Batch #1357

karanataryn · 2025-06-17T22:38:03Z

Refactor to ensure it uses sub functions when possible.

Copilot

Pull Request Overview

This PR refactors the batch processing methods by splitting the original process_batch functionality into dedicated sub functions to improve modularity and reusability.

Removed the original process_batch implementation
Updated process_batch_inference to support optional extractor_inputs and text_extractor parameters
Added a new process_batch that orchestrates inference and extraction based on provided flags

lib/sycamore/sycamore/transforms/detr_partitioner.py

HenryL27 · 2025-06-19T18:17:15Z

lib/sycamore/sycamore/transforms/detr_partitioner.py

+            extracted_pages = []
+            with LogTime("text_extraction"):
+                for i, page_data in enumerate(extractor_inputs):
+                    if isinstance(page_data, dict):
+                        width, height = page_data.get("dimensions")
+                        page = text_extractor.parse_output(page_data.get("data"), width, height)
+                    else:
+                        page = text_extractor.extract_page(page_data)
+                    extracted_pages.append(page)


looks like this is yer pdfminer caller? If you factor it to a method then it gets easier for the perf people to run it in a process pool or something right

bsowell · 2025-06-19T18:21:48Z

Unfortunately it’s not that simple. It should be, but there is a separate path for calling pdfminer in managed-server.

________________________________ From: Henry Lindeman ***@***.***> Sent: Thursday, June 19, 2025 11:17:37 AM To: aryn-ai/sycamore ***@***.***> Cc: Benjamin Sowell ***@***.***>; Review requested ***@***.***> Subject: Re: [aryn-ai/sycamore] Refactor Process Batch (PR #1357) @HenryL27 commented on this pull request.

________________________________ In lib/sycamore/sycamore/transforms/detr_partitioner.py<#1357 (comment)>:

+ extracted_pages = []

+ with LogTime("text_extraction"): + for i, page_data in enumerate(extractor_inputs): + if isinstance(page_data, dict): + width, height = page_data.get("dimensions") + page = text_extractor.parse_output(page_data.get("data"), width, height) + else: + page = text_extractor.extract_page(page_data) + extracted_pages.append(page) looks like this is yer pdfminer caller? If you factor it to a method then it gets easier for the perf people to run it in a process pool or something right — Reply to this email directly, view it on GitHub<#1357 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAJT7R3723PNO4GTSYZXXWD3EL5EDAVCNFSM6AAAAAB7ROTPNCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDSNBTG43TKOJVG4>. You are receiving this because your review was requested.Message ID: ***@***.***>

HenryL27

lgtm. not sure why you need it for the table thing but I'm sure all will be revealed shortly and it seems like a good refactor

refactor

aad49be

karanataryn requested a review from Copilot June 17, 2025 22:38

Copilot AI reviewed Jun 17, 2025

View reviewed changes

lib/sycamore/sycamore/transforms/detr_partitioner.py Show resolved Hide resolved

lib/sycamore/sycamore/transforms/detr_partitioner.py Outdated Show resolved Hide resolved

fix order

15e44e5

karanataryn requested review from MarkLindblad and bsowell June 18, 2025 21:05

alexaryn reviewed Jun 19, 2025

View reviewed changes

lib/sycamore/sycamore/transforms/detr_partitioner.py Show resolved Hide resolved

HenryL27 reviewed Jun 19, 2025

View reviewed changes

HenryL27 approved these changes Jun 19, 2025

View reviewed changes

karanataryn added 2 commits June 19, 2025 13:38

add threshold *

2c9ee8a

fix lint

3424b4c

karanataryn merged commit 8c1a0ad into main Jun 19, 2025
11 of 15 checks passed

karanataryn deleted the ksampath/refactor-batch-process branch June 19, 2025 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor Process Batch #1357

Refactor Process Batch #1357

Uh oh!

karanataryn commented Jun 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HenryL27 Jun 19, 2025

Uh oh!

bsowell commented Jun 19, 2025 via email

Uh oh!

HenryL27 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Refactor Process Batch #1357

Refactor Process Batch #1357

Uh oh!

Conversation

karanataryn commented Jun 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HenryL27 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

bsowell commented Jun 19, 2025 via email

Uh oh!

HenryL27 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants