Improve bbox sorting by accounting for asymmetric page margins on multi-column pages #1273

MarkLindblad · 2025-04-25T22:12:50Z

No description provided.

…mized

lib/sycamore/sycamore/tests/unit/utils/test_margin.py

lib/sycamore/sycamore/utils/bbox_sort.py

lib/sycamore/sycamore/utils/margin.py

lib/sycamore/sycamore/utils/bbox_sort.py

alexaryn

This is heading in the right direction, but it still seems like the code wants to be organized into a lower-complexity state. Once that's attained, the code will be easier to understand, test, debug, maintain, and extend. It'll also be more efficient and likely smaller.

lib/sycamore/sycamore/utils/bbox_sort.py

alexaryn · 2025-05-06T04:57:34Z

lib/sycamore/sycamore/utils/bbox_sort.py

+    clear_cached_bboxes(elems)
+
+
+def bbox_margin_sort_page(elements: list[Element]) -> None:


There should only be one entrypoint to sort a list of elements and it should take a transformation matrix as an argument. Determining if the margin calculation was reasonable should be in the other file for proper separation of concerns. It should bail out if it runs into a malfunction, and return the safe default (likely the identity matrix).

Do you really want to force the consumers of bbox_sort.py who want margin based sorting to call find_transform_page manually like this?

bbox_sort_page(elements, find_transform_page(elements))

as opposed to this?

bbox_margin_sort_page(elements)

Did you intend for users who want bbox sorting that accounts for margins to import a bbox_margin_sort_page from margin.py?

I have no problem with those additional bits of code. The thought is that the correct calculation of the transformation matrix could become complex and dependent on various options and document properties. It may or may not vary from page to page. This is all too much complexity for bbox_sort.py to sign up for. The cleaner approach is to just make the caller in charge. This is an example of "separation of concerns". There are also aspects of "separability" and "modularity" involved in the sense that we may end up with multiple alternative margin finders. We may also have other places in the code where finding margins is useful.

lib/sycamore/sycamore/utils/margin.py

…with Sycamore transforms

alexaryn · 2025-05-08T18:15:26Z

lib/sycamore/sycamore/tests/unit/utils/test_margin.py

+    expected_final_coordinates: list[tuple[float, float, float, float]],
+) -> None:
+    elements = [Element({"bbox": bbox}) for bbox in original_bboxes]
+    transform = find_matrix_page(elements)


Should we call this variable matrix or tmatrix so it won't get confused with a Sycamore transform? I have no problem with the word "transformed", but certain words have very specific meanings already, like "transformer" and "transform".

alexaryn · 2025-05-08T18:16:53Z

lib/sycamore/sycamore/utils/bbox_sort.py

-    if bbox:
-        return (bbox[1], bbox[0])
-    return (0.0, 0.0)
+cached_bbox_tag = "_matrixed_bbox"


I think _transformed_bbox is clearer. Also, the variable name probably wants to be ALL-CAPS or start with g_.

alexaryn · 2025-05-08T18:27:43Z

lib/sycamore/sycamore/utils/bbox_sort.py

+    bbox_sort_page(elements, matrix)
+
+
+def bbox_sort_page(elems: list[Element], matrix: Optional[np.ndarray] = None) -> None:


I feel like there's an opportunity to go full object-oriented here. It would go like:

bbs = BBoxSorter(tmatrix) bbs.sort_page(elems)

It somewhat favors keeping the same matrix for every page. It also facilitates future behavior settings on bbs. If settings change per-page, we can either allow overrides with optional arguments to sort_page() or we can just allow changing settings in the object, or just create an object per page.

alexaryn · 2025-05-08T18:30:15Z

lib/sycamore/sycamore/utils/bbox_sort.py

+    bbox_sort_based_on_tags(elems)
+    for elem in elems:
+        elem.data.pop("_coltag", None)  # clean up tags
+    clear_cached_bboxes(elems)


Just add elem.data.pop(cached_bbox_tag, None) inside the for-loop.

alexaryn · 2025-05-08T18:35:45Z

lib/sycamore/sycamore/utils/bbox_sort.py

+            left = bbox.x1
+            right = bbox.x2


You may not need left and right anymore, as bbox.x1 is clearer than bbox[0] was.

alexaryn · 2025-05-08T19:00:27Z

lib/sycamore/sycamore/utils/bbox_sort.py


-def bbox_sort_document(doc: Document, update_element_indexs: bool = True) -> None:
-    doc.elements = bbox_sorted_elements(doc.elements, update_element_indexs)
+def bbox_sort_document(doc: Document) -> None:


Probably BBoxSorter.sort_document()

alexaryn · 2025-05-08T19:00:45Z

lib/sycamore/sycamore/utils/bbox_sort.py

+    doc.elements = bbox_sorted_elements(doc.elements)
+
+
+def clear_cached_bboxes(elems: list[Element]) -> None:


alexaryn · 2025-05-08T19:02:26Z

lib/sycamore/sycamore/utils/bbox_sort.py

+
+
+def apply_matrix(bbox: BoundingBox, matrix: np.ndarray) -> BoundingBox:
+    x1, y1, x2, y2 = bbox.to_list()


You don't need to_list(), you can just reference bbox.x1 etc. when populating the matrix.

alexaryn · 2025-05-08T19:04:01Z

lib/sycamore/sycamore/utils/bbox_sort.py

+            return None


 def elem_left_top(elem: Element) -> tuple:


Is this now dead code? Any other leftovers?

alexaryn · 2025-05-08T19:05:39Z

lib/sycamore/sycamore/utils/margin.py

+from sycamore.data import Element
+
+
+class Margins:


You could use the BoundingBox class to represent this same data. The is_resonable() logic could go in a free function, or you could subclass BoundingBox. I realize BoundingBox is a little large and bloated, but it seems silly to have multiple ways to represent 2 horizontal and 2 vertical coordinates.

MarkLindblad · 2025-05-23T17:02:00Z

Superseded by #1301

MarkLindblad mentioned this pull request Apr 25, 2025

Detect article-like pages entirely in two columns when generating markdown #1269

Closed

Improve bbox sorting by allowing center and tolerance to be custo…

e9e9897

…mized

MarkLindblad force-pushed the mark/imp-sort branch from f642d34 to e9e9897 Compare May 1, 2025 21:55

Change approach to use a transform

a00ffc3

MarkLindblad changed the title ~~Improve bbox sorting by allowing center and tolerance to be customized~~ Improve bbox sorting by accounting for asymmetric page margin May 2, 2025

MarkLindblad changed the title ~~Improve bbox sorting by accounting for asymmetric page margin~~ Improve bbox sorting by accounting for asymmetric page margins May 2, 2025

MarkLindblad changed the title ~~Improve bbox sorting by accounting for asymmetric page margins~~ Improve bbox sorting by accounting for asymmetric page margins on multi-column pages May 2, 2025

MarkLindblad added 7 commits May 4, 2025 11:04

Fix transformation

dc3bb2a

Adjust constant

d6fc74e

Add unit test for margin adjustment

148a3d1

Fix black linting

2bcc1a8

Switch to simpler col_tag again

3d5fef6

Fix mypy linting

ad1b786

Fix order of operations

5683657

MarkLindblad marked this pull request as ready for review May 4, 2025 19:39

alexaryn reviewed May 5, 2025

View reviewed changes

MarkLindblad added 8 commits May 5, 2025 14:20

Adjust top and bottom margins as well

28606e0

Simplify test

52cd476

On large detected margins, revert to identity transformation

b77545e

Stop changing actual .bbox field

0b55848

Vary max_width with detected margin reasonableness

0a45a4a

Fix black linting

498bc7b

Fix ruff linting

a10bd4f

Fix mypy linting

bcb1d99

alexaryn reviewed May 6, 2025

View reviewed changes

MarkLindblad added 5 commits May 7, 2025 14:19

Rename find_margin_page to find_margin_of_pages

49c90dc

Move some functions from utils/margin.py to utils/bbox_sort.py

9ad5cee

Refactor bbox_sort.py, margin.py

f9d171b

Use .x1 and .x2 instead of bbox[0] and bbox[2]

f1af3dd

Add comment around batched transform

920945c

MarkLindblad added 9 commits May 7, 2025 17:15

Assume apply_transform will always get a transform

3b21a01

Rename cache to result for clarity

ba8d892

Rename get_bbox_prefer_cached to get_transformed_bbox

ac27d99

Use word matrix instead of transform to avoid name/concept collision …

f346e3b

…with Sycamore transforms

Fix type signature of find_margin_of_pages

fefaf96

Remove word cache where possible

22f6dc7

Fix test_margin.py

9264a69

Fix test_margin.py again

54cd410

Fix test_bbox_sort.py

824236e

alexaryn reviewed May 8, 2025

View reviewed changes

MarkLindblad closed this May 23, 2025

MarkLindblad deleted the mark/imp-sort branch May 27, 2025 20:19

		clear_cached_bboxes(elems)


		def bbox_margin_sort_page(elements: list[Element]) -> None:

		bbox_sort_page(elements, matrix)


		def bbox_sort_page(elems: list[Element], matrix: Optional[np.ndarray] = None) -> None:

		doc.elements = bbox_sorted_elements(doc.elements)


		def clear_cached_bboxes(elems: list[Element]) -> None:



		def apply_matrix(bbox: BoundingBox, matrix: np.ndarray) -> BoundingBox:
		x1, y1, x2, y2 = bbox.to_list()

Improve bbox sorting by accounting for asymmetric page margins on multi-column pages #1273

Improve bbox sorting by accounting for asymmetric page margins on multi-column pages #1273

Uh oh!

Conversation

MarkLindblad commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexaryn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarkLindblad May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarkLindblad commented May 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MarkLindblad May 7, 2025 •

edited

Loading