+
Skip to content

Conversation

MarkLindblad
Copy link
Contributor

No description provided.

@MarkLindblad MarkLindblad changed the title Improve bbox sorting by allowing center and tolerance to be customized Improve bbox sorting by accounting for asymmetric page margin May 2, 2025
@MarkLindblad MarkLindblad changed the title Improve bbox sorting by accounting for asymmetric page margin Improve bbox sorting by accounting for asymmetric page margins May 2, 2025
@MarkLindblad MarkLindblad changed the title Improve bbox sorting by accounting for asymmetric page margins Improve bbox sorting by accounting for asymmetric page margins on multi-column pages May 2, 2025
@MarkLindblad MarkLindblad marked this pull request as ready for review May 4, 2025 19:39
Copy link
Collaborator

@alexaryn alexaryn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is heading in the right direction, but it still seems like the code wants to be organized into a lower-complexity state. Once that's attained, the code will be easier to understand, test, debug, maintain, and extend. It'll also be more efficient and likely smaller.

clear_cached_bboxes(elems)


def bbox_margin_sort_page(elements: list[Element]) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should only be one entrypoint to sort a list of elements and it should take a transformation matrix as an argument. Determining if the margin calculation was reasonable should be in the other file for proper separation of concerns. It should bail out if it runs into a malfunction, and return the safe default (likely the identity matrix).

Copy link
Contributor Author

@MarkLindblad MarkLindblad May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want to force the consumers of bbox_sort.py who want margin based sorting to call find_transform_page manually like this?

bbox_sort_page(elements, find_transform_page(elements))

as opposed to this?

bbox_margin_sort_page(elements)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend for users who want bbox sorting that accounts for margins to import a bbox_margin_sort_page from margin.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with those additional bits of code. The thought is that the correct calculation of the transformation matrix could become complex and dependent on various options and document properties. It may or may not vary from page to page. This is all too much complexity for bbox_sort.py to sign up for. The cleaner approach is to just make the caller in charge. This is an example of "separation of concerns". There are also aspects of "separability" and "modularity" involved in the sense that we may end up with multiple alternative margin finders. We may also have other places in the code where finding margins is useful.

expected_final_coordinates: list[tuple[float, float, float, float]],
) -> None:
elements = [Element({"bbox": bbox}) for bbox in original_bboxes]
transform = find_matrix_page(elements)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this variable matrix or tmatrix so it won't get confused with a Sycamore transform? I have no problem with the word "transformed", but certain words have very specific meanings already, like "transformer" and "transform".

if bbox:
return (bbox[1], bbox[0])
return (0.0, 0.0)
cached_bbox_tag = "_matrixed_bbox"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _transformed_bbox is clearer. Also, the variable name probably wants to be ALL-CAPS or start with g_.

bbox_sort_page(elements, matrix)


def bbox_sort_page(elems: list[Element], matrix: Optional[np.ndarray] = None) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there's an opportunity to go full object-oriented here. It would go like:

bbs = BBoxSorter(tmatrix)
bbs.sort_page(elems)

It somewhat favors keeping the same matrix for every page. It also facilitates future behavior settings on bbs. If settings change per-page, we can either allow overrides with optional arguments to sort_page() or we can just allow changing settings in the object, or just create an object per page.

bbox_sort_based_on_tags(elems)
for elem in elems:
elem.data.pop("_coltag", None) # clean up tags
clear_cached_bboxes(elems)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add elem.data.pop(cached_bbox_tag, None) inside the for-loop.

Comment on lines +63 to +64
left = bbox.x1
right = bbox.x2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may not need left and right anymore, as bbox.x1 is clearer than bbox[0] was.


def bbox_sort_document(doc: Document, update_element_indexs: bool = True) -> None:
doc.elements = bbox_sorted_elements(doc.elements, update_element_indexs)
def bbox_sort_document(doc: Document) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably BBoxSorter.sort_document()

doc.elements = bbox_sorted_elements(doc.elements)


def clear_cached_bboxes(elems: list[Element]) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unneeded.



def apply_matrix(bbox: BoundingBox, matrix: np.ndarray) -> BoundingBox:
x1, y1, x2, y2 = bbox.to_list()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to_list(), you can just reference bbox.x1 etc. when populating the matrix.

return None


def elem_left_top(elem: Element) -> tuple:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this now dead code? Any other leftovers?

from sycamore.data import Element


class Margins:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use the BoundingBox class to represent this same data. The is_resonable() logic could go in a free function, or you could subclass BoundingBox. I realize BoundingBox is a little large and bloated, but it seems silly to have multiple ways to represent 2 horizontal and 2 vertical coordinates.

@MarkLindblad
Copy link
Contributor Author

Superseded by #1301

@MarkLindblad MarkLindblad deleted the mark/imp-sort branch May 27, 2025 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载