-
Notifications
You must be signed in to change notification settings - Fork 65
Improve bbox sorting by accounting for asymmetric page margins on multi-column pages #1273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f642d34
to
e9e9897
Compare
center
and tolerance
to be customizedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is heading in the right direction, but it still seems like the code wants to be organized into a lower-complexity state. Once that's attained, the code will be easier to understand, test, debug, maintain, and extend. It'll also be more efficient and likely smaller.
clear_cached_bboxes(elems) | ||
|
||
|
||
def bbox_margin_sort_page(elements: list[Element]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should only be one entrypoint to sort a list of elements and it should take a transformation matrix as an argument. Determining if the margin calculation was reasonable should be in the other file for proper separation of concerns. It should bail out if it runs into a malfunction, and return the safe default (likely the identity matrix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really want to force the consumers of bbox_sort.py
who want margin based sorting to call find_transform_page
manually like this?
bbox_sort_page(elements, find_transform_page(elements))
as opposed to this?
bbox_margin_sort_page(elements)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intend for users who want bbox sorting that accounts for margins to import a bbox_margin_sort_page
from margin.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no problem with those additional bits of code. The thought is that the correct calculation of the transformation matrix could become complex and dependent on various options and document properties. It may or may not vary from page to page. This is all too much complexity for bbox_sort.py
to sign up for. The cleaner approach is to just make the caller in charge. This is an example of "separation of concerns". There are also aspects of "separability" and "modularity" involved in the sense that we may end up with multiple alternative margin finders. We may also have other places in the code where finding margins is useful.
…with Sycamore transforms
expected_final_coordinates: list[tuple[float, float, float, float]], | ||
) -> None: | ||
elements = [Element({"bbox": bbox}) for bbox in original_bboxes] | ||
transform = find_matrix_page(elements) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call this variable matrix
or tmatrix
so it won't get confused with a Sycamore transform? I have no problem with the word "transformed", but certain words have very specific meanings already, like "transformer" and "transform".
if bbox: | ||
return (bbox[1], bbox[0]) | ||
return (0.0, 0.0) | ||
cached_bbox_tag = "_matrixed_bbox" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think _transformed_bbox
is clearer. Also, the variable name probably wants to be ALL-CAPS or start with g_
.
bbox_sort_page(elements, matrix) | ||
|
||
|
||
def bbox_sort_page(elems: list[Element], matrix: Optional[np.ndarray] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there's an opportunity to go full object-oriented here. It would go like:
bbs = BBoxSorter(tmatrix)
bbs.sort_page(elems)
It somewhat favors keeping the same matrix for every page. It also facilitates future behavior settings on bbs
. If settings change per-page, we can either allow overrides with optional arguments to sort_page()
or we can just allow changing settings in the object, or just create an object per page.
bbox_sort_based_on_tags(elems) | ||
for elem in elems: | ||
elem.data.pop("_coltag", None) # clean up tags | ||
clear_cached_bboxes(elems) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add elem.data.pop(cached_bbox_tag, None)
inside the for-loop.
left = bbox.x1 | ||
right = bbox.x2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may not need left
and right
anymore, as bbox.x1
is clearer than bbox[0]
was.
|
||
def bbox_sort_document(doc: Document, update_element_indexs: bool = True) -> None: | ||
doc.elements = bbox_sorted_elements(doc.elements, update_element_indexs) | ||
def bbox_sort_document(doc: Document) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably BBoxSorter.sort_document()
doc.elements = bbox_sorted_elements(doc.elements) | ||
|
||
|
||
def clear_cached_bboxes(elems: list[Element]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unneeded.
|
||
|
||
def apply_matrix(bbox: BoundingBox, matrix: np.ndarray) -> BoundingBox: | ||
x1, y1, x2, y2 = bbox.to_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to_list()
, you can just reference bbox.x1
etc. when populating the matrix.
return None | ||
|
||
|
||
def elem_left_top(elem: Element) -> tuple: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this now dead code? Any other leftovers?
from sycamore.data import Element | ||
|
||
|
||
class Margins: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use the BoundingBox class to represent this same data. The is_resonable()
logic could go in a free function, or you could subclass BoundingBox. I realize BoundingBox is a little large and bloated, but it seems silly to have multiple ways to represent 2 horizontal and 2 vertical coordinates.
Superseded by #1301 |
No description provided.