-
Notifications
You must be signed in to change notification settings - Fork 65
Initial draft of X-Y Cut reading order for Sycamore (Hackathon) #1301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
return dotted_lookup(self, field) | ||
|
||
def __lt__(self, other) -> bool: | ||
sidx = self.element_index or 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use a default of -1; Otherwise element 0 and element None sort to the same place.
I don't see an easy way to do that in python so I guess def must_element_index() = element_index if not None else -1.
0 or -1 == -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking...
sidx = -1 if self.element_index is None else self.element_index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also works. It felt kinda repetitive to me so I thought of making a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither is great.
else: | ||
promote_title(page) | ||
bbox_sort_page(page) | ||
if sort_mode and sort_mode == "xycut": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from sycamore.utils.element_sorting import ELEMENT_SORT
ELEMENT_SORT = {
None: bbox_sort_page, # default; alternately fix code to do sort_mode = "bbox" and remove the optional
"xycut": xycut_sorted_page,
"bbox": bbox_sort_page
}
...
ELEMENT_SORT[(sort_mode](page) # or add the return value to bbox_sort_page, but the functional version implies that it isn't an inline change
This is more general and keeps the code the same size so I don't have to ask about refactoring to keep the function size down.
Up to you if you want to move the element sorting functions into a single file or leave them in separate files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look. The semantics of sorted() versus sort() are tricky if we want to keep efficiency.
self.regex_pattern = regex_pattern | ||
self.llm_prompt = llm_prompt | ||
self.llm = llm | ||
self.sort_mode = sort_mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.element_sort = ELEMENT_SORT[sort_mode]
other_elements.extend(new_table_elements) | ||
document.elements = other_elements | ||
bbox_sort_document(document) | ||
if self.sort_mode and self.sort_mode == "xycut": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.element_sort(document)
node.append(elem) | ||
if elem == cut_after: | ||
node.advance() | ||
return node.finalize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain in slack to me how we can need to finalize; If the width >= 0 then gen_overlap found a cut point before and after since on the last element it will generate a width of -1, which would fall under the choice is None branch.
Once I understand, add a comment explaining to future readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for finalize is so the NodeLeaf will end up with 0 elements if nothing is ever appended.
lib/sycamore/sycamore/utils/xycut.py
Outdated
inner = NodeInner() | ||
for elist in node.elists: | ||
subnode = cleave_elems(elist) | ||
# print("....................") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if False:
debugstuff
or remove.
if subnode.cansplit(): | ||
inner.append(divide_node(subnode)) # recursive step | ||
else: | ||
inner.append(subnode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bbox sort the remaining elements if you can't split; or document why we leave the elements in the original order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean a simple sort by Y if we can't split, or calling the algorithm we're replacing. Aside from the top-level call, the inputs are already sorted for the previous cut. If we can't make a single cut, the current code leaves the order alone. Absent any cuts, is it safe to sort arbitrarily by Y? I could sort by Y as a pre-step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sorting should define the order, and xycut with bbox fallback is a complete ordering. Otherwise the order is potentially random (and we've had problems in the past with failing to preserve element ordering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't split at this point, it's because there aren't enough elements. This is the exit from the recursion.
lib/sycamore/sycamore/utils/xycut.py
Outdated
if len(elems) < 2: | ||
return elems | ||
flat = NodeLeaf().extend(elems) | ||
# print("VVVVVVVVVVVVVVVVVVVV") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if False: ...
lib/sycamore/sycamore/utils/xycut.py
Outdated
return tree.to_elems() | ||
|
||
|
||
def xycut_sorted_elements(elements: ElemList, update_indices: bool = True) -> ElemList: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document why you might not want to update indices. If you don't update indices then round-tripping the document through opensearch will revert the order.
I could imagine a preserve_original_indices, but I have no use case for that so still wouldn't include it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm copying behavior somebody added to bbox_sort. I have no idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ask them and then document in both places. Code where no one knows why its there doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused. Nuking. Easy to revive.
bbox = get_bbox(elem) | ||
aa = bbox[axis] | ||
bb = bbox[axis + 2] | ||
if bb < aa: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming that the fix would happen in the code that calls the ML model. I don't expect other users to handle bbox out of order, so I don't see a good reason to handle it here (other than assert).
if subnode.cansplit(): | ||
inner.append(divide_node(subnode)) # recursive step | ||
else: | ||
inner.append(subnode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sorting should define the order, and xycut with bbox fallback is a complete ordering. Otherwise the order is potentially random (and we've had problems in the past with failing to preserve element ordering)
lib/sycamore/sycamore/utils/xycut.py
Outdated
return tree.to_elems() | ||
|
||
|
||
def xycut_sorted_elements(elements: ElemList, update_indices: bool = True) -> ElemList: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ask them and then document in both places. Code where no one knows why its there doesn't make sense.
This isn't hooked up to live code yet. It'll need an option to select it (or kill-switch).