+
Skip to content

Conversation

alexaryn
Copy link
Collaborator

This isn't hooked up to live code yet. It'll need an option to select it (or kill-switch).

@alexaryn alexaryn marked this pull request as ready for review May 16, 2025 08:10
return dotted_lookup(self, field)

def __lt__(self, other) -> bool:
sidx = self.element_index or 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use a default of -1; Otherwise element 0 and element None sort to the same place.
I don't see an easy way to do that in python so I guess def must_element_index() = element_index if not None else -1.
0 or -1 == -1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking...

sidx = -1 if self.element_index is None else self.element_index

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also works. It felt kinda repetitive to me so I thought of making a function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither is great.

else:
promote_title(page)
bbox_sort_page(page)
if sort_mode and sort_mode == "xycut":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from sycamore.utils.element_sorting import ELEMENT_SORT

ELEMENT_SORT = {
   None: bbox_sort_page, # default; alternately fix code to do sort_mode = "bbox" and remove the optional
   "xycut": xycut_sorted_page,
   "bbox": bbox_sort_page
}
...
ELEMENT_SORT[(sort_mode](page) # or add the return value to bbox_sort_page, but the functional version implies that it isn't an inline change

This is more general and keeps the code the same size so I don't have to ask about refactoring to keep the function size down.

Up to you if you want to move the element sorting functions into a single file or leave them in separate files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look. The semantics of sorted() versus sort() are tricky if we want to keep efficiency.

self.regex_pattern = regex_pattern
self.llm_prompt = llm_prompt
self.llm = llm
self.sort_mode = sort_mode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.element_sort = ELEMENT_SORT[sort_mode]

other_elements.extend(new_table_elements)
document.elements = other_elements
bbox_sort_document(document)
if self.sort_mode and self.sort_mode == "xycut":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.element_sort(document)

node.append(elem)
if elem == cut_after:
node.advance()
return node.finalize()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain in slack to me how we can need to finalize; If the width >= 0 then gen_overlap found a cut point before and after since on the last element it will generate a width of -1, which would fall under the choice is None branch.

Once I understand, add a comment explaining to future readers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for finalize is so the NodeLeaf will end up with 0 elements if nothing is ever appended.

inner = NodeInner()
for elist in node.elists:
subnode = cleave_elems(elist)
# print("....................")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if False:
   debugstuff

or remove.

if subnode.cansplit():
inner.append(divide_node(subnode)) # recursive step
else:
inner.append(subnode)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bbox sort the remaining elements if you can't split; or document why we leave the elements in the original order.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a simple sort by Y if we can't split, or calling the algorithm we're replacing. Aside from the top-level call, the inputs are already sorted for the previous cut. If we can't make a single cut, the current code leaves the order alone. Absent any cuts, is it safe to sort arbitrarily by Y? I could sort by Y as a pre-step.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sorting should define the order, and xycut with bbox fallback is a complete ordering. Otherwise the order is potentially random (and we've had problems in the past with failing to preserve element ordering)

Copy link
Collaborator Author

@alexaryn alexaryn May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't split at this point, it's because there aren't enough elements. This is the exit from the recursion.

if len(elems) < 2:
return elems
flat = NodeLeaf().extend(elems)
# print("VVVVVVVVVVVVVVVVVVVV")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if False: ...

return tree.to_elems()


def xycut_sorted_elements(elements: ElemList, update_indices: bool = True) -> ElemList:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document why you might not want to update indices. If you don't update indices then round-tripping the document through opensearch will revert the order.
I could imagine a preserve_original_indices, but I have no use case for that so still wouldn't include it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm copying behavior somebody added to bbox_sort. I have no idea.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ask them and then document in both places. Code where no one knows why its there doesn't make sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused. Nuking. Easy to revive.

bbox = get_bbox(elem)
aa = bbox[axis]
bb = bbox[axis + 2]
if bb < aa:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming that the fix would happen in the code that calls the ML model. I don't expect other users to handle bbox out of order, so I don't see a good reason to handle it here (other than assert).

if subnode.cansplit():
inner.append(divide_node(subnode)) # recursive step
else:
inner.append(subnode)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sorting should define the order, and xycut with bbox fallback is a complete ordering. Otherwise the order is potentially random (and we've had problems in the past with failing to preserve element ordering)

return tree.to_elems()


def xycut_sorted_elements(elements: ElemList, update_indices: bool = True) -> ElemList:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ask them and then document in both places. Code where no one knows why its there doesn't make sense.

@alexaryn alexaryn merged commit 86749b2 into main May 21, 2025
11 of 15 checks passed
@alexaryn alexaryn deleted the alex_syc_xycut branch May 21, 2025 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载