-
Notifications
You must be signed in to change notification settings - Fork 65
Add VLMTableStructureExtractor for table structure extraction. #1304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This calls an LLM to determine the cells of a table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces the VLMTableStructureExtractor to enhance table structure extraction via an LLM, along with related utility functions and tests.
- Added a new _crop_bbox helper function and the VLMTableStructureExtractor class for processing table images.
- Extended unit and integration tests to validate new LLM-based table extraction, and updated deserialization methods in Gemini and Anthropic implementations.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
lib/sycamore/sycamore/transforms/table_structure/extract.py | Added _crop_bbox and VLMTableStructureExtractor to extract table structure using an LLM. |
lib/sycamore/sycamore/tests/unit/llms/test_llms.py | Added a new test case for ensuring proper Gemini pickling. |
lib/sycamore/sycamore/tests/integration/transforms/test_table_extraction.py | Added integration tests for table extraction using various LLMs. |
lib/sycamore/sycamore/llms/gemini.py | Updated reduce to use a separate deserializer function. |
lib/sycamore/sycamore/llms/anthropic.py | Updated reduce to use a separate deserializer function. |
"""Table structure extractor that uses a VLM model to extract the table structure.""" | ||
|
||
EXTRACT_TABLE_STRUCTURE_PROMPT = """You are given an image of a table from a document. Please convert this table into HTML. Be sure to include the table header and all rows. Use 'colspan' and 'rowspan' in the output to indicate merged cells. Return the HTML as a string. Do not include any other text in the response. | ||
+""" |
Copilot
AI
May 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be an extra '+' in the closing triple quotes for the prompt string in VLMTableStructureExtractor. Removing the extraneous '+' will prevent potential syntax errors.
+""" | |
""" |
Copilot uses AI. Check for mistakes.
new_elem = extractor.extract(element=basic_table_element, doc_image=basic_table_image) | ||
assert new_elem.table is not None | ||
|
||
print(new_elem.table.to_html()) |
Copilot
AI
May 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider removing or replacing the print statement used for debugging in the test to maintain clean test outputs.
print(new_elem.table.to_html()) | |
logging.debug(new_elem.table.to_html()) |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 q but lgtm
# Convert cell bounding boxes to be relative to the original image. | ||
for cell in table.cells: | ||
if cell.bbox is None: | ||
continue | ||
cell.bbox.translate_self(crop_box[0], crop_box[1]).to_relative_self(width, height) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't all the cell bboxes be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, yes. I guess I was just on auto-pilot. I'll go ahead and remove this. Part of me just wanted to leave it as a defense mechanism, but I can't think of a way that an llm could hallucinate bounding boxes in a way we would interpret.
This calls an LLM to determine the cells of a table.