-
Notifications
You must be signed in to change notification settings - Fork 65
Sycamore: deal with rotated tables. #1336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a few tests? I would think it would be feasible to create a pdf with an otherwise straightforward table rotated a few different ways and then check that the extraction does what we expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces utilities and modifications to better handle rotated tables in image documents. Key changes include:
- New rotation helper functions in rotation.py to support image and coordinate rotations.
- Enhancements in table structure extraction and text extraction, including handling of font size and additional vector data.
- Improved TableElement handling with a new shallow copy method and updated rotation logic in table extraction.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
lib/sycamore/sycamore/utils/rotation.py | Introduces rotation utility functions for images, coordinates, and bounding boxes. |
lib/sycamore/sycamore/transforms/text_extraction/text_extractor.py | Updates text extraction to incorporate font size and vector properties. |
lib/sycamore/sycamore/transforms/table_structure/extract.py | Adds rotated_table and modifications to apply table rotation adjustments during extraction. |
lib/sycamore/sycamore/transforms/detr_partitioner.py | Modifies token processing to include vector data and iterates page elements by index for in-place updates. |
lib/sycamore/sycamore/data/element.py | Introduces a new shallow copy method for TableElement. |
No description provided.