-
Notifications
You must be signed in to change notification settings - Fork 65
heuristics to get title from section headers #1033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
output_format: Optional[str] = None, | ||
text_extraction_options: dict[str, Any] = {}, | ||
source: str = "", | ||
promote_title: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check this from the text_extraction_options
? You'll also need to add documentation in the top level partition call to say that promote_title
is an accepted argument in text_extraction_options
and explain its functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding a 'output_label_options' option for driving all output label related functionalities.
output_format: Optional[str] = None, | ||
text_extraction_options: dict[str, Any] = {}, | ||
source: str = "", | ||
promote_title: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to text_extraction_options
output_format=self._output_format, | ||
text_extraction_options=self._text_extraction_options, | ||
source=self._source, | ||
promote_title=self._promote_title, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above.
def get_text(self, image: Image.Image) -> str: | ||
return self.tesseract.get_text(image) | ||
def get_text(self, image: Image.Image) -> tuple[str, float]: | ||
return self.tesseract.get_text(image), 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.tesseract.get_text(image)
returns a tuple.
def get_text(self, image: Image.Image) -> tuple[str, float]: | ||
val = self.pytesseract.image_to_string(image) | ||
return val | ||
return val, 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment that is not implemented? If it's not possible to calculate it from the output of tesseract, I would rather we return None
to indicate this is not a valid font size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return pages | ||
|
||
@staticmethod | ||
def _parse_obj(objs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename this to get_font_size
or something like it? parse_obj
seems inaccurate for the functionality here.
display(HTML(e.text_representation)) | ||
|
||
|
||
def promote_sectionheader_to_title(elements: list[Element]) -> list[Element]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we modify this to promote any set of elements to a title and take it in as an argument? You can set the default to be ["Section-header", "Caption"]
and call the function promote_elements
. This would make it more widely applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, updating the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking better. We will need to add the documentation for this in this PR or a later one, I'll leave that to you.
return self.tesseract.get_text(image) | ||
def get_text(self, image: Image.Image) -> tuple[str, Optional[float]]: | ||
# font size calculation is not supported for tesseract | ||
return self.tesseract.get_text(image)[0], None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you just want to return the entire tuple? Just self.tesseract.get_text(image)
. If it is not handled, we will get None
from there anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Could you add a small unit test to ensure we behave as intended? It should be relatively simple.
output_label_options: A dictionary for configuring output label behavior. It supports two options: | ||
promote_title, a boolean that specifies whether to add a title to partitioned elements if one is missing, and | ||
title_candidate_elements, a list of strings representing labels for potential titles. | ||
default: {"promote_title": True , "title_candidate_elements":["Section-header", "Caption"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it makes sense to have title_candidate_elements
as a top-level attribute given that we only care about it if promote_title
is true but this is similar to use_ocr
and ocr_images
so it's not worth blocking on this.
] | ||
|
||
result = promote_title(elements) | ||
print(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to remove this
If the first page of a document lacks a title, this heuristic identifies section headers or caption with the largest font size and promotes the most largest fontsize one to the title.
With the use_ocr flag enabled, the font size is determined based on the height of the bounding box. For Paddle OCR, each bounding box represents only a single line.
With the use_ocr flag disabled, the font size is obtained using PDFMiner.