+
Skip to content

Conversation

Soeb-aryn
Copy link
Contributor

@Soeb-aryn Soeb-aryn commented Nov 21, 2024

If the first page of a document lacks a title, this heuristic identifies section headers or caption with the largest font size and promotes the most largest fontsize one to the title.

With the use_ocr flag enabled, the font size is determined based on the height of the bounding box. For Paddle OCR, each bounding box represents only a single line.
With the use_ocr flag disabled, the font size is obtained using PDFMiner.

@Soeb-aryn Soeb-aryn marked this pull request as ready for review November 25, 2024 01:39
output_format: Optional[str] = None,
text_extraction_options: dict[str, Any] = {},
source: str = "",
promote_title: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check this from the text_extraction_options? You'll also need to add documentation in the top level partition call to say that promote_title is an accepted argument in text_extraction_options and explain its functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a 'output_label_options' option for driving all output label related functionalities.

output_format: Optional[str] = None,
text_extraction_options: dict[str, Any] = {},
source: str = "",
promote_title: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to text_extraction_options

output_format=self._output_format,
text_extraction_options=self._text_extraction_options,
source=self._source,
promote_title=self._promote_title,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

def get_text(self, image: Image.Image) -> str:
return self.tesseract.get_text(image)
def get_text(self, image: Image.Image) -> tuple[str, float]:
return self.tesseract.get_text(image), 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.tesseract.get_text(image) returns a tuple.

def get_text(self, image: Image.Image) -> tuple[str, float]:
val = self.pytesseract.image_to_string(image)
return val
return val, 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment that is not implemented? If it's not possible to calculate it from the output of tesseract, I would rather we return None to indicate this is not a valid font size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return pages

@staticmethod
def _parse_obj(objs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this to get_font_size or something like it? parse_obj seems inaccurate for the functionality here.

display(HTML(e.text_representation))


def promote_sectionheader_to_title(elements: list[Element]) -> list[Element]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we modify this to promote any set of elements to a title and take it in as an argument? You can set the default to be ["Section-header", "Caption"] and call the function promote_elements. This would make it more widely applicable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, updating the function

Copy link
Contributor

@karanataryn karanataryn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better. We will need to add the documentation for this in this PR or a later one, I'll leave that to you.

return self.tesseract.get_text(image)
def get_text(self, image: Image.Image) -> tuple[str, Optional[float]]:
# font size calculation is not supported for tesseract
return self.tesseract.get_text(image)[0], None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you just want to return the entire tuple? Just self.tesseract.get_text(image). If it is not handled, we will get None from there anyways

Copy link
Contributor

@karanataryn karanataryn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Could you add a small unit test to ensure we behave as intended? It should be relatively simple.

output_label_options: A dictionary for configuring output label behavior. It supports two options:
promote_title, a boolean that specifies whether to add a title to partitioned elements if one is missing, and
title_candidate_elements, a list of strings representing labels for potential titles.
default: {"promote_title": True , "title_candidate_elements":["Section-header", "Caption"]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it makes sense to have title_candidate_elements as a top-level attribute given that we only care about it if promote_title is true but this is similar to use_ocr and ocr_images so it's not worth blocking on this.

]

result = promote_title(elements)
print(result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remove this

@Soeb-aryn Soeb-aryn merged commit a9142a6 into main Nov 26, 2024
11 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载