heuristics to get title from section headers #1033

Soeb-aryn · 2024-11-21T20:35:38Z

If the first page of a document lacks a title, this heuristic identifies section headers or caption with the largest font size and promotes the most largest fontsize one to the title.

With the use_ocr flag enabled, the font size is determined based on the height of the bounding box. For Paddle OCR, each bounding box represents only a single line.
With the use_ocr flag disabled, the font size is obtained using PDFMiner.

lib/sycamore/sycamore/transforms/detr_partitioner.py

karanataryn · 2024-11-25T20:24:32Z

lib/sycamore/sycamore/transforms/detr_partitioner.py

        output_format: Optional[str] = None,
        text_extraction_options: dict[str, Any] = {},
        source: str = "",
+        promote_title: bool = False,


Can you check this from the text_extraction_options? You'll also need to add documentation in the top level partition call to say that promote_title is an accepted argument in text_extraction_options and explain its functionality.

adding a 'output_label_options' option for driving all output label related functionalities.

karanataryn · 2024-11-25T20:24:50Z

lib/sycamore/sycamore/transforms/partition.py

        output_format: Optional[str] = None,
        text_extraction_options: dict[str, Any] = {},
        source: str = "",
+        promote_title: bool = False,


Move to text_extraction_options

karanataryn · 2024-11-25T20:25:01Z

lib/sycamore/sycamore/transforms/partition.py

                output_format=self._output_format,
                text_extraction_options=self._text_extraction_options,
                source=self._source,
+                promote_title=self._promote_title,


karanataryn · 2024-11-25T20:26:15Z

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py

-    def get_text(self, image: Image.Image) -> str:
-        return self.tesseract.get_text(image)
+    def get_text(self, image: Image.Image) -> tuple[str, float]:
+        return self.tesseract.get_text(image), 0.0


self.tesseract.get_text(image) returns a tuple.

karanataryn · 2024-11-25T20:27:38Z

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py

+    def get_text(self, image: Image.Image) -> tuple[str, float]:
        val = self.pytesseract.image_to_string(image)
-        return val
+        return val, 0.0


Can you add a comment that is not implemented? If it's not possible to calculate it from the output of tesseract, I would rather we return None to indicate this is not a valid font size

karanataryn · 2024-11-25T20:28:51Z

lib/sycamore/sycamore/transforms/text_extraction/pdf_miner.py

            return pages

+    @staticmethod
+    def _parse_obj(objs):


Can you rename this to get_font_size or something like it? parse_obj seems inaccurate for the functionality here.

karanataryn · 2024-11-25T20:29:58Z

lib/sycamore/sycamore/utils/pdf_utils.py

                display(HTML(e.text_representation))
+
+
+def promote_sectionheader_to_title(elements: list[Element]) -> list[Element]:


Why don't we modify this to promote any set of elements to a title and take it in as an argument? You can set the default to be ["Section-header", "Caption"] and call the function promote_elements. This would make it more widely applicable.

good call, updating the function

karanataryn

Looking better. We will need to add the documentation for this in this PR or a later one, I'll leave that to you.

lib/sycamore/sycamore/transforms/partition.py

karanataryn · 2024-11-26T20:04:24Z

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py

-        return self.tesseract.get_text(image)
+    def get_text(self, image: Image.Image) -> tuple[str, Optional[float]]:
+        # font size calculation is not supported for tesseract
+        return self.tesseract.get_text(image)[0], None


Don't you just want to return the entire tuple? Just self.tesseract.get_text(image). If it is not handled, we will get None from there anyways

karanataryn

Looks good to me. Could you add a small unit test to ensure we behave as intended? It should be relatively simple.

karanataryn · 2024-11-26T22:27:06Z

lib/sycamore/sycamore/transforms/partition.py

+        output_label_options: A dictionary for configuring output label behavior. It supports two options: 
+        promote_title, a boolean that specifies whether to add a title to partitioned elements if one is missing, and
+            title_candidate_elements, a list of strings representing labels for potential titles.
+            default: {"promote_title": True ,  "title_candidate_elements":["Section-header", "Caption"]}


I'm not sure if it makes sense to have title_candidate_elements as a top-level attribute given that we only care about it if promote_title is true but this is similar to use_ocr and ocr_images so it's not worth blocking on this.

karanataryn · 2024-11-26T23:26:12Z

lib/sycamore/sycamore/tests/unit/utils/test_pdf_utils.py

+    ]
+
+    result = promote_title(elements)
+    print(result)


Need to remove this

lib/sycamore/sycamore/transforms/partition.py

Soeb-aryn added 10 commits November 21, 2024 20:13

heuristics to get title from section headers

facf2a7

updating get_text return

5789e32

rps unit test fix

88ff5f3

unit test fixes

248ff6b

mypy fixes

e99d115

lint fix

12fdc13

table element fixes

d4dcb83

reverting typo

1f6ee4d

linting

f6abc28

linting

81c81dc

Soeb-aryn marked this pull request as ready for review November 25, 2024 01:39

karanataryn reviewed Nov 25, 2024

View reviewed changes

Soeb-aryn added 2 commits November 26, 2024 00:26

updating function, adding comments and parameters

0d9d156

removing redundant checks

f577793

Soeb-aryn requested a review from karanataryn November 26, 2024 00:48

Soeb Hussain added 3 commits November 26, 2024 18:23

Merge branch 'main' into heuristics_to_promote_sectionheader_2_title

41ff140

more fixes for font sizes

5782305

Merge branch 'main' into heuristics_to_promote_sectionheader_2_title

16be4d3

karanataryn reviewed Nov 26, 2024

View reviewed changes

Soeb-aryn added 4 commits November 26, 2024 21:41

changing variable names, function definition and updating doc string

17c9151

typo handling

31e59c2

updating docs

3be9f3e

linting

9cc3003

Soeb-aryn requested a review from karanataryn November 26, 2024 22:21

karanataryn approved these changes Nov 26, 2024

View reviewed changes

updating unit tests

1723780

karanataryn reviewed Nov 26, 2024

View reviewed changes

removing print statements

8c04f81

Soeb-aryn merged commit a9142a6 into main Nov 26, 2024
11 of 14 checks passed

MarkLindblad reviewed Dec 4, 2024

View reviewed changes

lib/sycamore/sycamore/transforms/partition.py Show resolved Hide resolved

MarkLindblad reviewed Dec 5, 2024

View reviewed changes

lib/sycamore/sycamore/transforms/partition.py Show resolved Hide resolved

		display(HTML(e.text_representation))


		def promote_sectionheader_to_title(elements: list[Element]) -> list[Element]:

heuristics to get title from section headers #1033

heuristics to get title from section headers #1033

Uh oh!

Conversation

Soeb-aryn commented Nov 21, 2024 • edited by MarkLindblad Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karanataryn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karanataryn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Soeb-aryn commented Nov 21, 2024 •

edited by MarkLindblad

Loading