Add OCR Caching #1381

karanataryn · 2025-07-02T21:01:02Z

Adds the ability to have a configurable OCR Cache.

Copilot

Pull Request Overview

Adds a configurable, per-image OCR caching layer to Sycamore, refactors OCR model classes to use the new caching API, and provides an end-to-end example and tests.

Introduces ocr_cache.py with OcrCacheManager and cache key generator
Refactors OcrModel subclasses to wrap get_text/get_boxes_and_text with caching logic
Updates tests to cover cache manager functionality and adds an example script
Bumps paddleocr dependency and removes the old DiskCache import

Reviewed Changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
lib/sycamore/.../text_extraction/ocr_models.py	Refactor OCR models to use new `OcrCacheManager`, wrap calls
lib/sycamore/.../text_extraction/ocr_cache.py	New cache manager, key generation, and global accessors
lib/sycamore/tests/.../text_extraction/test_ocr_cache.py	Tests for cache key gen, manager, global accessor
lib/sycamore/pyproject.toml	Updated `paddleocr` version
examples/ocr_caching_example.py	New example demonstrating caching modes

Comments suppressed due to low confidence (6)

lib/sycamore/sycamore/transforms/text_extraction/ocr_cache.py:92

The docstring for __init__ says "If None, uses default local cache at ~/.sycamore/OcrCache", but the implementation disables caching when cache_path is None. Please update the docstring to match the actual behavior or change the behavior to use a default path.

        Args:

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:161

This comment refers to an old document-level cache that has been removed. Please remove or update the comment to avoid confusion.

        # Note: This method still uses the old document-level caching for backward compatibility

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:70

New caching logic is added in this wrapper, but there are no unit tests verifying that repeated calls hit the cache (e.g. mocking _get_text_impl). Please add tests to cover the cache-hit and cache-miss paths in OcrModel.get_text (and similarly for get_boxes_and_text).

    def get_text(self, image: Image.Image, **kwargs) -> tuple[str, Optional[float]]:

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:163

The code references tempfile but there is no import tempfile at the top of this file, which will cause a NameError at runtime. Please add import tempfile alongside the other imports.

        with tempfile.TemporaryDirectory() as tempdirname:  # type: ignore

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:185

Using a mutable default argument (lang_list=["en"]) can lead to unexpected shared state between instances. Consider using lang_list: Optional[list[str]] = None and then setting self._lang_list = lang_list or ["en"] inside the method.

        cache_path: Optional[str] = None,

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:240

[nitpick] Overriding the special __name__ method on instances is unconventional and may cause confusion. Consider using a property or a class attribute (e.g. model_name) instead of redefining __name__.

    def __name__(self):

alexaryn

Will comment more later.

alexaryn · 2025-07-02T23:32:24Z

examples/ocr_caching_example.py

+    img = Image.new("RGB", size, color)
+    draw = ImageDraw.Draw(img)
+
+    # Try to use a default font, fallback to default if not available


The word default used twice here is confusing.

alexaryn · 2025-07-02T23:37:03Z

examples/ocr_caching_example.py

+    logger.info("=== Local Caching Demo ===")
+
+    # Create temporary cache directory
+    with tempfile.TemporaryDirectory() as temp_dir:


It can be useful to set prefix= to avoid mysterious dirs in /tmp.

alexaryn · 2025-07-02T23:45:30Z

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py

+        cache_only: bool = False,
+        disable_caching: bool = True,


These parameters have confusing names. In essence, the controls we want are:

enable_cache_read: bool

enable_cache_write: bool

fail_on_cache_miss: bool

alexaryn · 2025-07-02T23:48:30Z

lib/sycamore/sycamore/transforms/text_extraction/ocr_cache.py

+        """Convert PIL image to a hash string."""
+        # Convert image to bytes for consistent hashing
+        img_bytes = io.BytesIO()
+        image.save(img_bytes, format="PNG")


I would use a more deterministic image format, perhaps zero compression. For PNG, saving with optimize off and compress_level zero sounds good.

alexaryn · 2025-07-02T23:54:34Z

lib/sycamore/sycamore/transforms/text_extraction/ocr_cache.py

+        img_bytes = io.BytesIO()
+        image.save(img_bytes, format="PNG")
+        img_bytes.seek(0)
+        return hashlib.sha256(img_bytes.getvalue()).hexdigest()


You should accept a hash context object and update it here. That allows the caller to accumulate whatever they want into the hash. Also, hexdigest is a clumsy way to represent a hash, except perhaps for final rendering into text. Base64 (or even 36) of` the digest bytes is better.

alexaryn · 2025-07-02T23:56:21Z

lib/sycamore/sycamore/transforms/text_extraction/ocr_cache.py

+            versions[package] = self._get_package_version(package)
+
+        # Create the cache key components
+        key_components = {


Just hash all of the things that matter in a deterministic order. No need to hash hashes. In fact, try to avoid that; it's mathematically unsound.

alexaryn · 2025-07-03T04:27:48Z

examples/ocr_caching_example.py

+
+    # Create temporary cache directory
+    with tempfile.TemporaryDirectory() as temp_dir:
+        cache_path = str(Path(temp_dir) / "ocr_cache")


Seems like a roundabout way of saying f"{temp_dir}/ocr_cache"

alexaryn · 2025-07-03T04:31:21Z

examples/ocr_caching_example.py

+        try:
+            ocr_cache_only.get_text(new_img)
+            assert False, "Should have raised CacheMissError"
+        except Exception as e:


Why not catch CacheMissError specifically?

alexaryn · 2025-07-03T04:32:27Z

examples/ocr_caching_example.py

+        assert ocr_disabled.cache_manager is None
+        assert ocr_disabled.disable_caching


I think you should test behavior, not inner settings.

alexaryn · 2025-07-03T04:34:21Z

examples/ocr_caching_example.py

+        # Results should be identical (same implementation called)
+        assert result1 == result2
+        logger.info(f"Results are identical: {result1}")


Instead of this off-topic test, I'd rather see validation that the hit rate is zero.

alexaryn · 2025-07-03T04:36:05Z

examples/ocr_caching_example.py

+        logger.info("Checking that cache was not populated...")
+
+        # Should not find cached result
+        cached_result = ocr_normal.cache_manager.get(img, "PaddleOcr", "get_text", {}, ["paddleocr", "paddle"])


This seems too intimate. If we had a mode that allowed reading from cache without writing and without executing, we could use that via an official PaddleOcr object.

alexaryn · 2025-07-03T06:24:13Z

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py

+
+        return result
+
+    def get_boxes_and_text(self, image: Image.Image, **kwargs) -> list[dict[str, Any]]:


Is this mostly a copy of the above function? Can we make a single function that takes the function name and a callable and wraps caching around anything?

alexaryn · 2025-07-03T06:26:29Z

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py

+            )
+            if cached_result is not None:
+                logger.debug(f"Cache hit for {self._model_name}.get_boxes_and_text")
+                assert isinstance(cached_result, list), f"Cached result is not a list: {type(cached_result)}"


Do we really need this assert?

alexaryn · 2025-07-03T06:32:47Z

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py

+                cached_result = [
+                    {"bbox": BoundingBox(*dict_value["bbox"]), **{k: v for k, v in dict_value.items() if k != "bbox"}}
+                    for dict_value in cached_result
+                ]


It's kinda ugly when the object we store in the cache isn't directly what we can return to users. In this case, I think we can hide/abstract some of it. Basically, we just need a simplify/reconstruct helper. Since this varies per-function, I don't think we can push it down into the cache module.

alexaryn · 2025-07-03T06:41:14Z

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py

+        # Note: This method still uses the old document-level caching for backward compatibility
+        # The new per-image caching is handled in get_text and get_boxes_and_text methods


It looks like the document-level caching has simply been ripped out, notwithstanding what the comment says. I think document-level caching still has value. Since we haven't run it in production, we're free to tweak it with the new hashing stuff, but we should be able to enable both document- and image-level caching. In fact, it's a good test of the design to use the code for both purposes.

alexaryn · 2025-07-03T06:44:44Z

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py

+        super().__init__(cache_path=cache_path, cache_only=cache_only, disable_caching=disable_caching)
+        self.tesseract = Tesseract(cache_path=cache_path, cache_only=cache_only, disable_caching=disable_caching)
+        self.easy_ocr = EasyOcr(cache_path=cache_path, cache_only=cache_only, disable_caching=disable_caching)


This would be less messy if we just passed a cache manager object around.

update cache

79d93a3

karanataryn requested a review from Copilot July 2, 2025 21:01

This comment was marked as outdated.

Sign in to view

karanataryn added 3 commits July 2, 2025 14:17

update mypy cache

a3cc8d4

rewrite tests and lint

41fed11

remove Path

ef7c600

karanataryn requested a review from Copilot July 2, 2025 21:55

Copilot AI reviewed Jul 2, 2025

View reviewed changes

karanataryn and others added 7 commits July 2, 2025 15:41

update with cacheable values

968e651

update with bug fix

51fdf63

fix

60626a0

format

cd86e48

update bbox

feb6e3e

merge

fb5e182

format

9054027

alexaryn reviewed Jul 2, 2025

View reviewed changes

alexaryn reviewed Jul 3, 2025

View reviewed changes

karanataryn added 2 commits July 7, 2025 07:31

merge main

f5af11a

merge main

6d1e344

		assert ocr_disabled.cache_manager is None
		assert ocr_disabled.disable_caching


		return result

		def get_boxes_and_text(self, image: Image.Image, **kwargs) -> list[dict[str, Any]]:

		# Note: This method still uses the old document-level caching for backward compatibility
		# The new per-image caching is handled in get_text and get_boxes_and_text methods

Add OCR Caching #1381

Are you sure you want to change the base?

Add OCR Caching #1381

Uh oh!

Conversation

karanataryn commented Jul 2, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

alexaryn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexaryn Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexaryn Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexaryn Jul 2, 2025 •

edited

Loading

alexaryn Jul 2, 2025 •

edited

Loading