+
Skip to content

Conversation

karanataryn
Copy link
Contributor

Adds the ability to have a configurable OCR Cache.

@karanataryn karanataryn requested a review from Copilot July 2, 2025 21:01
Copilot

This comment was marked as outdated.

@karanataryn karanataryn requested a review from Copilot July 2, 2025 21:55
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a configurable, per-image OCR caching layer to Sycamore, refactors OCR model classes to use the new caching API, and provides an end-to-end example and tests.

  • Introduces ocr_cache.py with OcrCacheManager and cache key generator
  • Refactors OcrModel subclasses to wrap get_text/get_boxes_and_text with caching logic
  • Updates tests to cover cache manager functionality and adds an example script
  • Bumps paddleocr dependency and removes the old DiskCache import

Reviewed Changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
lib/sycamore/.../text_extraction/ocr_models.py Refactor OCR models to use new OcrCacheManager, wrap calls
lib/sycamore/.../text_extraction/ocr_cache.py New cache manager, key generation, and global accessors
lib/sycamore/tests/.../text_extraction/test_ocr_cache.py Tests for cache key gen, manager, global accessor
lib/sycamore/pyproject.toml Updated paddleocr version
examples/ocr_caching_example.py New example demonstrating caching modes
Comments suppressed due to low confidence (6)

lib/sycamore/sycamore/transforms/text_extraction/ocr_cache.py:92

  • The docstring for __init__ says "If None, uses default local cache at ~/.sycamore/OcrCache", but the implementation disables caching when cache_path is None. Please update the docstring to match the actual behavior or change the behavior to use a default path.
        Args:

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:161

  • This comment refers to an old document-level cache that has been removed. Please remove or update the comment to avoid confusion.
        # Note: This method still uses the old document-level caching for backward compatibility

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:70

  • New caching logic is added in this wrapper, but there are no unit tests verifying that repeated calls hit the cache (e.g. mocking _get_text_impl). Please add tests to cover the cache-hit and cache-miss paths in OcrModel.get_text (and similarly for get_boxes_and_text).
    def get_text(self, image: Image.Image, **kwargs) -> tuple[str, Optional[float]]:

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:163

  • The code references tempfile but there is no import tempfile at the top of this file, which will cause a NameError at runtime. Please add import tempfile alongside the other imports.
        with tempfile.TemporaryDirectory() as tempdirname:  # type: ignore

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:185

  • Using a mutable default argument (lang_list=["en"]) can lead to unexpected shared state between instances. Consider using lang_list: Optional[list[str]] = None and then setting self._lang_list = lang_list or ["en"] inside the method.
        cache_path: Optional[str] = None,

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:240

  • [nitpick] Overriding the special __name__ method on instances is unconventional and may cause confusion. Consider using a property or a class attribute (e.g. model_name) instead of redefining __name__.
    def __name__(self):

Copy link
Collaborator

@alexaryn alexaryn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will comment more later.

img = Image.new("RGB", size, color)
draw = ImageDraw.Draw(img)

# Try to use a default font, fallback to default if not available
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word default used twice here is confusing.

logger.info("=== Local Caching Demo ===")

# Create temporary cache directory
with tempfile.TemporaryDirectory() as temp_dir:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be useful to set prefix= to avoid mysterious dirs in /tmp.

Comment on lines +197 to +198
cache_only: bool = False,
disable_caching: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters have confusing names. In essence, the controls we want are:

  • enable_cache_read: bool
  • enable_cache_write: bool
  • fail_on_cache_miss: bool

"""Convert PIL image to a hash string."""
# Convert image to bytes for consistent hashing
img_bytes = io.BytesIO()
image.save(img_bytes, format="PNG")
Copy link
Collaborator

@alexaryn alexaryn Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use a more deterministic image format, perhaps zero compression. For PNG, saving with optimize off and compress_level zero sounds good.

img_bytes = io.BytesIO()
image.save(img_bytes, format="PNG")
img_bytes.seek(0)
return hashlib.sha256(img_bytes.getvalue()).hexdigest()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should accept a hash context object and update it here. That allows the caller to accumulate whatever they want into the hash. Also, hexdigest is a clumsy way to represent a hash, except perhaps for final rendering into text. Base64 (or even 36) of` the digest bytes is better.

versions[package] = self._get_package_version(package)

# Create the cache key components
key_components = {
Copy link
Collaborator

@alexaryn alexaryn Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just hash all of the things that matter in a deterministic order. No need to hash hashes. In fact, try to avoid that; it's mathematically unsound.


# Create temporary cache directory
with tempfile.TemporaryDirectory() as temp_dir:
cache_path = str(Path(temp_dir) / "ocr_cache")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a roundabout way of saying f"{temp_dir}/ocr_cache"

try:
ocr_cache_only.get_text(new_img)
assert False, "Should have raised CacheMissError"
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not catch CacheMissError specifically?

Comment on lines +131 to +132
assert ocr_disabled.cache_manager is None
assert ocr_disabled.disable_caching
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should test behavior, not inner settings.

Comment on lines +139 to +141
# Results should be identical (same implementation called)
assert result1 == result2
logger.info(f"Results are identical: {result1}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this off-topic test, I'd rather see validation that the hit rate is zero.

logger.info("Checking that cache was not populated...")

# Should not find cached result
cached_result = ocr_normal.cache_manager.get(img, "PaddleOcr", "get_text", {}, ["paddleocr", "paddle"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too intimate. If we had a mode that allowed reading from cache without writing and without executing, we could use that via an official PaddleOcr object.


return result

def get_boxes_and_text(self, image: Image.Image, **kwargs) -> list[dict[str, Any]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mostly a copy of the above function? Can we make a single function that takes the function name and a callable and wraps caching around anything?

)
if cached_result is not None:
logger.debug(f"Cache hit for {self._model_name}.get_boxes_and_text")
assert isinstance(cached_result, list), f"Cached result is not a list: {type(cached_result)}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this assert?

Comment on lines +136 to +139
cached_result = [
{"bbox": BoundingBox(*dict_value["bbox"]), **{k: v for k, v in dict_value.items() if k != "bbox"}}
for dict_value in cached_result
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kinda ugly when the object we store in the cache isn't directly what we can return to users. In this case, I think we can hide/abstract some of it. Basically, we just need a simplify/reconstruct helper. Since this varies per-function, I don't think we can push it down into the cache module.

Comment on lines +174 to +175
# Note: This method still uses the old document-level caching for backward compatibility
# The new per-image caching is handled in get_text and get_boxes_and_text methods
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the document-level caching has simply been ripped out, notwithstanding what the comment says. I think document-level caching still has value. Since we haven't run it in production, we're free to tweak it with the new hashing stuff, but we should be able to enable both document- and image-level caching. In fact, it's a good test of the design to use the code for both purposes.

Comment on lines +297 to +299
super().__init__(cache_path=cache_path, cache_only=cache_only, disable_caching=disable_caching)
self.tesseract = Tesseract(cache_path=cache_path, cache_only=cache_only, disable_caching=disable_caching)
self.easy_ocr = EasyOcr(cache_path=cache_path, cache_only=cache_only, disable_caching=disable_caching)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be less messy if we just passed a cache manager object around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载