-
Notifications
You must be signed in to change notification settings - Fork 65
Add OCR Caching #1381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add OCR Caching #1381
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a configurable, per-image OCR caching layer to Sycamore, refactors OCR model classes to use the new caching API, and provides an end-to-end example and tests.
- Introduces
ocr_cache.py
withOcrCacheManager
and cache key generator - Refactors
OcrModel
subclasses to wrapget_text
/get_boxes_and_text
with caching logic - Updates tests to cover cache manager functionality and adds an example script
- Bumps
paddleocr
dependency and removes the oldDiskCache
import
Reviewed Changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
lib/sycamore/.../text_extraction/ocr_models.py | Refactor OCR models to use new OcrCacheManager , wrap calls |
lib/sycamore/.../text_extraction/ocr_cache.py | New cache manager, key generation, and global accessors |
lib/sycamore/tests/.../text_extraction/test_ocr_cache.py | Tests for cache key gen, manager, global accessor |
lib/sycamore/pyproject.toml | Updated paddleocr version |
examples/ocr_caching_example.py | New example demonstrating caching modes |
Comments suppressed due to low confidence (6)
lib/sycamore/sycamore/transforms/text_extraction/ocr_cache.py:92
- The docstring for
__init__
says "If None, uses default local cache at ~/.sycamore/OcrCache", but the implementation disables caching whencache_path
is None. Please update the docstring to match the actual behavior or change the behavior to use a default path.
Args:
lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:161
- This comment refers to an old document-level cache that has been removed. Please remove or update the comment to avoid confusion.
# Note: This method still uses the old document-level caching for backward compatibility
lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:70
- New caching logic is added in this wrapper, but there are no unit tests verifying that repeated calls hit the cache (e.g. mocking
_get_text_impl
). Please add tests to cover the cache-hit and cache-miss paths inOcrModel.get_text
(and similarly forget_boxes_and_text
).
def get_text(self, image: Image.Image, **kwargs) -> tuple[str, Optional[float]]:
lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:163
- The code references
tempfile
but there is noimport tempfile
at the top of this file, which will cause aNameError
at runtime. Please addimport tempfile
alongside the other imports.
with tempfile.TemporaryDirectory() as tempdirname: # type: ignore
lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:185
- Using a mutable default argument (
lang_list=["en"]
) can lead to unexpected shared state between instances. Consider usinglang_list: Optional[list[str]] = None
and then settingself._lang_list = lang_list or ["en"]
inside the method.
cache_path: Optional[str] = None,
lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py:240
- [nitpick] Overriding the special
__name__
method on instances is unconventional and may cause confusion. Consider using a property or a class attribute (e.g.model_name
) instead of redefining__name__
.
def __name__(self):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will comment more later.
img = Image.new("RGB", size, color) | ||
draw = ImageDraw.Draw(img) | ||
|
||
# Try to use a default font, fallback to default if not available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word default used twice here is confusing.
logger.info("=== Local Caching Demo ===") | ||
|
||
# Create temporary cache directory | ||
with tempfile.TemporaryDirectory() as temp_dir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be useful to set prefix= to avoid mysterious dirs in /tmp.
cache_only: bool = False, | ||
disable_caching: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These parameters have confusing names. In essence, the controls we want are:
- enable_cache_read: bool
- enable_cache_write: bool
- fail_on_cache_miss: bool
"""Convert PIL image to a hash string.""" | ||
# Convert image to bytes for consistent hashing | ||
img_bytes = io.BytesIO() | ||
image.save(img_bytes, format="PNG") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use a more deterministic image format, perhaps zero compression. For PNG, saving with optimize off and compress_level zero sounds good.
img_bytes = io.BytesIO() | ||
image.save(img_bytes, format="PNG") | ||
img_bytes.seek(0) | ||
return hashlib.sha256(img_bytes.getvalue()).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should accept a hash context object and update it here. That allows the caller to accumulate whatever they want into the hash. Also, hexdigest is a clumsy way to represent a hash, except perhaps for final rendering into text. Base64 (or even 36) of` the digest bytes is better.
versions[package] = self._get_package_version(package) | ||
|
||
# Create the cache key components | ||
key_components = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just hash all of the things that matter in a deterministic order. No need to hash hashes. In fact, try to avoid that; it's mathematically unsound.
|
||
# Create temporary cache directory | ||
with tempfile.TemporaryDirectory() as temp_dir: | ||
cache_path = str(Path(temp_dir) / "ocr_cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a roundabout way of saying f"{temp_dir}/ocr_cache"
try: | ||
ocr_cache_only.get_text(new_img) | ||
assert False, "Should have raised CacheMissError" | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not catch CacheMissError specifically?
assert ocr_disabled.cache_manager is None | ||
assert ocr_disabled.disable_caching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should test behavior, not inner settings.
# Results should be identical (same implementation called) | ||
assert result1 == result2 | ||
logger.info(f"Results are identical: {result1}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this off-topic test, I'd rather see validation that the hit rate is zero.
logger.info("Checking that cache was not populated...") | ||
|
||
# Should not find cached result | ||
cached_result = ocr_normal.cache_manager.get(img, "PaddleOcr", "get_text", {}, ["paddleocr", "paddle"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems too intimate. If we had a mode that allowed reading from cache without writing and without executing, we could use that via an official PaddleOcr object.
|
||
return result | ||
|
||
def get_boxes_and_text(self, image: Image.Image, **kwargs) -> list[dict[str, Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this mostly a copy of the above function? Can we make a single function that takes the function name and a callable and wraps caching around anything?
) | ||
if cached_result is not None: | ||
logger.debug(f"Cache hit for {self._model_name}.get_boxes_and_text") | ||
assert isinstance(cached_result, list), f"Cached result is not a list: {type(cached_result)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this assert?
cached_result = [ | ||
{"bbox": BoundingBox(*dict_value["bbox"]), **{k: v for k, v in dict_value.items() if k != "bbox"}} | ||
for dict_value in cached_result | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kinda ugly when the object we store in the cache isn't directly what we can return to users. In this case, I think we can hide/abstract some of it. Basically, we just need a simplify/reconstruct helper. Since this varies per-function, I don't think we can push it down into the cache module.
# Note: This method still uses the old document-level caching for backward compatibility | ||
# The new per-image caching is handled in get_text and get_boxes_and_text methods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the document-level caching has simply been ripped out, notwithstanding what the comment says. I think document-level caching still has value. Since we haven't run it in production, we're free to tweak it with the new hashing stuff, but we should be able to enable both document- and image-level caching. In fact, it's a good test of the design to use the code for both purposes.
super().__init__(cache_path=cache_path, cache_only=cache_only, disable_caching=disable_caching) | ||
self.tesseract = Tesseract(cache_path=cache_path, cache_only=cache_only, disable_caching=disable_caching) | ||
self.easy_ocr = EasyOcr(cache_path=cache_path, cache_only=cache_only, disable_caching=disable_caching) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be less messy if we just passed a cache manager object around.
Adds the ability to have a configurable OCR Cache.