+
Skip to content

Conversation

dhruvkaliraman7
Copy link
Contributor

1. MaterializeReadReliability: A new class that enables reliable batch processing of materialized files by:

  • Tracking already processed files

  • Limiting batch sizes

  • Supporting incremental processing through batch resets

  • Maintaining state between batch executions

2. Added utility functions:

  • name_from_docid: Custom naming function using path-based SHA256 hashes

  • docid_from_path: Generates document IDs(SHA 256) from path

  • doc_only_to_binary: Serialization helper

* Initial dev

* Remove debugging code

* Remove old code which passed reliability object to context

* Add exception handling after all files processed on ray, lint fix

* Add unit tests

* Switch to using Path Partition Filter

* Add logging, fix assertions in tests

* lint

* refactor tests

* mypy fix, make tests efficient, uniform naming convention

* Remove print

* Change func call from merge

* Better docs and logging

* Yet better docs

* nits

* lint smh

* Address comments
@dhruvkaliraman7 dhruvkaliraman7 merged commit ccd78b7 into main Feb 6, 2025
12 of 15 checks passed
@dhruvkaliraman7 dhruvkaliraman7 deleted the Add-Materialize-Read-Reliability branch February 6, 2025 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载