Initial attempt at straightforward document processing script. #731
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a proposal/example, not meant to be checked in.
The basic idea here is not only to bypass Ray, but also avoid the lazy-evaluated pipeline abstraction. Instead, it's coded the way a typical programmer would expect to write the code. This approach is synchronous rather than functional and allows different documents to be treated differently on the fly.
Instead of DocSet, we deal with a list of Document. DocSet confuses people because it's not a set of documents.
One finding is that most existing transforms would be easier to use a simple functions. Then they could the the target of "map", either directly or via DocSet.
This code represents the exercise of simplifying without modifying Sycamore. The function
iterInputs
would be intended as an addition to the Sycamore library. There are FIXME comments for how Sycamore could become easier to use directly.The remaining piece would be a way to encapsulate common processing sequences into higher-level single calls. We could do this generally, or just provide some off-the-shelf. This may turn into an exercise in naming.