Initial attempt at straightforward document processing script. #731

alexaryn · 2024-08-27T18:48:20Z

This is a proposal/example, not meant to be checked in.

The basic idea here is not only to bypass Ray, but also avoid the lazy-evaluated pipeline abstraction. Instead, it's coded the way a typical programmer would expect to write the code. This approach is synchronous rather than functional and allows different documents to be treated differently on the fly.

Instead of DocSet, we deal with a list of Document. DocSet confuses people because it's not a set of documents.

One finding is that most existing transforms would be easier to use a simple functions. Then they could the the target of "map", either directly or via DocSet.

This code represents the exercise of simplifying without modifying Sycamore. The function iterInputs would be intended as an addition to the Sycamore library. There are FIXME comments for how Sycamore could become easier to use directly.

The remaining piece would be a way to encapsulate common processing sequences into higher-level single calls. We could do this generally, or just provide some off-the-shelf. This may turn into an exercise in naming.

eric-anderson

In general, I prefer the local_mode (ctx = sycamore.init(exec_mode=ExecMode.LOCAL)) approach for three reasons:

If you need to scale up, it's easy to switch it over to ray mode
We can in the future add multiprocessing support to get more speed
It preserves all of the metadata so the reliability work will be able to happen

That said, there is clearly a need for some rayless thing as people are starting to use local mode before it's really ready, and you ended up writing this example.

eric-anderson · 2024-08-27T18:55:13Z

examples/direct.py

+
+###############################################################################
+
+def iterInputs(inputs: list[str], aws_sess = None) -> Iterator[BinaryIO]:


You can replace all of this with
docs = BinaryScan(paths=inputs).local_source()
once https://github.com/aryn-ai/sycamore/pull/712/files is in.

jonfritz · 2024-08-28T18:34:12Z

Do we need to get this group together to make a call on the approach? From: Eric Anderson ***@***.***> Date: Tuesday, August 27, 2024 at 12:00 PM To: aryn-ai/sycamore ***@***.***> Cc: Jonathan Fritz ***@***.***>, Review requested ***@***.***> Subject: Re: [aryn-ai/sycamore] Initial attempt at straightforward document processing script. (PR #731) @eric-anderson commented on this pull request. In general, I prefer the local_mode (ctx = sycamore.init(exec_mode=ExecMode.LOCAL)) approach for three reasons: 1. If you need to scale up, it's easy to switch it over to ray mode 2. We can in the future add multiprocessing support to get more speed 3. It preserves all of the metadata so the reliability work will be able to happen That said, there is clearly a need for some rayless thing as people are starting to use local mode before it's really ready, and you ended up writing this example.

________________________________ In examples/direct.py<#731 (comment)>:

+import boto3

+import pyarrow.fs + +import aryn_sdk.partition +import sycamore +from sycamore.transforms.embed import SentenceTransformerEmbedder +from sycamore.transforms.sketcher import Sketcher +from sycamore.connectors.duckdb.duckdb_writer import ( + DuckDBWriterClientParams, + DuckDBWriterTargetParams, + DuckDBWriter, +) + +############################################################################### + +def iterInputs(inputs: list[str], aws_sess = None) -> Iterator[BinaryIO]: You can replace all of this with docs = BinaryScan(paths=inputs).local_source() once https://github.com/aryn-ai/sycamore/pull/712/files is in. — Reply to this email directly, view it on GitHub<#731 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BAA5BM4K7V7MHHTOT5MZDVLZTTED5AVCNFSM6AAAAABNGXI4NCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENRUGIZDCOBTGU>. You are receiving this because your review was requested.Message ID: ***@***.***>

eric-anderson · 2025-02-04T00:31:30Z

Going to close this after 2024-02-10 unless there is action on it.

Initial attempt at straightforward document processing script.

cb9875c

alexaryn requested review from HenryL27, bsowell, eric-anderson and jonfritz August 27, 2024 18:48

eric-anderson reviewed Aug 27, 2024

View reviewed changes

eric-anderson mentioned this pull request Feb 4, 2025

Add document for Docset readers and writers. #196

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial attempt at straightforward document processing script. #731

Initial attempt at straightforward document processing script. #731

Uh oh!

alexaryn commented Aug 27, 2024

Uh oh!

eric-anderson left a comment

Uh oh!

eric-anderson Aug 27, 2024

Uh oh!

jonfritz commented Aug 28, 2024 via email

Uh oh!

eric-anderson commented Feb 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		###############################################################################

		def iterInputs(inputs: list[str], aws_sess = None) -> Iterator[BinaryIO]:

Initial attempt at straightforward document processing script. #731

Are you sure you want to change the base?

Initial attempt at straightforward document processing script. #731

Uh oh!

Conversation

alexaryn commented Aug 27, 2024

Uh oh!

eric-anderson left a comment

Choose a reason for hiding this comment

Uh oh!

eric-anderson Aug 27, 2024

Choose a reason for hiding this comment

Uh oh!

jonfritz commented Aug 28, 2024 via email

Uh oh!

eric-anderson commented Feb 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants