First part of reliable opensearch writing -- handles new items and missing metadata on source or destination. #1335

eric-anderson · 2025-06-09T18:33:08Z

Minor refactoring of opensearch_writer.py and utils.py to enable re-use.
Manual test to run against real opensearch and experiment.
Unit test to exercise implementation.

Next part will handle deletion/update in the source.

…ssing metadata on source or destination. * Minor refactoring of opensearch_writer.py and utils.py to enable re-use. * Manual test to run against real opensearch and experiment. * Unit test to exercise implementation. Next part will handle deletion/update in the source.

bsowell

I went through this. I think since it's a separate connector, I'm not super worried -- we will be able to adjust as we add the other features like delete.

I was a bit confused about how to use this with DocSets like the other connectors. Is that possible/coming later?

bsowell · 2025-06-10T05:06:17Z

lib/sycamore/sycamore/connectors/opensearch/sync.py

+                        to_be_loaded_groups[i].append(f)
+
+            for i, g in enumerate(to_be_loaded_groups):
+                root, splitter = self.sources[i]


I've spent a while trying to figure out exactly what a splitter is supposed to be in this context -- perhaps just because "splitter" is used elsewhere in sycamore. In this case, would "explode" be the canonical splitter in, e.g., our standard docstore ingestion pipeline?

Added an explanation of splitter to class, added pointer where you asked.

bsowell · 2025-06-10T05:11:04Z

lib/sycamore/sycamore/connectors/opensearch/sync.py

+
+
+# Todo accept sources as docset and require it end with materialize.
+class OpenSearchSync:


Maybe this is what the todo is referring to, but I'm a bit confused about the interface here. This doesn't look like our other connectors. How do I use it if I have a DocSet and want to write to OpenSearch?

Right now, you would materialize out, execute and run the connector. In the future I expect to add something that verifies you materialized as the last step, and then executes that and runs the reliable write. So you'd be able to write:
sycamore.init().read.whatever().ops().materialize("/tmp/example").write.reliable_opensearch(params)

bsowell · 2025-06-11T19:57:56Z

lib/sycamore/sycamore/connectors/opensearch/sync.py

+        return pid_to_parts
+
+    def os_client(self):
+        assert False


Is this left over from debugging?

Yes, and none of the tests use real opensearch so it wasn't caught.

eric-anderson · 2025-06-11T20:51:57Z

I went through this. I think since it's a separate connector, I'm not super worried -- we will be able to adjust as we add the other features like delete.

I was a bit confused about how to use this with DocSets like the other connectors. Is that possible/coming later?

Right now, you would materialize out, execute and run the connector. In the future I expect to add something that verifies you materialized as the last step, and then executes that and runs the reliable write. So you'd be able to write:
sycamore.init().read.whatever().ops().materialize("/tmp/example").write.reliable_opensearch(params)

eric-anderson requested a review from bsowell June 9, 2025 18:33

bsowell approved these changes Jun 11, 2025

View reviewed changes

Review improvements.

6a7e686

eric-anderson merged commit d41ac3e into main Jun 11, 2025
12 of 15 checks passed

eric-anderson deleted the eric-opensearch-sync branch June 11, 2025 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

First part of reliable opensearch writing -- handles new items and missing metadata on source or destination. #1335

First part of reliable opensearch writing -- handles new items and missing metadata on source or destination. #1335

Uh oh!

eric-anderson commented Jun 9, 2025

Uh oh!

bsowell left a comment

Uh oh!

bsowell Jun 10, 2025

Uh oh!

eric-anderson Jun 11, 2025

Uh oh!

bsowell Jun 10, 2025

Uh oh!

eric-anderson Jun 11, 2025

Uh oh!

bsowell Jun 11, 2025

Uh oh!

eric-anderson Jun 11, 2025

Uh oh!

eric-anderson commented Jun 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		# Todo accept sources as docset and require it end with materialize.
		class OpenSearchSync:

First part of reliable opensearch writing -- handles new items and missing metadata on source or destination. #1335

First part of reliable opensearch writing -- handles new items and missing metadata on source or destination. #1335

Uh oh!

Conversation

eric-anderson commented Jun 9, 2025

Uh oh!

bsowell left a comment

Choose a reason for hiding this comment

Uh oh!

bsowell Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

eric-anderson Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

bsowell Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

eric-anderson Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

bsowell Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

eric-anderson Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

eric-anderson commented Jun 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants