-
Notifications
You must be signed in to change notification settings - Fork 65
First part of reliable opensearch writing -- handles new items and missing metadata on source or destination. #1335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ssing metadata on source or destination. * Minor refactoring of opensearch_writer.py and utils.py to enable re-use. * Manual test to run against real opensearch and experiment. * Unit test to exercise implementation. Next part will handle deletion/update in the source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this. I think since it's a separate connector, I'm not super worried -- we will be able to adjust as we add the other features like delete.
I was a bit confused about how to use this with DocSets like the other connectors. Is that possible/coming later?
to_be_loaded_groups[i].append(f) | ||
|
||
for i, g in enumerate(to_be_loaded_groups): | ||
root, splitter = self.sources[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spent a while trying to figure out exactly what a splitter is supposed to be in this context -- perhaps just because "splitter" is used elsewhere in sycamore. In this case, would "explode" be the canonical splitter in, e.g., our standard docstore ingestion pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an explanation of splitter to class, added pointer where you asked.
|
||
|
||
# Todo accept sources as docset and require it end with materialize. | ||
class OpenSearchSync: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is what the todo is referring to, but I'm a bit confused about the interface here. This doesn't look like our other connectors. How do I use it if I have a DocSet and want to write to OpenSearch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, you would materialize out, execute and run the connector. In the future I expect to add something that verifies you materialized as the last step, and then executes that and runs the reliable write. So you'd be able to write:
sycamore.init().read.whatever().ops().materialize("/tmp/example").write.reliable_opensearch(params)
return pid_to_parts | ||
|
||
def os_client(self): | ||
assert False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this left over from debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and none of the tests use real opensearch so it wasn't caught.
Right now, you would materialize out, execute and run the connector. In the future I expect to add something that verifies you materialized as the last step, and then executes that and runs the reliable write. So you'd be able to write: |
Next part will handle deletion/update in the source.