-
Notifications
You must be signed in to change notification settings - Fork 65
Adding clustering and groupby in luna planner part #1183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, some clarifying questions
if "lineage_id" not in self.data: | ||
self.update_lineage_id() | ||
|
||
def get_by_path(self, path: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the same as field_to_value(..)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems similar, would try to reuse that one.
lib/sycamore/sycamore/docset.py
Outdated
return {"vector": doc.embedding, "cluster": -1} if field_name is None else {"vector": doc[field_name], "cluster": -1} | ||
|
||
embeddings = self.plan.execute().map(init_embedding).materialize() | ||
embeddings = self.plan.execute().filter(filter_meta).map(init_embedding).materialize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the materialize for? also can we not do
self.plan.filter(filter_meta).map(init_embedding).execute()
I think that'll use ray versions of the ops right? so theoretically faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also just pull our the filter_meta method somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is for reuse the embeddings, otherwise, just compute twice from beginning, would have one option for configuring this.
dataset = self._docset.plan.execute() | ||
grouped = dataset.map(Document.from_row).groupby(self._key) | ||
|
||
def filter_meta(row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pull out somewhere
raise Exception("New Top K not implemented for codegen") | ||
|
||
|
||
class SycamoreEmbed(SycamoreOperator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this for? don't think the planner uses it. My current thought is this (embedding) should just be an implementation details within group_by etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inside groupby just means every time run topk, it would do this embedding over all items. Not sure about what role the Luna plays, does it also write?
return result, [] | ||
|
||
|
||
class SycamoreNewTopK(SycamoreOperator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss the new operators we want to add.
I'd imagine we want: topK, groupBy
We should just be able to nuke the existing topK rather than adding an new one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call this GroupByCount
could be 'Form groups of different food'""" | ||
|
||
|
||
class NewTopK(Node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GroupByCount
4985dd6
to
6a1694f
Compare
No description provided.