+
Skip to content

Conversation

bohou-aryn
Copy link
Collaborator

No description provided.

Copy link
Contributor

@baitsguy baitsguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, some clarifying questions

if "lineage_id" not in self.data:
self.update_lineage_id()

def get_by_path(self, path: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the same as field_to_value(..)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems similar, would try to reuse that one.

return {"vector": doc.embedding, "cluster": -1} if field_name is None else {"vector": doc[field_name], "cluster": -1}

embeddings = self.plan.execute().map(init_embedding).materialize()
embeddings = self.plan.execute().filter(filter_meta).map(init_embedding).materialize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the materialize for? also can we not do

self.plan.filter(filter_meta).map(init_embedding).execute()

I think that'll use ray versions of the ops right? so theoretically faster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also just pull our the filter_meta method somewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is for reuse the embeddings, otherwise, just compute twice from beginning, would have one option for configuring this.

dataset = self._docset.plan.execute()
grouped = dataset.map(Document.from_row).groupby(self._key)

def filter_meta(row):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pull out somewhere

raise Exception("New Top K not implemented for codegen")


class SycamoreEmbed(SycamoreOperator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for? don't think the planner uses it. My current thought is this (embedding) should just be an implementation details within group_by etc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inside groupby just means every time run topk, it would do this embedding over all items. Not sure about what role the Luna plays, does it also write?

return result, []


class SycamoreNewTopK(SycamoreOperator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss the new operators we want to add.

I'd imagine we want: topK, groupBy

We should just be able to nuke the existing topK rather than adding an new one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call this GroupByCount

could be 'Form groups of different food'"""


class NewTopK(Node):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GroupByCount

@bohou-aryn bohou-aryn force-pushed the groupby branch 2 times, most recently from 4985dd6 to 6a1694f Compare February 19, 2025 19:41
@bohou-aryn bohou-aryn merged commit 5c1ce95 into main Feb 19, 2025
10 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载