+
Skip to content

Conversation

bohou-aryn
Copy link
Collaborator

Implement groupby based on ray dataset groupby and show how a general
entity clustering could be used together with kmeans clustering.

Copy link
Collaborator

@HenryL27 HenryL27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. What would it take to abstract this such that it works in both ray and local modes?

def init_embedding(row):
doc = Document.from_row(row)
return {"vector": doc.embedding, "cluster": -1}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a "assert self.context.exec_mode == ExecMode.RAY" in here?

return context.read.document(doc_list)

def test_groupby_count(self, fruits_docset):
aggregated = fruits_docset.groupby("text_representation").count()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do the Documents in aggregated look like at this point?

Comment on lines +39 to +41
def init(embeddings, K, init_mode):
if init_mode == "random":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supernit: init_mode could be an Enum but str is fine too. I guess would be nice to have the list of known init_modes in the exception?

Comment on lines 1 to 2
from ray.data._internal.aggregate import Count
from ray.data.aggregate import AggregateFn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    <ray imports>


return DocSet(self._docset.context, DatasetScan(serialized))

def count(self) -> DocSet:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import count here


@staticmethod
def update(embeddings, centroids, iterations, epsilon):
i = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import AggregateFn here

@bohou-aryn bohou-aryn force-pushed the clustering branch 2 times, most recently from 8a6aadd to 3653e2f Compare January 31, 2025 22:32
This includes generally three steps:
1. materialize a document's embedding
2. initialize centroids randomly
2. iterate the kmeans process until converge, this is based on ray
   dataset map group and aggregate operators.

The result centroids could be used for downstream work.
@bohou-aryn bohou-aryn force-pushed the clustering branch 2 times, most recently from 907625d to fdfc705 Compare February 3, 2025 21:16
@bohou-aryn bohou-aryn enabled auto-merge (rebase) February 3, 2025 21:40
@bohou-aryn bohou-aryn disabled auto-merge February 3, 2025 21:40
Implement groupby based on ray dataset groupby and show how a general
entity clustering could be used together with kmeans clustering.
@bohou-aryn bohou-aryn merged commit 3c8831d into main Feb 3, 2025
12 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载