Add a groupby operator #1123

bohou-aryn · 2025-01-23T18:04:38Z

Implement groupby based on ray dataset groupby and show how a general
entity clustering could be used together with kmeans clustering.

HenryL27

lgtm. What would it take to abstract this such that it works in both ray and local modes?

HenryL27 · 2025-01-30T21:18:49Z

lib/sycamore/sycamore/docset.py

+        def init_embedding(row):
+            doc = Document.from_row(row)
+            return {"vector": doc.embedding, "cluster": -1}
+


can you add a "assert self.context.exec_mode == ExecMode.RAY" in here?

HenryL27 · 2025-01-30T21:20:49Z

lib/sycamore/sycamore/tests/unit/test_grouped_data.py

+        return context.read.document(doc_list)
+
+    def test_groupby_count(self, fruits_docset):
+        aggregated = fruits_docset.groupby("text_representation").count()


What do the Documents in aggregated look like at this point?

HenryL27 · 2025-01-30T21:25:09Z

lib/sycamore/sycamore/transforms/clustering.py

+    def init(embeddings, K, init_mode):
+        if init_mode == "random":


supernit: init_mode could be an Enum but str is fine too. I guess would be nice to have the list of known init_modes in the exception?

HenryL27 · 2025-01-31T20:02:58Z

lib/sycamore/sycamore/grouped_data.py

+from ray.data._internal.aggregate import Count
+from ray.data.aggregate import AggregateFn


from typing import TYPE_CHECKING if TYPE_CHECKING: <ray imports>

HenryL27 · 2025-01-31T20:03:34Z

lib/sycamore/sycamore/grouped_data.py

+
+        return DocSet(self._docset.context, DatasetScan(serialized))
+
+    def count(self) -> DocSet:


import count here

HenryL27 · 2025-01-31T20:03:59Z

lib/sycamore/sycamore/transforms/clustering.py

+
+    @staticmethod
+    def update(embeddings, centroids, iterations, epsilon):
+        i = 0


import AggregateFn here

This includes generally three steps: 1. materialize a document's embedding 2. initialize centroids randomly 2. iterate the kmeans process until converge, this is based on ray dataset map group and aggregate operators. The result centroids could be used for downstream work.

Implement groupby based on ray dataset groupby and show how a general entity clustering could be used together with kmeans clustering.

bohou-aryn force-pushed the clustering branch from 955cd3c to db0d219 Compare January 27, 2025 21:43

bohou-aryn mentioned this pull request Jan 30, 2025

add groupby #1115

Closed

bohou-aryn requested review from HenryL27 and baitsguy January 30, 2025 20:53

HenryL27 approved these changes Jan 30, 2025

View reviewed changes

HenryL27 reviewed Jan 31, 2025

View reviewed changes

bohou-aryn force-pushed the clustering branch 2 times, most recently from 8a6aadd to 3653e2f Compare January 31, 2025 22:32

bohou-aryn force-pushed the clustering branch 2 times, most recently from 907625d to fdfc705 Compare February 3, 2025 21:16

bohou-aryn enabled auto-merge (rebase) February 3, 2025 21:40

bohou-aryn disabled auto-merge February 3, 2025 21:40

bohou-aryn force-pushed the clustering branch from fdfc705 to 59c08ad Compare February 3, 2025 21:52

Add a groupby operator

b25ee78

Implement groupby based on ray dataset groupby and show how a general entity clustering could be used together with kmeans clustering.

bohou-aryn force-pushed the clustering branch from 59c08ad to b25ee78 Compare February 3, 2025 22:05

bohou-aryn merged commit 3c8831d into main Feb 3, 2025
12 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a groupby operator #1123

Add a groupby operator #1123

Uh oh!

bohou-aryn commented Jan 23, 2025

Uh oh!

HenryL27 left a comment

Uh oh!

HenryL27 Jan 30, 2025

Uh oh!

HenryL27 Jan 30, 2025

Uh oh!

HenryL27 Jan 30, 2025

Uh oh!

HenryL27 Jan 31, 2025

Uh oh!

HenryL27 Jan 31, 2025

Uh oh!

HenryL27 Jan 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		def init(embeddings, K, init_mode):
		if init_mode == "random":

		from ray.data._internal.aggregate import Count
		from ray.data.aggregate import AggregateFn


		return DocSet(self._docset.context, DatasetScan(serialized))

		def count(self) -> DocSet:

Add a groupby operator #1123

Add a groupby operator #1123

Uh oh!

Conversation

bohou-aryn commented Jan 23, 2025

Uh oh!

HenryL27 left a comment

Choose a reason for hiding this comment

Uh oh!

HenryL27 Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

HenryL27 Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

HenryL27 Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

HenryL27 Jan 31, 2025

Choose a reason for hiding this comment

Uh oh!

HenryL27 Jan 31, 2025

Choose a reason for hiding this comment

Uh oh!

HenryL27 Jan 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants