Add entity name in grouped result, also add materialize in groupbycount operator #1204

bohou-aryn · 2025-02-28T21:16:57Z

No description provided.

HenryL27

What is this trying to do?

HenryL27 · 2025-02-28T22:11:02Z

lib/sycamore/sycamore/docset.py

        return DocSet(self.context, queries)

-    def groupby(self, key: Union[str, list[str]]) -> "GroupedData":
+    def groupby(self, grouped_key: Union[str, list[str]], entity: str=None) -> "GroupedData":


What does this entity do?

entity name, need map back to the field to do the embedding, otherwise, the result is only the count and cluster id, since our groupby is indirect on cluster id.

HenryL27 · 2025-02-28T22:15:24Z

lib/sycamore/sycamore/grouped_data.py

+            return result
+
+        grouped = dataset.filter(filter_meta).map(Document.from_row).groupby(self._grouped_key)
+        aggregated = grouped.map_groups(group_udf)


what happens if a group is bigger than the batch size?

That would overflow.

HenryL27 · 2025-02-28T22:22:37Z

lib/sycamore/sycamore/query/execution/sycamore_operator.py

        centroids = docset.kmeans(K=K * 2, field_name=embed_name)
        clustered = docset.clustering(centroids=centroids, cluster_field_name=cluster_field_name, field_name=embed_name)
-        result = clustered.groupby(cluster_field_name).count().sort(descending, "properties.count", 0).limit(K)
+        result = clustered.groupby(cluster_field_name, entity_name.split(".")[-1]).count().sort(descending, "properties.count", 0).limit(K)


why the split and take last segment of key?

our entity name is inside properties.

so entity_name is something like "properties.field"? What about "properties.nested.field"?

HenryL27 · 2025-02-28T22:26:31Z

lib/sycamore/sycamore/query/execution/sycamore_operator.py

+        import tempfile
+        temp_dir = tempfile.mkdtemp()
+
+        docset = self.inputs[0].embed(embedder).materialize(path=f"{temp_dir}", source_mode=MATERIALIZE_USE_STORED)


One of the trickinesses I faced using this the other day was getting the embedder to embed the field I wanted to cluster by - I had to sneak in a pre_process_document hook to the embedder object and also temporarily hide the element text representations and stick in a dummy document text representation to get it to do the right thing. Is that solved?

yes, that is mitigated in the embedder, you could point to some other field to embed instead of the text_representation.

HenryL27 · 2025-02-28T22:27:54Z

lib/sycamore/sycamore/grouped_data.py

+        def group_udf(batch):
+            import numpy as np
+            result = {"count": np.array([len(batch["properties"])])}
+            if self._entity:
+                result[self._entity] = np.array([batch["properties"][0][self._entity]])
+            return result


This is putting the first instance of the entity in the result? Why is this useful?

for each group, choose one entity in that group.

operator.

bohou-aryn requested review from HenryL27 and bsowell February 28, 2025 21:16

HenryL27 reviewed Feb 28, 2025

View reviewed changes

bohou-aryn force-pushed the groubycount branch 2 times, most recently from 545db54 to b96a17d Compare March 3, 2025 20:50

HenryL27 approved these changes Mar 3, 2025

View reviewed changes

bohou-aryn force-pushed the groubycount branch from b96a17d to a090865 Compare March 3, 2025 22:05

bohou-aryn enabled auto-merge (rebase) March 3, 2025 22:20

Add entity name in grouped result, also add materialize in groupbycount

471cc2b

operator.

bohou-aryn force-pushed the groubycount branch from a090865 to 471cc2b Compare March 3, 2025 22:35

bohou-aryn merged commit 21c9d8e into main Mar 3, 2025
12 of 15 checks passed

Add entity name in grouped result, also add materialize in groupbycount operator #1204

Add entity name in grouped result, also add materialize in groupbycount operator #1204

Uh oh!

Conversation

bohou-aryn commented Feb 28, 2025

Uh oh!

HenryL27 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants