+
Skip to content

Conversation

bohou-aryn
Copy link
Collaborator

No description provided.

Copy link
Collaborator

@HenryL27 HenryL27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this trying to do?

return DocSet(self.context, queries)

def groupby(self, key: Union[str, list[str]]) -> "GroupedData":
def groupby(self, grouped_key: Union[str, list[str]], entity: str=None) -> "GroupedData":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this entity do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entity name, need map back to the field to do the embedding, otherwise, the result is only the count and cluster id, since our groupby is indirect on cluster id.

return result

grouped = dataset.filter(filter_meta).map(Document.from_row).groupby(self._grouped_key)
aggregated = grouped.map_groups(group_udf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if a group is bigger than the batch size?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would overflow.

centroids = docset.kmeans(K=K * 2, field_name=embed_name)
clustered = docset.clustering(centroids=centroids, cluster_field_name=cluster_field_name, field_name=embed_name)
result = clustered.groupby(cluster_field_name).count().sort(descending, "properties.count", 0).limit(K)
result = clustered.groupby(cluster_field_name, entity_name.split(".")[-1]).count().sort(descending, "properties.count", 0).limit(K)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the split and take last segment of key?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our entity name is inside properties.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so entity_name is something like "properties.field"? What about "properties.nested.field"?

import tempfile
temp_dir = tempfile.mkdtemp()

docset = self.inputs[0].embed(embedder).materialize(path=f"{temp_dir}", source_mode=MATERIALIZE_USE_STORED)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the trickinesses I faced using this the other day was getting the embedder to embed the field I wanted to cluster by - I had to sneak in a pre_process_document hook to the embedder object and also temporarily hide the element text representations and stick in a dummy document text representation to get it to do the right thing. Is that solved?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is mitigated in the embedder, you could point to some other field to embed instead of the text_representation.

Comment on lines 18 to 31
def group_udf(batch):
import numpy as np
result = {"count": np.array([len(batch["properties"])])}
if self._entity:
result[self._entity] = np.array([batch["properties"][0][self._entity]])
return result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is putting the first instance of the entity in the result? Why is this useful?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for each group, choose one entity in that group.

@bohou-aryn bohou-aryn force-pushed the groubycount branch 2 times, most recently from 545db54 to b96a17d Compare March 3, 2025 20:50
@bohou-aryn bohou-aryn enabled auto-merge (rebase) March 3, 2025 22:20
@bohou-aryn bohou-aryn merged commit 21c9d8e into main Mar 3, 2025
12 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载