-
Notifications
You must be signed in to change notification settings - Fork 65
Add entity name in grouped result, also add materialize in groupbycount operator #1204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this trying to do?
lib/sycamore/sycamore/docset.py
Outdated
return DocSet(self.context, queries) | ||
|
||
def groupby(self, key: Union[str, list[str]]) -> "GroupedData": | ||
def groupby(self, grouped_key: Union[str, list[str]], entity: str=None) -> "GroupedData": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this entity do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entity name, need map back to the field to do the embedding, otherwise, the result is only the count and cluster id, since our groupby is indirect on cluster id.
return result | ||
|
||
grouped = dataset.filter(filter_meta).map(Document.from_row).groupby(self._grouped_key) | ||
aggregated = grouped.map_groups(group_udf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if a group is bigger than the batch size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would overflow.
centroids = docset.kmeans(K=K * 2, field_name=embed_name) | ||
clustered = docset.clustering(centroids=centroids, cluster_field_name=cluster_field_name, field_name=embed_name) | ||
result = clustered.groupby(cluster_field_name).count().sort(descending, "properties.count", 0).limit(K) | ||
result = clustered.groupby(cluster_field_name, entity_name.split(".")[-1]).count().sort(descending, "properties.count", 0).limit(K) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the split and take last segment of key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our entity name is inside properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so entity_name is something like "properties.field"? What about "properties.nested.field"?
import tempfile | ||
temp_dir = tempfile.mkdtemp() | ||
|
||
docset = self.inputs[0].embed(embedder).materialize(path=f"{temp_dir}", source_mode=MATERIALIZE_USE_STORED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the trickinesses I faced using this the other day was getting the embedder to embed the field I wanted to cluster by - I had to sneak in a pre_process_document hook to the embedder object and also temporarily hide the element text representations and stick in a dummy document text representation to get it to do the right thing. Is that solved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is mitigated in the embedder, you could point to some other field to embed instead of the text_representation.
def group_udf(batch): | ||
import numpy as np | ||
result = {"count": np.array([len(batch["properties"])])} | ||
if self._entity: | ||
result[self._entity] = np.array([batch["properties"][0][self._entity]]) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is putting the first instance of the entity in the result? Why is this useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for each group, choose one entity in that group.
545db54
to
b96a17d
Compare
No description provided.