-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Is there an existing issue for this?
- I have searched the existing issues
Describe the issue
Currently the reviewers column in github__pull_requests
does not account for duplicate/removed reviewers. E.g. it just joins all reviewers based on the the string aggregated users from the stg_github__pull_request_review
model.
This aggregation seems to be inherited upstream through the int_github__pull_request_reviewers
model and therefore, this model should be updated to be like so instead. I don't actually think in the example branch that we even need to leverage pull_request_review
anymore but should just be able to just use pull_request_reviewer_history
.
Relevant error log or model output
This is the sample of the data as it currently stands:
split_reviewers as (
select
reviewers,
split_reviewers,
count(distinct issue_id)
FROM `dbt-package-testing.fivetran_transformations_github.github__pull_requests` ,
unnest(split(reviewers, ', ')) split_reviewers
group by 1,2
having count(distinct issue_id) > 1
)
select * from split_reviewers
Some rows literally look like this:
```fivetran-joemarkiewicz, fivetran-joemarkiewicz, fivetran-joemarkiewicz```
Presumably because the `pull_request_review` seems like an event based table so aggregating all the users there will yield duplicates.
Expected behavior
The aggregated reviewers column should be de-duplicated with only one instance per unique reviewer.
So instead of:
fivetran-joemarkiewicz, fivetran-joemarkiewicz, fivetran-joemarkiewicz
It would be
fivetran-joemarkiewicz
dbt Project configurations
n/a
Package versions
n/a
What database are you using dbt with?
bigquery
dbt Version
1.3
Additional Context
No response
Are you willing to open a PR to help address this issue?
- Yes.
- Yes, but I will need assistance and will schedule time during our office hours for guidance
- No.