-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Search: enable repo:has.topic() for GitLab
#57649
Conversation
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff a9f2ba5...feae35b.
|
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff a3dbd371eb14ca5a56e250d068e6dabad296b638...959d1f27a799b17c3c3fbe9744590b3e796fe0ef.
|
repo:has.topic() for GitLab
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
Can you update the test plan for this PR in case there's anything specific the reviewers need to test?
Also, it seems you need to run the sg gen command to update the database schema docs - I believe that's what's causing the current CI failure.
d73f2b2 to
9a84dd3
Compare
camdencheek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
It would still be nice to add a SQL comment on the new index like was done with the previous index
COMMENT ON INDEX idx_repo_github_topics IS 'An index to speed up listing repos by topic. Intended to be used when TopicFilters are added to the RepoListOptions';
3fa939b to
e500952
Compare
e500952 to
b01a5cf
Compare
b01a5cf to
279e043
Compare
eseliger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for also caring about DB perf and generating a query plan for the test plan!
However, the plan generated might differ quite a lot from what is actually run in production for two reasons:
- We usually apply way more filters than just this one tested for, about visibility, deleted status, and so forth and it might be that with all these other conditions combined postgres no longer decides to actually use the index you would think it is using
- Your local dataset seemed to be a few rows only, to get real-world results it should be run against a database that has a ton of data, like sourcegraph.coms database. Postgres behaves very different when there's 10 rows vs when there's 10000 rows
Also left an inline comment about adding new indexes for every code host.
Don't see this as blocking, rather as some food for thought before this goes into production :)
migrations/frontend/1697558292_comment_on_gitlab_topic_idx/down.sql
Outdated
Show resolved
Hide resolved
migrations/frontend/1697558292_comment_on_gitlab_topic_idx/up.sql
Outdated
Show resolved
Hide resolved
| "idx_repo_github_topics" gin (((metadata -> 'RepositoryTopics'::text) -> 'Nodes'::text)) WHERE external_service_type = 'github'::text | ||
| "idx_repo_gitlab_topics" gin ((metadata -> 'topics'::text)) WHERE external_service_type = 'gitlab'::text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm the repo table has an absolute overload of indexes already, I wonder if we could simplify this by normalizing the DB a bit:
We introduce a new column topics text[] and the syncer decides what to write to that field. Then GitHub and GitLab can both do that, and in the future some other code host could also simply add to that proper column, vs us adding a new index every time.
That would also simplify the query we're making, because we can simply operate on a single column vs adding multiple OR-d together statements about the different ways topics can be retrieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that we should simplify it, but I expect the index to be quite small. The github topics index is 122MB on dotcom (select pg_table_size('idx_repo_github_topics');), and we have many more github repos than gitlab repos.
I'd be a bit concerned about denormalizing the DB. The syncer logic is a bit complex, and I honestly wouldn't feel comfortable changing it for a small feature like this, but what about a generated column? something like topics GENERATED STORED AS CASE WHEN external_service_type = 'gitlab' THEN ... WHEN external_service_type = 'github' THEN ... ELSE [] END
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work, too I assume! I haven't dealt with generated columns much in the past, but trust that they work well.
Co-authored-by: Erik Seliger <erikseliger@me.com>
Co-authored-by: Erik Seliger <erikseliger@me.com>
…n.sql Co-authored-by: Erik Seliger <erikseliger@me.com>
Co-authored-by: Erik Seliger <erikseliger@me.com>
|
@jasonhawkharris did you intend to merge this? |
add support to
has.topicfor gitlab topics. Previously only worked for github topics.Test plan
idx_repo_gitlab_topicsis being usedPreview 🤩
Preview Link