这是indexloc提供的服务,不要输入任何密码
Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Conversation

@jasonhawkharris
Copy link
Contributor

@jasonhawkharris jasonhawkharris commented Oct 16, 2023

add support to has.topic for gitlab topics. Previously only worked for github topics.

Test plan

  • Confirm new query does not break github topics
  • Confirm new index works for gitlab topics
  • Run EXPLAIN query locally to confirm idx_repo_gitlab_topics is being used
sourcegraph=# EXPLAIN SELECT *
FROM repo
WHERE external_service_type = 'gitlab'
AND metadata->'topics' @> '["test-topic"]';
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on repo  (cost=12.69..300.47 rows=89 width=687)
   Recheck Cond: (((metadata -> 'topics'::text) @> '["test-topic"]'::jsonb) AND (external_service_type = 'gitlab'::text))
   ->  Bitmap Index Scan on idx_repo_gitlab_topics  (cost=0.00..12.67 rows=89 width=0)
         Index Cond: ((metadata -> 'topics'::text) @> '["test-topic"]'::jsonb)
(4 rows)

Preview 🤩

Preview Link

@jasonhawkharris jasonhawkharris requested a review from a team October 16, 2023 21:51
@cla-bot cla-bot bot added the cla-signed label Oct 16, 2023
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Oct 16, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff a9f2ba5...feae35b.

Notify File(s)
@eseliger internal/database/repos.go
internal/extsvc/gitlab/projects.go
internal/repos/gitlab_test.go
@fkling client/shared/src/search/query/predicates.ts
client/shared/src/search/query/predicates.ts

@sourcegraph-bot
Copy link
Contributor

Codenotify: Notifying subscribers in CODENOTIFY files for diff a3dbd371eb14ca5a56e250d068e6dabad296b638...959d1f27a799b17c3c3fbe9744590b3e796fe0ef.

Notify File(s)
@eseliger internal/database/repos.go
internal/extsvc/gitlab/projects.go

@camdencheek camdencheek changed the title Jhh/search gitlab topics Search: enable repo:has.topic() for GitLab Oct 16, 2023
Copy link
Contributor

@BolajiOlajide BolajiOlajide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

Can you update the test plan for this PR in case there's anything specific the reviewers need to test?

Also, it seems you need to run the sg gen command to update the database schema docs - I believe that's what's causing the current CI failure.

@jasonhawkharris jasonhawkharris force-pushed the jhh/search-gitlab-topics branch 2 times, most recently from d73f2b2 to 9a84dd3 Compare October 17, 2023 15:00
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Oct 17, 2023

📖 Storybook live preview

@jasonhawkharris jasonhawkharris enabled auto-merge (squash) October 17, 2023 15:45
Copy link
Member

@camdencheek camdencheek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

It would still be nice to add a SQL comment on the new index like was done with the previous index

COMMENT ON INDEX idx_repo_github_topics IS 'An index to speed up listing repos by topic. Intended to be used when TopicFilters are added to the RepoListOptions';

@jasonhawkharris jasonhawkharris force-pushed the jhh/search-gitlab-topics branch from e500952 to b01a5cf Compare October 19, 2023 14:08
@jasonhawkharris jasonhawkharris force-pushed the jhh/search-gitlab-topics branch from b01a5cf to 279e043 Compare October 19, 2023 14:32
Copy link
Member

@eseliger eseliger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for also caring about DB perf and generating a query plan for the test plan!

However, the plan generated might differ quite a lot from what is actually run in production for two reasons:

  • We usually apply way more filters than just this one tested for, about visibility, deleted status, and so forth and it might be that with all these other conditions combined postgres no longer decides to actually use the index you would think it is using
  • Your local dataset seemed to be a few rows only, to get real-world results it should be run against a database that has a ton of data, like sourcegraph.coms database. Postgres behaves very different when there's 10 rows vs when there's 10000 rows

Also left an inline comment about adding new indexes for every code host.

Don't see this as blocking, rather as some food for thought before this goes into production :)

Comment on lines 3546 to +3547
"idx_repo_github_topics" gin (((metadata -> 'RepositoryTopics'::text) -> 'Nodes'::text)) WHERE external_service_type = 'github'::text
"idx_repo_gitlab_topics" gin ((metadata -> 'topics'::text)) WHERE external_service_type = 'gitlab'::text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the repo table has an absolute overload of indexes already, I wonder if we could simplify this by normalizing the DB a bit:
We introduce a new column topics text[] and the syncer decides what to write to that field. Then GitHub and GitLab can both do that, and in the future some other code host could also simply add to that proper column, vs us adding a new index every time.

That would also simplify the query we're making, because we can simply operate on a single column vs adding multiple OR-d together statements about the different ways topics can be retrieved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that we should simplify it, but I expect the index to be quite small. The github topics index is 122MB on dotcom (select pg_table_size('idx_repo_github_topics');), and we have many more github repos than gitlab repos.

I'd be a bit concerned about denormalizing the DB. The syncer logic is a bit complex, and I honestly wouldn't feel comfortable changing it for a small feature like this, but what about a generated column? something like topics GENERATED STORED AS CASE WHEN external_service_type = 'gitlab' THEN ... WHEN external_service_type = 'github' THEN ... ELSE [] END

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work, too I assume! I haven't dealt with generated columns much in the past, but trust that they work well.

@jasonhawkharris jasonhawkharris merged commit 43bf93c into main Oct 19, 2023
@jasonhawkharris jasonhawkharris deleted the jhh/search-gitlab-topics branch October 19, 2023 20:46
@camdencheek
Copy link
Member

@jasonhawkharris did you intend to merge this?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend repo:has.topic() Support to GitLab Support repo:has.topic for GitLab

6 participants