+
Skip to content

feat: use git database to fetch references and content instead of github api #801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jamesc-grafana
Copy link

Works towards: #764 using git2-rs to talk directly to the Git Remote and bare cloning for the comparison operations

@jamesc-grafana jamesc-grafana marked this pull request as draft May 16, 2025 14:05
@jamesc-grafana jamesc-grafana marked this pull request as ready for review May 16, 2025 14:37
@funnelfiasco funnelfiasco added enhancement New feature or request performance labels May 16, 2025
@woodruffw
Copy link
Member

Thanks @jamesc-grafana! I'll be able to give this a review later today!

0 => None,
_ => Some(token.to_string()),
},
fetched_repos: RwLock::new(HashMap::new()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud: instead using a bool as a sentinel here, can we make this a HashMap<String, PathBuf> or similar to track the underlying cached directory? I think that would reduce the number of checks needed below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tracking the path is a good idea, however I'm unsure on which of the checks that you'd propose getting rid of? In most circumstances the existence of the directory will require an initial fetch to ensure nothing has changed since the prior run 🤔

One thing that I did think, was that changing the behaviour from a clone for fetching content and a remote for listing refs to just a clone and then checking the refs locally to avoid lots of polling of the GitHub services. I can experiment with that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tracking the path is a good idea, however I'm unsure on which of the checks that you'd propose getting rid of?

I was thinking of the two checks at the top of run_fetch -- if I'm reading that right (I might not be!) I believe those could be collapsed into a single check, with a preserved invariant that if fetched_repos contains a path then we can use it without checking for existence.

One thing that I did think, was that changing the behaviour from a clone for fetching content and a remote for listing refs to just a clone and then checking the refs locally to avoid lots of polling of the GitHub services. I can experiment with that

Sounds good! I've held off on local benchmarking for now, but let me know once you're at a fixpoint with experiments here and I can start doing that 🙂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the get_repo and run_fetch to use the PathBuf in the HashMap if it exists to open the repo, we already know about it. Otherwise to check existence and then fetch or clone based on the status of the repo

Changes in 0fb174e

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

pageno += 1;
#[instrument(skip(self))]
pub(crate) async fn get_repo(&self, owner: &str, repo: &str) -> Result<Repository> {
let tmp = std::env::temp_dir().join(format!("zizmor/{owner}-{repo}"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: there's a top-level --cache-dir setting that gets plumbed into the AuditState (which defaults to an XDG cache dir). Might be good to plumb that further into Client and use it here rather than std::env::temp_dir() 🙂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed in 0fb174e

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, had to rebase for Cargo.toml so fixed in ff539fd instead

jamesc-grafana and others added 4 commits June 2, 2025 13:52
Signed-off-by: William Woodruff <william@yossarian.net>
Signed-off-by: William Woodruff <william@yossarian.net>
@woodruffw
Copy link
Member

Hey @jamesc-grafana, sorry for the delay here! This popped a bit further down the stack after the fix APIs 😉

I'm taking another look at this now, and I think the overall approach looks great! One outstanding concern I have is with performance, however -- on local tests this is a lot slower than the current REST API calls, probably because git2 ends up pulling a lot more data over the wire. Do you have any thoughts on how we can optimize this?

(Normally I'd reach for shallow clones + sparse checkouts here, but I'm not super familiar with libgit2 or git2 for Rust 🙂)

@jamesc-grafana
Copy link
Author

One outstanding concern I have is with performance, however -- on local tests this is a lot slower than the current REST API calls

Yes, I also find this on an initial run too, however, once the initial clone is complete I was finding it quicker than the API calls. It's currently doing a "bare", so this could be the problem, I think that your suggestion of shallow/sparse clones might help only then fetching more data when required.

(Normally I'd reach for shallow clones + sparse checkouts here, but I'm not super familiar with libgit2 or git2 for Rust 🙂)

I will look into git2-rs for shallow and sparse checkouts

@mostafa mostafa mentioned this pull request Jul 9, 2025
11 tasks
@jamesc-grafana jamesc-grafana marked this pull request as draft July 11, 2025 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载