-
-
Notifications
You must be signed in to change notification settings - Fork 90
feat: use git database to fetch references and content instead of github api #801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks @jamesc-grafana! I'll be able to give this a review later today! |
0 => None, | ||
_ => Some(token.to_string()), | ||
}, | ||
fetched_repos: RwLock::new(HashMap::new()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud: instead using a bool
as a sentinel here, can we make this a HashMap<String, PathBuf>
or similar to track the underlying cached directory? I think that would reduce the number of checks needed below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tracking the path is a good idea, however I'm unsure on which of the checks that you'd propose getting rid of? In most circumstances the existence of the directory will require an initial fetch to ensure nothing has changed since the prior run 🤔
One thing that I did think, was that changing the behaviour from a clone
for fetching content and a remote
for listing refs to just a clone
and then checking the refs locally to avoid lots of polling of the GitHub services. I can experiment with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tracking the path is a good idea, however I'm unsure on which of the checks that you'd propose getting rid of?
I was thinking of the two checks at the top of run_fetch
-- if I'm reading that right (I might not be!) I believe those could be collapsed into a single check, with a preserved invariant that if fetched_repos
contains a path then we can use it without checking for existence.
One thing that I did think, was that changing the behaviour from a clone for fetching content and a remote for listing refs to just a clone and then checking the refs locally to avoid lots of polling of the GitHub services. I can experiment with that
Sounds good! I've held off on local benchmarking for now, but let me know once you're at a fixpoint with experiments here and I can start doing that 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the get_repo
and run_fetch
to use the PathBuf in the HashMap if it exists to open the repo, we already know about it. Otherwise to check existence and then fetch or clone based on the status of the repo
Changes in 0fb174e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
crates/zizmor/src/github_api.rs
Outdated
pageno += 1; | ||
#[instrument(skip(self))] | ||
pub(crate) async fn get_repo(&self, owner: &str, repo: &str) -> Result<Repository> { | ||
let tmp = std::env::temp_dir().join(format!("zizmor/{owner}-{repo}")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: there's a top-level --cache-dir
setting that gets plumbed into the AuditState
(which defaults to an XDG cache dir). Might be good to plumb that further into Client
and use it here rather than std::env::temp_dir()
🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, fixed in 0fb174e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, had to rebase for Cargo.toml
so fixed in ff539fd instead
Signed-off-by: William Woodruff <william@yossarian.net>
Signed-off-by: William Woodruff <william@yossarian.net>
Hey @jamesc-grafana, sorry for the delay here! This popped a bit further down the stack after the fix APIs 😉 I'm taking another look at this now, and I think the overall approach looks great! One outstanding concern I have is with performance, however -- on local tests this is a lot slower than the current REST API calls, probably because (Normally I'd reach for shallow clones + sparse checkouts here, but I'm not super familiar with libgit2 or |
Yes, I also find this on an initial run too, however, once the initial clone is complete I was finding it quicker than the API calls. It's currently doing a "bare", so this could be the problem, I think that your suggestion of shallow/sparse clones might help only then fetching more data when required.
I will look into |
Works towards: #764 using
git2-rs
to talk directly to the Git Remote and bare cloning for the comparison operations