-
Notifications
You must be signed in to change notification settings - Fork 145
sparse-checkout: add 'clean' command #1941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
sparse-checkout: add 'clean' command #1941
Conversation
The logic for the 'git sparse-checkout' builtin uses the_repository all over the place, despite some use of a repository struct in different method parameters. Complete this removal of the_repository by using 'repo' when possible. In one place, there was already a local variable 'r' that was set to the_repository, so move that to a method parameter. We cannot remove the USE_THE_REPOSITORY_VARIABLE declaration as we are still using global constants for the state of the sparse-checkout. Signed-off-by: Derrick Stolee <stolee@gmail.com>
When users change their sparse-checkout definitions to add new directories and remove old ones, there may be a few reasons why directories no longer in scope remain (ignored or excluded files still exist, Windows handles are still open, etc.). When these files still exist, the sparse index feature notices that a tracked, but sparse, directory still exists on disk and thus the index expands. This causes a performance hit _and_ the advice printed isn't very helpful. Using 'git clean' isn't enough (generally '-dfx' may be needed) but also this may not be sufficient. Add a new subcommand to 'git sparse-checkout' that removes these tracked-but-sparse directories, including any excluded or ignored files underneath. This is the most extreme method for doing this, but it works when the sparse-checkout is in cone mode and is expected to rescope based on directories, not files. Be sure to add a --dry-run option so users can predict what will be deleted. In general, output the directories that are being removed so users can know what was removed. Note that untracked directories remain. Further, directories that contain staged changes are not deleted. This is a detail that is partly hidden by the implementation which relies on collapsing the index to a sparse index in-memory and only deleting directories that are listed as sparse in the index. If a staged change exists, then that entry is not stored as a sparse tree entry and thus remains on-disk until committed or reset. Signed-off-by: Derrick Stolee <stolee@gmail.com>
In my experience, the most-common reason that the sparse index must expand to a full one is because there is some leftover file in a tracked directory that is now outside of the sparse-checkout. The new 'git sparse-checkout clean' command will find and delete these directories, so point users to it when they hit the sparse index expansion advice. Signed-off-by: Derrick Stolee <stolee@gmail.com>
/submit |
Submitted as pull.1941.git.1751973594.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Patrick Steinhardt wrote (reply to this): On Tue, Jul 08, 2025 at 11:19:50AM +0000, Derrick Stolee via GitGitGadget wrote:
> When using cone-mode sparse-checkout, users specify which tracked
> directories they want (recursively) and any directory not part of the parent
> paths for those directories are considered "out of scope". When changing
> sparse-checkouts, there are a variety of reasons why these "out of scope"
> directories could remain, including:
>
> * The user has .gitignore or .git/info/exclude files that tell Git to not
> remove files of a certain type.
> * Some filesystem blocker prevented the removal of a tracked file. This is
> usually more of an issue on Windows where a read handle will block file
> deletion.
>
> Typically, this would not mean too much for the user experience. A few extra
> filesystem checks might be required to satisfy git status commands, but the
> scope of the performance hit is relative to how many cruft files are left
> over in this situation.
>
> However, when using the sparse index, these tracked sparse directories cause
> significant performance issues. When noticing that the index contains a
> sparse directory but that directory exists on disk, Git needs to expand that
> sparse directory to determine which files are tracked or untracked. The
> current mechanism expands the entire index to a full one, an expensive
> operation that scales with the total number of paths at HEAD and not just
> the number of cruft files left over.
>
> Advice was added in 9479a31d603 (advice: warn when sparse index expands,
> 2024-07-08) to help users determine that they were in this state. However,
> the advice doesn't actually recommend helpful ways to get out of this state.
> Recommending "git clean" on its own is incomplete, as typically users
> actually need 'git clean -dfx' to clear out the ignored or excluded files.
> Even then, they may need 'git sparse-checkout reapply' afterwards to clear
> the sparse directories.
>
> The advice was successful in helping to alert users to the problem, which is
> how I got wind of many of these cases for how users get into this state.
> It's now time to give them a tool that helps them out of this state.
As usual for you, this is a nicely-written summary of how we got here
and why the current mechanisms are insufficient for mere mortals.
> This series adds a new 'git sparse-checkout clean' command that currently
> only works for cone-mode sparse-checkouts. The only thing it does is
> collapse the index to a sparse index (as much as possible) and make sure
> that any sparse directories are removed. These directories are listed to
> stdout.
>
> A --dry-run option is available to list the directories that would be
> removed without actually deleting the directories.
>
> This option would be preferred to something like 'git clean -dfx' since it
> does not clear the excluded files that are still within the sparse-checkout.
> Instead, it performs the exact filesystem operations required to refresh the
> sparse index performance back to what is expected.
>
> I spent a few weeks debating with myself about whether or not this was the
> right interface, so please suggest alternatives if you have better ideas.
> Among my rejected ideas include:
>
> * 'git sparse-checkout reapply -f -x' or similar augmentations of
> 'reapply'.
> * 'git clean --sparse' to focus the clean operation on things outside of
> the sparse-checkout.
>
> The implementation is rather simple with the current CLI. Future
> augmentations could include a --quiet option to silence the output and a
> --verbose option to list the files that exist within each directory and
> would/will be removed.
One of the benefits of your new command is that we can extend it in the
future as necessary if we ever notice that there are other things that
we need to do to bring the sparse checkout up to date again. So without
yet having had a look at the implementation I think this direction is
quite sensible.
Ideally it would of course be great if we could automatically fix the
issue for our users. But as we have to prune potentially-ignored data it
is very much a no-go to do that in the automatically.
Patrick |
User |
@@ -9,7 +9,7 @@ git-sparse-checkout - Reduce your working tree to a subset of tracked files | |||
SYNOPSIS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Patrick Steinhardt wrote (reply to this):
On Tue, Jul 08, 2025 at 11:19:52AM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/Documentation/git-sparse-checkout.adoc b/Documentation/git-sparse-checkout.adoc
> index 529a8edd9c1e..21ba6f759905 100644
> --- a/Documentation/git-sparse-checkout.adoc
> +++ b/Documentation/git-sparse-checkout.adoc
> @@ -111,6 +111,17 @@ flags, with the same meaning as the flags from the `set` command, in order
> to change which sparsity mode you are using without needing to also respecify
> all sparsity paths.
>
> +'clean'::
> + Remove all files in tracked directories that are outside of the
> + sparse-checkout definition. This subcommand requires cone-mode
> + sparse-checkout to be sure that we know which directories are
> + both tracked and all contained paths are not in the sparse-checkout.
> + This command can be used to be sure the sparse index works
> + efficiently.
> ++
> +The `clean` command can also take the `--dry-run` (`-n`) option to list
> +the directories it would remove without performing any filesystem changes.
> +
Hm. This is somewhat different from `git clean`, where you have to pass
`-f` to make it delete any data. I'm not particularly a fan of that
mode, but should we maybe retain it regardless to ensure that things are
at least a tiny bit more consistent?
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index 8b70d0c6a441..6d2843827367 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -924,6 +924,76 @@ static int sparse_checkout_reapply(int argc, const char **argv,
> return update_working_directory(repo, NULL);
> }
>
> +static char const * const builtin_sparse_checkout_clean_usage[] = {
> + "git sparse-checkout clean [-n|--dry-run]",
> + NULL
> +};
> +
> +static struct sparse_checkout_clean_opts {
> + int dry_run;
> +} clean_opts;
> +
> +static int sparse_checkout_clean(int argc, const char **argv,
> + const char *prefix,
> + struct repository *repo)
> +{
> + struct strbuf full_path = STRBUF_INIT;
> + size_t worktree_len;
> + static struct option builtin_sparse_checkout_clean_options[] = {
> + OPT_BOOL('n', "dry-run", &clean_opts.dry_run,
> + N_("list the directories that would be removed without making filesystem changes")),
> + OPT_END(),
> + };
> +
> + setup_work_tree();
> + if (!core_apply_sparse_checkout)
> + die(_("must be in a sparse-checkout to clean directories"));
> + if (!core_sparse_checkout_cone)
> + die(_("must be in a cone-mode sparse-checkout to clean directories"));
> +
> + argc = parse_options(argc, argv, prefix,
> + builtin_sparse_checkout_clean_options,
> + builtin_sparse_checkout_clean_usage, 0);
> +
> + if (repo_read_index(repo) < 0)
> + die(_("failed to read index"));
> +
> + if (convert_to_sparse(repo->index, SPARSE_INDEX_MEMORY_ONLY))
> + die(_("failed to convert index to a sparse index"));
I noticed that there are several cases in `convert_to_sparse()` where we
simply do nothing. Should we check whether `repo->index->sparse_index`
matches `INDEX_COLLAPSED` after the operation?
> + strbuf_addstr(&full_path, repo->worktree);
> + strbuf_addch(&full_path, '/');
> + worktree_len = full_path.len;
> +
> + for (size_t i = 0; i < repo->index->cache_nr; i++) {
> + DIR* dir;
Nit: the `*` goes with the variable, not the type.
> + struct cache_entry *ce = repo->index->cache[i];
> + if (!S_ISSPARSEDIR(ce->ce_mode))
> + continue;
Okay, we only need to handle sparse directories.
> + strbuf_setlen(&full_path, worktree_len);
> + strbuf_add(&full_path, ce->name, ce->ce_namelen);
> +
> + dir = opendir(full_path.buf);
Shouldn't it be sufficient to use `is_directory()`?
> + if (!dir)
> + continue;
This is the good and expected case, right? The entry is sparse, so
ideally it doesn't exist. If it does we have to recurse into to end up
with the full index.
> + else if (ENOENT != errno) {
Nit: style. If one branches requires curly braces, all branches should
use them.
> + warning_errno(_("failed to check for existence of '%s'"), ce->name);
> + continue;
> + }
> +
> + closedir(dir);
> +
> + printf("%s\n", ce->name);
git-clean(1) says "Removing %s\n". Should we do the same here?
> + if (!clean_opts.dry_run) {
> + if (remove_dir_recursively(&full_path, 0))
> + warning_errno(_("failed to remove '%s'"), ce->name);
> + }
> + }
> +
> + strbuf_release(&full_path);
> + return 0;
> +}
> +
> static char const * const builtin_sparse_checkout_disable_usage[] = {
> "git sparse-checkout disable",
> NULL
Patrick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
Patrick Steinhardt <ps@pks.im> writes:
> On Tue, Jul 08, 2025 at 11:19:52AM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/Documentation/git-sparse-checkout.adoc b/Documentation/git-sparse-checkout.adoc
>> index 529a8edd9c1e..21ba6f759905 100644
>> --- a/Documentation/git-sparse-checkout.adoc
>> +++ b/Documentation/git-sparse-checkout.adoc
>> @@ -111,6 +111,17 @@ flags, with the same meaning as the flags from the `set` command, in order
>> to change which sparsity mode you are using without needing to also respecify
>> all sparsity paths.
>>
>> +'clean'::
>> + Remove all files in tracked directories that are outside of the
>> + sparse-checkout definition. This subcommand requires cone-mode
>> + sparse-checkout to be sure that we know which directories are
>> + both tracked and all contained paths are not in the sparse-checkout.
>> + This command can be used to be sure the sparse index works
>> + efficiently.
>> ++
>> +The `clean` command can also take the `--dry-run` (`-n`) option to list
>> +the directories it would remove without performing any filesystem changes.
>> +
>
> Hm. This is somewhat different from `git clean`, where you have to pass
> `-f` to make it delete any data. I'm not particularly a fan of that
> mode, but should we maybe retain it regardless to ensure that things are
> at least a tiny bit more consistent?
Ah, it reminds me of my favorite "regret". We may want to consider
making the --force/--dry-run used in "git clean" saner at a major
version boundary. I am not particulary a fan of that mode, and
would oppose a patch made as a part of regular "let's change this,
as I do not like it" exercise, but as a known-breaking change, I do
not mind it at all. Essentially the change to propose would be to
deprecate clean.requireForce and internally make it a constant
false.
But that is a tangent ;-)
On the Git mailing list, Elijah Newren wrote (reply to this): On Tue, Jul 8, 2025 at 4:19 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> When using cone-mode sparse-checkout, users specify which tracked
> directories they want (recursively) and any directory not part of the parent
> paths for those directories are considered "out of scope". When changing
> sparse-checkouts, there are a variety of reasons why these "out of scope"
> directories could remain, including:
>
> * The user has .gitignore or .git/info/exclude files that tell Git to not
> remove files of a certain type.
> * Some filesystem blocker prevented the removal of a tracked file. This is
> usually more of an issue on Windows where a read handle will block file
> deletion.
>
> Typically, this would not mean too much for the user experience. A few extra
> filesystem checks might be required to satisfy git status commands, but the
> scope of the performance hit is relative to how many cruft files are left
> over in this situation.
>
> However, when using the sparse index, these tracked sparse directories cause
> significant performance issues. When noticing that the index contains a
> sparse directory but that directory exists on disk, Git needs to expand that
> sparse directory to determine which files are tracked or untracked. The
> current mechanism expands the entire index to a full one, an expensive
> operation that scales with the total number of paths at HEAD and not just
> the number of cruft files left over.
>
> Advice was added in 9479a31d603 (advice: warn when sparse index expands,
> 2024-07-08) to help users determine that they were in this state. However,
> the advice doesn't actually recommend helpful ways to get out of this state.
> Recommending "git clean" on its own is incomplete, as typically users
> actually need 'git clean -dfx' to clear out the ignored or excluded files.
> Even then, they may need 'git sparse-checkout reapply' afterwards to clear
> the sparse directories.
>
> The advice was successful in helping to alert users to the problem, which is
> how I got wind of many of these cases for how users get into this state.
> It's now time to give them a tool that helps them out of this state.
>
> This series adds a new 'git sparse-checkout clean' command that currently
> only works for cone-mode sparse-checkouts. The only thing it does is
> collapse the index to a sparse index (as much as possible) and make sure
> that any sparse directories are removed. These directories are listed to
> stdout.
But what does it clean up?
- untracked files?
- ignored files?
- tracked-but-unmodified files?
- tracked-and-modified files?
- tracked-and-conflicted files? (which is probably a subset of
tracked-and-modified, but thought I'd call it out)
Note: "tracked" probably has a slightly ambiguous connotation here
since we sometimes mean "is it in the index", and there's a difference
between "would it be in the sparse index" and "would it be in the
fully expanded index". Here, by "tracked" I mean the latter -- "is it
in the fully expanded index".
> A --dry-run option is available to list the directories that would be
> removed without actually deleting the directories.
>
> This option would be preferred to something like 'git clean -dfx' since it
> does not clear the excluded files that are still within the sparse-checkout.
This seems to suggest you are only interested in untracked and ignored
files. I'm sure that's by far the most common case, but I'm curious
about the others. Are you expecting users to sometimes need to run
both 'git sparse-checkout clean' and 'git sparse-checkout reapply'?
> Instead, it performs the exact filesystem operations required to refresh the
> sparse index performance back to what is expected.
But what operations are those and what is expected?
As you mentioned above, for untracked or ignored files, the
expectation is that those would be removed.
I think if there are tracked-but-unmodified files, I'd expect those to
be removed as well.
If only the above filetypes exist, then we'd expect the directory to
be nuked and sparse index performance to be improved back to "normal".
However, if there are tracked-and-modified files, I'd expect an error
and for the sparse index performance to continue to suffer until those
paths are resolved. (Or, pie-in-sky spitballing:maybe we could
attempt to do something smarter like make sibling directories to the
tracked-and-modified path be treated as sparse directories, so that
performance only suffers a little).
> I spent a few weeks debating with myself about whether or not this was the
> right interface, so please suggest alternatives if you have better ideas.
> Among my rejected ideas include:
>
> * 'git sparse-checkout reapply -f -x' or similar augmentations of
> 'reapply'.
The connection to sparse-checkout reapply at least would make it
clearer what you are doing with tracked files, since its explanation
explicitly mentions those. However, reapply doesn't say anything
about untracked or ignored files, which we'd need to start explaining
and perhaps isn't as clean a fit, especially since your new usecase is
predominantly about untracked and ignored files. I don't have a
strong opinion here, but I think I also like your choice of a separate
'clean' subcommand better.
> * 'git clean --sparse' to focus the clean operation on things outside of
> the sparse-checkout.
Yeah, this choice would have likely prevented you from cleaning up
tracked files, and required users to run both 'clean --sparse' and
'sparse-checkout reapply'. And this command feels more tightly
connected to sparse-checkouts to me, so I wouldn't have liked this
choice either.
> The implementation is rather simple with the current CLI. Future
> augmentations could include a --quiet option to silence the output and a
> --verbose option to list the files that exist within each directory and
> would/will be removed.
I'm also curious what happens when (1) you are in cone mode and there
is no sparse index, or (2) when you are not in cone mode. I suspect
those and the questions above will be answered as I read the
individual patches, so I'll keep going...
> Thanks, -Stolee
>
> Derrick Stolee (3):
> sparse-checkout: remove use of the_repository
> sparse-checkout: add 'clean' command
> sparse-index: point users to new 'clean' action
>
> Documentation/git-sparse-checkout.adoc | 13 +-
> builtin/sparse-checkout.c | 192 +++++++++++++++++--------
> sparse-index.c | 3 +-
> t/t1091-sparse-checkout-builtin.sh | 48 +++++++
> 4 files changed, 197 insertions(+), 59 deletions(-)
>
>
> base-commit: 8b6f19ccfc3aefbd0f22f6b7d56ad6a3fc5e4f37
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1941%2Fderrickstolee%2Fgit-sparse-checkout-clean-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1941/derrickstolee/git-sparse-checkout-clean-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1941
> --
> gitgitgadget |
@@ -204,12 +204,12 @@ static void clean_tracked_sparse_directories(struct repository *r) | |||
ensure_full_index(r->index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Tue, Jul 8, 2025 at 4:20 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <stolee@gmail.com>
>
> The logic for the 'git sparse-checkout' builtin uses the_repository all
> over the place, despite some use of a repository struct in different
> method parameters. Complete this removal of the_repository by using
> 'repo' when possible.
>
> In one place, there was already a local variable 'r' that was set to
> the_repository, so move that to a method parameter.
Always nice to see these cleanups.
> We cannot remove the USE_THE_REPOSITORY_VARIABLE declaration as we are
> still using global constants for the state of the sparse-checkout.
Thanks for calling this out and explaining it.
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> builtin/sparse-checkout.c | 119 ++++++++++++++++++++------------------
> 1 file changed, 63 insertions(+), 56 deletions(-)
>
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index 1bf01591b275..8b70d0c6a441 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -204,12 +204,12 @@ static void clean_tracked_sparse_directories(struct repository *r)
> ensure_full_index(r->index);
> }
>
> -static int update_working_directory(struct pattern_list *pl)
> +static int update_working_directory(struct repository *r,
> + struct pattern_list *pl)
> {
> enum update_sparsity_result result;
> struct unpack_trees_options o;
> struct lock_file lock_file = LOCK_INIT;
> - struct repository *r = the_repository;
> struct pattern_list *old_pl;
>
> /* If no branch has been checked out, there are no updates to make. */
> @@ -327,7 +327,8 @@ static void write_cone_to_file(FILE *fp, struct pattern_list *pl)
> string_list_clear(&sl, 0);
> }
>
> -static int write_patterns_and_update(struct pattern_list *pl)
> +static int write_patterns_and_update(struct repository *repo,
> + struct pattern_list *pl)
> {
> char *sparse_filename;
> FILE *fp;
> @@ -336,15 +337,15 @@ static int write_patterns_and_update(struct pattern_list *pl)
>
> sparse_filename = get_sparse_checkout_filename();
>
> - if (safe_create_leading_directories(the_repository, sparse_filename))
> + if (safe_create_leading_directories(repo, sparse_filename))
> die(_("failed to create directory for sparse-checkout file"));
>
> hold_lock_file_for_update(&lk, sparse_filename, LOCK_DIE_ON_ERROR);
>
> - result = update_working_directory(pl);
> + result = update_working_directory(repo, pl);
> if (result) {
> rollback_lock_file(&lk);
> - update_working_directory(NULL);
> + update_working_directory(repo, NULL);
> goto out;
> }
>
> @@ -372,25 +373,26 @@ enum sparse_checkout_mode {
> MODE_CONE_PATTERNS = 2,
> };
>
> -static int set_config(enum sparse_checkout_mode mode)
> +static int set_config(struct repository *repo,
> + enum sparse_checkout_mode mode)
> {
> /* Update to use worktree config, if not already. */
> - if (init_worktree_config(the_repository)) {
> + if (init_worktree_config(repo)) {
> error(_("failed to initialize worktree config"));
> return 1;
> }
>
> - if (repo_config_set_worktree_gently(the_repository,
> + if (repo_config_set_worktree_gently(repo,
> "core.sparseCheckout",
> mode ? "true" : "false") ||
> - repo_config_set_worktree_gently(the_repository,
> + repo_config_set_worktree_gently(repo,
> "core.sparseCheckoutCone",
> mode == MODE_CONE_PATTERNS ?
> "true" : "false"))
> return 1;
>
> if (mode == MODE_NO_PATTERNS)
> - return set_sparse_index_config(the_repository, 0);
> + return set_sparse_index_config(repo, 0);
>
> return 0;
> }
> @@ -410,7 +412,7 @@ static enum sparse_checkout_mode update_cone_mode(int *cone_mode) {
> return MODE_ALL_PATTERNS;
> }
>
> -static int update_modes(int *cone_mode, int *sparse_index)
> +static int update_modes(struct repository *repo, int *cone_mode, int *sparse_index)
> {
> int mode, record_mode;
>
> @@ -418,20 +420,20 @@ static int update_modes(int *cone_mode, int *sparse_index)
> record_mode = (*cone_mode != -1) || !core_apply_sparse_checkout;
>
> mode = update_cone_mode(cone_mode);
> - if (record_mode && set_config(mode))
> + if (record_mode && set_config(repo, mode))
> return 1;
>
> /* Set sparse-index/non-sparse-index mode if specified */
> if (*sparse_index >= 0) {
> - if (set_sparse_index_config(the_repository, *sparse_index) < 0)
> + if (set_sparse_index_config(repo, *sparse_index) < 0)
> die(_("failed to modify sparse-index config"));
>
> /* force an index rewrite */
> - repo_read_index(the_repository);
> - the_repository->index->updated_workdir = 1;
> + repo_read_index(repo);
> + repo->index->updated_workdir = 1;
>
> if (!*sparse_index)
> - ensure_full_index(the_repository->index);
> + ensure_full_index(repo->index);
> }
>
> return 0;
> @@ -448,7 +450,7 @@ static struct sparse_checkout_init_opts {
> } init_opts;
>
> static int sparse_checkout_init(int argc, const char **argv, const char *prefix,
> - struct repository *repo UNUSED)
> + struct repository *repo)
> {
> struct pattern_list pl;
> char *sparse_filename;
> @@ -464,7 +466,7 @@ static int sparse_checkout_init(int argc, const char **argv, const char *prefix,
> };
>
> setup_work_tree();
> - repo_read_index(the_repository);
> + repo_read_index(repo);
>
> init_opts.cone_mode = -1;
> init_opts.sparse_index = -1;
> @@ -473,7 +475,7 @@ static int sparse_checkout_init(int argc, const char **argv, const char *prefix,
> builtin_sparse_checkout_init_options,
> builtin_sparse_checkout_init_usage, 0);
>
> - if (update_modes(&init_opts.cone_mode, &init_opts.sparse_index))
> + if (update_modes(repo, &init_opts.cone_mode, &init_opts.sparse_index))
> return 1;
>
> memset(&pl, 0, sizeof(pl));
> @@ -485,14 +487,14 @@ static int sparse_checkout_init(int argc, const char **argv, const char *prefix,
> if (res >= 0) {
> free(sparse_filename);
> clear_pattern_list(&pl);
> - return update_working_directory(NULL);
> + return update_working_directory(repo, NULL);
> }
>
> - if (repo_get_oid(the_repository, "HEAD", &oid)) {
> + if (repo_get_oid(repo, "HEAD", &oid)) {
> FILE *fp;
>
> /* assume we are in a fresh repo, but update the sparse-checkout file */
> - if (safe_create_leading_directories(the_repository, sparse_filename))
> + if (safe_create_leading_directories(repo, sparse_filename))
> die(_("unable to create leading directories of %s"),
> sparse_filename);
> fp = xfopen(sparse_filename, "w");
> @@ -511,7 +513,7 @@ static int sparse_checkout_init(int argc, const char **argv, const char *prefix,
> add_pattern("!/*/", empty_base, 0, &pl, 0);
> pl.use_cone_patterns = init_opts.cone_mode;
>
> - return write_patterns_and_update(&pl);
> + return write_patterns_and_update(repo, &pl);
> }
>
> static void insert_recursive_pattern(struct pattern_list *pl, struct strbuf *path)
> @@ -674,7 +676,8 @@ static void add_patterns_literal(int argc, const char **argv,
> add_patterns_from_input(pl, argc, argv, use_stdin ? stdin : NULL);
> }
>
> -static int modify_pattern_list(struct strvec *args, int use_stdin,
> +static int modify_pattern_list(struct repository *repo,
> + struct strvec *args, int use_stdin,
> enum modify_type m)
> {
> int result;
> @@ -696,22 +699,23 @@ static int modify_pattern_list(struct strvec *args, int use_stdin,
> }
>
> if (!core_apply_sparse_checkout) {
> - set_config(MODE_ALL_PATTERNS);
> + set_config(repo, MODE_ALL_PATTERNS);
> core_apply_sparse_checkout = 1;
> changed_config = 1;
> }
>
> - result = write_patterns_and_update(pl);
> + result = write_patterns_and_update(repo, pl);
>
> if (result && changed_config)
> - set_config(MODE_NO_PATTERNS);
> + set_config(repo, MODE_NO_PATTERNS);
>
> clear_pattern_list(pl);
> free(pl);
> return result;
> }
>
> -static void sanitize_paths(struct strvec *args,
> +static void sanitize_paths(struct repository *repo,
> + struct strvec *args,
> const char *prefix, int skip_checks)
> {
> int i;
> @@ -752,7 +756,7 @@ static void sanitize_paths(struct strvec *args,
>
> for (i = 0; i < args->nr; i++) {
> struct cache_entry *ce;
> - struct index_state *index = the_repository->index;
> + struct index_state *index = repo->index;
> int pos = index_name_pos(index, args->v[i], strlen(args->v[i]));
>
> if (pos < 0)
> @@ -779,7 +783,7 @@ static struct sparse_checkout_add_opts {
> } add_opts;
>
> static int sparse_checkout_add(int argc, const char **argv, const char *prefix,
> - struct repository *repo UNUSED)
> + struct repository *repo)
> {
> static struct option builtin_sparse_checkout_add_options[] = {
> OPT_BOOL_F(0, "skip-checks", &add_opts.skip_checks,
> @@ -796,7 +800,7 @@ static int sparse_checkout_add(int argc, const char **argv, const char *prefix,
> if (!core_apply_sparse_checkout)
> die(_("no sparse-checkout to add to"));
>
> - repo_read_index(the_repository);
> + repo_read_index(repo);
>
> argc = parse_options(argc, argv, prefix,
> builtin_sparse_checkout_add_options,
> @@ -804,9 +808,9 @@ static int sparse_checkout_add(int argc, const char **argv, const char *prefix,
>
> for (int i = 0; i < argc; i++)
> strvec_push(&patterns, argv[i]);
> - sanitize_paths(&patterns, prefix, add_opts.skip_checks);
> + sanitize_paths(repo, &patterns, prefix, add_opts.skip_checks);
>
> - ret = modify_pattern_list(&patterns, add_opts.use_stdin, ADD);
> + ret = modify_pattern_list(repo, &patterns, add_opts.use_stdin, ADD);
>
> strvec_clear(&patterns);
> return ret;
> @@ -825,7 +829,7 @@ static struct sparse_checkout_set_opts {
> } set_opts;
>
> static int sparse_checkout_set(int argc, const char **argv, const char *prefix,
> - struct repository *repo UNUSED)
> + struct repository *repo)
> {
> int default_patterns_nr = 2;
> const char *default_patterns[] = {"/*", "!/*/", NULL};
> @@ -847,7 +851,7 @@ static int sparse_checkout_set(int argc, const char **argv, const char *prefix,
> int ret;
>
> setup_work_tree();
> - repo_read_index(the_repository);
> + repo_read_index(repo);
>
> set_opts.cone_mode = -1;
> set_opts.sparse_index = -1;
> @@ -856,7 +860,7 @@ static int sparse_checkout_set(int argc, const char **argv, const char *prefix,
> builtin_sparse_checkout_set_options,
> builtin_sparse_checkout_set_usage, 0);
>
> - if (update_modes(&set_opts.cone_mode, &set_opts.sparse_index))
> + if (update_modes(repo, &set_opts.cone_mode, &set_opts.sparse_index))
> return 1;
>
> /*
> @@ -870,10 +874,10 @@ static int sparse_checkout_set(int argc, const char **argv, const char *prefix,
> } else {
> for (int i = 0; i < argc; i++)
> strvec_push(&patterns, argv[i]);
> - sanitize_paths(&patterns, prefix, set_opts.skip_checks);
> + sanitize_paths(repo, &patterns, prefix, set_opts.skip_checks);
> }
>
> - ret = modify_pattern_list(&patterns, set_opts.use_stdin, REPLACE);
> + ret = modify_pattern_list(repo, &patterns, set_opts.use_stdin, REPLACE);
>
> strvec_clear(&patterns);
> return ret;
> @@ -891,7 +895,7 @@ static struct sparse_checkout_reapply_opts {
>
> static int sparse_checkout_reapply(int argc, const char **argv,
> const char *prefix,
> - struct repository *repo UNUSED)
> + struct repository *repo)
> {
> static struct option builtin_sparse_checkout_reapply_options[] = {
> OPT_BOOL(0, "cone", &reapply_opts.cone_mode,
> @@ -912,12 +916,12 @@ static int sparse_checkout_reapply(int argc, const char **argv,
> builtin_sparse_checkout_reapply_options,
> builtin_sparse_checkout_reapply_usage, 0);
>
> - repo_read_index(the_repository);
> + repo_read_index(repo);
>
> - if (update_modes(&reapply_opts.cone_mode, &reapply_opts.sparse_index))
> + if (update_modes(repo, &reapply_opts.cone_mode, &reapply_opts.sparse_index))
> return 1;
>
> - return update_working_directory(NULL);
> + return update_working_directory(repo, NULL);
> }
>
> static char const * const builtin_sparse_checkout_disable_usage[] = {
> @@ -927,7 +931,7 @@ static char const * const builtin_sparse_checkout_disable_usage[] = {
>
> static int sparse_checkout_disable(int argc, const char **argv,
> const char *prefix,
> - struct repository *repo UNUSED)
> + struct repository *repo)
> {
> static struct option builtin_sparse_checkout_disable_options[] = {
> OPT_END(),
> @@ -955,7 +959,7 @@ static int sparse_checkout_disable(int argc, const char **argv,
> * are expecting to do that when disabling sparse-checkout.
> */
> give_advice_on_expansion = 0;
> - repo_read_index(the_repository);
> + repo_read_index(repo);
>
> memset(&pl, 0, sizeof(pl));
> hashmap_init(&pl.recursive_hashmap, pl_hashmap_cmp, NULL, 0);
> @@ -965,14 +969,14 @@ static int sparse_checkout_disable(int argc, const char **argv,
>
> add_pattern("/*", empty_base, 0, &pl, 0);
>
> - prepare_repo_settings(the_repository);
> - the_repository->settings.sparse_index = 0;
> + prepare_repo_settings(repo);
> + repo->settings.sparse_index = 0;
>
> - if (update_working_directory(&pl))
> + if (update_working_directory(repo, &pl))
> die(_("error while refreshing working directory"));
>
> clear_pattern_list(&pl);
> - return set_config(MODE_NO_PATTERNS);
> + return set_config(repo, MODE_NO_PATTERNS);
> }
>
> static char const * const builtin_sparse_checkout_check_rules_usage[] = {
> @@ -987,14 +991,17 @@ static struct sparse_checkout_check_rules_opts {
> char *rules_file;
> } check_rules_opts;
>
> -static int check_rules(struct pattern_list *pl, int null_terminated) {
> +static int check_rules(struct repository *repo,
> + struct pattern_list *pl,
> + int null_terminated)
> +{
> struct strbuf line = STRBUF_INIT;
> struct strbuf unquoted = STRBUF_INIT;
> char *path;
> int line_terminator = null_terminated ? 0 : '\n';
> strbuf_getline_fn getline_fn = null_terminated ? strbuf_getline_nul
> : strbuf_getline;
> - the_repository->index->sparse_checkout_patterns = pl;
> + repo->index->sparse_checkout_patterns = pl;
> while (!getline_fn(&line, stdin)) {
> path = line.buf;
> if (!null_terminated && line.buf[0] == '"') {
> @@ -1006,7 +1013,7 @@ static int check_rules(struct pattern_list *pl, int null_terminated) {
> path = unquoted.buf;
> }
>
> - if (path_in_sparse_checkout(path, the_repository->index))
> + if (path_in_sparse_checkout(path, repo->index))
> write_name_quoted(path, stdout, line_terminator);
> }
> strbuf_release(&line);
> @@ -1016,7 +1023,7 @@ static int check_rules(struct pattern_list *pl, int null_terminated) {
> }
>
> static int sparse_checkout_check_rules(int argc, const char **argv, const char *prefix,
> - struct repository *repo UNUSED)
> + struct repository *repo)
> {
> static struct option builtin_sparse_checkout_check_rules_options[] = {
> OPT_BOOL('z', NULL, &check_rules_opts.null_termination,
> @@ -1055,7 +1062,7 @@ static int sparse_checkout_check_rules(int argc, const char **argv, const char *
> free(sparse_filename);
> }
>
> - ret = check_rules(&pl, check_rules_opts.null_termination);
> + ret = check_rules(repo, &pl, check_rules_opts.null_termination);
> clear_pattern_list(&pl);
> free(check_rules_opts.rules_file);
> return ret;
> @@ -1084,8 +1091,8 @@ int cmd_sparse_checkout(int argc,
>
> git_config(git_default_config, NULL);
>
> - prepare_repo_settings(the_repository);
> - the_repository->settings.command_requires_full_index = 0;
> + prepare_repo_settings(repo);
> + repo->settings.command_requires_full_index = 0;
>
> return fn(argc, argv, prefix, repo);
> }
> --
> gitgitgadget
Patch looks good to me.
@@ -204,12 +204,12 @@ static void clean_tracked_sparse_directories(struct repository *r) | |||
ensure_full_index(r->index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
>
> The logic for the 'git sparse-checkout' builtin uses the_repository all
> over the place, despite some use of a repository struct in different
> method parameters. Complete this removal of the_repository by using
> 'repo' when possible.
>
> In one place, there was already a local variable 'r' that was set to
> the_repository, so move that to a method parameter.
>
> We cannot remove the USE_THE_REPOSITORY_VARIABLE declaration as we are
> still using global constants for the state of the sparse-checkout.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> builtin/sparse-checkout.c | 119 ++++++++++++++++++++------------------
> 1 file changed, 63 insertions(+), 56 deletions(-)
OK. The damage is not too bad for a partial update ;-).
As the file-scope static functions in builtin/sparse-checkout.c are
not going to be called by anybody else, it does not really matter if
they internally pass an extra parameter around or use the_repository
since the end result is the same. But doing this may hopefully help
those that may want to move some of these functions to a more
library-ish part of the system outside builtin/ hierarchy.
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index 1bf01591b275..8b70d0c6a441 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -204,12 +204,12 @@ static void clean_tracked_sparse_directories(struct repository *r)
> ensure_full_index(r->index);
> }
>
> -static int update_working_directory(struct pattern_list *pl)
> +static int update_working_directory(struct repository *r,
> + struct pattern_list *pl)
> {
> enum update_sparsity_result result;
> struct unpack_trees_options o;
> struct lock_file lock_file = LOCK_INIT;
> - struct repository *r = the_repository;
> struct pattern_list *old_pl;
As this already used short-and-sweet 'r', we just follow suit to
minimize the damage, which is fine.
> @@ -327,7 +327,8 @@ static void write_cone_to_file(FILE *fp, struct pattern_list *pl)
> string_list_clear(&sl, 0);
> }
>
> -static int write_patterns_and_update(struct pattern_list *pl)
> +static int write_patterns_and_update(struct repository *repo,
> + struct pattern_list *pl)
> {
> char *sparse_filename;
> FILE *fp;
> @@ -336,15 +337,15 @@ static int write_patterns_and_update(struct pattern_list *pl)
>
> sparse_filename = get_sparse_checkout_filename();
>
> - if (safe_create_leading_directories(the_repository, sparse_filename))
> + if (safe_create_leading_directories(repo, sparse_filename))
> die(_("failed to create directory for sparse-checkout file"));
>
> hold_lock_file_for_update(&lk, sparse_filename, LOCK_DIE_ON_ERROR);
>
> - result = update_working_directory(pl);
> + result = update_working_directory(repo, pl);
> if (result) {
> rollback_lock_file(&lk);
> - update_working_directory(NULL);
> + update_working_directory(repo, NULL);
> goto out;
> }
But this introduces a new parameter. Both of two instances of
repository struct used in the existing code in this function, other
than references to struct repository *UNUSED, use "r", and with this
patch, the name "repo" becomes more prevanent.
We would probably want to rename "r" to "repo" for consistency in
clean_tracked_sparse_repositories() and update_working_directory(),
but that is better done later after the dust settles and the code
around here becomes quiescent again, not as part of this topic as an
extra churn.
Looking good. Thanks. Will queue.
@@ -9,7 +9,7 @@ git-sparse-checkout - Reduce your working tree to a subset of tracked files | |||
SYNOPSIS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
>
> When users change their sparse-checkout definitions to add new
> directories and remove old ones, there may be a few reasons why
> directories no longer in scope remain (ignored or excluded files still
> exist, Windows handles are still open, etc.). When these files still
> exist, the sparse index feature notices that a tracked, but sparse,
> directory still exists on disk and thus the index expands. This causes a
> performance hit _and_ the advice printed isn't very helpful. Using 'git
> clean' isn't enough (generally '-dfx' may be needed) but also this may
> not be sufficient.
>
> Add a new subcommand to 'git sparse-checkout' that removes these
> tracked-but-sparse directories, including any excluded or ignored files
Are excluded files and ignored files form two separate sets, or are
they one and the same? Do files that users forgot to add (e.g. new
source file that would not match any patterns listed in .gitignore)
and object files left over from the previous compilation (most
likely match *.o in .gitignore) treated the same way for the purpose
of determining if the directory that is no longer in the cone can be
removed?
> underneath. This is the most extreme method for doing this, but it works
> when the sparse-checkout is in cone mode and is expected to rescope
> based on directories, not files.
>
> Be sure to add a --dry-run option so users can predict what will be
> deleted. In general, output the directories that are being removed so
> users can know what was removed.
Hmph. It would be safer to show not just the directories but which
excluded files are about to be lost, wouldn't it, especially when
the user is trying to play safe and see what potential damage they
are looking at?
Also even though ignored files are "ignored and expendable", nobody
marks their temporary file as "ignored but precious" (yet), so "it
is listed in .gitignore so we can safely remove it" may not be a
safe assumption for us to be making (yet). Shouldn't we at least be
listing these ignored files in --dry-run output, next to those files
that the user may have forgotten to add?
> Note that untracked directories remain. Further, directories that
> contain staged changes are not deleted. This is a detail that is partly
> hidden by the implementation which relies on collapsing the index to a
> sparse index in-memory and only deleting directories that are listed as
> sparse in the index. If a staged change exists, then that entry is not
> stored as a sparse tree entry and thus remains on-disk until committed
> or reset.
Removing untracked directories is a job for "clean -d", so it makes
sense for this new command not to touch them. Not losing changes
that have already been added is just a bad as losing new files that
the user forgot to add, so it does make sense not to remove them.
I wonder if we need "-x" and/or "-X" options "clean" has (and
perhaps "-d" that is a no-op, as the whole point of this subcommand
is about removing directories from the working tree) to control its
operation a bit finer-grained way.
> + for (size_t i = 0; i < repo->index->cache_nr; i++) {
> + DIR* dir;
The asterisk sticks to the variable, not the type, i.e.
DIR *dir;
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 7/8/2025 5:20 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> When users change their sparse-checkout definitions to add new
>> directories and remove old ones, there may be a few reasons why
>> directories no longer in scope remain (ignored or excluded files still
>> exist, Windows handles are still open, etc.). When these files still
>> exist, the sparse index feature notices that a tracked, but sparse,
>> directory still exists on disk and thus the index expands. This causes a
>> performance hit _and_ the advice printed isn't very helpful. Using 'git
>> clean' isn't enough (generally '-dfx' may be needed) but also this may
>> not be sufficient.
>>
>> Add a new subcommand to 'git sparse-checkout' that removes these
>> tracked-but-sparse directories, including any excluded or ignored files
>
> Are excluded files and ignored files form two separate sets, or are
> they one and the same? Do files that users forgot to add (e.g. new
> source file that would not match any patterns listed in .gitignore)
> and object files left over from the previous compilation (most
> likely match *.o in .gitignore) treated the same way for the purpose
> of determining if the directory that is no longer in the cone can be
> removed?
I think of them as separate in my head because:
* .gitignore is committed to the repo, and is common to all users of
the repo.
* .git/info/exclude is custom to each user, so users are choosing to
ignore extra files that are atypical from most users.
In the monorepo I'm thinking about, .gitignore files are rather small
because all build output has already been redirected out of the
worktree for performance reasons. Thus, _most_ users don't have this
problem. However, some users add extra excludes for things like vim
files and those get leftover, causing invisible (to 'git status') pain.
>> underneath. This is the most extreme method for doing this, but it works
>> when the sparse-checkout is in cone mode and is expected to rescope
>> based on directories, not files.
>>
>> Be sure to add a --dry-run option so users can predict what will be
>> deleted. In general, output the directories that are being removed so
>> users can know what was removed.
>
> Hmph. It would be safer to show not just the directories but which
> excluded files are about to be lost, wouldn't it, especially when
> the user is trying to play safe and see what potential damage they
> are looking at?
> > Also even though ignored files are "ignored and expendable", nobody
> marks their temporary file as "ignored but precious" (yet), so "it
> is listed in .gitignore so we can safely remove it" may not be a
> safe assumption for us to be making (yet). Shouldn't we at least be
> listing these ignored files in --dry-run output, next to those files
> that the user may have forgotten to add?
I considered this, but mostly behind a potential --verbose option to
list the files that are leftover. Much of the design here is that
these _directories_ are out of scope, skipping over any details about
the contained files, so I thought this directory-based output would
communicate enough information.
A curious user may want to know "why are these directories still
around?" and the more verbose output would assist.
>> Note that untracked directories remain. Further, directories that
>> contain staged changes are not deleted. This is a detail that is partly
>> hidden by the implementation which relies on collapsing the index to a
>> sparse index in-memory and only deleting directories that are listed as
>> sparse in the index. If a staged change exists, then that entry is not
>> stored as a sparse tree entry and thus remains on-disk until committed
>> or reset.
>
> Removing untracked directories is a job for "clean -d", so it makes
> sense for this new command not to touch them. Not losing changes
> that have already been added is just a bad as losing new files that
> the user forgot to add, so it does make sense not to remove them.
>
> I wonder if we need "-x" and/or "-X" options "clean" has (and
> perhaps "-d" that is a no-op, as the whole point of this subcommand
> is about removing directories from the working tree) to control its
> operation a bit finer-grained way.
I'm of two minds here.
My first inclination is "we already have 'git clean' for fine-grained
control of removing ignored/excluded files".
My second inclination is "'git clean' would remove these ignored files
even when they are within the sparse-checkout, so that's too big of a
hammer".
There are a lot of ways to filter the files that would be removed,
but I think that in this case most users are wanting a one-command way
to get their sparse-checkout into a better state.
I'm not making any final statements here. I appreciate all of the
thoughts around which options should be default and which should be
hidden behind options.
Thanks,
-Stolee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
Derrick Stolee <stolee@gmail.com> writes:
> A curious user may want to know "why are these directories still
> around?" and the more verbose output would assist.
Understood. That one is what I was primarily after, as opposed to
"These directories will be gone, as there is nothing interesting or
worth saving", which I find much less interesting (and perhaps
should only be shown with --verbose, as opposed to "this will be
kept even though it is out of cone, as it contains these things that
may worth saving", which I think is something the user would care
more).
>> I wonder if we need "-x" and/or "-X" options "clean" has (and
>> perhaps "-d" that is a no-op, as the whole point of this subcommand
>> is about removing directories from the working tree) to control its
>> operation a bit finer-grained way.
> I'm of two minds here.
Same here, and that is why I said "I wonder" ;-)
@@ -9,7 +9,7 @@ git-sparse-checkout - Reduce your working tree to a subset of tracked files | |||
SYNOPSIS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Tue, Jul 8, 2025 at 4:20 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <stolee@gmail.com>
>
> When users change their sparse-checkout definitions to add new
> directories and remove old ones, there may be a few reasons why
> directories no longer in scope remain (ignored or excluded files still
> exist, Windows handles are still open, etc.).
Good background; I am still particularly interested in the "etc." part...
> When these files still
> exist, the sparse index feature notices that a tracked, but sparse,
> directory still exists on disk and thus the index expands. This causes a
> performance hit _and_ the advice printed isn't very helpful. Using 'git
> clean' isn't enough (generally '-dfx' may be needed) but also this may
> not be sufficient.
Very well motivated.
> Add a new subcommand to 'git sparse-checkout' that removes these
> tracked-but-sparse directories, including any excluded or ignored files
> underneath.
"including"?
> This is the most extreme method for doing this, but it works
> when the sparse-checkout is in cone mode and is expected to rescope
> based on directories, not files.
So is this also meant for cone mode without sparse index turned on?
What about non-cone mode?
> Be sure to add a --dry-run option so users can predict what will be
> deleted. In general, output the directories that are being removed so
> users can know what was removed.
Is greater fidelity of interest when there are multiple different
types of files contained? For example, "git status" lists individual
files within a directory, unless it find an ignored directory and then
it simply lists the directory. That means we get more fidelity when
it's warranted, and less when it's not. I'm not sure if that's a
perfect analogy, though; it may well be that we don't need the same
kind of fidelity that `git status` provides. (And I'm kind of
guessing it isn't needed, except in error cases, but I'm just asking.)
> Note that untracked directories remain.
What does this mean? If the sparse directory had an untracked
directory within it then it'll be left on disk, you will only clean up
untracked files at a depth of 1 within the sparse directory?
Or that untracked directories not contained within a sparse directory
will be left alone?
> Further, directories that
> contain staged changes are not deleted.
Shouldn't those be safe to delete? When a sparse directory has files
underneath it with staged changes, those roll-up into a staged
sparse-directory tree value, and so we should be able to delete the
file.
In contrast, the files under the sparse directory with unstaged
changes would be problematic to simply remove.
> This is a detail that is partly
> hidden by the implementation which relies on collapsing the index to a
> sparse index in-memory and only deleting directories that are listed as
> sparse in the index. If a staged change exists, then that entry is not
> stored as a sparse tree entry and thus remains on-disk until committed
> or reset.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> Documentation/git-sparse-checkout.adoc | 13 ++++-
> builtin/sparse-checkout.c | 73 +++++++++++++++++++++++++-
> t/t1091-sparse-checkout-builtin.sh | 48 +++++++++++++++++
> 3 files changed, 132 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-sparse-checkout.adoc b/Documentation/git-sparse-checkout.adoc
> index 529a8edd9c1e..21ba6f759905 100644
> --- a/Documentation/git-sparse-checkout.adoc
> +++ b/Documentation/git-sparse-checkout.adoc
> @@ -9,7 +9,7 @@ git-sparse-checkout - Reduce your working tree to a subset of tracked files
> SYNOPSIS
> --------
> [verse]
> -'git sparse-checkout' (init | list | set | add | reapply | disable | check-rules) [<options>]
> +'git sparse-checkout' (init | list | set | add | reapply | disable | check-rules | clean) [<options>]
>
>
> DESCRIPTION
> @@ -111,6 +111,17 @@ flags, with the same meaning as the flags from the `set` command, in order
> to change which sparsity mode you are using without needing to also respecify
> all sparsity paths.
>
> +'clean'::
> + Remove all files in tracked directories that are outside of the
> + sparse-checkout definition.
If literal, this sounds unsafe, particularly if run while resolving
merge or rebase conflicts (since those conflicts may occur in paths
outside the sparse checkout definition).
> + This subcommand requires cone-mode
> + sparse-checkout to be sure that we know which directories are
> + both tracked and all contained paths are not in the sparse-checkout.
> + This command can be used to be sure the sparse index works
> + efficiently.
So...what does it do when in cone mode and the sparse index is not enabled?
> ++
> +The `clean` command can also take the `--dry-run` (`-n`) option to list
> +the directories it would remove without performing any filesystem changes.
> +
> 'disable'::
> Disable the `core.sparseCheckout` config setting, and restore the
> working directory to include all files.
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index 8b70d0c6a441..6d2843827367 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -23,7 +23,7 @@
> static const char *empty_base = "";
>
> static char const * const builtin_sparse_checkout_usage[] = {
> - N_("git sparse-checkout (init | list | set | add | reapply | disable | check-rules) [<options>]"),
> + N_("git sparse-checkout (init | list | set | add | reapply | disable | check-rules | clean) [<options>]"),
> NULL
> };
>
> @@ -924,6 +924,76 @@ static int sparse_checkout_reapply(int argc, const char **argv,
> return update_working_directory(repo, NULL);
> }
>
> +static char const * const builtin_sparse_checkout_clean_usage[] = {
> + "git sparse-checkout clean [-n|--dry-run]",
> + NULL
> +};
> +
> +static struct sparse_checkout_clean_opts {
> + int dry_run;
> +} clean_opts;
> +
> +static int sparse_checkout_clean(int argc, const char **argv,
> + const char *prefix,
> + struct repository *repo)
> +{
> + struct strbuf full_path = STRBUF_INIT;
> + size_t worktree_len;
> + static struct option builtin_sparse_checkout_clean_options[] = {
> + OPT_BOOL('n', "dry-run", &clean_opts.dry_run,
> + N_("list the directories that would be removed without making filesystem changes")),
> + OPT_END(),
> + };
> +
> + setup_work_tree();
> + if (!core_apply_sparse_checkout)
> + die(_("must be in a sparse-checkout to clean directories"));
> + if (!core_sparse_checkout_cone)
> + die(_("must be in a cone-mode sparse-checkout to clean directories"));
> +
> + argc = parse_options(argc, argv, prefix,
> + builtin_sparse_checkout_clean_options,
> + builtin_sparse_checkout_clean_usage, 0);
> +
> + if (repo_read_index(repo) < 0)
> + die(_("failed to read index"));
> +
> + if (convert_to_sparse(repo->index, SPARSE_INDEX_MEMORY_ONLY))
> + die(_("failed to convert index to a sparse index"));
So, you make the in-memory index sparse; I don't remember the details
on this function so it might invalidate some things I say below...but
after this point you then...
> +
> + strbuf_addstr(&full_path, repo->worktree);
> + strbuf_addch(&full_path, '/');
> + worktree_len = full_path.len;
> +
> + for (size_t i = 0; i < repo->index->cache_nr; i++) {
> + DIR* dir;
> + struct cache_entry *ce = repo->index->cache[i];
> + if (!S_ISSPARSEDIR(ce->ce_mode))
> + continue;
...skip the entries that aren't sparse directories.
> + strbuf_setlen(&full_path, worktree_len);
> + strbuf_add(&full_path, ce->name, ce->ce_namelen);
> +
> + dir = opendir(full_path.buf);
> + if (!dir)
> + continue;
...skip the sparse directories that, as expected, don't exist on disk.
> + else if (ENOENT != errno) {
> + warning_errno(_("failed to check for existence of '%s'"), ce->name);
> + continue;
> + }
> +
> + closedir(dir);
> +
> + printf("%s\n", ce->name);
> + if (!clean_opts.dry_run) {
> + if (remove_dir_recursively(&full_path, 0))
> + warning_errno(_("failed to remove '%s'"), ce->name);
> + }
...and then unconditionally remove the directory, as you stated in the
documentation for this clean option.
I'm worried whether this is safe; if someone does a merge or rebase,
there could be tracked-and-modified/conflicted files outside the
sparse specification in the working tree.
Even after resolving such a merge and committing, the paths may remain
around until the user does a 'git sparse-checkout reapply' (I don't
remember details here, but our documentation for reapply certainly
says so), and since the file might stick around, the user may make
further modifications to such a file.
...or will the convert_to_sparse() call above fail in all these cases?
If it does, should it give a better and more useful error message
than "failed to convert index to a sparse index" and rather e.g. "path
%s has modifications; please stage or revert first"?
> + }
> +
> + strbuf_release(&full_path);
> + return 0;
> +}
> +
> static char const * const builtin_sparse_checkout_disable_usage[] = {
> "git sparse-checkout disable",
> NULL
> @@ -1080,6 +1150,7 @@ int cmd_sparse_checkout(int argc,
> OPT_SUBCOMMAND("set", &fn, sparse_checkout_set),
> OPT_SUBCOMMAND("add", &fn, sparse_checkout_add),
> OPT_SUBCOMMAND("reapply", &fn, sparse_checkout_reapply),
> + OPT_SUBCOMMAND("clean", &fn, sparse_checkout_clean),
> OPT_SUBCOMMAND("disable", &fn, sparse_checkout_disable),
> OPT_SUBCOMMAND("check-rules", &fn, sparse_checkout_check_rules),
> OPT_END(),
> diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
> index ab3a105ffff2..7f8a444541f7 100755
> --- a/t/t1091-sparse-checkout-builtin.sh
> +++ b/t/t1091-sparse-checkout-builtin.sh
> @@ -1050,5 +1050,53 @@ test_expect_success 'check-rules null termination' '
> test_cmp expect actual
> '
>
> +test_expect_success 'clean' '
> + git -C repo sparse-checkout set --cone deep/deeper1 &&
> + mkdir repo/deep/deeper2 repo/folder1 &&
> + touch repo/deep/deeper2/file &&
> + touch repo/folder1/file &&
> +
> + cat >expect <<-\EOF &&
> + deep/deeper2/
> + folder1/
> + EOF
> +
> + git -C repo sparse-checkout clean --dry-run >out &&
> + test_cmp expect out &&
> +
> + test_path_exists repo/deep/deeper2 &&
> + test_path_exists repo/folder1 &&
> +
> + git -C repo sparse-checkout clean >out &&
> + test_cmp expect out &&
> +
> + ! test_path_exists repo/deep/deeper2 &&
> + ! test_path_exists repo/folder1
> +'
> +
> +test_expect_success 'clean with staged sparse change' '
> + git -C repo sparse-checkout set --cone deep/deeper1 &&
> + mkdir repo/deep/deeper2 repo/folder1 &&
> + touch repo/deep/deeper2/file &&
> + touch repo/folder1/file &&
> +
> + git -C repo add --sparse folder1/file &&
> +
> + cat >expect <<-\EOF &&
> + deep/deeper2/
> + EOF
> +
> + git -C repo sparse-checkout clean --dry-run >out &&
> + test_cmp expect out &&
> +
> + test_path_exists repo/deep/deeper2 &&
> + test_path_exists repo/folder1 &&
> +
> + git -C repo sparse-checkout clean >out &&
> + test_cmp expect out &&
> +
> + ! test_path_exists repo/deep/deeper2 &&
> + test_path_exists repo/folder1
> +'
>
> test_done
> --
> gitgitgadget
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 7/8/2025 5:43 PM, Elijah Newren wrote:
> On Tue, Jul 8, 2025 at 4:20 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> When users change their sparse-checkout definitions to add new
>> directories and remove old ones, there may be a few reasons why
>> directories no longer in scope remain (ignored or excluded files still
>> exist, Windows handles are still open, etc.).
>
> Good background; I am still particularly interested in the "etc." part...
I listed the cases that I've confirmed to be problems. There are
perhaps some that I'm missing or overlap (such as "I had my terminal
window open on that directory" which is really a handle problem).
>> When these files still
>> exist, the sparse index feature notices that a tracked, but sparse,
>> directory still exists on disk and thus the index expands. This causes a
>> performance hit _and_ the advice printed isn't very helpful. Using 'git
>> clean' isn't enough (generally '-dfx' may be needed) but also this may
>> not be sufficient.
>
> Very well motivated.
>
>> Add a new subcommand to 'git sparse-checkout' that removes these
>> tracked-but-sparse directories, including any excluded or ignored files
>> underneath.
>
> "including"?
Yes. If we leave the ignored files then we have not accomplished our
goal in deleting the sparse directories.
>> This is the most extreme method for doing this, but it works
>> when the sparse-checkout is in cone mode and is expected to rescope
>> based on directories, not files.
>
> So is this also meant for cone mode without sparse index turned on?
> What about non-cone mode?
This command die()s if not in cone mode. We can consider future
changes that perform similar actions in non-cone mode, but I'm
not sure if there is a valuable need in that case.
>> Be sure to add a --dry-run option so users can predict what will be
>> deleted. In general, output the directories that are being removed so
>> users can know what was removed.
>
> Is greater fidelity of interest when there are multiple different
> types of files contained? For example, "git status" lists individual
> files within a directory, unless it find an ignored directory and then
> it simply lists the directory. That means we get more fidelity when
> it's warranted, and less when it's not. I'm not sure if that's a
> perfect analogy, though; it may well be that we don't need the same
> kind of fidelity that `git status` provides. (And I'm kind of
> guessing it isn't needed, except in error cases, but I'm just asking.)
>
>> Note that untracked directories remain.
>
> What does this mean? If the sparse directory had an untracked
> directory within it then it'll be left on disk, you will only clean up
> untracked files at a depth of 1 within the sparse directory?
>
> Or that untracked directories not contained within a sparse directory
> will be left alone?
This second part: "untracked directories not contained within
a sparse directory will remain". This is mostly to point out
that we are not saying "the only directories that remain are
tracked directories within the sparse-checkout" as that could
remove valuable temporary directories that are covered by
.gitignore or exclude files.
>> Further, directories that
>> contain staged changes are not deleted.
>
> Shouldn't those be safe to delete? When a sparse directory has files
> underneath it with staged changes, those roll-up into a staged
> sparse-directory tree value, and so we should be able to delete the
> file.
This is _mostly_ an implementation detail. The sparse index will
not represent this directory as a sparse directory, so it's not
deleted. (see the next paragraph:)
>> This is a detail that is partly
>> hidden by the implementation which relies on collapsing the index to a
>> sparse index in-memory and only deleting directories that are listed as
>> sparse in the index. If a staged change exists, then that entry is not
>> stored as a sparse tree entry and thus remains on-disk until committed
>> or reset.
> In contrast, the files under the sparse directory with unstaged
> changes would be problematic to simply remove.
Except that a user is only using this command when they want
files outside of the sparse-checkout to be deleted.
I'd like to find the right way to make it clear to users who
discover this command that they are asking for the following:
"I changed my sparse-checkout and some directories that I
expected to be deleted are still around. Delete them as I
don't care about them or the files inside anymore."
Some of the discussion around having a --verbose option (in
conjunction with --dry-run) would allow for the following
user scenario:
"I changed my sparse-checkout and some directories that I
expected to be deleted are still around. Which files are
preventing that deletion? I'd like to know what's in the
way so I can evaluate if those files are important to me."
>> +'clean'::
>> + Remove all files in tracked directories that are outside of the
>> + sparse-checkout definition.
>
> If literal, this sounds unsafe, particularly if run while resolving
> merge or rebase conflicts (since those conflicts may occur in paths
> outside the sparse checkout definition).
If we are in a merge-conflict state, the directory is not
collapsed in the sparse index..
>> + This subcommand requires cone-mode
>> + sparse-checkout to be sure that we know which directories are
>> + both tracked and all contained paths are not in the sparse-checkout.
>> + This command can be used to be sure the sparse index works
>> + efficiently.
>
> So...what does it do when in cone mode and the sparse index is not enabled?
It doesn't effect the behavior, since we don't care about the on-disk
format and instead use an in-memory sparse index to determine which
directories to delete.
There could be a benefit for users wanting to clean up extra files in
their worktree even if they are not using a sparse index. It is less
likely that they will discover that they are in that state if they
are not pestered by the index expansion advice message.
>> + if (convert_to_sparse(repo->index, SPARSE_INDEX_MEMORY_ONLY))
>> + die(_("failed to convert index to a sparse index"));
>
> So, you make the in-memory index sparse; I don't remember the details
> on this function so it might invalidate some things I say below...but
> after this point you then...
>
>> +
>> + strbuf_addstr(&full_path, repo->worktree);
>> + strbuf_addch(&full_path, '/');
>> + worktree_len = full_path.len;
>> +
>> + for (size_t i = 0; i < repo->index->cache_nr; i++) {
>> + DIR* dir;
>> + struct cache_entry *ce = repo->index->cache[i];
>> + if (!S_ISSPARSEDIR(ce->ce_mode))
>> + continue;
>
> ...skip the entries that aren't sparse directories.
>
>> + strbuf_setlen(&full_path, worktree_len);
>> + strbuf_add(&full_path, ce->name, ce->ce_namelen);
>> +
>> + dir = opendir(full_path.buf);
>> + if (!dir)
>> + continue;
>
> ...skip the sparse directories that, as expected, don't exist on disk.
>
>> + else if (ENOENT != errno) {
>> + warning_errno(_("failed to check for existence of '%s'"), ce->name);
>> + continue;
>> + }
>> +
>> + closedir(dir);
>> +
>> + printf("%s\n", ce->name);
>> + if (!clean_opts.dry_run) {
>> + if (remove_dir_recursively(&full_path, 0))
>> + warning_errno(_("failed to remove '%s'"), ce->name);
>> + }
>
> ...and then unconditionally remove the directory, as you stated in the
> documentation for this clean option.
>
> I'm worried whether this is safe; if someone does a merge or rebase,
> there could be tracked-and-modified/conflicted files outside the
> sparse specification in the working tree.
The conflicted files will not collapse to sparse directory entries.
Does that ease your concern on that front?
> Even after resolving such a merge and committing, the paths may remain
> around until the user does a 'git sparse-checkout reapply' (I don't
> remember details here, but our documentation for reapply certainly
> says so), and since the file might stick around, the user may make
> further modifications to such a file.
>
> ...or will the convert_to_sparse() call above fail in all these cases?
> If it does, should it give a better and more useful error message
> than "failed to convert index to a sparse index" and rather e.g. "path
> %s has modifications; please stage or revert first"?
It won't fail. It just won't collapse as far.
You do make a good point that there could be extra help messages to say
that there are uncollapsed directories (detectable by seeing a blob path
with the skip-worktree bit on, maybe). I will think on this.
Thanks,
-Stolee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Wed, Jul 9, 2025 at 9:13 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/8/2025 5:43 PM, Elijah Newren wrote:
> > On Tue, Jul 8, 2025 at 4:20 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
[...]
> >> Add a new subcommand to 'git sparse-checkout' that removes these
> >> tracked-but-sparse directories, including any excluded or ignored files
> >> underneath.
> >
> > "including"?
>
> Yes. If we leave the ignored files then we have not accomplished our
> goal in deleting the sparse directories.
I understood that; my interest was more in what is being glossed over
with this word, i.e. what else would be removed.
> >> This is the most extreme method for doing this, but it works
> >> when the sparse-checkout is in cone mode and is expected to rescope
> >> based on directories, not files.
> >
> > So is this also meant for cone mode without sparse index turned on?
> > What about non-cone mode?
>
> This command die()s if not in cone mode. We can consider future
> changes that perform similar actions in non-cone mode, but I'm
> not sure if there is a valuable need in that case.
That answers the second question; what about the first -- cone mode
without a sparse index? (Edit: you discuss that below, so I'll add
more there.)
> >> Be sure to add a --dry-run option so users can predict what will be
> >> deleted. In general, output the directories that are being removed so
> >> users can know what was removed.
> >
> > Is greater fidelity of interest when there are multiple different
> > types of files contained? For example, "git status" lists individual
> > files within a directory, unless it find an ignored directory and then
> > it simply lists the directory. That means we get more fidelity when
> > it's warranted, and less when it's not. I'm not sure if that's a
> > perfect analogy, though; it may well be that we don't need the same
> > kind of fidelity that `git status` provides. (And I'm kind of
> > guessing it isn't needed, except in error cases, but I'm just asking.)
> >
> >> Note that untracked directories remain.
> >
> > What does this mean? If the sparse directory had an untracked
> > directory within it then it'll be left on disk, you will only clean up
> > untracked files at a depth of 1 within the sparse directory?
> >
> > Or that untracked directories not contained within a sparse directory
> > will be left alone?
>
> This second part: "untracked directories not contained within
> a sparse directory will remain". This is mostly to point out
> that we are not saying "the only directories that remain are
> tracked directories within the sparse-checkout" as that could
> remove valuable temporary directories that are covered by
> .gitignore or exclude files.
Thanks; could this be clarified in the commit message?
> >> Further, directories that
> >> contain staged changes are not deleted.
> >
> > Shouldn't those be safe to delete? When a sparse directory has files
> > underneath it with staged changes, those roll-up into a staged
> > sparse-directory tree value, and so we should be able to delete the
> > file.
>
> This is _mostly_ an implementation detail. The sparse index will
> not represent this directory as a sparse directory, so it's not
> deleted. (see the next paragraph:)
If there are files with staged changes underneath a directory, I don't
see why the sparse index would not be able to represent that directory
as a sparse directory.
From the implementation, I'm wondering if you meant something entirely
different by this sentence and that "contain" was perhaps a misleading
word choice; in particular, did you mean that when the directory
itself has been replaced in the index with some staged file, then you
leave that staged file/directory alone? If this was the intended
meaning, perhaps we just need some rewording to clarify the commit
message since "contain" makes me think about paths below the
directory, not another path to replace the directory.
And if neither of my guesses are what you meant by this sentence,
please do clue me in.
> >> This is a detail that is partly
> >> hidden by the implementation which relies on collapsing the index to a
> >> sparse index in-memory and only deleting directories that are listed as
> >> sparse in the index. If a staged change exists, then that entry is not
> >> stored as a sparse tree entry and thus remains on-disk until committed
> >> or reset.
> > In contrast, the files under the sparse directory with unstaged
> > changes would be problematic to simply remove.
> Except that a user is only using this command when they want
> files outside of the sparse-checkout to be deleted.
>
> I'd like to find the right way to make it clear to users who
> discover this command that they are asking for the following:
>
> "I changed my sparse-checkout and some directories that I
> expected to be deleted are still around. Delete them as I
> don't care about them or the files inside anymore."
>
> Some of the discussion around having a --verbose option (in
> conjunction with --dry-run) would allow for the following
> user scenario:
>
> "I changed my sparse-checkout and some directories that I
> expected to be deleted are still around. Which files are
> preventing that deletion? I'd like to know what's in the
> way so I can evaluate if those files are important to me."
Yes, and this might even be a status-like output, showing whether the
files are untracked, ignored, tracked-and-unmodified, or
tracked-and-modified.
> >> +'clean'::
> >> + Remove all files in tracked directories that are outside of the
> >> + sparse-checkout definition.
> >
> > If literal, this sounds unsafe, particularly if run while resolving
> > merge or rebase conflicts (since those conflicts may occur in paths
> > outside the sparse checkout definition).
>
> If we are in a merge-conflict state, the directory is not
> collapsed in the sparse index..
Ah, good to know.
> >> + This subcommand requires cone-mode
> >> + sparse-checkout to be sure that we know which directories are
> >> + both tracked and all contained paths are not in the sparse-checkout.
> >> + This command can be used to be sure the sparse index works
> >> + efficiently.
> >
> > So...what does it do when in cone mode and the sparse index is not enabled?
>
> It doesn't effect the behavior, since we don't care about the on-disk
> format and instead use an in-memory sparse index to determine which
> directories to delete.
>
> There could be a benefit for users wanting to clean up extra files in
> their worktree even if they are not using a sparse index. It is less
> likely that they will discover that they are in that state if they
> are not pestered by the index expansion advice message.
Right, for cone mode without the sparse index turned on, this new
subcommand seems to be a silent no-op (other than burning some
computation time), despite the fact that the wording in the manual
might lead users to believe it will do some nice tidying for them.
When the command spends some time working but doesn't do anything and
doesn't report anything, the user might then think the command is just
buggy. I think the command should probably either (a) do some tidying
for them, or (b) give them a warning that the command has no effect
when sparse index is not turned on. Thoughts?
[...]
> >> + printf("%s\n", ce->name);
> >> + if (!clean_opts.dry_run) {
> >> + if (remove_dir_recursively(&full_path, 0))
> >> + warning_errno(_("failed to remove '%s'"), ce->name);
> >> + }
> >
> > ...and then unconditionally remove the directory, as you stated in the
> > documentation for this clean option.
> >
> > I'm worried whether this is safe; if someone does a merge or rebase,
> > there could be tracked-and-modified/conflicted files outside the
> > sparse specification in the working tree.
>
> The conflicted files will not collapse to sparse directory entries.
>
> Does that ease your concern on that front?
Yes, that does ease my concerns...but it doesn't erase them.
If someone resolves the conflicted merge or rebase and commits (long
before running this `git sparse-checkout clean` command), what happens
to those paths? Do these materialized paths persist in the worktree
after the commit? I know they did at some point in our
implementation, and the current wording of `git sparse-checkout
reapply` in the manual certainly suggests such paths may stick around
until the user takes manual action:
Reapply the sparsity pattern rules to paths in the working tree.
Commands like merge or rebase can materialize paths to do their
work (e.g. in order to show you a conflict), and other
sparse-checkout commands might fail to sparsify an individual file
(e.g. because it has unstaged changes or conflicts). In such cases,
it can make sense to run git sparse-checkout reapply later after
cleaning up affected paths (e.g. resolving conflicts, undoing or
committing changes, etc.).
Now, if these paths stick around despite being outside the sparsity
specification, users may decide to modify them. And if they have
modified them and run your command, won't you succeed in collapsing to
a sparse directory tree? And then wouldn't this cause unstaged
changes to be discarded? (I know this is a rare case, because users
would probably only merge or rebase changes they made while in their
sparse-checkout, and thus conflicts would generally be limited to the
sparse-checkout, but there are at least three ways I can think of that
conflicts could be triggered outside their sparse checkout, so I think
it's a realistic scenario that we should think through.)
Throwing away unstaged changes is something that is usually gated
behind a forcing flag (e.g. `git reset --hard` or `git checkout
--force`). I know, we currently also gate removal of untracked files
in `git clean` behind a `--force` flag as well and I'm not concerned
with throwing away untracked files with your command without a forcing
flag, but I'm just wondering if we should be more careful with
unstaged changes than with untracked files.
Perhaps everyone is fine with also throwing away unstaged changes as
part of this command. I could be convinced of that. But if so, that
should at least be called out rather explicitly in the commit message,
the documentation, and the tests, whereas currently your patch is
silent on anything other than untracked and ignored files.
(Or, if my memory is out-of-date about materialized paths persisting
after their conflicts are resolved and a commit happens, then it'd
probably be worth calling that out in the commit message and perhaps
also updating some wording on the reapply subcommand.)
> > Even after resolving such a merge and committing, the paths may remain
> > around until the user does a 'git sparse-checkout reapply' (I don't
> > remember details here, but our documentation for reapply certainly
> > says so), and since the file might stick around, the user may make
> > further modifications to such a file.
> >
> > ...or will the convert_to_sparse() call above fail in all these cases?
> > If it does, should it give a better and more useful error message
> > than "failed to convert index to a sparse index" and rather e.g. "path
> > %s has modifications; please stage or revert first"?
>
> It won't fail. It just won't collapse as far.
Oh! Based on this hint, I went and looked up the code for this; it's
from convert_to_sparse_rec(), right? I see something interesting
there; does the present-despite-skipped checks (from 82386b44963f
(Merge branch 'en/present-despite-skipped', 2022-03-09)) cause this
collapsing to also fail for unstaged entries? I.e. this part of
convert_to_sparse_rec():
if (ce_stage(ce) ||
S_ISGITLINK(ce->ce_mode) ||
!(ce->ce_flags & CE_SKIP_WORKTREE))
can_convert = 0;
The `ce_stage(ce)` part of it is what prevent it from collapsing when
there are conflicts, and I think the `!(ce->ce_flags &
CE_SKIP_WORKTREE))` would prevent it from collapsing any tracked files
whatsoever, whether modified or not, due to the
present-despite-skipped checks. Does that sound right?
In other words, perhaps your clean command as implemented really does
only handle untracked and ignored files, and if the user also has
tracked-but-unmodified or tracked-with-unstaged-changes or
tracked-with-staged-changes then this command won't actually restore
performance for them until they _also_ run `git sparse-checkout
reapply` ?
> You do make a good point that there could be extra help messages to say
> that there are uncollapsed directories (detectable by seeing a blob path
> with the skip-worktree bit on, maybe). I will think on this.
In order to detect this by the skip-worktree bit, you'd probably need
to run it as part of the present-despite-skipped checks mentioned
above (or run it before those checks clear the skip-worktree bit for
present files).
@@ -32,7 +32,8 @@ int give_advice_on_expansion = 1; | |||
"Your working directory likely has contents that are outside of\n" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Tue, Jul 8, 2025 at 4:20 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <stolee@gmail.com>
>
> In my experience, the most-common reason that the sparse index must
> expand to a full one is because there is some leftover file in a tracked
> directory that is now outside of the sparse-checkout. The new 'git
> sparse-checkout clean' command will find and delete these directories,
> so point users to it when they hit the sparse index expansion advice.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> sparse-index.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/sparse-index.c b/sparse-index.c
> index 5634abafaa07..5d14795063b5 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -32,7 +32,8 @@ int give_advice_on_expansion = 1;
> "Your working directory likely has contents that are outside of\n" \
> "your sparse-checkout patterns. Use 'git sparse-checkout list' to\n" \
> "see your sparse-checkout definition and compare it to your working\n" \
> - "directory contents. Running 'git clean' may assist in this cleanup."
> + "directory contents. Running 'git sparse-checkout clean' may assist\n" \
> + "in this cleanup."
>
> struct modify_index_context {
> struct index_state *write;
> --
> gitgitgadget
Makes sense, once we work out any wrinkles with `git sparse-checkout clean`.
On the Git mailing list, Elijah Newren wrote (reply to this): On Tue, Jul 8, 2025 at 1:36 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, Jul 8, 2025 at 4:19 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
[...]
> I'm also curious what happens when (1) you are in cone mode and there
> is no sparse index, or (2) when you are not in cone mode. I suspect
> those and the questions above will be answered as I read the
> individual patches, so I'll keep going...
After reading the series, I know the answer to (2). I think the
answer to (1) is that it effectively turns into a silent (but not
instantaneous) no-op, which may be confusing for users. We might want
to provide them with an alternative implementation, or at least a
warning or error that the mode doesn't (currently?) do anything when
sparse index isn't in use.
Anyway, I think the series is a good direction and you've explained
the motivation very well, but I'm a bit worried the current
implementation might be using too coarse of a hammer. |
On the Git mailing list, Junio C Hamano wrote (reply to this): "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> The implementation is rather simple with the current CLI. Future
> augmentations could include a --quiet option to silence the output and a
> --verbose option to list the files that exist within each directory and
> would/will be removed.
I liked the overall idea but this has some interactions with a topic
in flight. 2c5b5565 (environment: remove the global variable
'sparse_expect_files_outside_of_patterns', 2025-07-01). I may have
botched (semantic) conflict resolution but with both merged to
'seen', a few steps in the sparse test seem to fail.
For tonight's integration, I'll leave the topic out of 'seen' so
that we can pass other new topics that we acquired through the CI.
I may re-attempt merging this topic later, or I may eject the other
topic from 'seen' and queue this one first, asking the other topic
to be redone on top. We'll see.
Thanks. |
On the Git mailing list, Derrick Stolee wrote (reply to this): On 7/8/2025 7:41 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> The implementation is rather simple with the current CLI. Future
>> augmentations could include a --quiet option to silence the output and a
>> --verbose option to list the files that exist within each directory and
>> would/will be removed.
>
> I liked the overall idea but this has some interactions with a topic
> in flight. 2c5b5565 (environment: remove the global variable
> 'sparse_expect_files_outside_of_patterns', 2025-07-01). I may have
> botched (semantic) conflict resolution but with both merged to
> 'seen', a few steps in the sparse test seem to fail.
>
> For tonight's integration, I'll leave the topic out of 'seen' so
> that we can pass other new topics that we acquired through the CI.
>
> I may re-attempt merging this topic later, or I may eject the other
> topic from 'seen' and queue this one first, asking the other topic
> to be redone on top. We'll see.
I'll update my next version to be built on top of that one so
we don't need to worry about semantic merge resolutions.
Thanks,
-Stolee
|
When using cone-mode sparse-checkout, users specify which tracked directories they want (recursively) and any directory not part of the parent paths for those directories are considered "out of scope". When changing sparse-checkouts, there are a variety of reasons why these "out of scope" directories could remain, including:
.gitignore
or.git/info/exclude
files that tell Git to not remove files of a certain type.Typically, this would not mean too much for the user experience. A few extra filesystem checks might be required to satisfy
git status
commands, but the scope of the performance hit is relative to how many cruft files are left over in this situation.However, when using the sparse index, these tracked sparse directories cause significant performance issues. When noticing that the index contains a sparse directory but that directory exists on disk, Git needs to expand that sparse directory to determine which files are tracked or untracked. The current mechanism expands the entire index to a full one, an expensive operation that scales with the total number of paths at HEAD and not just the number of cruft files left over.
Advice was added in 9479a31 (advice: warn when sparse index expands, 2024-07-08) to help users determine that they were in this state. However, the advice doesn't actually recommend helpful ways to get out of this state. Recommending "git clean" on its own is incomplete, as typically users actually need 'git clean -dfx' to clear out the ignored or excluded files. Even then, they may need 'git sparse-checkout reapply' afterwards to clear the sparse directories.
The advice was successful in helping to alert users to the problem, which is how I got wind of many of these cases for how users get into this state. It's now time to give them a tool that helps them out of this state.
This series adds a new 'git sparse-checkout clean' command that currently only works for cone-mode sparse-checkouts. The only thing it does is collapse the index to a sparse index (as much as possible) and make sure that any sparse directories are removed. These directories are listed to stdout.
A --dry-run option is available to list the directories that would be removed without actually deleting the directories.
This option would be preferred to something like 'git clean -dfx' since it does not clear the excluded files that are still within the sparse-checkout. Instead, it performs the exact filesystem operations required to refresh the sparse index performance back to what is expected.
I spent a few weeks debating with myself about whether or not this was the right interface, so please suggest alternatives if you have better ideas. Among my rejected ideas include:
The implementation is rather simple with the current CLI. Future augmentations could include a
--quiet
option to silence the output and a--verbose
option to list the files that exist within each directory and would/will be removed.Thanks,
-Stolee
cc: gitster@pobox.com
cc: newren@gmail.com
cc: Patrick Steinhardt ps@pks.im