WIP: Refactor pipeline downloads command #3634

ErikDanielsson · 2025-06-23T09:03:58Z

This is a draft PR for refactoring nf-core pipelines download for readability and to use the nextflow inspect command for container detection. It builds upon the excellent work of @MatthiasZepper, @muffato (#3509), @JulianFlesch (refactor/pipeline-download), and @nikhil (#3517).

Integrated changes from contributors:

Base container fetching primarily on nextflow inspect
Add SingularityFetcher class for fetching singularity images
Add DockerFetcher class for fetching docker images

Added changes

Refactor download.py
- Split out the WorkflowRepo class. I have not made significant changes to the code (yet)
- (Re)move regex parsing of container strings. This is still present but moved to download/utils.py. I will remove it entirely once we have tested the nextflow inspect command properly.
- Add checks and nice reporting for required Nextflow version for nextflow inspect.
- Add checks for weird container strings from nextflow inspect: The only previously supported case where nextflow inspect fails is when there is a variable in the string which is not currently available. This was used in the star_align module of rnaseq 3.7 (which is old and cannot either way be used with nextflow inspect since it requires input parameters). The output from the nextflow inspect will then contain a null. Perhaps nextflow inspect should issue a warning here, rather than require that we capture this downstream?
Refactor SingularityFetcher and DockerFetcher: I've created a superclass ContainerFetcher with a coherent interface and some code sharing.
Change concurrency model for DockerFetcher. See below.
Changed --parallel-downloads to parallel to better reflect the processing of both docker and singularity images. See docker concurrency below.

Discussion points related to this

I've tested quite a few pipelines, and nextflow inspect works nicely overall. However, some older pipelines require that we specify input parameters, which we then also would have to do for nextflow inspect. I suspect that this change happened when the lib/ folder was removed, is this correct? It is in principle possible to pass (dummy or actual) input parameters with the -params-file flag, but I not sure if it is desired and necessary for current pipelines.

Left to do

Remove regex code and corresponding tests
Add more tests... as always

Tests

The nextflow inspect command is required to run on a full pipeline: the command uses a Nextflow file as the entry point by default and then checks what modules are imported in any workflows or subworkflows that are used.
This means that out current tests which are based on stripped modules do not work. I have therefore added a pipeline skeleton in the test data "mock_pipeline_containers" with the following features

The tested container URI that nextflow inspect are located in modules/local/ -- I have structured it so that it mirrors the typical structure of local modules in an nf-core pipeline. The tested URIs are taken from the containers present in mock_module_containers
- The subdirectory passing contains modules where the container strings are correctly captured by nextflow inspect
- The subdirectory failing contains modules where the container strings are not capture correctly by nextflow inspect. It contains a single module at the moment, mock_dsl2_variable.nf, where the container string is not correctly resolved, presumably since nextflow inspect does not run any code in the script section, meaning that container_id is resolved to null. This is added to ensure that we issue nice error messages for weird container strings.
Tested are performed by running nextflow inspect on either of the entrypoints main_passing_test.nf or main_failing_test.nf.
Correct behaviour is tested in tests/download/DownloadTest.test_containers_pipeline_<container> where <container> is either singularity or docker. In these tests the output from nextflow inspect . -profile <container> is compared to the true list of containers which is kept in mock_pipeline_containers/per_profile_output as a JSON file.
I have left much of the Nextflow code added by the nf-core create command -- we could definitely make this folder leaner if that is desired. Any pointers related to this are very welcome!

Once we are ready to remove the legacy regex code we will be able to remove the old test data folder for module container definitions mock_module_containers.

Discussion points related to this:

Should we support container definitions in configs? While nextflow inspect will capture them if they are configured correctly I am not sure if it is desired behavior.
- Since we will likely make the container definitions more strict (Move to seqera containers 1 and Move to seqera containers 2), I would presume this is the case, but then we need to decide on backwards compatibility.
- If we want to support it, then we need to port the tests to ensure that containers are capture correctly

Downloading docker images

Support for docker containers is added in this PR. It works slightly different than support for singularity containers:

Docker images are always pulled from a registry with docker image pull
Images that have been pulled (or otherwise available locally) can be saved to a tar archive with docker image save to be transferred to the offline machine

Concurrency
Parallel pulls do not make sense since there might be layer dependence between different images. However, parallel saves does make sense, since the images are only read when making the tar archive. In the code I use a single ThreadPool where the pulling of images is done sequentially, and the remaining threads are used for converting the images to tar. See this is done in the following lines.

PR checklist

This comment contains a description of changes (with reason)
CHANGELOG.md is updated
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated

…environment variables

…have to be aware of the TaskID

…ogress

…uarantee the task gets removed

codecov · 2025-06-25T12:08:22Z

Codecov Report

Attention: Patch coverage is 61.57480% with 488 lines in your changes missing coverage. Please review.

Project coverage is 76.75%. Comparing base (28b76a4) to head (05adb4d).

Files with missing lines	Patch %	Lines
nf_core/pipelines/download/download.py	65.96%	161 Missing ⚠️
nf_core/pipelines/download/docker.py	20.12%	123 Missing ⚠️
nf_core/pipelines/download/singularity.py	42.70%	110 Missing ⚠️
nf_core/pipelines/download/workflow_repo.py	67.58%	47 Missing ⚠️
nf_core/pipelines/download/container_fetcher.py	73.62%	24 Missing ⚠️
nf_core/pipelines/download/utils.py	89.20%	23 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mirpedrol · 2025-07-07T10:09:13Z

Current failing tests are not dure to this PR, I am fixing then in a separate one https://github.com/nf-core/tools/actions/runs/16113859276/job/45463246665?pr=3634

ErikDanielsson · 2025-07-07T10:51:30Z

@mirpedrol Thanks for letting me know!

… for subclassing)

MatthiasZepper

Thanks a lot for your tremendous work! This looks already very promising and I only had some minor remarks.

(I admittedly hopped over the mock_pipeline_containers files and only skimmed over the changed tests. But it looks as if you have compiled an impressive test suite for all the new features, so I am convinced that the code works as expected.)

MatthiasZepper · 2025-07-09T14:48:44Z

nf_core/__main__.py

@@ -416,10 +416,10 @@ def command_pipelines_lint(
 )
 @click.option(
    "-d",
-    "--parallel-downloads",


I think, @mirpedrol and @mashehu are quite strict on renaming existing CLI arguments, since breaking changes would, according to SemVer, require a new major version release (4.0.0). But maybe Click would map a --parallel-downloads automatically to --parallel, so it is backwards compatible?

Apart from that, CLI parameters in download are mostly standardized in a way that the first letter of the last term is their short flag: -d / --parallel-downloads , -s / --container-system, -i / --container-cache-index and others.

What was the reason you wish to rename? As far as I remember, it used to be -p and --parallel in 2.x.x and was specifically changed, because Seqera went from Tower to Platform, and we needed to free the short letter -p

MatthiasZepper · 2025-07-09T15:14:29Z

nf_core/pipelines/download/container_fetcher.py

+
+        self.check_and_set_implementation()
+
+    @abstractmethod


Nice!

I didn't know about @abstractmethods decorator before, but that is exactly what was missing in the SyncedRepo class to enforce a consistent API of the subclasses. Great that it is so cleanly implemented here!

MatthiasZepper · 2025-07-09T15:26:23Z

nf_core/pipelines/download/container_fetcher.py

+                continue
+
+            # Generate file paths for all three locations
+            output_path = os.path.join(output_dir, container_filename)


I remember that @ mirpedrol wanted to switch to pathlib instead of os.path.

MatthiasZepper · 2025-07-09T15:33:09Z

nf_core/pipelines/download/container_fetcher.py

+                containers_copy.append((container, library_path, output_path))
+                # update the cache if needed
+                if cache_path and not self.amend_cachedir and not os.path.exists(cache_path):
+                    containers_copy.append((container, library_path, cache_path))


Please double-check this logic. If not self.amend_cachedir, I will copy the container from the library_path to the cache_path?

Also, we should ask @muffato, who is actually using a library directory, if it is a desirable action to copy images from there to the cache and thus duplicate them?

MatthiasZepper · 2025-07-09T15:38:57Z

nf_core/pipelines/download/container_fetcher.py

+                    # download into the cache
+                    containers_remote_fetch.append((container, cache_path))
+                    # only copy to the output if we are not amending the cache
+                    if not self.amend_cachedir:


Same here. Maybe I am misunderstanding how you use this variable, but to me, self.amend_cachedir=True means: Yes, add new containers to the cache!, and also that the output_path is the cache_path, because the user wants to use their cache as the final location of the image.

MatthiasZepper · 2025-07-11T12:10:46Z

nf_core/pipelines/download/singularity.py

+            extension = ".sif"
+            container_fn = container_fn.replace(".sif:", "-")
+        elif container_fn.endswith(".sif"):
+            extension = ".sif"


Suggested change

extension = ".sif"

MatthiasZepper · 2025-07-11T13:18:23Z

nf_core/pipelines/download/singularity.py

+        for container, output_path in containers_pull:
+            # it is possible to try multiple registries / mirrors if multiple were specified.
+            # Iteration happens over a copy of self.container_library[:], as I want to be able to remove failing registries for subsequent images.
+            for library in self.container_library[:]:


Would using itertools.product here simplify the loop logic? Unless I am missing something, this here is also a for...else case and if the new dependency is introduced for the SingularityError, one might as well use it here?

MatthiasZepper · 2025-07-11T13:22:17Z

nf_core/pipelines/download/utils.py

+
+# Check that the Nextflow version >= the minimal version required
+# This is used to ensure that we can run `nextflow inspect`
+def check_nextflow_version(minimal_nxf_version: tuple[int, int, int], silent=False) -> bool:


I wonder, if those functions should rather be in the generic and not the download-specific utils file? To me, it seems they are useful for other tools as well?

MatthiasZepper · 2025-07-11T13:33:18Z

nf_core/pipelines/download/utils.py

+    """
+
+
+@contextlib.contextmanager


A somewhat confusing order in this file. At first, I thought that these would be methods of the DownloadError class.

MatthiasZepper · 2025-07-11T13:48:20Z

nf_core/pipelines/download/utils.py

+            self.remove_task(task)
+
+
+class FileDownloader:


I feel that there is still too much heterogeneity between how the SingularityFetcher using this class and the DockerFetcher with its own future handling are built.

I am fine with not addressing this during this PR, as it already is such a tremendous improvement, that I would love to see merged to put the future cornerstones in place, but it would be good to create an issue about this then.

muffato and others added 30 commits March 25, 2025 14:54

Moved all singularity-related download functions into a new file

ae0d4dd

Added a helper method to use a temporary file as intermediate output

bbd297f

First round of fixing the tests

a9b2b00

WIP: Start refactoring

2797752

downloads.singularity functions should rely on proper parameters, no …

dc333b3

…environment variables

Duplicated with the line above

fbc09ac

Added tests for utils.intermediate_file and fixed it !

fee7182

Simpler implementation

c507d9e

Added helper functions to manage the main task so that callers don't …

aba5233

…have to be aware of the TaskID

No need to build a temp path here

0403a37

More tests

159c34e

Fixed the symlink_registries tests

693da64

Rewrote to support older versions of Python

c5379fb

Adds nextflow command call to extract containers

9bf7bb3

Move import

5dc30e2

Debug outpu

563a330

Added a dest for the main_task-management methods added to DownloadPr…

8ee8eed

…ogress

Added some tests for the DownloadProgress object

71ab538

Turned two methods into functions, simplified testing

a141820

Test comments

4133c90

Theoretically it could be None, so check it really is a string

2b8fbd6

Added docker class to match singularity class

72b6d60

More type hints

7813f30

More tests for get_container_filename

cb71d42

No need to keep a wraper as a method

2d8cb4d

Better comments

4bfb45e

New class to support the downloads

d6b8f6c

Introduced a context manager that creates a new sub task. Useful to g…

4abd736

…uarantee the task gets removed

Forgot that pytest.raises exists !

31f5627

Moved the download class to utils

06d3760

ErikDanielsson and others added 4 commits June 23, 2025 14:35

Merge branch 'dev' into download-refactor-again

cb8a93b

Move WorkflowRepo to separate file

7ca0875

Start adding in docker support, add hiding of progress bars

1684127

Start adding container base class

0739df0

ErikDanielsson added this to nf-core infrastructure projects Jun 25, 2025

Move legacy container finding into utils (for now) and update tests

9913e56

ErikDanielsson force-pushed the download-refactor-again branch from b458c8a to 9913e56 Compare June 25, 2025 11:17

Merge branch 'dev' into download-refactor-again

c65d09f

ErikDanielsson force-pushed the download-refactor-again branch from e7c2da7 to c65d09f Compare June 25, 2025 11:50

ErikDanielsson added 4 commits June 26, 2025 14:36

Add mock pipeline, and move some stuff around

0e0a5b8

Try to refactor singularity and docker container handling in this branch

8160181

Remove refactoring code and add some nice messages

9b695f2

Merge branch 'dev' into download-refactor-again

7670666

ErikDanielsson added 13 commits July 7, 2025 14:17

Nearing working refactor

b64a77f

Merge branch 'download-refactor-again' into refactor-docker-singularity

35df64b

Further refactoring

a18f611

Make DockerError and SingularityError classes (too much customization…

6b44e74

… for subclassing)

Integrate new classes and update tests

41c7332

Merge branch 'dev' into download-refactor-again

c5cf8e0

Fix failing tests

275adab

Add error message for weird containers and testing of it

56a76b5

Remove singularity and docker files without subclassing

d58c282

Make concurrency nicer for docker

1964d50

Merge branch 'dev' into download-refactor-again

6de9f41

Organize mock modules

d1ce487

Remove unused code

05adb4d

MatthiasZepper reviewed Jul 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Refactor pipeline downloads command #3634

WIP: Refactor pipeline downloads command #3634

Uh oh!

ErikDanielsson commented Jun 23, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jun 25, 2025 •

edited

Loading

Uh oh!

mirpedrol commented Jul 7, 2025

Uh oh!

ErikDanielsson commented Jul 7, 2025

Uh oh!

MatthiasZepper left a comment

Uh oh!

MatthiasZepper Jul 9, 2025

Uh oh!

MatthiasZepper Jul 9, 2025

Uh oh!

MatthiasZepper Jul 9, 2025

Uh oh!

MatthiasZepper Jul 9, 2025

Uh oh!

MatthiasZepper Jul 9, 2025

Uh oh!

MatthiasZepper Jul 11, 2025

Uh oh!

MatthiasZepper Jul 11, 2025

Uh oh!

MatthiasZepper Jul 11, 2025

Uh oh!

MatthiasZepper Jul 11, 2025

Uh oh!

MatthiasZepper Jul 11, 2025

Uh oh!

Uh oh!

		"""


		@contextlib.contextmanager

WIP: Refactor pipeline downloads command #3634

Are you sure you want to change the base?

WIP: Refactor pipeline downloads command #3634

Uh oh!

Conversation

ErikDanielsson commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integrated changes from contributors:

Added changes

Left to do

Tests

Downloading docker images

PR checklist

Uh oh!

codecov bot commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mirpedrol commented Jul 7, 2025

Uh oh!

ErikDanielsson commented Jul 7, 2025

Uh oh!

MatthiasZepper left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ErikDanielsson commented Jun 23, 2025 •

edited

Loading

codecov bot commented Jun 25, 2025 •

edited

Loading