-
Notifications
You must be signed in to change notification settings - Fork 206
WIP: Refactor pipeline downloads command #3634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
WIP: Refactor pipeline downloads command #3634
Conversation
…environment variables
…have to be aware of the TaskID
…uarantee the task gets removed
b458c8a
to
9913e56
Compare
e7c2da7
to
c65d09f
Compare
Codecov ReportAttention: Patch coverage is
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Current failing tests are not dure to this PR, I am fixing then in a separate one https://github.com/nf-core/tools/actions/runs/16113859276/job/45463246665?pr=3634 |
@mirpedrol Thanks for letting me know! |
… for subclassing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your tremendous work! This looks already very promising and I only had some minor remarks.
(I admittedly hopped over the mock_pipeline_containers files and only skimmed over the changed tests. But it looks as if you have compiled an impressive test suite for all the new features, so I am convinced that the code works as expected.)
@@ -416,10 +416,10 @@ def command_pipelines_lint( | |||
) | |||
@click.option( | |||
"-d", | |||
"--parallel-downloads", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, @mirpedrol and @mashehu are quite strict on renaming existing CLI arguments, since breaking changes would, according to SemVer, require a new major version release (4.0.0). But maybe Click would map a --parallel-downloads
automatically to --parallel
, so it is backwards compatible?
Apart from that, CLI parameters in download are mostly standardized in a way that the first letter of the last term is their short flag: -d
/ --parallel-downloads
, -s
/ --container-system
, -i
/ --container-cache-index
and others.
What was the reason you wish to rename? As far as I remember, it used to be -p
and --parallel
in 2.x.x and was specifically changed, because Seqera went from Tower to Platform, and we needed to free the short letter -p
|
||
self.check_and_set_implementation() | ||
|
||
@abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I didn't know about @abstractmethods
decorator before, but that is exactly what was missing in the SyncedRepo
class to enforce a consistent API of the subclasses. Great that it is so cleanly implemented here!
continue | ||
|
||
# Generate file paths for all three locations | ||
output_path = os.path.join(output_dir, container_filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that @ mirpedrol wanted to switch to pathlib instead of os.path
.
containers_copy.append((container, library_path, output_path)) | ||
# update the cache if needed | ||
if cache_path and not self.amend_cachedir and not os.path.exists(cache_path): | ||
containers_copy.append((container, library_path, cache_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double-check this logic. If not self.amend_cachedir
, I will copy the container from the library_path to the cache_path?
Also, we should ask @muffato, who is actually using a library directory, if it is a desirable action to copy images from there to the cache and thus duplicate them?
# download into the cache | ||
containers_remote_fetch.append((container, cache_path)) | ||
# only copy to the output if we are not amending the cache | ||
if not self.amend_cachedir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Maybe I am misunderstanding how you use this variable, but to me, self.amend_cachedir=True
means: Yes, add new containers to the cache!, and also that the output_path
is the cache_path
, because the user wants to use their cache as the final location of the image.
extension = ".sif" | ||
container_fn = container_fn.replace(".sif:", "-") | ||
elif container_fn.endswith(".sif"): | ||
extension = ".sif" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extension = ".sif" |
for container, output_path in containers_pull: | ||
# it is possible to try multiple registries / mirrors if multiple were specified. | ||
# Iteration happens over a copy of self.container_library[:], as I want to be able to remove failing registries for subsequent images. | ||
for library in self.container_library[:]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would using itertools.product
here simplify the loop logic? Unless I am missing something, this here is also a for...else
case and if the new dependency is introduced for the SingularityError
, one might as well use it here?
|
||
# Check that the Nextflow version >= the minimal version required | ||
# This is used to ensure that we can run `nextflow inspect` | ||
def check_nextflow_version(minimal_nxf_version: tuple[int, int, int], silent=False) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, if those functions should rather be in the generic and not the download-specific utils file? To me, it seems they are useful for other tools as well?
""" | ||
|
||
|
||
@contextlib.contextmanager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A somewhat confusing order in this file. At first, I thought that these would be methods of the DownloadError
class.
self.remove_task(task) | ||
|
||
|
||
class FileDownloader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that there is still too much heterogeneity between how the SingularityFetcher
using this class and the DockerFetcher
with its own future handling are built.
I am fine with not addressing this during this PR, as it already is such a tremendous improvement, that I would love to see merged to put the future cornerstones in place, but it would be good to create an issue about this then.
This is a draft PR for refactoring
nf-core pipelines download
for readability and to use thenextflow inspect
command for container detection. It builds upon the excellent work of @MatthiasZepper, @muffato (#3509), @JulianFlesch (refactor/pipeline-download), and @nikhil (#3517).Integrated changes from contributors:
nextflow inspect
SingularityFetcher
class for fetching singularity imagesDockerFetcher
class for fetching docker imagesAdded changes
download.py
WorkflowRepo
class. I have not made significant changes to the code (yet)download/utils.py
. I will remove it entirely once we have tested thenextflow inspect
command properly.nextflow inspect
.nextflow inspect
: The only previously supported case wherenextflow inspect
fails is when there is a variable in the string which is not currently available. This was used in thestar_align
module ofrnaseq 3.7
(which is old and cannot either way be used withnextflow inspect
since it requires input parameters). The output from thenextflow inspect
will then contain anull
. Perhapsnextflow inspect
should issue a warning here, rather than require that we capture this downstream?SingularityFetcher
andDockerFetcher
: I've created a superclassContainerFetcher
with a coherent interface and some code sharing.DockerFetcher
. See below.--parallel-downloads
toparallel
to better reflect the processing of bothdocker
andsingularity
images. See docker concurrency below.Discussion points related to this
nextflow inspect
works nicely overall. However, some older pipelines require that we specify input parameters, which we then also would have to do fornextflow inspect
. I suspect that this change happened when thelib/
folder was removed, is this correct? It is in principle possible to pass (dummy or actual) input parameters with the-params-file
flag, but I not sure if it is desired and necessary for current pipelines.Left to do
Tests
The
nextflow inspect
command is required to run on a full pipeline: the command uses a Nextflow file as the entry point by default and then checks what modules are imported in any workflows or subworkflows that are used.This means that out current tests which are based on stripped modules do not work. I have therefore added a pipeline skeleton in the test data "
mock_pipeline_containers
" with the following featuresnextflow inspect
are located inmodules/local/
-- I have structured it so that it mirrors the typical structure of local modules in an nf-core pipeline. The tested URIs are taken from the containers present inmock_module_containers
passing
contains modules where the container strings are correctly captured bynextflow inspect
failing
contains modules where the container strings are not capture correctly bynextflow inspect
. It contains a single module at the moment,mock_dsl2_variable.nf
, where the container string is not correctly resolved, presumably sincenextflow inspect
does not run any code in thescript
section, meaning thatcontainer_id
is resolved tonull
. This is added to ensure that we issue nice error messages for weird container strings.nextflow inspect
on either of the entrypointsmain_passing_test.nf
ormain_failing_test.nf
.tests/download/DownloadTest.test_containers_pipeline_<container>
where<container>
is eithersingularity
ordocker
. In these tests the output fromnextflow inspect . -profile <container>
is compared to the true list of containers which is kept inmock_pipeline_containers/per_profile_output
as a JSON file.nf-core create
command -- we could definitely make this folder leaner if that is desired. Any pointers related to this are very welcome!Once we are ready to remove the legacy regex code we will be able to remove the old test data folder for module container definitions
mock_module_containers
.Discussion points related to this:
nextflow inspect
will capture them if they are configured correctly I am not sure if it is desired behavior.Downloading docker images
Support for
docker
containers is added in this PR. It works slightly different than support forsingularity
containers:docker image pull
tar
archive withdocker image save
to be transferred to the offline machineConcurrency
Parallel pulls do not make sense since there might be layer dependence between different images. However, parallel saves does make sense, since the images are only read when making the tar archive. In the code I use a single
ThreadPool
where the pulling of images is done sequentially, and the remaining threads are used for converting the images totar
. See this is done in the following lines.PR checklist
CHANGELOG.md
is updateddocs
is updated