-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Windows Version
Microsoft Windows [Version 10.0.20348.3807]
WSL Version
2.5.9.0
Are you using WSL 1 or WSL 2?
- WSL 2
- WSL 1
Kernel Version
6.6.87.2-1
Distro Version
Fedora 42 but a custom build from https://github.com/containers/podman-machine-os/
Other Software
Podman (main branch), we are seeing this in our own podman upstream CI environment. We are running the test on AWS z1d.metal instances (48 CPUs / 384 GBs) with Windows Server 2022.
Repro Steps
unclear, we have not seen users reporting this to us so far and had no luck trying to reproduce outside CI.
In CI a full run of our wsl test suite takes about ~20 mins. We do at least 61 WLS imports with starts/stops so that certainly stresses the system a fair amount more than a normal user would.
Expected Behavior
WSL should be running stable and not randomly fail with various different errors.
We always had flakly WSL tests but this has been getting much worse recently, we are tracking the full problem here containers/podman#26547
It has been much worse since we updated from WSL 2.4.13.0 to 2.5.9.0 As far as I can tell.
Actual Behavior
Issues seen in our CI logs:
C:\Users\Administrator\AppData\Local\cirrus-ci-build\repo\bin\windows\podman.exe machine init --disk-size 11 --image Z:\podman-machine.x86_64.wsl.tar foo1
Importing operating system into WSL (this may take a few minutes on a new WSL install)...
The operation completed successfully.
Configuring system...
The operation timed out because a response was not received from the virtual machine or container.
Error code: Wsl/Service/CreateInstance/HCS_E_CONNECTION_TIMEOUT
Error: could not create root authorized keys on guest OS: command C:\Windows\system32\wsl.exe [wsl -u root -d podman-foo1 sh -c mkdir -p /root/.ssh;cat >> /root/.ssh/authorized_keys; chmod 600 /root/.ssh/authorized_keys] failed: exit status 0xffffffff
C:\Users\Administrator\AppData\Local\cirrus-ci-build\repo\bin\windows\podman.exe machine init --disk-size 11 --image Z:\podman-machine.x86_64.wsl.tar a46a1b20d95c
Importing operating system into WSL (this may take a few minutes on a new WSL install)...
Failed to attach disk '\\?\Z:\podman_test518083807\.local\share\containers\podman\machine\wsl\wsldist\a46a1b20d95c\ext4.vhdx' to WSL2: Error: 0x80041001
Error code: Wsl/Service/RegisterDistro/MountDisk/HCS/VM_E_INVALID_STATE
Error: the WSL import of guest OS failed: command C:\Windows\system32\wsl.exe [wsl --import podman-a46a1b20d95c Z:\podman_test518083807\.local\share\containers\podman\machine\wsl\wsldist\a46a1b20d95c Z:\podman_test518083807\.local\share\containers\podman\machine\wsl\a46a1b20d95c-amd64 --version 2] failed: exit status 0xffffffff
https://api.cirrus-ci.com/v1/task/5559458543173632
WSL logs: https://api.cirrus-ci.com/v1/artifact/task/5559458543173632/wsl_logs/WslLogs2025-07-24_04-41-02.zip
The operation timed out because a response was not received from the virtual machine or container.
Error code: Wsl/Service/CreateInstance/HCS_E_CONNECTION_TIMEOUT
Error: the WSL bootstrap script failed: command C:\Windows\system32\wsl.exe [wsl -u root -d podman-9c15505e812e /root/bootstrap] failed: exit status 0xffffffff
https://cirrus-ci.com/task/6166253468909568
WSL logs: https://api.cirrus-ci.com/v1/artifact/task/6166253468909568/wsl_logs/WslLogs2025-07-08_10-14-52.zip
And afterwards all following tests just timeout as all the commands just seems to hang indefinitely (we have a command timeout of 10 mins)
We also often observe just timeouts without any visible WSL error:
https://cirrus-ci.com/task/5782833148461056
WSL logs: https://api.cirrus-ci.com/v1/artifact/task/5782833148461056/wsl_logs/WslLogs2025-07-07_18-05-50.zip
Diagnostic Logs
see the WSL logs above, I uploaded them here directly as our CI logs are deleted after 90 days.
WslLogs2025-07-24_04-41-02.zip
WslLogs2025-07-08_10-14-52.zip
WslLogs2025-07-07_18-05-50.zip
Note these are capturing the full CI runs. We had to tweak the log collection script to make it work in the CI env, see containers/podman#26568
And since we enabled these log captures the failure rate seems to be much lower than before so I think it is possible that the logging itself changes the timing enough to make the issues less likely somehow.