+
Skip to content

WSL flaky and unstable in CI environment #13301

@Luap99

Description

@Luap99

Windows Version

Microsoft Windows [Version 10.0.20348.3807]

WSL Version

2.5.9.0

Are you using WSL 1 or WSL 2?

  • WSL 2
  • WSL 1

Kernel Version

6.6.87.2-1

Distro Version

Fedora 42 but a custom build from https://github.com/containers/podman-machine-os/

Other Software

Podman (main branch), we are seeing this in our own podman upstream CI environment. We are running the test on AWS z1d.metal instances (48 CPUs / 384 GBs) with Windows Server 2022.

Repro Steps

unclear, we have not seen users reporting this to us so far and had no luck trying to reproduce outside CI.
In CI a full run of our wsl test suite takes about ~20 mins. We do at least 61 WLS imports with starts/stops so that certainly stresses the system a fair amount more than a normal user would.

Expected Behavior

WSL should be running stable and not randomly fail with various different errors.

We always had flakly WSL tests but this has been getting much worse recently, we are tracking the full problem here containers/podman#26547

It has been much worse since we updated from WSL 2.4.13.0 to 2.5.9.0 As far as I can tell.

Actual Behavior

Issues seen in our CI logs:

  C:\Users\Administrator\AppData\Local\cirrus-ci-build\repo\bin\windows\podman.exe machine init --disk-size 11 --image Z:\podman-machine.x86_64.wsl.tar foo1
  Importing operating system into WSL (this may take a few minutes on a new WSL install)...
  The operation completed successfully. 
  Configuring system...
  The operation timed out because a response was not received from the virtual machine or container. 
  Error code: Wsl/Service/CreateInstance/HCS_E_CONNECTION_TIMEOUT
  Error: could not create root authorized keys on guest OS: command C:\Windows\system32\wsl.exe [wsl -u root -d podman-foo1 sh -c mkdir -p /root/.ssh;cat >> /root/.ssh/authorized_keys; chmod 600 /root/.ssh/authorized_keys] failed: exit status 0xffffffff
  C:\Users\Administrator\AppData\Local\cirrus-ci-build\repo\bin\windows\podman.exe machine init --disk-size 11 --image Z:\podman-machine.x86_64.wsl.tar a46a1b20d95c
  Importing operating system into WSL (this may take a few minutes on a new WSL install)...
  Failed to attach disk '\\?\Z:\podman_test518083807\.local\share\containers\podman\machine\wsl\wsldist\a46a1b20d95c\ext4.vhdx' to WSL2: Error: 0x80041001
  Error code: Wsl/Service/RegisterDistro/MountDisk/HCS/VM_E_INVALID_STATE
  Error: the WSL import of guest OS failed: command C:\Windows\system32\wsl.exe [wsl --import podman-a46a1b20d95c Z:\podman_test518083807\.local\share\containers\podman\machine\wsl\wsldist\a46a1b20d95c Z:\podman_test518083807\.local\share\containers\podman\machine\wsl\a46a1b20d95c-amd64 --version 2] failed: exit status 0xffffffff

https://api.cirrus-ci.com/v1/task/5559458543173632
WSL logs: https://api.cirrus-ci.com/v1/artifact/task/5559458543173632/wsl_logs/WslLogs2025-07-24_04-41-02.zip


  The operation timed out because a response was not received from the virtual machine or container. 
  Error code: Wsl/Service/CreateInstance/HCS_E_CONNECTION_TIMEOUT
  Error: the WSL bootstrap script failed: command C:\Windows\system32\wsl.exe [wsl -u root -d podman-9c15505e812e /root/bootstrap] failed: exit status 0xffffffff

https://cirrus-ci.com/task/6166253468909568
WSL logs: https://api.cirrus-ci.com/v1/artifact/task/6166253468909568/wsl_logs/WslLogs2025-07-08_10-14-52.zip

And afterwards all following tests just timeout as all the commands just seems to hang indefinitely (we have a command timeout of 10 mins)


We also often observe just timeouts without any visible WSL error:

https://cirrus-ci.com/task/5782833148461056
WSL logs: https://api.cirrus-ci.com/v1/artifact/task/5782833148461056/wsl_logs/WslLogs2025-07-07_18-05-50.zip

Diagnostic Logs

see the WSL logs above, I uploaded them here directly as our CI logs are deleted after 90 days.

WslLogs2025-07-24_04-41-02.zip

WslLogs2025-07-08_10-14-52.zip

WslLogs2025-07-07_18-05-50.zip

Note these are capturing the full CI runs. We had to tweak the log collection script to make it work in the CI env, see containers/podman#26568
And since we enabled these log captures the failure rate seems to be much lower than before so I think it is possible that the logging itself changes the timing enough to make the issues less likely somehow.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载