WSL flaky and unstable in CI environment

### Windows Version

Microsoft Windows [Version 10.0.20348.3807]

### WSL Version

2.5.9.0

### Are you using WSL 1 or WSL 2?

- [x] WSL 2
- [ ] WSL 1

### Kernel Version

6.6.87.2-1

### Distro Version

Fedora 42 but a custom build from https://github.com/containers/podman-machine-os/ 

### Other Software

Podman (main branch), we are seeing this in our own podman upstream CI environment. We are running the test on AWS z1d.metal instances (48 CPUs / 384 GBs) with Windows Server 2022.

### Repro Steps

unclear, we have not seen users reporting this to us so far and had no luck trying to reproduce outside CI.
In CI a full run of our wsl test suite takes about ~20 mins. We do at least 61 WLS imports with starts/stops so that certainly stresses the system a fair amount more than a normal user would.

### Expected Behavior

WSL should be running stable and not randomly fail with various different errors.

We always had flakly WSL tests but this has been getting much worse recently, we are tracking the full problem here https://github.com/containers/podman/issues/26547

It has been much worse since we updated from WSL 2.4.13.0 to 2.5.9.0 As far as I can tell. 

### Actual Behavior

Issues seen in our CI logs:


```
  C:\Users\Administrator\AppData\Local\cirrus-ci-build\repo\bin\windows\podman.exe machine init --disk-size 11 --image Z:\podman-machine.x86_64.wsl.tar foo1
  Importing operating system into WSL (this may take a few minutes on a new WSL install)...
  The operation completed successfully. 
  Configuring system...
  The operation timed out because a response was not received from the virtual machine or container. 
  Error code: Wsl/Service/CreateInstance/HCS_E_CONNECTION_TIMEOUT
  Error: could not create root authorized keys on guest OS: command C:\Windows\system32\wsl.exe [wsl -u root -d podman-foo1 sh -c mkdir -p /root/.ssh;cat >> /root/.ssh/authorized_keys; chmod 600 /root/.ssh/authorized_keys] failed: exit status 0xffffffff
```


```
  C:\Users\Administrator\AppData\Local\cirrus-ci-build\repo\bin\windows\podman.exe machine init --disk-size 11 --image Z:\podman-machine.x86_64.wsl.tar a46a1b20d95c
  Importing operating system into WSL (this may take a few minutes on a new WSL install)...
  Failed to attach disk '\\?\Z:\podman_test518083807\.local\share\containers\podman\machine\wsl\wsldist\a46a1b20d95c\ext4.vhdx' to WSL2: Error: 0x80041001
  Error code: Wsl/Service/RegisterDistro/MountDisk/HCS/VM_E_INVALID_STATE
  Error: the WSL import of guest OS failed: command C:\Windows\system32\wsl.exe [wsl --import podman-a46a1b20d95c Z:\podman_test518083807\.local\share\containers\podman\machine\wsl\wsldist\a46a1b20d95c Z:\podman_test518083807\.local\share\containers\podman\machine\wsl\a46a1b20d95c-amd64 --version 2] failed: exit status 0xffffffff
```
https://api.cirrus-ci.com/v1/task/5559458543173632
WSL logs: https://api.cirrus-ci.com/v1/artifact/task/5559458543173632/wsl_logs/WslLogs2025-07-24_04-41-02.zip

---

```
  The operation timed out because a response was not received from the virtual machine or container. 
  Error code: Wsl/Service/CreateInstance/HCS_E_CONNECTION_TIMEOUT
  Error: the WSL bootstrap script failed: command C:\Windows\system32\wsl.exe [wsl -u root -d podman-9c15505e812e /root/bootstrap] failed: exit status 0xffffffff
```
https://cirrus-ci.com/task/6166253468909568
WSL logs: https://api.cirrus-ci.com/v1/artifact/task/6166253468909568/wsl_logs/WslLogs2025-07-08_10-14-52.zip

And afterwards all following tests just timeout as all the commands just seems to hang indefinitely (we have a command timeout of 10 mins)

---

We also often observe just timeouts without any visible WSL error:

https://cirrus-ci.com/task/5782833148461056
WSL logs: https://api.cirrus-ci.com/v1/artifact/task/5782833148461056/wsl_logs/WslLogs2025-07-07_18-05-50.zip


### Diagnostic Logs

see the WSL logs above, I uploaded them here directly as our CI logs are deleted after 90 days. 

[WslLogs2025-07-24_04-41-02.zip](https://github.com/user-attachments/files/21486455/WslLogs2025-07-24_04-41-02.zip)

[WslLogs2025-07-08_10-14-52.zip](https://github.com/user-attachments/files/21486445/WslLogs2025-07-08_10-14-52.zip)

[WslLogs2025-07-07_18-05-50.zip](https://github.com/user-attachments/files/21486441/WslLogs2025-07-07_18-05-50.zip)

Note these are capturing the full CI runs. We had to tweak the log collection script to make it work in the CI env, see https://github.com/containers/podman/pull/26568 
And since we enabled these log captures the failure rate seems to be much lower than before so I think it is possible that the logging itself changes the timing enough to make the issues less likely somehow. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WSL flaky and unstable in CI environment #13301

Windows Version

WSL Version

Are you using WSL 1 or WSL 2?

Kernel Version

Distro Version

Other Software

Repro Steps

Expected Behavior

Actual Behavior

Diagnostic Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WSL flaky and unstable in CI environment #13301

Description

Windows Version

WSL Version

Are you using WSL 1 or WSL 2?

Kernel Version

Distro Version

Other Software

Repro Steps

Expected Behavior

Actual Behavior

Diagnostic Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions