-
Notifications
You must be signed in to change notification settings - Fork 51
Description
Hello,
About half of the times when I boot a VM with an H100 attached with CC mode enabled I get the following error:
[ 24.369352] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2254
[ 24.372633] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2254
[ 31.231957] NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************
[ 31.231963] NVRM: _kgspLogXid119: Note: Please also check logs above.
[ 31.231967] NVRM: nvAssertFailedNoLog: Assertion failed: expectedFunc == pHistoryEntry->function @ kernel_gsp.c:1993
[ 31.231985] NVRM: GPU at PCI:0000:01:00: GPU-6a77de48-cf1d-2048-5be0-a63557bf0e6f
[ 31.231988] NVRM: Xid (PCI:0000:01:00): 119, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 47 (UNLOADING_GUEST_DRIVER) sequence 423 (0x0 0x0).
[ 31.231994] NVRM: GPU0 GSP RPC buffer contains function 4124 (GSP_LOCKDOWN_NOTICE) sequence 0 and data 0x0000000000000000 0x0000000000000000.
[ 31.231996] NVRM: GPU0 RPC history (CPU -> GSP):
[ 31.231997] NVRM: entry function sequence data0 data1 ts_start ts_end duration actively_polling
[ 31.231998] NVRM: 0 -388792779 Unknown 3579119361 0x0000000000000000 0x0000000000000000 0x000642c4f9273171 0x0000000000000000 y
[ 31.232002] NVRM: -1 -417704358 Unknown 1368943075 0x0000000000000000 0x0000000000000000 0x000642c4f9272caf 0x000642c4f92730a4 1013us
[ 31.232005] NVRM: -2 -1186186342 Unknown 750038700 0x0000000000000000 0x0000000000000000 0x000642c4f92728ae 0x000642c4f9272caa 1020us
[ 31.232007] NVRM: -3 -969190040 Unknown 1188055699 0x0000000000000000 0x0000000000000000 0x000642c4f92724c4 0x000642c4f92728a1 989us
[ 31.232009] NVRM: -4 84031309 Unknown 945494249 0x0000000000000000 0x0000000000000000 0x000642c4f9272075 0x000642c4f92724ba 1093us
[ 31.232011] NVRM: -5 -408017343 Unknown 1321406506 0x0000000000000000 0x0000000000000000 0x000642c4f9271c42 0x000642c4f927203b 1017us
[ 31.232013] NVRM: -6 1652902690 Unknown 3700590743 0x0000000000000000 0x0000000000000000 0x000642c4f92716e0 0x000642c4f9271ba2 1218us
[ 31.232015] NVRM: -7 464360894 Unknown 396536444 0x0000000000000000 0x0000000000000000 0x000642c4f92710b9 0x000642c4f927158c 1235us
[ 31.232017] NVRM: GPU0 RPC event history (CPU <- GSP):
[ 31.232019] NVRM: entry function sequence data0 data1 ts_start ts_end duration during_incomplete_rpc
[ 31.232020] NVRM: 0 4124 GSP_LOCKDOWN_NOTICE 0 0x0000000000000000 0x0000000000000000 0x000642c4f927b844 0x000642c4f927b844 y
[ 31.232022] NVRM: -1 4124 GSP_LOCKDOWN_NOTICE 0 0x0000000000000001 0x0000000000000000 0x000642c4f927b208 0x000642c4f927b208 y
[ 31.232024] NVRM: -2 4108 UCODE_LIBOS_PRINT 0 0x0000000000000000 0x0000000000000000 0x000642c4f92795d8 0x000642c4f92795da 2us y
[ 31.232027] NVRM: -3 4124 GSP_LOCKDOWN_NOTICE 0 0x0000000000000000 0x0000000000000000 0x000642c4f91d4b8e 0x000642c4f91d4b8e
[ 31.232029] NVRM: -4 4124 GSP_LOCKDOWN_NOTICE 0 0x0000000000000001 0x0000000000000000 0x000642c4f91d47d2 0x000642c4f91d47d2
[ 31.232030] NVRM: -5 4124 GSP_LOCKDOWN_NOTICE 0 0x0000000000000000 0x0000000000000000 0x000642c4f91d3502 0x000642c4f91d3502
[ 31.232032] NVRM: -6 4124 GSP_LOCKDOWN_NOTICE 0 0x0000000000000001 0x0000000000000000 0x000642c4f91d3208 0x000642c4f91d3208
[ 31.232034] NVRM: -7 4124 GSP_LOCKDOWN_NOTICE 0 0x0000000000000000 0x0000000000000000 0x000642c4f91d2916 0x000642c4f91d2916
[ 31.232039] CPU: 13 UID: 0 PID: 626 Comm: nvidia-powerd Tainted: G OE 6.14.0-35-generic #35-Ubuntu
[ 31.232043] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 31.232045] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[ 31.232047] Call Trace:
[ 31.232049] <TASK>
[ 31.232055] show_stack+0x49/0x60
[ 31.232062] dump_stack_lvl+0x5f/0x90
[ 31.232066] dump_stack+0x10/0x18
[ 31.232101] os_dump_stack+0xe/0x20 [nvidia]
[ 31.232618] _kgspRpcRecvPoll+0x54b/0x750 [nvidia]
[ 31.232766] _issueRpcAndWait+0x7f/0x390 [nvidia]
[ 31.232875] kgspUnloadRm_IMPL+0x78/0x190 [nvidia]
[ 31.232937] gpuStateDestroy_IMPL+0x1a4/0x1b0 [nvidia]
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] RmShutdownAdapter+0x258/0x330 [nvidia]
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] rm_shutdown_adapter+0x58/0x60 [nvidia]
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] nv_shutdown_adapter+0xa9/0x1d0 [nvidia]
[ 31.232937] nv_close_device+0x132/0x180 [nvidia]
[ 31.232937] nvidia_close_callback+0x99/0x1a0 [nvidia]
[ 31.232937] nvidia_close+0x26b/0x290 [nvidia]
[ 31.232937] __fput+0xed/0x2d0
[ 31.232937] ____fput+0x15/0x20
[ 31.232937] task_work_run+0x60/0xa0
[ 31.232937] do_exit+0x276/0x4d0
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? __audit_syscall_entry+0xca/0x170
[ 31.232937] do_group_exit+0x34/0x90
[ 31.232937] __x64_sys_exit_group+0x18/0x20
[ 31.232937] x64_sys_call+0x141e/0x2310
[ 31.232937] do_syscall_64+0x7e/0x170
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? do_read_fault+0xfd/0x200
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? do_fault+0x151/0x220
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? handle_pte_fault+0x157/0x200
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? __handle_mm_fault+0x3d2/0x7a0
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? __count_memcg_events+0xd3/0x1a0
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? count_memcg_events.constprop.0+0x2a/0x50
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? handle_mm_fault+0x1bb/0x2d0
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? do_user_addr_fault+0x5e9/0x7e0
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? arch_exit_to_user_mode_prepare.isra.0+0x22/0x120
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? irqentry_exit_to_user_mode+0x2d/0x1d0
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? irqentry_exit+0x43/0x50
[ 31.232937] ? srso_return_thunk+0x5/0x5f
[ 31.232937] ? exc_page_fault+0x96/0x1e0
[ 31.232937] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 31.232937] RIP: 0033:0x725d12334f8d
[ 31.232937] Code: Unable to access opcode bytes at 0x725d12334f63.
[ 31.232937] RSP: 002b:00007ffe0065a6a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
[ 31.232937] RAX: ffffffffffffffda RBX: 0000725d12450fe8 RCX: 0000725d12334f8d
[ 31.232937] RDX: 00000000000000e7 RSI: ffffffffffffff70 RDI: 0000000000000000
[ 31.232937] RBP: 00007ffe0065a700 R08: 00007ffe0065a638 R09: 0000000000000000
[ 31.232937] R10: 0000000000000008 R11: 0000000000000206 R12: 0000000000000005
[ 31.232937] R13: 0000000000000000 R14: 0000725d1244f680 R15: 0000725d12451000
[ 31.232937] </TASK>
[ 31.233870] NVRM: _kgspLogXid119: ********************************************************************************
[ 31.233873] NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 47 sequence 423!
[ 31.233878] NVRM: kgspCheckGspRmCcCleanup_GH100: CC secret cleanup successful!
[ 31.233957] NVRM: nvAssertFailedNoLog: Assertion failed: rmStatus == NV_OK @ osinit.c:2174
[ 31.604682] fbcon: Taking over console
[ 33.171994] kauditd_printk_skb: 666 callbacks suppressed
[ 33.171999] audit: type=1131 audit(1762263629.321:373): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=nvidia-powerd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 33.240524] NVRM: gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[ 33.242205] NVRM: nvAssertFailedNoLog: Assertion failed: pVGpu != NULL @ objvgpu.c:148
[ 33.243729] NVRM: osInitNvMapping: *** Cannot attach gpu
[ 33.244777] NVRM: RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[ 33.246286] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:744)
[ 33.248919] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 33.305931] NVRM: gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[ 33.307419] NVRM: nvAssertFailedNoLog: Assertion failed: pVGpu != NULL @ objvgpu.c:148
[ 33.308794] NVRM: osInitNvMapping: *** Cannot attach gpu
[ 33.309749] NVRM: RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[ 33.311183] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:744)
[ 33.313837] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
The guest VM is running Ubuntu plucky with the following configuration:
- Driver: NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 570.195.03 Release Build (dvs-builder@U22-I3-H04-03-1) Sat Sep 20 00:39:49 UTC 2025
- Kernel: Linux version 6.14.0-35-generic (buildd@lcy02-amd64-078) (x86_64-linux-gnu-gcc-14 (Ubuntu 14.2.0-19ubuntu2) 14.2.0, GNU ld (GNU Binutils for Ubuntu) 2.44) Executing an Attestation of the GPU(x-nv-gpu-measurements-match) Failed #35-Ubuntu SMP PREEMPT_DYNAMIC Sat Oct 11 10:06:31 UTC 2025 (Ubuntu 6.14.0-35.35-generic 6.14.11)
The Kernel command line contains just console=ttyS0 pci=realloc,nocrs as documented on the deployment guide.
I'm running Ubuntu plucky because it ships with a more recent kernel version which supports the functionality needed by COCONUT-SVSM.
Output of dmesg | grep -i sev on the guest:
[ 2.319769] Memory Encryption Features active: AMD SEV SEV-ES SEV-SNP
[ 2.320704] SEV: Status: SEV SEV-ES SEV-SNP
[ 2.539717] SEV: APIC: wakeup_secondary_cpu() replaced with wakeup_cpu_via_vmgexit()
[ 3.677778] SEV: Using SNP CPUID table, 28 entries present.
[ 3.678703] SEV: SNP running at VMPL2.
[ 4.034162] SEV: SNP guest platform device initialized.
[ 6.156077] sev-guest sev-guest: Initialized SEV guest driver (using VMPCK2 communication key)
On the guest I've also removed the nouveau kernel module:
# modinfo nouveau
modinfo: ERROR: Module nouveau not found.
The times where the GPU is detected, I'm able to verify the GPU attestation on the guest and everything works fine:
Generating nonce in the local GPU Verifier ..
Number of GPUs available : 1
Fetching GPU 0 information from GPU driver.
All GPU Evidences fetched successfully
-----------------------------------
Verifying GPU: GPU-6a77de48-cf1d-2048-5be0-a63557bf0e6f
Driver version fetched : 570.195.03
VBIOS version fetched : 96.00.9f.00.01
Validating GPU certificate chains.
The firmware ID in the device certificate chain is matching with the one in the attestation report.
GPU attestation report certificate chain validation successful.
The certificate chain revocation status verification successful.
Authenticating attestation report
The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
Driver version fetched from the attestation report : 570.195.03
VBIOS version fetched from the attestation report : 96.00.9f.00.01
Attestation report signature verification successful.
Attestation report verification successful.
Authenticating the RIMs.
Authenticating Driver RIM
Fetching the driver RIM from the RIM service.
RIM Schema validation passed.
driver RIM certificate chain verification successful.
The certificate chain revocation status verification successful.
driver RIM signature verification successful.
Driver RIM verification successful
Authenticating VBIOS RIM.
Fetching the VBIOS RIM from the RIM service.
RIM Schema validation passed.
vbios RIM certificate chain verification successful.
The certificate chain revocation status verification successful.
vbios RIM signature verification successful.
VBIOS RIM verification successful
Comparing measurements (runtime vs golden)
The runtime measurements are matching with the golden measurements.
GPU is in expected state.
GPU 0 with UUID GPU-6a77de48-cf1d-2048-5be0-a63557bf0e6f verified successfully.
Setting the GPU Ready State to READY
GPU Attestation is Successful.
Output of nvidia-smi conf-compute -q:
==============NVSMI CONF-COMPUTE LOG==============
CC State : ON
Multi-GPU Mode : None
CPU CC Capabilities : AMD SEV-SNP
GPU CC Capabilities : CC Capable
CC GPUs Ready State : Ready
The host is configured following the instructions at https://coconut-svsm.github.io/svsm/installation/INSTALL/.
I'm experiencing intermittent GPU detection failures in my VM, which is running the OS in live mode. This means every reboot starts from a clean OS image, yet I still frequently need multiple reboots for the GPU to be recognized. The dmesg output appears identical in both successful and failed detection attempts, with the exception of the previously reported stack trace.