这是indexloc提供的服务,不要输入任何密码
Skip to content

Driver won't load - AMD SEV-SNP #123

@ivanvalentini-h

Description

@ivanvalentini-h

Hello,
About half of the times when I boot a VM with an H100 attached with CC mode enabled I get the following error:

[   24.369352] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2254
[   24.372633] NVRM: nvAssertFailed: Assertion failed: 0 @ g_kern_gmmu_nvoc.h:2254
[   31.231957] NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************
[   31.231963] NVRM: _kgspLogXid119: Note: Please also check logs above.
[   31.231967] NVRM: nvAssertFailedNoLog: Assertion failed: expectedFunc == pHistoryEntry->function @ kernel_gsp.c:1993
[   31.231985] NVRM: GPU at PCI:0000:01:00: GPU-6a77de48-cf1d-2048-5be0-a63557bf0e6f
[   31.231988] NVRM: Xid (PCI:0000:01:00): 119, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 47 (UNLOADING_GUEST_DRIVER) sequence 423 (0x0 0x0).
[   31.231994] NVRM: GPU0 GSP RPC buffer contains function 4124 (GSP_LOCKDOWN_NOTICE) sequence 0 and data 0x0000000000000000 0x0000000000000000.
[   31.231996] NVRM: GPU0 RPC history (CPU -> GSP):
[   31.231997] NVRM:     entry function                     sequence data0              data1              ts_start           ts_end             duration actively_polling
[   31.231998] NVRM:      0    -388792779 Unknown               3579119361 0x0000000000000000 0x0000000000000000 0x000642c4f9273171 0x0000000000000000          y
[   31.232002] NVRM:     -1    -417704358 Unknown               1368943075 0x0000000000000000 0x0000000000000000 0x000642c4f9272caf 0x000642c4f92730a4   1013us  
[   31.232005] NVRM:     -2    -1186186342 Unknown                750038700 0x0000000000000000 0x0000000000000000 0x000642c4f92728ae 0x000642c4f9272caa   1020us  
[   31.232007] NVRM:     -3    -969190040 Unknown               1188055699 0x0000000000000000 0x0000000000000000 0x000642c4f92724c4 0x000642c4f92728a1    989us  
[   31.232009] NVRM:     -4    84031309 Unknown                945494249 0x0000000000000000 0x0000000000000000 0x000642c4f9272075 0x000642c4f92724ba   1093us  
[   31.232011] NVRM:     -5    -408017343 Unknown               1321406506 0x0000000000000000 0x0000000000000000 0x000642c4f9271c42 0x000642c4f927203b   1017us  
[   31.232013] NVRM:     -6    1652902690 Unknown               3700590743 0x0000000000000000 0x0000000000000000 0x000642c4f92716e0 0x000642c4f9271ba2   1218us  
[   31.232015] NVRM:     -7    464360894 Unknown                396536444 0x0000000000000000 0x0000000000000000 0x000642c4f92710b9 0x000642c4f927158c   1235us  
[   31.232017] NVRM: GPU0 RPC event history (CPU <- GSP):
[   31.232019] NVRM:     entry function                     sequence data0              data1              ts_start           ts_end             duration during_incomplete_rpc
[   31.232020] NVRM:      0    4124 GSP_LOCKDOWN_NOTICE            0 0x0000000000000000 0x0000000000000000 0x000642c4f927b844 0x000642c4f927b844          y
[   31.232022] NVRM:     -1    4124 GSP_LOCKDOWN_NOTICE            0 0x0000000000000001 0x0000000000000000 0x000642c4f927b208 0x000642c4f927b208          y
[   31.232024] NVRM:     -2    4108 UCODE_LIBOS_PRINT              0 0x0000000000000000 0x0000000000000000 0x000642c4f92795d8 0x000642c4f92795da      2us y
[   31.232027] NVRM:     -3    4124 GSP_LOCKDOWN_NOTICE            0 0x0000000000000000 0x0000000000000000 0x000642c4f91d4b8e 0x000642c4f91d4b8e           
[   31.232029] NVRM:     -4    4124 GSP_LOCKDOWN_NOTICE            0 0x0000000000000001 0x0000000000000000 0x000642c4f91d47d2 0x000642c4f91d47d2           
[   31.232030] NVRM:     -5    4124 GSP_LOCKDOWN_NOTICE            0 0x0000000000000000 0x0000000000000000 0x000642c4f91d3502 0x000642c4f91d3502           
[   31.232032] NVRM:     -6    4124 GSP_LOCKDOWN_NOTICE            0 0x0000000000000001 0x0000000000000000 0x000642c4f91d3208 0x000642c4f91d3208           
[   31.232034] NVRM:     -7    4124 GSP_LOCKDOWN_NOTICE            0 0x0000000000000000 0x0000000000000000 0x000642c4f91d2916 0x000642c4f91d2916           
[   31.232039] CPU: 13 UID: 0 PID: 626 Comm: nvidia-powerd Tainted: G           OE      6.14.0-35-generic #35-Ubuntu
[   31.232043] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[   31.232045] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[   31.232047] Call Trace:
[   31.232049]  <TASK>
[   31.232055]  show_stack+0x49/0x60
[   31.232062]  dump_stack_lvl+0x5f/0x90
[   31.232066]  dump_stack+0x10/0x18
[   31.232101]  os_dump_stack+0xe/0x20 [nvidia]
[   31.232618]  _kgspRpcRecvPoll+0x54b/0x750 [nvidia]
[   31.232766]  _issueRpcAndWait+0x7f/0x390 [nvidia]
[   31.232875]  kgspUnloadRm_IMPL+0x78/0x190 [nvidia]
[   31.232937]  gpuStateDestroy_IMPL+0x1a4/0x1b0 [nvidia]
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  RmShutdownAdapter+0x258/0x330 [nvidia]
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  rm_shutdown_adapter+0x58/0x60 [nvidia]
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  nv_shutdown_adapter+0xa9/0x1d0 [nvidia]
[   31.232937]  nv_close_device+0x132/0x180 [nvidia]
[   31.232937]  nvidia_close_callback+0x99/0x1a0 [nvidia]
[   31.232937]  nvidia_close+0x26b/0x290 [nvidia]
[   31.232937]  __fput+0xed/0x2d0
[   31.232937]  ____fput+0x15/0x20
[   31.232937]  task_work_run+0x60/0xa0
[   31.232937]  do_exit+0x276/0x4d0
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? __audit_syscall_entry+0xca/0x170
[   31.232937]  do_group_exit+0x34/0x90
[   31.232937]  __x64_sys_exit_group+0x18/0x20
[   31.232937]  x64_sys_call+0x141e/0x2310
[   31.232937]  do_syscall_64+0x7e/0x170
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? do_read_fault+0xfd/0x200
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? do_fault+0x151/0x220
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? handle_pte_fault+0x157/0x200
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? __handle_mm_fault+0x3d2/0x7a0
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? __count_memcg_events+0xd3/0x1a0
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? count_memcg_events.constprop.0+0x2a/0x50
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? handle_mm_fault+0x1bb/0x2d0
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? do_user_addr_fault+0x5e9/0x7e0
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? arch_exit_to_user_mode_prepare.isra.0+0x22/0x120
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? irqentry_exit_to_user_mode+0x2d/0x1d0
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? irqentry_exit+0x43/0x50
[   31.232937]  ? srso_return_thunk+0x5/0x5f
[   31.232937]  ? exc_page_fault+0x96/0x1e0
[   31.232937]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   31.232937] RIP: 0033:0x725d12334f8d
[   31.232937] Code: Unable to access opcode bytes at 0x725d12334f63.
[   31.232937] RSP: 002b:00007ffe0065a6a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
[   31.232937] RAX: ffffffffffffffda RBX: 0000725d12450fe8 RCX: 0000725d12334f8d
[   31.232937] RDX: 00000000000000e7 RSI: ffffffffffffff70 RDI: 0000000000000000
[   31.232937] RBP: 00007ffe0065a700 R08: 00007ffe0065a638 R09: 0000000000000000
[   31.232937] R10: 0000000000000008 R11: 0000000000000206 R12: 0000000000000005
[   31.232937] R13: 0000000000000000 R14: 0000725d1244f680 R15: 0000725d12451000
[   31.232937]  </TASK>
[   31.233870] NVRM: _kgspLogXid119: ********************************************************************************
[   31.233873] NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 47 sequence 423!
[   31.233878] NVRM: kgspCheckGspRmCcCleanup_GH100: CC secret cleanup successful!
[   31.233957] NVRM: nvAssertFailedNoLog: Assertion failed: rmStatus == NV_OK @ osinit.c:2174
[   31.604682] fbcon: Taking over console
[   33.171994] kauditd_printk_skb: 666 callbacks suppressed
[   33.171999] audit: type=1131 audit(1762263629.321:373): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=nvidia-powerd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   33.240524] NVRM: gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[   33.242205] NVRM: nvAssertFailedNoLog: Assertion failed: pVGpu != NULL @ objvgpu.c:148
[   33.243729] NVRM: osInitNvMapping: *** Cannot attach gpu
[   33.244777] NVRM: RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[   33.246286] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:744)
[   33.248919] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[   33.305931] NVRM: gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[   33.307419] NVRM: nvAssertFailedNoLog: Assertion failed: pVGpu != NULL @ objvgpu.c:148
[   33.308794] NVRM: osInitNvMapping: *** Cannot attach gpu
[   33.309749] NVRM: RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[   33.311183] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:744)
[   33.313837] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

The guest VM is running Ubuntu plucky with the following configuration:

  • Driver: NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 570.195.03 Release Build (dvs-builder@U22-I3-H04-03-1) Sat Sep 20 00:39:49 UTC 2025
  • Kernel: Linux version 6.14.0-35-generic (buildd@lcy02-amd64-078) (x86_64-linux-gnu-gcc-14 (Ubuntu 14.2.0-19ubuntu2) 14.2.0, GNU ld (GNU Binutils for Ubuntu) 2.44) Executing an Attestation of the GPU(x-nv-gpu-measurements-match) Failed #35-Ubuntu SMP PREEMPT_DYNAMIC Sat Oct 11 10:06:31 UTC 2025 (Ubuntu 6.14.0-35.35-generic 6.14.11)

The Kernel command line contains just console=ttyS0 pci=realloc,nocrs as documented on the deployment guide.

I'm running Ubuntu plucky because it ships with a more recent kernel version which supports the functionality needed by COCONUT-SVSM.

Output of dmesg | grep -i sev on the guest:

[    2.319769] Memory Encryption Features active: AMD SEV SEV-ES SEV-SNP
[    2.320704] SEV: Status: SEV SEV-ES SEV-SNP 
[    2.539717] SEV: APIC: wakeup_secondary_cpu() replaced with wakeup_cpu_via_vmgexit()
[    3.677778] SEV: Using SNP CPUID table, 28 entries present.
[    3.678703] SEV: SNP running at VMPL2.
[    4.034162] SEV: SNP guest platform device initialized.
[    6.156077] sev-guest sev-guest: Initialized SEV guest driver (using VMPCK2 communication key)

On the guest I've also removed the nouveau kernel module:

# modinfo nouveau
modinfo: ERROR: Module nouveau not found.

The times where the GPU is detected, I'm able to verify the GPU attestation on the guest and everything works fine:

Generating nonce in the local GPU Verifier ..
Number of GPUs available : 1
Fetching GPU 0 information from GPU driver.
All GPU Evidences fetched successfully
-----------------------------------
Verifying GPU: GPU-6a77de48-cf1d-2048-5be0-a63557bf0e6f
        Driver version fetched : 570.195.03
        VBIOS version fetched : 96.00.9f.00.01
        Validating GPU certificate chains.
                The firmware ID in the device certificate chain is matching with the one in the attestation report.
                GPU attestation report certificate chain validation successful.
                        The certificate chain revocation status verification successful.
        Authenticating attestation report
                The nonce in the SPDM GET MEASUREMENT request message is matching with the generated nonce.
                Driver version fetched from the attestation report : 570.195.03
                VBIOS version fetched from the attestation report : 96.00.9f.00.01
                Attestation report signature verification successful.
                Attestation report verification successful.
        Authenticating the RIMs.
                Authenticating Driver RIM
                        Fetching the driver RIM from the RIM service.
                        RIM Schema validation passed.
                        driver RIM certificate chain verification successful.
                        The certificate chain revocation status verification successful.
                        driver RIM signature verification successful.
                        Driver RIM verification successful
                Authenticating VBIOS RIM.
                        Fetching the VBIOS RIM from the RIM service.
                        RIM Schema validation passed.
                        vbios RIM certificate chain verification successful.
                        The certificate chain revocation status verification successful.
                        vbios RIM signature verification successful.
                        VBIOS RIM verification successful
        Comparing measurements (runtime vs golden)
                        The runtime measurements are matching with the golden measurements.                            
                GPU is in expected state.
        GPU 0 with UUID GPU-6a77de48-cf1d-2048-5be0-a63557bf0e6f verified successfully.
        Setting the GPU Ready State to READY
GPU Attestation is Successful.

Output of nvidia-smi conf-compute -q:

==============NVSMI CONF-COMPUTE LOG==============

    CC State                   : ON
    Multi-GPU Mode             : None
    CPU CC Capabilities        : AMD SEV-SNP
    GPU CC Capabilities        : CC Capable
    CC GPUs Ready State        : Ready

The host is configured following the instructions at https://coconut-svsm.github.io/svsm/installation/INSTALL/.

I'm experiencing intermittent GPU detection failures in my VM, which is running the OS in live mode. This means every reboot starts from a clean OS image, yet I still frequently need multiple reboots for the GPU to be recognized. The dmesg output appears identical in both successful and failed detection attempts, with the exception of the previously reported stack trace.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions