2022-11-16

UEFI Windows guest hang after live migration

Notes about debug windows hang issue.

Test case

If qemu guest need to use nvidia GPU, according to https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF#Video_card_driver_virtualisation_detection a workaround need to be setup in domain xml:

<features>
  ...
  <kvm>
    <hidden state='on'/>
  </kvm>
  ...
</features>

after hidden kvm from guest, the GPU driver could work as expected. But we met an issue with Windows UEFI guest which hidden kvm and hung after live migration.

After searching for google, I got some directions for debug this issue:

OVMF live migration issue: OVMF file size changed due to lib upgraded, configuration file length mismatch may cause guest hang
Host CPU feature issue: Host CPU features not match, may cause guest paused
QEMU/Libvirt issue
Windows issue: hidden kvm from guest is not compatible for all windows guest

So I did some test try to figure out which component to suspect:

Check OVMF version: not changed
Check host CPU feature: not changed
Check QEMU/Libvirt log, no virtualization error or error message exists
Remove the hidden tag to try live migration: Guest not hang

We can see that kvm hidden seems to blame, but in order to use GPU on UEFI guest this issue should be resolved. So trace what happened during migration is the next step, open following logs for debug usage:

OVMF log by following:

<qemu:commandline>
  <qemu:arg value='-debugcon'/>
  <qemu:arg value='file:/var/log/libvirt/qemu/debug.log'/>
  <qemu:arg value='-global'/>
  <qemu:arg value='isa-debugcon.iobase=0x402'/>
</qemu:commandline>

QEMU/libvirt debug, but we already have qemu log under /var/log/libvirt/qemu/
Check windows events after reboot

But before we start debug, the environment related issue should be checked. Because we use nested virtualizatin as default, following environment check tests are required:

Use baremetal host to test
Use latest qemu and libvirt to test
Use latest edk2 to test

While combine test 1 and test 2, we get the result that UEFI wouldn’t hang after live migration. So we decide to test the same scenario on nested environment. And we did not met guest hang issue after upgrade libvirt. Go through the diffs from bug version and upstream:

I found following patch:

-    if (!loader || !loader->nvram || virFileExists(loader->nvram))
+    if (!loader || !loader->nvram ||
+        (virFileExists(loader->nvram) &&
+            virFileLength(loader->templt, -1) == virFileLength(loader->nvram, -1))
+        )
         return 0;

+    unlink(loader->nvram);

which is submitted to solve ovmf upgrade issue:

1	nvram: regenerate nvram mapping file from template when firmware being upgraded

After regenerating nvram mapping, the guest could be successfully migrated. Indeed this discovery solve our problem in short term and I want to get the root cause for why the guest hang and this patch is a important hint.

How ovmf guest perform live migration

How to perform live migration on ovmf guest. I search on edk2.groups.io try to find the answer.

https://edk2.groups.io/g/devel/topic/71141681#55046 and this topic discussed about live migration issue for ovmf guest which is quite helpful.

First of all, topic owner can not perform live migration because OVMF.fd changed its size from 2MB to 4MB which will be checked by qemu and raise length mismatch error like following(I got similar error from my test env):

1	qemu-kvm: Length mismatch: system.flash1: 0x84000 in != 0x20000:Invalid argument

And the reason of extending the flash size is due to https://github.com/tianocore/edk2/commit/b24fca05751f windows HCK require which declared that this is a incompatible change. So the solution may be:

Stick with the same version of the ROM between VMs you want to migrate
Pad your ROM images to some larger size (e.g. 8MB) so that even if they grow a little bigger then you don’t hit the problem.

When think about live migration, all guest’s memory will be migrated to target host, so the memory content of the firmware will be copied to the target host so no matter what loaded at target host memory will be overwritten. So if we want to get rid of this issue, keep the firmware with edk2 version is a good solution.

For legacy guest, BIOS use fixed magic address ranges but UEFI uses dynamically allocated memory, so there is not fixed addresses. When firmware flash image size change, also the content parts will changed too and which can not keep compatible.

But for live migration, due to the memory not changed, the nvram should also not be changed after that. I just quota the answer how ovmf works with live migration:

With live migration, the running guest doesn’t notice anything. This is
a general requirement for live migration (regardless of UEFI or flash).

You are very correct to ask about “skipping” the NVRAM region. With the
approach that OvmfPkg originally supported, live migration would simply
be unfeasible. The “build” utility would produce a single (unified)
OVMF.fd file, which would contain both NVRAM and executable regions, and
the guest’s variable updates would modify the one file that would exist.
This is inappropriate even without considering live migration, because
OVMF binary upgrades (package updates) on the virtualization host would
force guests to lose their private variable stores (NVRAMs).

Therefore, the “build” utility produces “split” files too, in addition
to the unified OVMF.fd file. Namely, OVMF_CODE.fd and OVMF_VARS.fd.
OVMF.fd is simply the concatenation of the latter two.

$ cat OVMF_VARS.fd OVMF_CODE.fd | cmp - OVMF.fd
[prints nothing]

When you define a new domain (VM) on a virtualization host, the domain
definition saves a reference (pathname) to the OVMF_CODE.fd file.
However, the OVMF_VARS.fd file (the variable store template) is not
directly referenced; instead, it is copied into a separate (private)
file for the domain.

Furthermore, once booted, guest has two flash chips, one that maps the
firmware executable OVMF_CODE.fd read-only, and another pflash chip that
maps its private varstore file read-write.

This makes it possible to upgrade OVMF_CODE.fd and OVMF_VARS.fd (via
package upgrades on the virt host) without messing with varstores that
were earlier instantiated from OVMF_VARS.fd. What’s important here is
that the various constants in the new (upgraded) OVMF_CODE.fd file
remain compatible with the old OVMF_VARS.fd structure, across package
upgrades.

If that’s not possible for introducing e.g. a new feature, then the
package upgrade must not overwrite the OVMF_CODE.fd file in place, but
must provide an additional firmware binary. This firmware binary can
then only be used by freshly defined domains (old domains cannot be
switched over). Old domains can be switched over manually – and only if
the sysadmin decides it is OK to lose the current variable store
contents. Then the old varstore file for the domain is deleted
(manually), the domain definition is updated, and then a new (logically
empty, pristine) varstore can be created from the new OVMF_2_VARS.fd
that matches the new OVMF_2_CODE.fd.

During live migration, the “RAM-like” contents of both pflash chips are
migrated (the guest-side view of both chips remains the same, including
the case when the writeable chip happens to be in “programming mode”,
i.e., during a UEFI variable write through the Fault Tolerant Write and
Firmware Volume Block(2) protocols).

Once live migration completes, QEMU dumps the full contents of the
writeable chip to the backing file (on the destination host). Going
forward, flash writes from within the guest are reflected to said
host-side file on-line, just like it happened on the source host before
live migration. If the file backing the r/w pflash chip is on NFS
(shared by both src and dst hosts), then this one-time dumping when the
migration completes is superfluous, but it’s also harmless.

The interesting question is, what happens when you power down the VM on
the destination host (= post migration), and launch it again there, from
zero. In that case, the firmware executable file comes from the
destination host (it was never persistently migrated from the source
host, i.e. never written out on the dst). It simply comes from the OVMF
package that had been installed on the destination host, by the
sysadmin. However, the varstore pflash does reflect the permanent result
of the previous migration. So this is where things can fall apart, if
both firmware binaries (on the src host and on the dst host) don’t agree
about the internal structure of the varstore pflash.

from this long reply, we can get thoes points:

Live migration should not be awared by guest
Edk2 seperate read-only executable codes and varstore to support firmware upgrade
- OVMF_CODE.fd keep compitable with orignal version
- Qemu keep varstores in its nvram which will not be changed
For new features if OVMF_CODE.fd can not keep compitable, use another OVMF_CODE_2.fd instead
Once live migraiton complete, qemu dump all contents to dest host

So for qemu guest, live migration just migrate memory to dest host and if we keep the same varstore and code with source host, it should be supported. Also because the pflash after live migration is actually in memory, so keep the varstore not changed will keep the compitable (no side-effects during runtime).

And another fact is when we turn off kvm hidden, the migration performs well and no any errors during guest runtime occurs in edk2’s log.

Enable KVM trace

According to: https://www.reddit.com/r/VFIO/comments/80p1q7/high_kvmqemu_cpu_utilization_when_windows_10/ a windows performance topic.

Because there is no abvious log shows any error from qemu or libvirt and it seems that the guest hangs but qemu and libvirt works well, I decide to enable kvm tracing and hope to get more clues.

1	echo 1 > /sys/kernel/debug/tracing/events/kvm/enable

can we can get tracing by:

1	cat /sys/kernel/debug/tracing/trace_pipe

and the following log printed:

<...>-41061 [003] .... 167992.130071: kvm_exit: reason MSR_READ rip 0xfffff8008f9d0c0d info 0 0
<...>-41061 [003] .... 167992.130072: kvm_msr: msr_read 40000020 = 0xdfa1388fd
<...>-41061 [003] d... 167992.130072: kvm_entry: vcpu 0
<...>-41064 [002] .... 167992.130073: kvm_exit: reason MSR_READ rip 0xfffff8008f9d0c0d info 0 0
<...>-41064 [002] .... 167992.130074: kvm_msr: msr_read 40000020 = 0xdfa138912
<...>-41064 [002] d... 167992.130074: kvm_entry: vcpu 3
<...>-41064 [002] .... 167992.130085: kvm_exit: reason MSR_READ rip 0xfffff8008f9d0c0d info 0 0
<...>-41064 [002] .... 167992.130086: kvm_msr: msr_read 40000020 = 0xdfa138988
<...>-41064 [002] d... 167992.130086: kvm_entry: vcpu 3
<...>-41061 [003] .... 167992.130086: kvm_exit: reason MSR_READ rip 0xfffff8008f9d0c0d info 0 0
<...>-41061 [003] .... 167992.130087: kvm_msr: msr_read 40000020 = 0xdfa138998
<...>-41061 [003] d... 167992.130088: kvm_entry: vcpu 0
<...>-41061 [003] .... 167992.130102: kvm_exit: reason MSR_READ rip 0xfffff8008f9d0c0d info 0 0
<...>-41061 [003] .... 167992.130103: kvm_msr: msr_read 40000020 = 0xdfa138a32
<...>-41061 [003] d... 167992.130103: kvm_entry: vcpu 0
<...>-41064 [002] .... 167992.130103: kvm_exit: reason MSR_READ rip 0xfffff8008f9d0c0d info 0 0
<...>-41064 [002] .... 167992.130104: kvm_msr: msr_read 40000020 = 0xdfa138a3f
<...>-41064 [002] d... 167992.130104: kvm_entry: vcpu 3
<...>-41064 [002] .... 167992.130114: kvm_exit: reason MSR_READ rip 0xfffff8008f9d0c0d info 0 0
<...>-41064 [002] .... 167992.130114: kvm_msr: msr_read 40000020 = 0xdfa138aaa
<...>-41064 [002] d... 167992.130115: kvm_entry: vcpu 3

the vcpus seems run kvm_entry and kvm_exit forever to do msr_read and combine with top:

top - 12:54:08 up 1 day, 22:42,  1 user,  load average: 5.69, 5.82, 6.10
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu0  : 87.5 us, 12.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 87.5 us, 12.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 10054364 total,  7086476 free,  2163892 used,   803996 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7563464 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 41029 root      20   0 5193164   1.6g  17104 S 387.5 17.0 394:17.63 qemu-kvm

guest’s sys usage is quite high use perf to get more details about this process:

1	perf kvm --host top -p `pidof qemu-kvm`

I see that:

Samples: 99K of event 'cycles', Event count (approx.): 10268746210
Overhead  Shared Object            Symbol                                                                                                                                                                                          ◆
  18.41%  [kernel]                 [k] vmx_vcpu_run                                                                                                                                                                                ▒
   6.43%  [kernel]                 [k] vcpu_enter_guest                                                                                                                                                                            ▒
   5.58%  [kernel]                 [k] pvclock_clocksource_read                                                                                                                                                                    ▒
   3.84%  [kernel]                 [k] mutex_lock                                                                                                                                                                                  ▒
   2.79%  [kernel]                 [k] vmx_handle_exit

vmx_vcpu_run is high (on intel cpu) this means cpu switch to guest mode and show that guest mode and kernel mode switch spent a lot of time.

And by kvm tracing we can see many vm entry/exit so check the reason why vm exit happend (because the vcpu mode switch now spend too much time)

Just use a small piece of the output:

[root@172-24-195-187 ~]# perf stat -e 'kvm:*' -a -- sleep 1

 Performance counter stats for 'system wide':

           206,380      kvm:kvm_entry
                 0      kvm:kvm_hypercall
                 0      kvm:kvm_hv_hypercall
               263      kvm:kvm_pio
                 0      kvm:kvm_fast_mmio
                 0      kvm:kvm_cpuid
               162      kvm:kvm_apic
           206,395      kvm:kvm_exit
                 0      kvm:kvm_inj_virq
                 0      kvm:kvm_inj_exception
                 0      kvm:kvm_page_fault
           202,600      kvm:kvm_msr
                 0      kvm:kvm_cr
               195      kvm:kvm_pic_set_irq
                81      kvm:kvm_apic_ipi
               370      kvm:kvm_apic_accept_irq
                65      kvm:kvm_eoi

kvm_msr is the main reason for vm exit and which matches kvm tracing.

By collect kvm events:

1	perf kvm --host stat live

we could see MSR_READ and EXTERNAL_INTERRUPT used almost all of the time.

13:02:10.174121

Analyze events for all VMs, all VCPUs:

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

            MSR_READ       6022    78.14%    74.10%      0.71us  21778.22us     42.52us ( +-  17.16% )
  EXTERNAL_INTERRUPT       1494    19.38%    21.65%      0.53us  12810.58us     50.08us ( +-  29.22% )
      IO_INSTRUCTION        126     1.63%     1.16%     23.80us     51.45us     31.73us ( +-   1.56% )
          APIC_WRITE         26     0.34%     0.06%      3.23us     10.96us      8.39us ( +-   4.81% )
         EOI_INDUCED         20     0.26%     0.02%      1.93us      3.27us      2.63us ( +-   3.21% )
       EPT_MISCONFIG         19     0.25%     3.01%     29.30us   9795.61us    548.07us ( +-  93.74% )

Total Samples:7707, Total events handled time:345545.97us.

from kvm tracing:

1	<...>-41064 [000] .... 168733.114930: kvm_msr: msr_read 40000020 = 0xfb3c08a54

we can find 0x40000020 from linux kernel code:

1 2	/* MSR used to read the per-partition time reference counter */ #define HV_X64_MSR_TIME_REF_COUNT 0x40000020

it seems a hyperv clocksource related issue so I just remote the hyperclock field from libvirt xml and the migration issue disappeared after that.

Try to find root cause

Actually we can workaround our issue by remove the clocksource from vm configuration but we do not known the root cause but only a vm_exit and failed to read hyperv clocksource.

So simply trace kernel code for more details.

vmx.h defines lots of EXIT reasons:

1	#define EXIT_REASON_MSR_READ 31

and vmx.c register handlers

/*
 * The exit handlers return 1 if the exit was handled fully and guest execution
 * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
 * to be done to userspace and return 0.
 */
static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
  ...
	[EXIT_REASON_MSR_READ]                = handle_rdmsr,
};

than move to handle_rdmsr

static int handle_rdmsr(struct kvm_vcpu *vcpu)
{
	u32 ecx = vcpu->arch.regs[VCPU_REGS_RCX];
	struct msr_data msr_info;

	msr_info.index = ecx;
	msr_info.host_initiated = false;
	if (vmx_get_msr(vcpu, &msr_info)) {
		trace_kvm_msr_read_ex(ecx);
		kvm_inject_gp(vcpu, 0);
		return 1;
	}

	trace_kvm_msr_read(ecx, msr_info.data);

	/* FIXME: handling of bits 32:63 of rax, rdx */
	vcpu->arch.regs[VCPU_REGS_RAX] = msr_info.data & -1u;
	vcpu->arch.regs[VCPU_REGS_RDX] = (msr_info.data >> 32) & -1u;
	skip_emulated_instruction(vcpu);
	return 1;
}

because we see the trace from kvm that means trace_kvm_msr_read_ex is executed and will return 1 which means vm could resume.

If you look back to the kvm process trace

1	perf kvm --host top -p `pidof qemu-kvm`

we can find the inside vmx_vcpu_run, the vm vmresume is used and according to the code, the function will finished after kvm_inject_gp(vcpu, 0); and vm will entry guest.

vmx_vcpu_run  /proc/kcore
Percent│       mov    0x238(%rcx),%rbx
       │       mov    0x230(%rcx),%rdx
       │       mov    0x250(%rcx),%rsi
  0.11 │       mov    0x258(%rcx),%rdi
       │       mov    0x248(%rcx),%rbp
       │       mov    0x260(%rcx),%r8
       │       mov    0x268(%rcx),%r9
  0.03 │       mov    0x270(%rcx),%r10
       │       mov    0x278(%rcx),%r11
       │       mov    0x280(%rcx),%r12
       │       mov    0x288(%rcx),%r13
  0.11 │       mov    0x290(%rcx),%r14
  0.02 │       mov    0x298(%rcx),%r15
       │       mov    0x228(%rcx),%rcx
       │     ↓ jne    2a1
       │       vmlaunch
       │     ↓ jmp    2a4
  0.02 │2a1:   vmresume
 45.46 │2a4:   mov    %rcx,0x8(%rsp)
  4.07 │       pop    %rcx
  1.50 │       mov    %rax,0x220(%rcx)
  2.07 │       mov    %rbx,0x238(%rcx)

So that means guest will exit due to its own READ_MSR requirement.

Check kernel related code, the function chains are following:

1	vmx_get_msr -> kvm_get_msr_common -> kvm_hv_get_msr_common

look into kvm_hv_get_msr_common

int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata, bool host)
{
	if (kvm_hv_msr_partition_wide(msr)) {
		int r;

		mutex_lock(&vcpu->kvm->arch.hyperv.hv_lock);
		r = kvm_hv_get_msr_pw(vcpu, msr, pdata);
		mutex_unlock(&vcpu->kvm->arch.hyperv.hv_lock);
		return r;
	} else
		return kvm_hv_get_msr(vcpu, msr, pdata, host);
}

because msr count matches:

static bool kvm_hv_msr_partition_wide(u32 msr)
{
	bool r = false;

	switch (msr) {
	case HV_X64_MSR_GUEST_OS_ID:
	case HV_X64_MSR_HYPERCALL:
	case HV_X64_MSR_REFERENCE_TSC:
	case HV_X64_MSR_TIME_REF_COUNT:
	case HV_X64_MSR_CRASH_CTL:
	case HV_X64_MSR_CRASH_P0 ... HV_X64_MSR_CRASH_P4:
	case HV_X64_MSR_RESET:
		r = true;
		break;
	}

	return r;
}

so the read fall into kvm_hv_get_msr_pw

case HV_X64_MSR_TIME_REF_COUNT:
	/* read-only, but still ignore it if host-initiated */
	if (!host)
		return 1;
	break;

and finally return 1 or report

1
2
3

vcpu_unimpl(vcpu, "Hyper-V uhandled wrmsr: 0x%x data 0x%llx\n",
	    msr, data);
return 1;

come to here it seems a guest bug and according to: https://msrc-blog.microsoft.com/2018/12/10/first-steps-in-hyper-v-research/ we can get some information about EXIT_REASON_MSR_READ

that

Hyper-V handles MSR access (both read and write) in its VMEXIT loop handler. It’s easy to see it in IDA: it’s a large switch case over all the MSRs supported values, with the default case of falling back to rdmsr/wrmsr, if that MSR doesn’t have special treatment by the hypervisor. Note that there are authentication checks in the MSR read/write handlers, checking the current partition permissions. From there we can find the different MSRs Hyper-V supports, and the functions to handle read and write.

So it seems a Hyper-V feature to access the MSR timeout ref count.

Check qemu doc about ‘Hyper-V Enlightenments’, it explains the usage:

1
2

hv-time
Enables two Hyper-V-specific clocksources available to the guest: MSR-based Hyper-V clocksource (HV_X64_MSR_TIME_REF_COUNT, 0x40000020) and Reference TSC page (enabled via MSR HV_X64_MSR_REFERENCE_TSC, 0x40000021). Both clocksources are per-guest, Reference TSC page clocksource allows for exit-less time stamp readings. Using this enlightenment leads to significant speedup of all timestamp related operations.

and used for speedup of all timestamp related operations but in this case the guest do not response as a result.

And come to here, I noticed that when guest migrated, some lines come into qemu.log

2022-11-16T09:53:19.529516Z qemu-kvm: warning: TSC frequency mismatch between VM (2095020 kHz) and host (2095087 kHz), and TSC scaling unavailable
2022-11-16T09:53:19.533393Z qemu-kvm: warning: TSC frequency mismatch between VM (2095020 kHz) and host (2095087 kHz), and TSC scaling unavailable
2022-11-16T09:53:19.533626Z qemu-kvm: warning: TSC frequency mismatch between VM (2095020 kHz) and host (2095087 kHz), and TSC scaling unavailable
2022-11-16T09:53:19.533816Z qemu-kvm: warning: TSC frequency mismatch between VM (2095020 kHz) and host (2095087 kHz), and TSC scaling unavailable

qemu-kvm warning TSC frequency mismatched which normally not occurs for guest.

When check kernel code, we can see that:

static int kvm_arch_set_tsc_khz(CPUState *cs)
{
    X86CPU *cpu = X86_CPU(cs);
    CPUX86State *env = &cpu->env;
    int r;

    if (!env->tsc_khz) {
        return 0;
    }

    r = kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL) ?
        kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz) :
        -ENOTSUP;
    if (r < 0) {
        /* When KVM_SET_TSC_KHZ fails, it's an error only if the current
         * TSC frequency doesn't match the one we want.
         */
        int cur_freq = kvm_check_extension(cs->kvm_state, KVM_CAP_GET_TSC_KHZ) ?
                       kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ) :
                       -ENOTSUP;
        if (cur_freq <= 0 || cur_freq != env->tsc_khz) {
            warn_report("TSC frequency mismatch between "
                        "VM (%" PRId64 " kHz) and host (%d kHz), "
                        "and TSC scaling unavailable",
                        env->tsc_khz, cur_freq);
            return r;
        }
    }

    return 0;
}

qemu tries to KVM_SET_TSC_KHZ but failed will show thoes lines.

From hypervisor functional specification:

1	The TscScale value is used to adjust the Virtual TSC value across migration events to mitigate TSC frequency changes from one platform to another.

used to cut down the mitigate TSC frequency change for guest.

So look for qemu code:

if (level == KVM_PUT_FULL_STATE) {
    /* We don't check for kvm_arch_set_tsc_khz() errors here,
     * because TSC frequency mismatch shouldn't abort migration,
     * unless the user explicitly asked for a more strict TSC
     * setting (e.g. using an explicit "tsc-freq" option).
     */
    kvm_arch_set_tsc_khz(cpu);
}

during live migration, this trace will be printed but do not abort migration.

Indeed, newer nvidia driver do not require kvm hidden: https://www.heiko-sieger.info/passing-through-a-nvidia-rtx-2070-super-gpu/ so the concerns about the use case maybe not necessary.

I write a mail to community to discuss if there is any better way to solve the problem.

https://lists.nongnu.org/archive/html/qemu-discuss/2022-11/msg00028.html