Notes about debug windows hang issue.
Test case
If qemu guest need to use nvidia GPU, according to https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF#Video_card_driver_virtualisation_detection a workaround need to be setup in domain xml:
1 | <features> |
after hidden kvm from guest, the GPU driver could work as expected. But we met an issue with Windows UEFI guest which hidden kvm and hung after live migration.
After searching for google, I got some directions for debug this issue:
- OVMF live migration issue: OVMF file size changed due to lib upgraded, configuration file length mismatch may cause guest hang
- Host CPU feature issue: Host CPU features not match, may cause guest paused
- QEMU/Libvirt issue
- Windows issue: hidden kvm from guest is not compatible for all windows guest
So I did some test try to figure out which component to suspect:
- Check OVMF version: not changed
- Check host CPU feature: not changed
- Check QEMU/Libvirt log, no virtualization error or error message exists
- Remove the hidden tag to try live migration: Guest not hang
We can see that kvm hidden seems to blame, but in order to use GPU on UEFI guest this issue should be resolved. So trace what happened during migration is the next step, open following logs for debug usage:
OVMF log by following:
1
2
3
4
5
6<qemu:commandline>
<qemu:arg value='-debugcon'/>
<qemu:arg value='file:/var/log/libvirt/qemu/debug.log'/>
<qemu:arg value='-global'/>
<qemu:arg value='isa-debugcon.iobase=0x402'/>
</qemu:commandline>QEMU/libvirt debug, but we already have qemu log under
/var/log/libvirt/qemu/
Check windows events after reboot
But before we start debug, the environment related issue should be checked. Because we use nested virtualizatin as default, following environment check tests are required:
- Use baremetal host to test
- Use latest qemu and libvirt to test
- Use latest edk2 to test
While combine test 1 and test 2, we get the result that UEFI wouldn’t hang after live migration. So we decide to test the same scenario on nested environment. And we did not met guest hang issue after upgrade libvirt. Go through the diffs from bug version and upstream:
I found following patch:
1 | - if (!loader || !loader->nvram || virFileExists(loader->nvram)) |
which is submitted to solve ovmf upgrade issue:
1 | nvram: regenerate nvram mapping file from template when firmware being upgraded |
After regenerating nvram mapping, the guest could be successfully migrated. Indeed this discovery solve our problem in short term and I want to get the root cause for why the guest hang and this patch is a important hint.
How ovmf guest perform live migration
How to perform live migration on ovmf guest. I search on edk2.groups.io try to find the answer.
https://edk2.groups.io/g/devel/topic/71141681#55046 and this topic discussed about live migration issue for ovmf guest which is quite helpful.
First of all, topic owner can not perform live migration because OVMF.fd changed its size from 2MB to 4MB which will be checked by qemu and raise length mismatch error like following(I got similar error from my test env):
1 | qemu-kvm: Length mismatch: system.flash1: 0x84000 in != 0x20000:Invalid argument |
And the reason of extending the flash size is due to https://github.com/tianocore/edk2/commit/b24fca05751f windows HCK require which declared that this is a incompatible change. So the solution may be:
- Stick with the same version of the ROM between VMs you want to migrate
- Pad your ROM images to some larger size (e.g. 8MB) so that even if they grow a little bigger then you don’t hit the problem.
When think about live migration, all guest’s memory will be migrated to target host, so the memory content of the firmware will be copied to the target host so no matter what loaded at target host memory will be overwritten. So if we want to get rid of this issue, keep the firmware with edk2 version is a good solution.
For legacy guest, BIOS use fixed magic address ranges but UEFI uses dynamically allocated memory, so there is not fixed addresses. When firmware flash image size change, also the content parts will changed too and which can not keep compatible.
But for live migration, due to the memory not changed, the nvram should also not be changed after that. I just quota the answer how ovmf works with live migration:
With live migration, the running guest doesn’t notice anything. This is
a general requirement for live migration (regardless of UEFI or flash).You are very correct to ask about “skipping” the NVRAM region. With the
approach that OvmfPkg originally supported, live migration would simply
be unfeasible. The “build” utility would produce a single (unified)
OVMF.fd file, which would contain both NVRAM and executable regions, and
the guest’s variable updates would modify the one file that would exist.
This is inappropriate even without considering live migration, because
OVMF binary upgrades (package updates) on the virtualization host would
force guests to lose their private variable stores (NVRAMs).Therefore, the “build” utility produces “split” files too, in addition
to the unified OVMF.fd file. Namely, OVMF_CODE.fd and OVMF_VARS.fd.
OVMF.fd is simply the concatenation of the latter two.$ cat OVMF_VARS.fd OVMF_CODE.fd | cmp - OVMF.fd
[prints nothing]When you define a new domain (VM) on a virtualization host, the domain
definition saves a reference (pathname) to the OVMF_CODE.fd file.
However, the OVMF_VARS.fd file (the variable store template) is not
directly referenced; instead, it is copied into a separate (private)
file for the domain.Furthermore, once booted, guest has two flash chips, one that maps the
firmware executable OVMF_CODE.fd read-only, and another pflash chip that
maps its private varstore file read-write.This makes it possible to upgrade OVMF_CODE.fd and OVMF_VARS.fd (via
package upgrades on the virt host) without messing with varstores that
were earlier instantiated from OVMF_VARS.fd. What’s important here is
that the various constants in the new (upgraded) OVMF_CODE.fd file
remain compatible with the old OVMF_VARS.fd structure, across package
upgrades.If that’s not possible for introducing e.g. a new feature, then the
package upgrade must not overwrite the OVMF_CODE.fd file in place, but
must provide an additional firmware binary. This firmware binary can
then only be used by freshly defined domains (old domains cannot be
switched over). Old domains can be switched over manually – and only if
the sysadmin decides it is OK to lose the current variable store
contents. Then the old varstore file for the domain is deleted
(manually), the domain definition is updated, and then a new (logically
empty, pristine) varstore can be created from the new OVMF_2_VARS.fd
that matches the new OVMF_2_CODE.fd.During live migration, the “RAM-like” contents of both pflash chips are
migrated (the guest-side view of both chips remains the same, including
the case when the writeable chip happens to be in “programming mode”,
i.e., during a UEFI variable write through the Fault Tolerant Write and
Firmware Volume Block(2) protocols).Once live migration completes, QEMU dumps the full contents of the
writeable chip to the backing file (on the destination host). Going
forward, flash writes from within the guest are reflected to said
host-side file on-line, just like it happened on the source host before
live migration. If the file backing the r/w pflash chip is on NFS
(shared by both src and dst hosts), then this one-time dumping when the
migration completes is superfluous, but it’s also harmless.The interesting question is, what happens when you power down the VM on
the destination host (= post migration), and launch it again there, from
zero. In that case, the firmware executable file comes from the
destination host (it was never persistently migrated from the source
host, i.e. never written out on the dst). It simply comes from the OVMF
package that had been installed on the destination host, by the
sysadmin. However, the varstore pflash does reflect the permanent result
of the previous migration. So this is where things can fall apart, if
both firmware binaries (on the src host and on the dst host) don’t agree
about the internal structure of the varstore pflash.
from this long reply, we can get thoes points:
- Live migration should not be awared by guest
- Edk2 seperate read-only executable codes and varstore to support firmware upgrade
- OVMF_CODE.fd keep compitable with orignal version
- Qemu keep varstores in its nvram which will not be changed
- For new features if OVMF_CODE.fd can not keep compitable, use another OVMF_CODE_2.fd instead
- Once live migraiton complete, qemu dump all contents to dest host
So for qemu guest, live migration just migrate memory to dest host and if we keep the same varstore and code with source host, it should be supported. Also because the pflash after live migration is actually in memory, so keep the varstore not changed will keep the compitable (no side-effects during runtime).
And another fact is when we turn off kvm hidden, the migration performs well and no any errors during guest runtime occurs in edk2’s log.
Enable KVM trace
According to: https://www.reddit.com/r/VFIO/comments/80p1q7/high_kvmqemu_cpu_utilization_when_windows_10/ a windows performance topic.
Because there is no abvious log shows any error from qemu or libvirt and it seems that the guest hangs but qemu and libvirt works well, I decide to enable kvm tracing and hope to get more clues.
1 | echo 1 > /sys/kernel/debug/tracing/events/kvm/enable |
can we can get tracing by:
1 | cat /sys/kernel/debug/tracing/trace_pipe |
and the following log printed:
1 | <...>-41061 [003] .... 167992.130071: kvm_exit: reason MSR_READ rip 0xfffff8008f9d0c0d info 0 0 |
the vcpus seems run kvm_entry and kvm_exit forever to do msr_read and combine with top:
1 | top - 12:54:08 up 1 day, 22:42, 1 user, load average: 5.69, 5.82, 6.10 |
guest’s sys usage is quite high use perf to get more details about this process:
1 | perf kvm --host top -p `pidof qemu-kvm` |
I see that:
1 | Samples: 99K of event 'cycles', Event count (approx.): 10268746210 |
vmx_vcpu_run is high (on intel cpu) this means cpu switch to guest mode and show that guest mode and kernel mode switch spent a lot of time.
And by kvm tracing we can see many vm entry/exit so check the reason why vm exit happend (because the vcpu mode switch now spend too much time)
1 |
Just use a small piece of the output:
1 | [root@172-24-195-187 ~]# perf stat -e 'kvm:*' -a -- sleep 1 |
kvm_msr is the main reason for vm exit and which matches kvm tracing.
By collect kvm events:
1 | perf kvm --host stat live |
we could see MSR_READ and EXTERNAL_INTERRUPT used almost all of the time.
1 | 13:02:10.174121 |
from kvm tracing:
1 | <...>-41064 [000] .... 168733.114930: kvm_msr: msr_read 40000020 = 0xfb3c08a54 |
we can find 0x40000020 from linux kernel code:
1 | /* MSR used to read the per-partition time reference counter */ |
it seems a hyperv clocksource related issue so I just remote the hyperclock field from libvirt xml and the migration issue disappeared after that.
Try to find root cause
Actually we can workaround our issue by remove the clocksource from vm configuration but we do not known the root cause but only a vm_exit and failed to read hyperv clocksource.
So simply trace kernel code for more details.
vmx.h defines lots of EXIT reasons:
1 |
and vmx.c register handlers
1 | /* |
than move to handle_rdmsr
1 | static int handle_rdmsr(struct kvm_vcpu *vcpu) |
because we see the trace from kvm that means trace_kvm_msr_read_ex
is executed and will return 1 which means vm could resume.
If you look back to the kvm process trace
1 | perf kvm --host top -p `pidof qemu-kvm` |
we can find the inside vmx_vcpu_run, the vm vmresume is used and according to the code, the function will finished after kvm_inject_gp(vcpu, 0);
and vm will entry guest.
1 | vmx_vcpu_run /proc/kcore |
So that means guest will exit due to its own READ_MSR requirement.
Check kernel related code, the function chains are following:
1 | vmx_get_msr -> kvm_get_msr_common -> kvm_hv_get_msr_common |
look into kvm_hv_get_msr_common
1 | int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata, bool host) |
because msr count matches:
1 | static bool kvm_hv_msr_partition_wide(u32 msr) |
so the read fall into kvm_hv_get_msr_pw
1 | case HV_X64_MSR_TIME_REF_COUNT: |
and finally return 1 or report
1 | vcpu_unimpl(vcpu, "Hyper-V uhandled wrmsr: 0x%x data 0x%llx\n", |
come to here it seems a guest bug and according to: https://msrc-blog.microsoft.com/2018/12/10/first-steps-in-hyper-v-research/ we can get some information about EXIT_REASON_MSR_READ
that
1 | Hyper-V handles MSR access (both read and write) in its VMEXIT loop handler. It’s easy to see it in IDA: it’s a large switch case over all the MSRs supported values, with the default case of falling back to rdmsr/wrmsr, if that MSR doesn’t have special treatment by the hypervisor. Note that there are authentication checks in the MSR read/write handlers, checking the current partition permissions. From there we can find the different MSRs Hyper-V supports, and the functions to handle read and write. |
So it seems a Hyper-V feature to access the MSR timeout ref count.
Check qemu doc about ‘Hyper-V Enlightenments’, it explains the usage:
1 | hv-time |
and used for speedup of all timestamp related operations but in this case the guest do not response as a result.
And come to here, I noticed that when guest migrated, some lines come into qemu.log
1 | 2022-11-16T09:53:19.529516Z qemu-kvm: warning: TSC frequency mismatch between VM (2095020 kHz) and host (2095087 kHz), and TSC scaling unavailable |
qemu-kvm warning TSC frequency mismatched which normally not occurs for guest.
When check kernel code, we can see that:
1 | static int kvm_arch_set_tsc_khz(CPUState *cs) |
qemu tries to KVM_SET_TSC_KHZ but failed will show thoes lines.
From hypervisor functional specification:
1 | The TscScale value is used to adjust the Virtual TSC value across migration events to mitigate TSC frequency changes from one platform to another. |
used to cut down the mitigate TSC frequency change for guest.
So look for qemu code:
1 | if (level == KVM_PUT_FULL_STATE) { |
during live migration, this trace will be printed but do not abort migration.
Indeed, newer nvidia driver do not require kvm hidden: https://www.heiko-sieger.info/passing-through-a-nvidia-rtx-2070-super-gpu/ so the concerns about the use case maybe not necessary.
I write a mail to community to discuss if there is any better way to solve the problem.
https://lists.nongnu.org/archive/html/qemu-discuss/2022-11/msg00028.html