[TOC]
What kvm hidden did to qemu
Based on last blog, we can see how libvirt cpu feature configuration changes qemu cpuid. And we figure out hypervisor disable configuration have what kind of influence.
Then another recommanded feature from libvirt is kvm hidden. In the same way with last blog, we can find libvirt will configure kvm=off
to -cpu
and according to qemu:
1 | DEFINE_PROP_BOOL("hv-relaxed", X86CPU, hyperv_relaxed_timing, false), |
those configures are defined by target/i386/cpu.c
in variable x86_cpu_properties
.
kvm=off
will be treated as “kvm” is false and the local variable of this cpu changes expose_kvm
to false.
1 | if (!kvm_enabled() || !cpu->expose_kvm) { |
x86_cpu_realizefn
will invoke x86_cpu_expand_features
to expand features from configuration, as a result FEAT_KVM will disable all features after realize features.
1 | [FEAT_KVM] = { |
check its definition, almost all kvm related features is disabled.
Then go ahead to linux kernel arch/x86/include/uapi/asm/kvm_para.h
defines those features from cpuid:
1 | /* This CPUID returns a feature bitmap in eax. Before enabling a particular |
And before we check all features details let’s check how linux figure kvm feature at first.
For kernel, check kvm by kvm_para_available
:
1 | bool kvm_para_available(void) |
which will return a kvm based hypervisor by check cpu_has_hypervisor
:
1 | static noinline uint32_t __kvm_cpuid_base(void) |
and cpu_has_hypervisor
is defined from the hypervisor feature we mentioned in last post:
1 |
So we combine those two part together to check the influence introduced by kvm hidden.
Note: here is the brief description about those features in cpuid:
1 | function: define KVM_CPUID_FEATURES (0x40000001) |
KVM_FEATURE_CLOCKSOURCE & KVM_FEATURE_CLOCKSOURCE2
This feature is used directly when implement kvmclock_init
:
1 | void __init kvmclock_init(void) |
KVM_FEATURE_NOP_IO_DELAY
During guest init, paravirt_ops_setup will use this feature:
1 | void __init kvm_guest_init(void) |
which changes io_delay
of paravirt cpu ops to kvm_io_delay
:
1 | static void __init paravirt_ops_setup(void) |
which just means without any io delay:
1 | /* |
KVM_FEATURE_MMU_OP
Deprecated.
KVM_FEATURE_ASYNC_PF
When init kvm guest:
1 | void __init kvm_guest_init(void) |
kvm_apf_trap_init
will be set to x86_init.irqs.trap_init
which will set async_page_fault
when interrupt request for trap operations:
1 | static void __init kvm_apf_trap_init(void) |
And then when init kvm guest cpu, will manually enable cpu to allow to write async page fault:
1 | static void kvm_guest_cpu_init(void) |
Then feature will enable async PF for this cpu.
Note: trap initialize will be done by arch/x86/kernel/traps.c
:
1 | void __init trap_init(void) |
and x86_init.irqs.trap_init();
will be used post other features.
KVM_FEATURE_STEAL_TIME
when do kvm guest init:
1 | if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { |
Paravirt steal lock will be replaced by kvm
1 | static u64 kvm_steal_clock(int cpu) |
which will steal the time from cpu directly.
KVM_FEATURE_PV_EOI
From kvm guest init:
1 | if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) |
During kvm guest cpu init:
1 | if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) { |
Besides, those paravirt kvm features is used by kernel so those features need to be disabled if kernel changed, for example, load kernel by kexec, to avoid the features pointing to old memory of old kernel, those features will disabled by write msr manually:
1 | static void kvm_pv_guest_cpu_reboot(void *unused) |
So does kvm guest cpu offline do:
1 | static void kvm_guest_cpu_offline(void *dummy) |
That’s all due to paravirt use shared memory to use those features between guest and host.
KVM_FEATURE_PV_UNHALT
Allow to use para-virtualized spinlock
1 | void __init kvm_spinlock_init(void) |
KVM_FEATURE_CLOCKSOURCE_STABLE_BIT
kvm clock will set a PVCLOCK_TSC_STABLE_BIT
to pvclock.
1 | printk(KERN_INFO "kvm-clock: Using msrs %x and %x", |
when stable source detected:
1 | u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) |
clocksource read will return directly.
Hyper-v impact
linux will converting hyperv and kvmclock
1 | static bool compute_tsc_page_parameters(struct pvclock_vcpu_time_info *hv_clock, |
but if no stable tsc allowed, hypervclock and kvmclock computing will be skipped.
Function chain as following:
kvm_guest_time_update
-> kvm_hv_setup_tsc_page
-> compute_tsc_page_parameters
And source is from kvm request:
1 | if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) { |
We need to know more about KVM_REQ_CLOCK_UPDATE
to figure out when. this request will be used.
The clue is kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
make request usage.
Ioctl kvm clock set ->
KVM_SET_CLOCK
->kvm_gen_update_masterclock
kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu)
->kvm_gen_update_masterclock
kvm_guest_time_update
->kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
first update is from kvm request:1
2
3
4
5if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
r = kvm_guest_time_update(vcpu);
if (unlikely(r))
goto out;
}then interrupt will be disabled to prevent clock changes:
1
2
3
4
5
6
7
8/* Keep irq disabled to prevent changes to the clock */
local_irq_save(flags);
this_tsc_khz = __this_cpu_read(cpu_tsc_khz);
if (unlikely(this_tsc_khz == 0)) {
local_irq_restore(flags);
kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
return 1;
}INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
->kvmclock_update_fn
->kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
kvm lock will be updated by a schedule:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15/*
* kvmclock updates which are isolated to a given vcpu, such as
* vcpu->cpu migration, should not allow system_timestamp from
* the rest of the vcpus to remain static. Otherwise ntp frequency
* correction applies to one vcpu's system_timestamp but not
* the others.
*
* So in those cases, request a kvmclock update for all vcpus.
* We need to rate-limit these requests though, as they can
* considerably slow guests that have a large number of vcpus.
* The time for a remote vcpu to update its kvmclock is bound
* by the delay we use to rate-limit the updates.
*/and kvmlock sync delays are
1
kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu)
->kvm_gen_kvmclock_update
->kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
MSR_KVM_SYSTEM_TIME
kvm_arch_vcpu_load
update clock if no master clock or host cpu to sync.1
2
3
4
5
6
7
8
9/*
* On a host with synchronized TSC, there is no need to update
* kvmclock on vcpu->cpu migration
*/
if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
if (vcpu->cpu != cpu)
kvm_migrate_timers(vcpu);
vcpu->cpu = cpu;
kvm_arch_vcpu_load
->kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
Adjust time if needed1
2
3
4
5
6/* Apply any externally detected TSC adjustments (due to suspend) */
if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
vcpu->arch.tsc_offset_adjustment = 0;
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
}kvm_set_guest_paused
->kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
if guest kernel stopped by hypervisor use this to update pv clock.1
2
3
4
5
6
7
8
9
10
11
12
13
14/*
* kvm_set_guest_paused() indicates to the guest kernel that it has been
* stopped by the hypervisor. This function will be called from the host only.
* EINVAL is returned when the host attempts to set the flag for a guest that
* does not support pv clocks.
*/
static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
{
if (!vcpu->arch.pv_time_enabled)
return -EINVAL;
vcpu->arch.pvclock_set_guest_stopped_request = true;
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
return 0;
}kvmclock_cpufreq_notifier
->kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
see the annotation from code:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38/*
* We allow guests to temporarily run on slowing clocks,
* provided we notify them after, or to run on accelerating
* clocks, provided we notify them before. Thus time never
* goes backwards.
*
* However, we have a problem. We can't atomically update
* the frequency of a given CPU from this function; it is
* merely a notifier, which can be called from any CPU.
* Changing the TSC frequency at arbitrary points in time
* requires a recomputation of local variables related to
* the TSC for each VCPU. We must flag these local variables
* to be updated and be sure the update takes place with the
* new frequency before any guests proceed.
*
* Unfortunately, the combination of hotplug CPU and frequency
* change creates an intractable locking scenario; the order
* of when these callouts happen is undefined with respect to
* CPU hotplug, and they can race with each other. As such,
* merely setting per_cpu(cpu_tsc_khz) = X during a hotadd is
* undefined; you can actually have a CPU frequency change take
* place in between the computation of X and the setting of the
* variable. To protect against this problem, all updates of
* the per_cpu tsc_khz variable are done in an interrupt
* protected IPI, and all callers wishing to update the value
* must wait for a synchronous IPI to complete (which is trivial
* if the caller is on the CPU already). This establishes the
* necessary total order on variable updates.
*
* Note that because a guest time update may take place
* anytime after the setting of the VCPU's request bit, the
* correct TSC value must be set before the request. However,
* to ensure the update actually makes it to any guest which
* starts running in hardware virtualization between the set
* and the acquisition of the spinlock, we must also ping the
* CPU after setting the request bit.
*
*/after
kvm_guest_exit();
update clock if vcpu request clock always up to date.1
2if (unlikely(vcpu->arch.tsc_always_catchup))
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);hardware_enable_nolock
->kvm_arch_hardware_enable
->kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
multi functino access hardware_enable_nolockkvm_cpu_hotplug
1
2
3
4
5
6
7
8
9
10
11
12
13
14static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
void *v)
{
val &= ~CPU_TASKS_FROZEN;
switch (val) {
case CPU_DYING:
hardware_disable();
break;
case CPU_STARTING:
hardware_enable();
break;
}
return NOTIFY_OK;
}kvm_resume
Note: for hv_stimer
1 | /* |
will be done after guest clock up-to-date.
Hyper-v impact conclusion
With kvm hidden, hyper-v tsc compute will be skipped:
1 | static bool compute_tsc_page_parameters(struct pvclock_vcpu_time_info *hv_clock, |
which can be triggered by above kvm code.
During migration, we know that guest will be stopped (paused) by KVM_KVMCLOCK_CTRL
and we could check kvm userspace’s (qemu) usage:
1 | static void kvmclock_vm_state_change(void *opaque, int running, |
when set guest to running, qemu will use KVM_SET_CLOCK
else will use kvm_update_clock
works as following:
1 | static void kvm_update_clock(KVMClockState *s) |
But from the annotation in kvmclock_vm_state_change
:
1 | /* |
qemu seems to relay on vmsave to reset the guest while vm is continued, we just keep our eyes on that.
Combine qemu guest state change hook:
1 | case KVM_SET_CLOCK: { |
will be used to update guest clock.
Hand on test to confirm clock updates
Enable kvm trace by:
1 | echo 1 > /sys/kernel/debug/tracing/events/kvm/enable |
Then collect the output when vm migrated to this host:
1 | cat /sys/kernel/debug/tracing/trace_pipe > trace_migrated_vm |
We can see following logs at first:
1 | <...>-89383 [001] .... 97852.765277: kvm_update_master_clock: masterclock 0 hostclock 0x2 offsetmatched 0 |
kvm_update_master_clock
is used for vm migration:
And the tsc offset changed:
1 | <...>-89441 [002] d... 97852.785366: kvm_write_tsc_offset: vcpu=0 prev=0 next=18446539041810541506 |
Follow the trace we can find linux kernel code:
kvm_vcpu_write_tsc_offset
-> kvm_x86_write_l1_tsc_offset
-> write_l1_tsc_offset
-> vmx_write_l1_tsc_offset
-> trace_kvm_write_tsc_offset
And there are multi usages of kvm_vcpu_write_tsc_offset
kvm_synchronize_tsc
MSR_IA32_TSC
->kvm_synchronize_tsc
kvm_vm_ioctl_create_vcpu
->kvm_arch_vcpu_postcreate
->kvm_synchronize_tsc
adjust_tsc_offset_guest
kvm_guest_time_update
->adjust_tsc_offset_guest
andkvm_hv_setup_tsc_page
this is hyper-v impacted caseMSR_IA32_TSC
->adjust_tsc_offset_guest
MSR_IA32_TSC_ADJUST
->adjust_tsc_offset_guest
kvm_arch_vcpu_load
->adjust_tsc_offset_host
->adjust_tsc_offset_guest
kvm_arch_vcpu_load
same as above
So the following three parts of kvm_vcpu_write_tsc_offset
matches with guest creation.
- Create vcpu
- Load vcpu
- Adjust tsc offset
In last guest hang post, we can see windows guest try to get counter ref:
1 | static u64 get_time_ref_counter(struct kvm *kvm) |
But this is used by MSR read request from guest. And now we need to debug hv_tsc_page_status
and kvm_hv_setup_tsc_page
usage.
Without kvm hidden:
1 | <...>-114210 [002] d... 12255.411580: kvm_exit: vcpu 1 reason MSR_READ rip 0xfffff800ece454c5 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000 |
We can find kvm_hv_timer_state
in trace, and according to linux kernel code:
1 | TRACE_EVENT(kvm_hv_timer_state, |
There are two ways to show the trace:
start_sw_timer
->trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, false);
which is always false (means 0 in trace)start_hv_timer
->trace_kvm_hv_timer_state(vcpu->vcpu_id, ktimer->hv_timer_in_use);
which returnshv_timer_in_use
fromktimer->hv_timer_in_use
Check the code about start_hv_timer
:
1 | static bool start_hv_timer(struct kvm_lapic *apic) |
ktimer->hv_timer_in_use
is set to true
so we focus on start_sw_timer
next.
There are several ways to goes into restart_apic_timer
restart_apic_timer
->start_sw_timer
vmx_exit_handlers_fastpath
or__vmx_handle_exit
->handle_fastpath_preemption_timer
->kvm_lapic_expired_hv_timer
->restart_apic_timer
vcpu_block
->post_block
->vmx_post_block
->kvm_lapic_switch_to_hv_timer
->restart_apic_timer
MSR_IA32_TSC_DEADLINE
->handle_fastpath_set_tscdeadline
->kvm_set_lapic_tscdeadline_msr
->__start_apic_timer
->restart_apic_timer
APIC_TDCR
->restart_apic_timer
vcpu_block
->vmx_pre_block
->kvm_lapic_switch_to_sw_timer
->start_sw_timer
Because we see a trace before shows:
1 | kvm_vcpu_wakeup: wait time 1759974 ns, polling valid |
which is in kvm_vcpu_block, so this means vmx_post_block restart_apic_timer
1 | trace_kvm_vcpu_wakeup(block_ns, waited, vcpu_valid_wakeup(vcpu)); |
And because the code runs as:
1 | if (!start_hv_timer(apic)) |
start_hv_timer
must returns false:
1 | static bool start_hv_timer(struct kvm_lapic *apic) |
kvm_can_use_hv_timer
check seems works on x86 machine and while X86_FEATURE_MWAIT
is supported.
From the trace we could know, when vcpu exit and come back to work, the timer will be updated, and use vcpu 3 as example:
1 | <...>-114212 [002] d... 12297.437890: kvm_exit: vcpu 3 reason HLT rip 0xfffff800ecc2b36e info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000 |
vcpu 3 HLT and cause kvm_exit.
Then it wakeup after 4774180 ns
and hv_timer is traced without usage.
1 | <...>-114212 [002] .... 12255.393408: kvm_vcpu_wakeup: wait time 4774180 ns, polling valid |
And hv_timer will be cancelled after live migration:
1 | if (apic->lapic_timer.hv_timer_in_use) |
Let’s check hv_timer before migration:
Can we resolve compatibility issues?
See the code of qemu, it will disable features of FEAT_KVM after all features setup, so we can not manually assign those features:
1 | for (l = plus_features; l; l = l->next) { |