2023-03-09

Cpu features about kvm hidden

[TOC]

What kvm hidden did to qemu

Based on last blog, we can see how libvirt cpu feature configuration changes qemu cpuid. And we figure out hypervisor disable configuration have what kind of influence.

Then another recommanded feature from libvirt is kvm hidden. In the same way with last blog, we can find libvirt will configure kvm=off to -cpu and according to qemu:

DEFINE_PROP_BOOL("hv-relaxed", X86CPU, hyperv_relaxed_timing, false),
DEFINE_PROP_BOOL("hv-vapic", X86CPU, hyperv_vapic, false),
DEFINE_PROP_BOOL("hv-time", X86CPU, hyperv_time, false),
DEFINE_PROP_BOOL("hv-crash", X86CPU, hyperv_crash, false),
DEFINE_PROP_BOOL("hv-reset", X86CPU, hyperv_reset, false),
DEFINE_PROP_BOOL("hv-vpindex", X86CPU, hyperv_vpindex, false),
DEFINE_PROP_BOOL("hv-runtime", X86CPU, hyperv_runtime, false),
DEFINE_PROP_BOOL("hv-synic", X86CPU, hyperv_synic, false),
DEFINE_PROP_BOOL("hv-stimer", X86CPU, hyperv_stimer, false),
DEFINE_PROP_BOOL("hv-frequencies", X86CPU, hyperv_frequencies, false),
DEFINE_PROP_BOOL("check", X86CPU, check_cpuid, true),
DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),

those configures are defined by target/i386/cpu.c in variable x86_cpu_properties.

kvm=off will be treated as “kvm” is false and the local variable of this cpu changes expose_kvm to false.

1
2
3

if (!kvm_enabled() || !cpu->expose_kvm) {
    env->features[FEAT_KVM] = 0;
}

x86_cpu_realizefn will invoke x86_cpu_expand_features to expand features from configuration, as a result FEAT_KVM will disable all features after realize features.

[FEAT_KVM] = {
    .feat_names = {
        "kvmclock", "kvm-nopiodelay", "kvm-mmu", "kvmclock",
        "kvm-asyncpf", "kvm-steal-time", "kvm-pv-eoi", "kvm-pv-unhalt",
        NULL, "kvm-pv-tlb-flush", NULL, NULL,
        NULL, NULL, NULL, NULL,
        NULL, NULL, NULL, NULL,
        NULL, NULL, NULL, NULL,
        "kvmclock-stable-bit", NULL, NULL, NULL,
        NULL, NULL, NULL, NULL,
    },
    .cpuid_eax = KVM_CPUID_FEATURES, .cpuid_reg = R_EAX,
    .tcg_features = TCG_KVM_FEATURES,
},

check its definition, almost all kvm related features is disabled.

Then go ahead to linux kernel arch/x86/include/uapi/asm/kvm_para.h defines those features from cpuid:

/* This CPUID returns a feature bitmap in eax.  Before enabling a particular
 * paravirtualization, the appropriate feature bit should be checked.
 */
#define KVM_CPUID_FEATURES	0x40000001
#define KVM_FEATURE_CLOCKSOURCE		0
#define KVM_FEATURE_NOP_IO_DELAY	1
#define KVM_FEATURE_MMU_OP		2
/* This indicates that the new set of kvmclock msrs
 * are available. The use of 0x11 and 0x12 is deprecated
 */
#define KVM_FEATURE_CLOCKSOURCE2        3
#define KVM_FEATURE_ASYNC_PF		4
#define KVM_FEATURE_STEAL_TIME		5
#define KVM_FEATURE_PV_EOI		6
#define KVM_FEATURE_PV_UNHALT		7

/* The last 8 bits are used to indicate how to interpret the flags field
 * in pvclock structure. If no bits are set, all flags are ignored.
 */
#define KVM_FEATURE_CLOCKSOURCE_STABLE_BIT	24

And before we check all features details let’s check how linux figure kvm feature at first.

For kernel, check kvm by kvm_para_available:

bool kvm_para_available(void)
{
	return kvm_cpuid_base() != 0;
}

which will return a kvm based hypervisor by check cpu_has_hypervisor:

static noinline uint32_t __kvm_cpuid_base(void)
{
	if (boot_cpu_data.cpuid_level < 0)
		return 0;	/* So we don't blow up on old processors */

	if (cpu_has_hypervisor)
		return hypervisor_cpuid_base("KVMKVMKVM\0\0\0", 0);

	return 0;
}

and cpu_has_hypervisor is defined from the hypervisor feature we mentioned in last post:

1	#define cpu_has_hypervisor boot_cpu_has(X86_FEATURE_HYPERVISOR)

So we combine those two part together to check the influence introduced by kvm hidden.

Note: here is the brief description about those features in cpuid:

function: define KVM_CPUID_FEATURES (0x40000001)
returns : ebx, ecx, edx = 0
          eax = and OR'ed group of (1 << flag), where each flags is:


flag                               || value || meaning
=============================================================================
KVM_FEATURE_CLOCKSOURCE            ||     0 || kvmclock available at msrs
                                   ||       || 0x11 and 0x12.
------------------------------------------------------------------------------
KVM_FEATURE_NOP_IO_DELAY           ||     1 || not necessary to perform delays
                                   ||       || on PIO operations.
------------------------------------------------------------------------------
KVM_FEATURE_MMU_OP                 ||     2 || deprecated.
------------------------------------------------------------------------------
KVM_FEATURE_CLOCKSOURCE2           ||     3 || kvmclock available at msrs
                                   ||       || 0x4b564d00 and 0x4b564d01
------------------------------------------------------------------------------
KVM_FEATURE_ASYNC_PF               ||     4 || async pf can be enabled by
                                   ||       || writing to msr 0x4b564d02
------------------------------------------------------------------------------
KVM_FEATURE_STEAL_TIME             ||     5 || steal time can be enabled by
                                   ||       || writing to msr 0x4b564d03.
------------------------------------------------------------------------------
KVM_FEATURE_PV_EOI                 ||     6 || paravirtualized end of interrupt
                                   ||       || handler can be enabled by writing
                                   ||       || to msr 0x4b564d04.
------------------------------------------------------------------------------
KVM_FEATURE_PV_UNHALT              ||     7 || guest checks this feature bit
                                   ||       || before enabling paravirtualized
                                   ||       || spinlock support.
------------------------------------------------------------------------------
KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
                                   ||       || per-cpu warps are expected in
                                   ||       || kvmclock.
------------------------------------------------------------------------------

KVM_FEATURE_CLOCKSOURCE & KVM_FEATURE_CLOCKSOURCE2

This feature is used directly when implement kvmclock_init:

void __init kvmclock_init(void)
{
	struct pvclock_vcpu_time_info *vcpu_time;
	unsigned long mem, mem_wall_clock;
	int size, cpu, wall_clock_size;
	u8 flags;

	size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);

	if (!kvm_para_available())
		return;

	if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
		msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
		msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
	} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
		return;

KVM_FEATURE_NOP_IO_DELAY

During guest init, paravirt_ops_setup will use this feature:

void __init kvm_guest_init(void)
{
	int i;

	if (!kvm_para_available())
		return;

	paravirt_ops_setup();

which changes io_delay of paravirt cpu ops to kvm_io_delay:

static void __init paravirt_ops_setup(void)
{
	pv_info.name = "KVM";
	pv_info.paravirt_enabled = 1;

	if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY))
		pv_cpu_ops.io_delay = kvm_io_delay;

#ifdef CONFIG_X86_IO_APIC
	no_timer_check = 1;
#endif
}

which just means without any io delay:

/*
 * No need for any "IO delay" on KVM
 */
static void kvm_io_delay(void)
{
}

KVM_FEATURE_MMU_OP

Deprecated.

KVM_FEATURE_ASYNC_PF

When init kvm guest:

void __init kvm_guest_init(void)
{
  // ...
	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
		x86_init.irqs.trap_init = kvm_apf_trap_init;

kvm_apf_trap_init will be set to x86_init.irqs.trap_init which will set async_page_fault when interrupt request for trap operations:

static void __init kvm_apf_trap_init(void)
{
	set_intr_gate(14, async_page_fault);
}

And then when init kvm guest cpu, will manually enable cpu to allow to write async page fault:

static void kvm_guest_cpu_init(void)
{
	if (!kvm_para_available())
		return;

	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
		u64 pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));

#ifdef CONFIG_PREEMPT
		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
#endif
		wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED);
		__this_cpu_write(apf_reason.enabled, 1);
		printk(KERN_INFO"KVM setup async PF for cpu %d\n",
		       smp_processor_id());
	}

Then feature will enable async PF for this cpu.

Note: trap initialize will be done by arch/x86/kernel/traps.c:

void __init trap_init(void)
{
	int i;

#ifdef CONFIG_EISA
	void __iomem *p = early_ioremap(0x0FFFD9, 4);

	if (readl(p) == 'E' + ('I'<<8) + ('S'<<16) + ('A'<<24))
		EISA_bus = 1;
	early_iounmap(p, 4);
#endif

	set_intr_gate(X86_TRAP_DE, divide_error);
	set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
	/* int4 can be called from all */
	set_system_intr_gate(X86_TRAP_OF, &overflow);
	set_intr_gate(X86_TRAP_BR, bounds);
	set_intr_gate(X86_TRAP_UD, invalid_op);
	set_intr_gate(X86_TRAP_NM, device_not_available);
#ifdef CONFIG_X86_32
	set_task_gate(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS);
#else
	set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
#endif
	set_intr_gate(X86_TRAP_OLD_MF, coprocessor_segment_overrun);
	set_intr_gate(X86_TRAP_TS, invalid_TSS);
	set_intr_gate(X86_TRAP_NP, segment_not_present);
	set_intr_gate(X86_TRAP_SS, stack_segment);
	set_intr_gate(X86_TRAP_GP, general_protection);
	set_intr_gate(X86_TRAP_SPURIOUS, spurious_interrupt_bug);
	set_intr_gate(X86_TRAP_MF, coprocessor_error);
	set_intr_gate(X86_TRAP_AC, alignment_check);
#ifdef CONFIG_X86_MCE
	set_intr_gate_ist(X86_TRAP_MC, &machine_check, MCE_STACK);
#endif
	set_intr_gate(X86_TRAP_XF, simd_coprocessor_error);

	/* Reserve all the builtin and the syscall vector: */
	for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
		set_bit(i, used_vectors);

#ifdef CONFIG_IA32_EMULATION
	set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall);
	set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif

#ifdef CONFIG_X86_32
	set_system_trap_gate(SYSCALL_VECTOR, &system_call);
	set_bit(SYSCALL_VECTOR, used_vectors);
#endif

	/*
	 * Set the IDT descriptor to a fixed read-only location, so that the
	 * "sidt" instruction will not leak the location of the kernel, and
	 * to defend the IDT against arbitrary memory write vulnerabilities.
	 * It will be reloaded in cpu_init() */
	__set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
	idt_descr.address = fix_to_virt(FIX_RO_IDT);

	/*
	 * Should be a barrier for any external CPU state:
	 */
	cpu_init();

	x86_init.irqs.trap_init();

#ifdef CONFIG_X86_64
	memcpy(&debug_idt_table, &idt_table, IDT_ENTRIES * 16);
	set_nmi_gate(X86_TRAP_DB, &debug);
	set_nmi_gate(X86_TRAP_BP, &int3);
#endif
}

and x86_init.irqs.trap_init(); will be used post other features.

KVM_FEATURE_STEAL_TIME

when do kvm guest init:

if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
	has_steal_clock = 1;
	pv_time_ops.steal_clock = kvm_steal_clock;
}

Paravirt steal lock will be replaced by kvm

static u64 kvm_steal_clock(int cpu)
{
	u64 steal;
	struct kvm_steal_time *src;
	int version;

	src = &per_cpu(steal_time, cpu);
	do {
		version = src->version;
		rmb();
		steal = src->steal;
		rmb();
	} while ((version & 1) || (version != src->version));

	return steal;
}

which will steal the time from cpu directly.

KVM_FEATURE_PV_EOI

From kvm guest init:

1 2	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) apic_set_eoi_write(kvm_guest_apic_eoi_write);

During kvm guest cpu init:

if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
	unsigned long pa;
	/* Size alignment is implied but just to make it explicit. */
	BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
	__this_cpu_write(kvm_apic_eoi, 0);
	pa = slow_virt_to_phys(this_cpu_ptr(&kvm_apic_eoi))
		| KVM_MSR_ENABLED;
	wrmsrl(MSR_KVM_PV_EOI_EN, pa);
}

Besides, those paravirt kvm features is used by kernel so those features need to be disabled if kernel changed, for example, load kernel by kexec, to avoid the features pointing to old memory of old kernel, those features will disabled by write msr manually:

static void kvm_pv_guest_cpu_reboot(void *unused)
{
	/*
	 * We disable PV EOI before we load a new kernel by kexec,
	 * since MSR_KVM_PV_EOI_EN stores a pointer into old kernel's memory.
	 * New kernel can re-enable when it boots.
	 */
	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
		wrmsrl(MSR_KVM_PV_EOI_EN, 0);
	kvm_pv_disable_apf();
	kvm_disable_steal_time();
}

So does kvm guest cpu offline do:

static void kvm_guest_cpu_offline(void *dummy)
{
	kvm_disable_steal_time();
	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
		wrmsrl(MSR_KVM_PV_EOI_EN, 0);
	kvm_pv_disable_apf();
	apf_task_wake_all();
}

That’s all due to paravirt use shared memory to use those features between guest and host.

KVM_FEATURE_PV_UNHALT

Allow to use para-virtualized spinlock

void __init kvm_spinlock_init(void)
{
	if (!kvm_para_available())
		return;
	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
		return;

KVM_FEATURE_CLOCKSOURCE_STABLE_BIT

kvm clock will set a PVCLOCK_TSC_STABLE_BIT to pvclock.

printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
       msr_kvm_system_time, msr_kvm_wall_clock);

if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
	pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);

when stable source detected:

u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
{
	unsigned version;
	u64 ret;
	u64 last;
	u8 flags;

	do {
		version = pvclock_read_begin(src);
		ret = __pvclock_read_cycles(src, rdtsc_ordered());
		flags = src->flags;
	} while (pvclock_read_retry(src, version));

	if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
		src->flags &= ~PVCLOCK_GUEST_STOPPED;
		pvclock_touch_watchdogs();
	}

	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
		(flags & PVCLOCK_TSC_STABLE_BIT))
		return ret;

clocksource read will return directly.

Hyper-v impact

linux will converting hyperv and kvmclock

static bool compute_tsc_page_parameters(struct pvclock_vcpu_time_info *hv_clock,
					HV_REFERENCE_TSC_PAGE *tsc_ref)
{
	u64 max_mul;

	if (!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT))
		return false;

but if no stable tsc allowed, hypervclock and kvmclock computing will be skipped.

Function chain as following:

kvm_guest_time_update -> kvm_hv_setup_tsc_page -> compute_tsc_page_parameters

And source is from kvm request:

if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
	r = kvm_guest_time_update(vcpu);
	if (unlikely(r))
		goto out;
}

We need to know more about KVM_REQ_CLOCK_UPDATE to figure out when. this request will be used.

The clue is kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); make request usage.

Ioctl kvm clock set -> KVM_SET_CLOCK -> kvm_gen_update_masterclock
kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu) -> kvm_gen_update_masterclock

kvm_guest_time_update -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
first update is from kvm request:

if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
	r = kvm_guest_time_update(vcpu);
	if (unlikely(r))
		goto out;
}

then interrupt will be disabled to prevent clock changes:

/* Keep irq disabled to prevent changes to the clock */
local_irq_save(flags);
this_tsc_khz = __this_cpu_read(cpu_tsc_khz);
if (unlikely(this_tsc_khz == 0)) {
	local_irq_restore(flags);
	kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
	return 1;
}

INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn); -> kvmclock_update_fn -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
kvm lock will be updated by a schedule:

/*
 * kvmclock updates which are isolated to a given vcpu, such as
 * vcpu->cpu migration, should not allow system_timestamp from
 * the rest of the vcpus to remain static. Otherwise ntp frequency
 * correction applies to one vcpu's system_timestamp but not
 * the others.
 *
 * So in those cases, request a kvmclock update for all vcpus.
 * We need to rate-limit these requests though, as they can
 * considerably slow guests that have a large number of vcpus.
 * The time for a remote vcpu to update its kvmclock is bound
 * by the delay we use to rate-limit the updates.
 */

#define KVMCLOCK_UPDATE_DELAY msecs_to_jiffies(100)

and kvmlock sync delays are

1	#define KVMCLOCK_SYNC_PERIOD (300 * HZ)

kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu) -> kvm_gen_kvmclock_update -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);

MSR_KVM_SYSTEM_TIME

kvm_arch_vcpu_load
update clock if no master clock or host cpu to sync.

/*
 * On a host with synchronized TSC, there is no need to update
 * kvmclock on vcpu->cpu migration
 */
if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
	kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
if (vcpu->cpu != cpu)
	kvm_migrate_timers(vcpu);
vcpu->cpu = cpu;

kvm_arch_vcpu_load -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
Adjust time if needed

/* Apply any externally detected TSC adjustments (due to suspend) */
if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
	adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
	vcpu->arch.tsc_offset_adjustment = 0;
	kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
}

kvm_set_guest_paused -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
if guest kernel stopped by hypervisor use this to update pv clock.

/*
 * kvm_set_guest_paused() indicates to the guest kernel that it has been
 * stopped by the hypervisor.  This function will be called from the host only.
 * EINVAL is returned when the host attempts to set the flag for a guest that
 * does not support pv clocks.
 */
static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
{
	if (!vcpu->arch.pv_time_enabled)
		return -EINVAL;
	vcpu->arch.pvclock_set_guest_stopped_request = true;
	kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
	return 0;
}

kvmclock_cpufreq_notifier -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
see the annotation from code:

/*
 * We allow guests to temporarily run on slowing clocks,
 * provided we notify them after, or to run on accelerating
 * clocks, provided we notify them before.  Thus time never
 * goes backwards.
 *
 * However, we have a problem.  We can't atomically update
 * the frequency of a given CPU from this function; it is
 * merely a notifier, which can be called from any CPU.
 * Changing the TSC frequency at arbitrary points in time
 * requires a recomputation of local variables related to
 * the TSC for each VCPU.  We must flag these local variables
 * to be updated and be sure the update takes place with the
 * new frequency before any guests proceed.
 *
 * Unfortunately, the combination of hotplug CPU and frequency
 * change creates an intractable locking scenario; the order
 * of when these callouts happen is undefined with respect to
 * CPU hotplug, and they can race with each other.  As such,
 * merely setting per_cpu(cpu_tsc_khz) = X during a hotadd is
 * undefined; you can actually have a CPU frequency change take
 * place in between the computation of X and the setting of the
 * variable.  To protect against this problem, all updates of
 * the per_cpu tsc_khz variable are done in an interrupt
 * protected IPI, and all callers wishing to update the value
 * must wait for a synchronous IPI to complete (which is trivial
 * if the caller is on the CPU already).  This establishes the
 * necessary total order on variable updates.
 *
 * Note that because a guest time update may take place
 * anytime after the setting of the VCPU's request bit, the
 * correct TSC value must be set before the request.  However,
 * to ensure the update actually makes it to any guest which
 * starts running in hardware virtualization between the set
 * and the acquisition of the spinlock, we must also ping the
 * CPU after setting the request bit.
 *
 */

after kvm_guest_exit();
update clock if vcpu request clock always up to date.

1 2	if (unlikely(vcpu->arch.tsc_always_catchup)) kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

hardware_enable_nolock -> kvm_arch_hardware_enable -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
multi functino access hardware_enable_nolock

kvm_cpu_hotplug

static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
			   void *v)
{
	val &= ~CPU_TASKS_FROZEN;
	switch (val) {
	case CPU_DYING:
		hardware_disable();
		break;
	case CPU_STARTING:
		hardware_enable();
		break;
	}
	return NOTIFY_OK;
}

kvm_resume

Note: for hv_stimer

/*
 * KVM_REQ_HV_STIMER has to be processed after
 * KVM_REQ_CLOCK_UPDATE, because Hyper-V SynIC timers
 * depend on the guest clock being up-to-date
 */
if (kvm_check_request(KVM_REQ_HV_STIMER, vcpu))
	kvm_hv_process_stimers(vcpu);

will be done after guest clock up-to-date.

Hyper-v impact conclusion

With kvm hidden, hyper-v tsc compute will be skipped:

static bool compute_tsc_page_parameters(struct pvclock_vcpu_time_info *hv_clock,
					struct ms_hyperv_tsc_page *tsc_ref)
{
	u64 max_mul;

	if (!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT))
		return false;

which can be triggered by above kvm code.

During migration, we know that guest will be stopped (paused) by KVM_KVMCLOCK_CTRL and we could check kvm userspace’s (qemu) usage:

static void kvmclock_vm_state_change(void *opaque, int running,
                                     RunState state)
{
    KVMClockState *s = opaque;
    CPUState *cpu;
    int cap_clock_ctrl = kvm_check_extension(kvm_state, KVM_CAP_KVMCLOCK_CTRL);
    int ret;

    if (running) {
        struct kvm_clock_data data = {};

        /*
         * If the host where s->clock was read did not support reliable
         * KVM_GET_CLOCK, read kvmclock value from memory.
         */
        if (!s->clock_is_reliable) {
            uint64_t pvclock_via_mem = kvmclock_current_nsec(s);
            /* We can't rely on the saved clock value, just discard it */
            if (pvclock_via_mem) {
                s->clock = pvclock_via_mem;
            }
        }

        s->clock_valid = false;

        data.clock = s->clock;
        ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
        if (ret < 0) {
            fprintf(stderr, "KVM_SET_CLOCK failed: %s\n", strerror(ret));
            abort();
        }

        if (!cap_clock_ctrl) {
            return;
        }
        CPU_FOREACH(cpu) {
            run_on_cpu(cpu, do_kvmclock_ctrl, RUN_ON_CPU_NULL);
        }
    } else {

        if (s->clock_valid) {
            return;
        }

        s->runstate_paused = runstate_check(RUN_STATE_PAUSED);

        kvm_synchronize_all_tsc();

        kvm_update_clock(s);
        /*
         * If the VM is stopped, declare the clock state valid to
         * avoid re-reading it on next vmsave (which would return
         * a different value). Will be reset when the VM is continued.
         */
        s->clock_valid = true;
    }
}

when set guest to running, qemu will use KVM_SET_CLOCK else will use kvm_update_clock works as following:

static void kvm_update_clock(KVMClockState *s)
{
    struct kvm_clock_data data;
    int ret;

    ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
    if (ret < 0) {
        fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
                abort();
    }
    s->clock = data.clock;

    /* If kvm_has_adjust_clock_stable() is false, KVM_GET_CLOCK returns
     * essentially CLOCK_MONOTONIC plus a guest-specific adjustment.  This
     * can drift from the TSC-based value that is computed by the guest,
     * so we need to go through kvmclock_current_nsec().  If
     * kvm_has_adjust_clock_stable() is true, and the flags contain
     * KVM_CLOCK_TSC_STABLE, then KVM_GET_CLOCK returns a TSC-based value
     * and kvmclock_current_nsec() is not necessary.
     *
     * Here, however, we need not check KVM_CLOCK_TSC_STABLE.  This is because:
     *
     * - if the host has disabled the kvmclock master clock, the guest already
     *   has protection against time going backwards.  This "safety net" is only
     *   absent when kvmclock is stable;
     *
     * - therefore, we can replace a check like
     *
     *       if last KVM_GET_CLOCK was not reliable then
     *               read from memory
     *
     *   with
     *
     *       if last KVM_GET_CLOCK was not reliable && masterclock is enabled
     *               read from memory
     *
     * However:
     *
     * - if kvm_has_adjust_clock_stable() returns false, the left side is
     *   always true (KVM_GET_CLOCK is never reliable), and the right side is
     *   unknown (because we don't have data.flags).  We must assume it's true
     *   and read from memory.
     *
     * - if kvm_has_adjust_clock_stable() returns true, the result of the &&
     *   is always false (masterclock is enabled iff KVM_GET_CLOCK is reliable)
     *
     * So we can just use this instead:
     *
     *       if !kvm_has_adjust_clock_stable() then
     *               read from memory
     */
    s->clock_is_reliable = kvm_has_adjust_clock_stable();
}

But from the annotation in kvmclock_vm_state_change:

/*
 * If the VM is stopped, declare the clock state valid to
 * avoid re-reading it on next vmsave (which would return
 * a different value). Will be reset when the VM is continued.
 */

qemu seems to relay on vmsave to reset the guest while vm is continued, we just keep our eyes on that.

Combine qemu guest state change hook:

case KVM_SET_CLOCK: {
	struct kvm_arch *ka = &kvm->arch;
	struct kvm_clock_data user_ns;
	u64 now_ns;

	r = -EFAULT;
	if (copy_from_user(&user_ns, argp, sizeof(user_ns)))
		goto out;

	r = -EINVAL;
	if (user_ns.flags)
		goto out;

	r = 0;
	/*
	 * TODO: userspace has to take care of races with VCPU_RUN, so
	 * kvm_gen_update_masterclock() can be cut down to locked
	 * pvclock_update_vm_gtod_copy().
	 */
	kvm_gen_update_masterclock(kvm);

	/*
	 * This pairs with kvm_guest_time_update(): when masterclock is
	 * in use, we use master_kernel_ns + kvmclock_offset to set
	 * unsigned 'system_time' so if we use get_kvmclock_ns() (which
	 * is slightly ahead) here we risk going negative on unsigned
	 * 'system_time' when 'user_ns.clock' is very small.
	 */
	spin_lock_irq(&ka->pvclock_gtod_sync_lock);
	if (kvm->arch.use_master_clock)
		now_ns = ka->master_kernel_ns;
	else
		now_ns = get_kvmclock_base_ns();
	ka->kvmclock_offset = user_ns.clock - now_ns;
	spin_unlock_irq(&ka->pvclock_gtod_sync_lock);

	kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);

will be used to update guest clock.

Hand on test to confirm clock updates

Enable kvm trace by:

1	echo 1 > /sys/kernel/debug/tracing/events/kvm/enable

Then collect the output when vm migrated to this host:

1	cat /sys/kernel/debug/tracing/trace_pipe > trace_migrated_vm

We can see following logs at first:

<...>-89383 [001] .... 97852.765277: kvm_update_master_clock: masterclock 0 hostclock 0x2 offsetmatched 0
<...>-89441 [002] d... 97852.785366: kvm_write_tsc_offset: vcpu=0 prev=0 next=18446539041810541506
<...>-89441 [002] d... 97852.785402: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock 0x2
<...>-89442 [002] d... 97852.786522: kvm_write_tsc_offset: vcpu=1 prev=0 next=18446539041810541506
<...>-89442 [002] d... 97852.786533: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock 0x2
<...>-89443 [002] d... 97852.787341: kvm_write_tsc_offset: vcpu=2 prev=0 next=18446539041810541506
<...>-89443 [002] d... 97852.787348: kvm_track_tsc: vcpu_id 2 masterclock 0 offsetmatched 2 nr_online 3 hostclock 0x2
<...>-89444 [002] d... 97852.788099: kvm_write_tsc_offset: vcpu=3 prev=0 next=18446539041810541506
<...>-89444 [002] d... 97852.788120: kvm_track_tsc: vcpu_id 3 masterclock 0 offsetmatched 3 nr_online 4 hostclock 0x2

kvm_update_master_clock is used for vm migration:

And the tsc offset changed:

<...>-89441 [002] d... 97852.785366: kvm_write_tsc_offset: vcpu=0 prev=0 next=18446539041810541506
<...>-89441 [002] d... 97852.785402: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock 0x2
<...>-89442 [002] d... 97852.786522: kvm_write_tsc_offset: vcpu=1 prev=0 next=18446539041810541506
<...>-89442 [002] d... 97852.786533: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock 0x2
<...>-89443 [002] d... 97852.787341: kvm_write_tsc_offset: vcpu=2 prev=0 next=18446539041810541506
<...>-89443 [002] d... 97852.787348: kvm_track_tsc: vcpu_id 2 masterclock 0 offsetmatched 2 nr_online 3 hostclock 0x2
<...>-89444 [002] d... 97852.788099: kvm_write_tsc_offset: vcpu=3 prev=0 next=18446539041810541506

<...>-89441 [003] d... 97852.872014: kvm_write_tsc_offset: vcpu=0 prev=18446539041810541506 next=18446539041810541506
<...>-89442 [003] d... 97852.872105: kvm_write_tsc_offset: vcpu=1 prev=18446539041810541506 next=18446539041810541506
<...>-89443 [003] d... 97852.872189: kvm_write_tsc_offset: vcpu=2 prev=18446539041810541506 next=18446539041810541506
<...>-89444 [003] d... 97852.872264: kvm_write_tsc_offset: vcpu=3 prev=18446539041810541506 next=18446539041810541506

<...>-89441 [000] d... 97856.399432: kvm_write_tsc_offset: vcpu=0 prev=18446539041810541506 next=18446562414330701094
<...>-89442 [000] d... 97856.403066: kvm_write_tsc_offset: vcpu=1 prev=18446539041810541506 next=18446562414330701094
<...>-89443 [000] d... 97856.403273: kvm_write_tsc_offset: vcpu=2 prev=18446539041810541506 next=18446562414330701094
<...>-89444 [000] d... 97856.403414: kvm_write_tsc_offset: vcpu=3 prev=18446539041810541506 next=18446562414330701094

Follow the trace we can find linux kernel code:

kvm_vcpu_write_tsc_offset -> kvm_x86_write_l1_tsc_offset -> write_l1_tsc_offset -> vmx_write_l1_tsc_offset -> trace_kvm_write_tsc_offset

And there are multi usages of kvm_vcpu_write_tsc_offset

kvm_synchronize_tsc
- MSR_IA32_TSC -> kvm_synchronize_tsc
- kvm_vm_ioctl_create_vcpu -> kvm_arch_vcpu_postcreate -> kvm_synchronize_tsc
adjust_tsc_offset_guest
- kvm_guest_time_update -> adjust_tsc_offset_guest and kvm_hv_setup_tsc_page this is hyper-v impacted case
- MSR_IA32_TSC -> adjust_tsc_offset_guest
- MSR_IA32_TSC_ADJUST -> adjust_tsc_offset_guest
- kvm_arch_vcpu_load -> adjust_tsc_offset_host -> adjust_tsc_offset_guest
kvm_arch_vcpu_load same as above

So the following three parts of kvm_vcpu_write_tsc_offset matches with guest creation.

Create vcpu
Load vcpu
Adjust tsc offset

In last guest hang post, we can see windows guest try to get counter ref:

static u64 get_time_ref_counter(struct kvm *kvm)
{
	struct kvm_hv *hv = to_kvm_hv(kvm);
	struct kvm_vcpu *vcpu;
	u64 tsc;

	/*
	 * Fall back to get_kvmclock_ns() when TSC page hasn't been set up,
	 * is broken, disabled or being updated.
	 */
	if (hv->hv_tsc_page_status != HV_TSC_PAGE_SET)
		return div_u64(get_kvmclock_ns(kvm), 100);

	vcpu = kvm_get_vcpu(kvm, 0);
	tsc = kvm_read_l1_tsc(vcpu, rdtsc());
	return mul_u64_u64_shr(tsc, hv->tsc_ref.tsc_scale, 64)
		+ hv->tsc_ref.tsc_offset;
}

But this is used by MSR read request from guest. And now we need to debug hv_tsc_page_status and kvm_hv_setup_tsc_page usage.

Without kvm hidden:

<...>-114210 [002] d... 12255.411580: kvm_exit: vcpu 1 reason MSR_READ rip 0xfffff800ece454c5 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
<...>-114210 [002] .... 12255.411581: kvm_msr: msr_read 40000020 = 0x6fac3c27
<...>-114210 [002] d... 12255.411582: kvm_entry: vcpu 1, rip 0xfffff800ece454c7
<...>-114211 [000] .... 12255.411585: kvm_vcpu_wakeup: wait time 1759974 ns, polling valid
<...>-114211 [000] .... 12255.411585: kvm_hv_timer_state: vcpu_id 2 hv_timer 0

We can find kvm_hv_timer_state in trace, and according to linux kernel code:

1 2	TRACE_EVENT(kvm_hv_timer_state, TP_PROTO(unsigned int vcpu_id, unsigned int hv_timer_in_use),

There are two ways to show the trace:

start_sw_timer -> trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, false); which is always false (means 0 in trace)
start_hv_timer -> trace_kvm_hv_timer_state(vcpu->vcpu_id, ktimer->hv_timer_in_use); which returns hv_timer_in_use from ktimer->hv_timer_in_use

Check the code about start_hv_timer:

static bool start_hv_timer(struct kvm_lapic *apic)
{
	struct kvm_timer *ktimer = &apic->lapic_timer;
	struct kvm_vcpu *vcpu = apic->vcpu;
	bool expired;

	WARN_ON(preemptible());
	if (!kvm_can_use_hv_timer(vcpu))
		return false;

	if (!ktimer->tscdeadline)
		return false;

	if (static_call(kvm_x86_set_hv_timer)(vcpu, ktimer->tscdeadline, &expired))
		return false;

	ktimer->hv_timer_in_use = true;
	hrtimer_cancel(&ktimer->timer);

	/*
	 * To simplify handling the periodic timer, leave the hv timer running
	 * even if the deadline timer has expired, i.e. rely on the resulting
	 * VM-Exit to recompute the periodic timer's target expiration.
	 */
	if (!apic_lvtt_period(apic)) {
		/*
		 * Cancel the hv timer if the sw timer fired while the hv timer
		 * was being programmed, or if the hv timer itself expired.
		 */
		if (atomic_read(&ktimer->pending)) {
			cancel_hv_timer(apic);
		} else if (expired) {
			apic_timer_expired(apic, false);
			cancel_hv_timer(apic);
		}
	}

	trace_kvm_hv_timer_state(vcpu->vcpu_id, ktimer->hv_timer_in_use);

	return true;
}

ktimer->hv_timer_in_use is set to true so we focus on start_sw_timer next.

There are several ways to goes into restart_apic_timer

restart_apic_timer -> start_sw_timer
- vmx_exit_handlers_fastpath or __vmx_handle_exit -> handle_fastpath_preemption_timer -> kvm_lapic_expired_hv_timer -> restart_apic_timer
- vcpu_block -> post_block -> vmx_post_block -> kvm_lapic_switch_to_hv_timer -> restart_apic_timer
- MSR_IA32_TSC_DEADLINE ->handle_fastpath_set_tscdeadline -> kvm_set_lapic_tscdeadline_msr -> __start_apic_timer -> restart_apic_timer
- APIC_TDCR -> restart_apic_timer
vcpu_block -> vmx_pre_block -> kvm_lapic_switch_to_sw_timer -> start_sw_timer

Because we see a trace before shows:

1	kvm_vcpu_wakeup: wait time 1759974 ns, polling valid

which is in kvm_vcpu_block, so this means vmx_post_block restart_apic_timer

1 2	trace_kvm_vcpu_wakeup(block_ns, waited, vcpu_valid_wakeup(vcpu)); kvm_arch_vcpu_block_finish(vcpu);

And because the code runs as:

1 2	if (!start_hv_timer(apic)) start_sw_timer(apic);

start_hv_timer must returns false:

static bool start_hv_timer(struct kvm_lapic *apic)
{
	struct kvm_timer *ktimer = &apic->lapic_timer;
	struct kvm_vcpu *vcpu = apic->vcpu;
	bool expired;

	WARN_ON(preemptible());
	if (!kvm_can_use_hv_timer(vcpu))
		return false;

	if (!ktimer->tscdeadline)
		return false;

	if (static_call(kvm_x86_set_hv_timer)(vcpu, ktimer->tscdeadline, &expired))
		return false;

kvm_can_use_hv_timer check seems works on x86 machine and while X86_FEATURE_MWAIT is supported.

From the trace we could know, when vcpu exit and come back to work, the timer will be updated, and use vcpu 3 as example:

1	<...>-114212 [002] d... 12297.437890: kvm_exit: vcpu 3 reason HLT rip 0xfffff800ecc2b36e info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000

vcpu 3 HLT and cause kvm_exit.

Then it wakeup after 4774180 ns and hv_timer is traced without usage.

1 2	<...>-114212 [002] .... 12255.393408: kvm_vcpu_wakeup: wait time 4774180 ns, polling valid <...>-114212 [002] .... 12255.393410: kvm_hv_timer_state: vcpu_id 3 hv_timer 0

And hv_timer will be cancelled after live migration:

1 2	if (apic->lapic_timer.hv_timer_in_use) cancel_hv_timer(apic);

Let’s check hv_timer before migration:

Can we resolve compatibility issues?

See the code of qemu, it will disable features of FEAT_KVM after all features setup, so we can not manually assign those features:

for (l = plus_features; l; l = l->next) {
    const char *prop = l->data;
    object_property_set_bool(OBJECT(cpu), true, prop, &local_err);
    if (local_err) {
        goto out;
    }
}

for (l = minus_features; l; l = l->next) {
    const char *prop = l->data;
    object_property_set_bool(OBJECT(cpu), false, prop, &local_err);
    if (local_err) {
        goto out;
    }
}

if (!kvm_enabled() || !cpu->expose_kvm) {
    env->features[FEAT_KVM] = 0;
}