Linux guest could not access tsc clocksource

Phenomenon

check TSC disabled from guest

1
2
[    0.000004] tsc: Detected 1999.999 MHz processor
[    0.251920] tsc: Marking TSC unstable due to TSCs unsynchronized

Steps

  1. lscpu | grep tsc in guest, confirm cpu support tsc
  2. check vm not runs over IBM’s Summit2 by dmidecode
  3. lscpu | grep constant_tsc in both guest and host to confirm this is not in guest
  4. change kernel cmdline, add tsc=reliable if need ( based on cat /proc/cmdline)
  5. check cpu Vendor by lscpu if using Intel cpu
  6. remove acpi from libvirt. xml,but this will let all hot-plug operation fails

change to using other amd cpu,tsc clocksource still could not be found

Cpu that do not has vendor intel runs linux kernel will met this issue.

Resolution

method 1:

change guest os linux cmdline add tsc=reliable

method 2:

remove acpi from libvirt. xml by virsh edit

More information

x86 guest disable constant_tsc in code,see the disscusion: https://lore.kernel.org/qemu-devel/20210426185155.G[A32439@fuller.cnet](mailto:A32439@fuller.cnet)/t/

but linux guest kernel let intel cpu use tsc and as the result intel vs amd cpu has below differences.

go through kernel code

Code snippet 0

1
2
3
4
5
/*
* TSC can be unstable due to cpufreq or due to unsynced TSCs
*/

static int __read_mostly tsc_unstable;`

The code defines a static integer variable tsc_unstable. The static keyword means that this variable is only visible within the file it’s defined in, limiting its scope to tsc.c. This is a common practice when you want to keep the variable private to a particular file and prevent it from being accessed directly from other files.

The __read_mostly attribute is a compiler directive specific to the Linux kernel, which suggests that this variable is expected to be read from often, but seldom written to. This can help the compiler optimize the generated code for this access pattern.

The variable tsc_unstable is used to indicate whether the TSC is unstable. The comment above the variable declaration explains that the TSC can be unstable due to CPU frequency scaling (cpufreq) or due to unsynchronized TSCs across different cores or processors.

In terms of improvements, the code is quite clear and concise. However, it might be helpful to expand the comment to explain what the implications of an unstable TSC are, and how the tsc_unstable variable is used in the system’s logic.

Code snippet 1

1
2
3
4
5
6
7
8
9
10
11
12
static int __init tsc_setup(char *str)
{
if (!strcmp(str, "reliable"))
tsc_clocksource_reliable = 1;
if (!strncmp(str, "noirqtime", 9))
no_sched_irq_time = 1;
if (!strcmp(str, "unstable"))
mark_tsc_unstable("boot parameter");
if (!strcmp(str, "nowatchdog"))
no_tsc_watchdog = 1;
return 1;
}

The __init keyword before the function name is a macro used in the Linux kernel to indicate that this function is used only at initialization time. The memory for this function is freed after the initialization is complete to conserve the kernel’s memory footprint.

Inside the function, there are several if statements that compare the input string str to different string literals using the strcmp and strncmp functions. strcmp compares two strings and returns 0 if they are identical, while strncmp compares up to a specified number of characters from two strings.

If str is “reliable”, it sets the tsc_clocksource_reliable variable to 1. If str starts with “noirqtime”, it sets the no_sched_irq_time variable to 1. If str is “unstable”, it calls the mark_tsc_unstable function with “boot parameter” as an argument. If str is “nowatchdog”, it sets the no_tsc_watchdog variable to 1.

Each of these variables or functions presumably controls a different aspect of the TSC’s behavior. For example, tsc_clocksource_reliable might indicate whether the TSC is a reliable source of time, and no_sched_irq_time might control whether the scheduler uses the TSC for interrupt timing.

In terms of improvements, the function is quite clear and concise. However, adding comments to explain the purpose of each variable and what each string argument represents would improve readability. It would also be beneficial to add error handling to account for the case where str does not match any of the expected values.

And we could use tsc=reliable to kernel cmdline to changes to tsc clock source.

Code snippet 2

1
2
3
4
5
6
7
8
9
10
11
static void tsc_cs_mark_unstable(struct clocksource *cs)
{
if (tsc_unstable)
return;

tsc_unstable = 1;
if (using_native_sched_clock())
clear_sched_clock_stable();
disable_sched_clock_irqtime();
pr_info("Marking TSC unstable due to clocksource watchdog\n");
}

The function first checks if the tsc_unstable variable is already set to 1. If it is, the function immediately returns, as the TSC has already been marked as unstable. This is a common pattern in C programming known as a “guard clause”, which is used to exit a function early when certain conditions are met.

If tsc_unstable is not set to 1, the function proceeds to mark the TSC as unstable. It does this by setting tsc_unstable to 1, and then calling two functions: clear_sched_clock_stable and disable_sched_clock_irqtime. These functions presumably perform some cleanup or configuration changes related to the TSC becoming unstable.

Finally, the function logs a message using the pr_info macro, which is a kernel print function that outputs a message to the system log. The message indicates that the TSC has been marked as unstable due to the clocksource watchdog.

In terms of improvements, the function is quite clear and concise. However, adding comments to explain the purpose of the clear_sched_clock_stable and disable_sched_clock_irqtime functions would improve readability. It would also be beneficial to add error handling to account for any potential issues that could occur when these functions are called.

Code snippet 3

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/*
* Make an educated guess if the TSC is trustworthy and synchronized
* over all CPUs.
*/
int unsynchronized_tsc(void)
{
if (!boot_cpu_has(X86_FEATURE_TSC) || tsc_unstable)
return 1;

#ifdef CONFIG_SMP
if (apic_is_clustered_box())
return 1;
#endif

if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
return 0;

if (tsc_clocksource_reliable)
return 0;
/*
* Intel systems are normally all synchronized.
* Exceptions must mark TSC as unstable:
*/
if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) {
/* assume multi socket systems are not synchronized: */
if (num_possible_cpus() > 1)
return 1;
}

return 0;
}


The function begins by checking if the boot CPU has the TSC feature and if the TSC is unstable. If either of these conditions is true, the function immediately returns 1, indicating that the TSC is unsynchronized.

Next, if the system is configured for symmetric multiprocessing (SMP), the function checks if the Advanced Programmable Interrupt Controller (APIC) is clustered. If it is, the function returns 1, again indicating that the TSC is unsynchronized.

The function then checks if the boot CPU has the constant TSC feature or if the TSC clocksource is reliable. If either of these conditions is true, the function returns 0, indicating that the TSC is synchronized.

Finally, the function checks if the CPU vendor is not Intel. If it is not, and the system has more than one possible CPU, the function returns 1, indicating that the TSC is unsynchronized. If none of the previous conditions are met, the function returns 0, indicating that the TSC is synchronized.

More practice

SystemTap

Because of above issue, I just spent more time to check the tsc value used by guest and from host cpu do have any different. With systemtap.

observe rdtsc

result

value of tsc clock,average value and stantard deviation has different

and the value from guest os is not stable when compared with host

during live migration, tsc value will be smaller than usual (I think its because live migration has down time, so we need to change tsc to tolerant it)

so just from the small test, its not a good idea to relay on tsc which is not as specific as it on the host

data from my test

The first version, use the script test average value and stantard deviation

in guest:

1
2
3
4
5
6
TSC mean: 2000170717.800000, TSC std dev: 255861.233545
Time mean: 1000101683.030000, Time std dev: 162898.956256
TSC mean: 2000158595.200000, TSC std dev: 340159.343486
Time mean: 1000092746.020000, Time std dev: 170311.019513
TSC mean: 2000116749.600000, TSC std dev: 96417.905701
Time mean: 1000076448.860000, Time std dev: 102460.953979

in guest during live migration:

1
2
3
4
TSC mean: 1990107194.600000, TSC std dev: 71113321.983521
Time mean: 1000129868.770000, Time std dev: 340417.298586
TSC mean: 1993829457.200000, TSC std dev: 47439246.502752
Time mean: 1001882162.230000, Time std dev: 16929541.16989

Samples from host:

1
2
3
4
5
6
TSC mean: 2000087563.600000, TSC std dev: 16626.290598
Time mean: 1000065114.610000, Time std dev: 8341.142215
TSC mean: 2000084499.400000, TSC std dev: 4760.334824
Time mean: 1000063541.340000, Time std dev: 2447.439965
TSC mean: 2000083391.800000, TSC std dev: 11786.744451
Time mean: 1000062911.800000, Time std dev: 5922.538748

TSC average value will be less that normal during migration.

change the script to check abnormal samples

1
2
3
4
5
6
7
Sample 54, TSC diff: 1998893560, Time diff: 1000069363 ns
Sample 55, TSC diff: 1998887140, Time diff: 1000065570 ns
Sample 56, TSC diff: 1999007020, Time diff: 1000130480 ns
Sample 57, TSC diff: 1535293100, Time diff: 1000090774 ns
Sample 58, TSC diff: 1998899520, Time diff: 1000073836 ns
Sample 59, TSC diff: 2000588260, Time diff: 1001300447 ns
Sample 60, TSC diff: 1998899540, Time diff: 1000072444 ns

just paste my test code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <time.h>
#include <math.h>

#define SAMPLES 100

uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}

double calc_std_dev(uint64_t *data, double mean){
double sum = 0.0;
for(int i = 0; i < SAMPLES; i++){
sum += pow(data[i] - mean, 2);
}
return sqrt(sum / SAMPLES);
}

int main(){
struct timespec start, end;
uint64_t tsc_start, tsc_end;
uint64_t tsc_diffs[SAMPLES], time_diffs[SAMPLES];
double tsc_sum = 0.0, time_sum = 0.0;

for(int i = 0; i < SAMPLES; i++){
clock_gettime(CLOCK_MONOTONIC, &start);
tsc_start = rdtsc();

sleep(1);

tsc_end = rdtsc();
clock_gettime(CLOCK_MONOTONIC, &end);

tsc_diffs[i] = tsc_end - tsc_start;
time_diffs[i] = (end.tv_sec - start.tv_sec) * 1e9 + (end.tv_nsec - start.tv_nsec);

printf("Sample %d, TSC diff: %llu, Time diff: %llu ns\n", i, tsc_diffs[i], time_diffs[i]);

tsc_sum += tsc_diffs[i];
time_sum += time_diffs[i];
}

double tsc_mean = tsc_sum / SAMPLES;
double time_mean = time_sum / SAMPLES;

double tsc_std_dev = calc_std_dev(tsc_diffs, tsc_mean);
double time_std_dev = calc_std_dev(time_diffs, time_mean);

printf("TSC mean: %f, TSC std dev: %f\n", tsc_mean, tsc_std_dev);
printf("Time mean: %f, Time std dev: %f\n", time_mean, time_std_dev);

return 0;
}

TSC

Time Stamp Counter (TSC)All 80x86 microprocessors include a CLK input pin, which receives the clock signal of an external oscillator. Starting with the Pentium, 80x86 microprocessors sport a counter that is increased at each clock signal, and is accessible through the TSC register which can be read by means of the rdtsc assembly instruction. When using this register the kernel has to take into consideration the frequency of the clock signal: if, for instance, the clock ticks at 1 GHz, the TSC is increased once every nanosecond. Linux may take advantage of this register to get much more accurate time measurements.

https://access.redhat.com/solutions/18627

The disk in the guest OS is unmounted during the kernel startup process.

Phenomenon: When creating a new virtual machine, after the virtual machine enters the “running” state (libvirt reports running, and the qemu process starts), a disk is loaded. During the kernel startup process, the disk (vdb) is recognized, and then qemu receives a device removal event, which is fed back to libvirt. Libvirt updates the XML, causing inconsistency between the disk state recorded in the zstack database and the XML on the host.

The main issue here is that the libvirt loading device interface returns success, and the XML corresponding to the device is also added. However, this device is deleted according to the event feedback from qemu.

Important log information: Here, let’s first analyze the system logs in the guest OS:

Here, we notice the logs related to pciehp because this virtual machine is UEFI-booted, leading to numerous pcie-related logs (due to UEFI boot requiring the q35 machine type, which defaults to pcie devices).

The initially observed logs include an error log from pcieport:

1
pci 0000:00:02.7: BAR 13: failed to assign [io size 0x1000]

Followed by the recognition of the vdb device:

1
_virtio_blk virtio6: [vdb] 104857600 512-byte logical blocks (537 GB/500 GiB)

An external interrupt is sent to the virtual machine:

1
pciehp 0000:00:02.7:pcie004: Slot(0-7): Attention button pressed

Subsequently, through ausearch, it is identified that libvirt received a device deletion event, leading to the removal of the mentioned device:

libvirt received device deleted event, removing the device

Based on these scenarios, we have summarized the steps to reproduce the issue:

  1. During the kernel startup process
  2. Load the data disk.
  3. Check for inconsistencies between the XML and the database.
  4. Through repeated testing of VM boot and data disk loading, the issue can be reproduced.

Regarding the error logs mentioned above, the explanation is as follows:

  1. pci 0000:00:02.7: BAR 13: failed to assign [io size 0x1000]:

    • According to https://access.redhat.com/solutions/3144711, this error may occur because in virtualized environments, there might be more PCIe ports than in a real physical environment, leading to this error. However, it does not have any actual impact.
  2. _virtio_blk virtio6: [vdb] 104857600 512-byte logical blocks (537 GB/500 GiB):

    • This indicates that the virtio-blk disk has indeed been successfully loaded, as it has been recognized within the virtual machine.
  3. pciehp 0000:00:02.7:pcie004: Slot(0-7): Attention button pressed:

    • “Attention button pressed” indicates that when resetting the PCIe slot, QEMU sends the corresponding interrupt. When the host receives this interrupt, the corresponding processing logic prints this log.

As for the key QEMU code, by searching the codebase, it has been confirmed that QEMU sends the corresponding interrupt when resetting the PCIe slot, and the host prints the log as part of the corresponding processing logic.

code from pcie.c

1
2
3
4
5
6
pci_word_test_and_clear_mask(exp_cap + PCI_EXP_SLTSTA,
PCI_EXP_SLTSTA_EIS |/* on reset,
the lock is released */
PCI_EXP_SLTSTA_CC |
PCI_EXP_SLTSTA_PDC |
PCI_EXP_SLTSTA_ABP);

which is used in qdev.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
QLIST_FOREACH(bus, &dev->child_bus, sibling) {
object_property_set_bool(OBJECT(bus), true, "realized",
&local_err);
if (local_err != NULL) {
goto child_realize_fail;
}
}
if (dev->hotplugged) {
device_reset(dev);
}
dev->pending_deleted_event = false;

if (hotplug_ctrl) {
hotplug_handler_plug(hotplug_ctrl, dev, &local_err);
if (local_err != NULL) {
goto child_realize_fail;
}
}

During the device hotplug process, there will be a reset action.

Kernel-related code:

from pciehp_hpc.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
static int pciehp_poll(void *data)
{
struct controller *ctrl = data;

schedule_timeout_idle(10 * HZ); /* start with 10 sec delay */

while (!kthread_should_stop()) {
/* poll for interrupt events or user requests */
while (pciehp_isr(IRQ_NOTCONNECTED, ctrl) == IRQ_WAKE_THREAD ||
atomic_read(&ctrl->pending_events))
pciehp_ist(IRQ_NOTCONNECTED, ctrl);

if (pciehp_poll_time <= 0 || pciehp_poll_time > 60)
pciehp_poll_time = 2; /* clamp to sane value */

schedule_timeout_idle(pciehp_poll_time * HZ);
}

return 0;
}

/* Check Attention Button Pressed */
if (events & PCI_EXP_SLTSTA_ABP) {
ctrl_info(ctrl, "Slot(%s): Attention button pressed\n",
slot_name(ctrl));
pciehp_handle_button_press(ctrl);
}

This code represents a kernel function for polling PCIe Hot Plug events. It uses a kernel thread (kthread) to continuously poll for interrupt events or user requests related to PCIe Hot Plug. The function includes a timeout mechanism with an initial delay of 10 seconds and then repeats the polling process based on the specified polling time. The function stops when the kernel thread should stop (kthread_should_stop() returns true).

And pciehp_handle_button_press is implemented as following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
void pciehp_handle_button_press(struct controller *ctrl)
{
mutex_lock(&ctrl->state_lock);
switch (ctrl->state) {
case OFF_STATE:
case ON_STATE:
if (ctrl->state == ON_STATE) {
ctrl->state = BLINKINGOFF_STATE;
ctrl_info(ctrl, "Slot(%s): Powering off due to button press\n",
slot_name(ctrl));
} else {
ctrl->state = BLINKINGON_STATE;
ctrl_info(ctrl, "Slot(%s) Powering on due to button press\n",
slot_name(ctrl));
}
/* blink power indicator and turn off attention */
pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_BLINK,
PCI_EXP_SLTCTL_ATTN_IND_OFF);
schedule_delayed_work(&ctrl->button_work, 5 * HZ);
break;
case BLINKINGOFF_STATE:
case BLINKINGON_STATE:
/*
* Cancel if we are still blinking; this means that we
* press the attention again before the 5 sec. limit
* expires to cancel hot-add or hot-remove
*/
ctrl_info(ctrl, "Slot(%s): Button cancel\n", slot_name(ctrl));
cancel_delayed_work(&ctrl->button_work);
if (ctrl->state == BLINKINGOFF_STATE) {
ctrl->state = ON_STATE;
pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_ON,
PCI_EXP_SLTCTL_ATTN_IND_OFF);
} else {
ctrl->state = OFF_STATE;
pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_OFF,
PCI_EXP_SLTCTL_ATTN_IND_OFF);
}
ctrl_info(ctrl, "Slot(%s): Action canceled due to button press\n",
slot_name(ctrl));
break;
default:
ctrl_err(ctrl, "Slot(%s): Ignoring invalid state %#x\n",
slot_name(ctrl), ctrl->state);
break;
}
mutex_unlock(&ctrl->state_lock);
}

Both OFF_STATE and ON_STATE could changed to each other by same press request.

Based on the test results, we preliminarily conclude that there is a race condition between hot-plug operations and kernel boot, leading to unexpected changes in the PCIe slot’s state from off → on → off. (Note: The crucial point here is that pciehp_handle_button_press(ctrl); simultaneously handles both on and off scenarios.)

With reference to the above keywords, we identified a related Bugzilla entry for QEMU version 4.2 by searching for ‘qemu pci device kernel boot race condition’:

https://bugzilla.kernel.org/show_bug.cgi?id=211691

“The document mentions a virtio-net failover mechanism introduced by QEMU 4.2, addressing the issue of hot-plugging network cards failing during the VM startup phase. This problem arises from a race condition in the QEMU code that sets the PCIe slot’s state. The provided QEMU patch resolves the issue:

QEMU Patch Link

The title of this patch is: ‘pcie: don’t set link state active if the slot is empty.’

Upon reviewing its content, it appears that during PCIe initialization and the hot-plug phase, the ‘reset’ is called, potentially causing inconsistencies in the slot’s state. This patch addresses the problem by preventing the setting of the link state to active if the slot is empty, eliminating the observed issue.”

TIPs:

Search for changes related to the virtual machine process using ausearch:

1
ausearch -m "VIRT_RESOURCE" -p 63259

Libvirt’s XML and QEMU event update mechanism: Details can be found in TIC-1360 - Cloud VM disk does not exist, capacity inconsistency between UI interface and underlying view (Closed).

Quick reference for PCIe events:

From linux pci_regs.h:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#define PCI_EXP_SLTCTL      24  /* Slot Control */
#define PCI_EXP_SLTCTL_ABPE 0x0001 /* Attention Button Pressed Enable */
#define PCI_EXP_SLTCTL_PFDE 0x0002 /* Power Fault Detected Enable */
#define PCI_EXP_SLTCTL_MRLSCE 0x0004 /* MRL Sensor Changed Enable */
#define PCI_EXP_SLTCTL_PDCE 0x0008 /* Presence Detect Changed Enable */
#define PCI_EXP_SLTCTL_CCIE 0x0010 /* Command Completed Interrupt Enable */
#define PCI_EXP_SLTCTL_HPIE 0x0020 /* Hot-Plug Interrupt Enable */
#define PCI_EXP_SLTCTL_AIC 0x00c0 /* Attention Indicator Control */
#define PCI_EXP_SLTCTL_ATTN_IND_SHIFT 6 /* Attention Indicator shift */
#define PCI_EXP_SLTCTL_ATTN_IND_ON 0x0040 /* Attention Indicator on */
#define PCI_EXP_SLTCTL_ATTN_IND_BLINK 0x0080 /* Attention Indicator blinking */
#define PCI_EXP_SLTCTL_ATTN_IND_OFF 0x00c0 /* Attention Indicator off */
#define PCI_EXP_SLTCTL_PIC 0x0300 /* Power Indicator Control */
#define PCI_EXP_SLTCTL_PWR_IND_ON 0x0100 /* Power Indicator on */
#define PCI_EXP_SLTCTL_PWR_IND_BLINK 0x0200 /* Power Indicator blinking */
#define PCI_EXP_SLTCTL_PWR_IND_OFF 0x0300 /* Power Indicator off */
#define PCI_EXP_SLTCTL_PCC 0x0400 /* Power Controller Control */
#define PCI_EXP_SLTCTL_PWR_ON 0x0000 /* Power On */
#define PCI_EXP_SLTCTL_PWR_OFF 0x0400 /* Power Off */
#define PCI_EXP_SLTCTL_EIC 0x0800 /* Electromechanical Interlock Control */
#define PCI_EXP_SLTCTL_DLLSCE 0x1000 /* Data Link Layer State Changed Enable */

QEMU systemtap trace, refer to: QEMU Tracing Documentation

1
/usr/bin/qemu-trace-stap run /usr/libexec/qemu-kvm pci_cfg_write

Guest os pcie trace analysis:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pending interrupts 0x0010 from Slot Status
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_on: SLOTCTRL 6c write cmd 100
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pending interrupts 0x0010 from Slot Status
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_set_attention_status: SLOTCTRL 6c write cmd c0
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: Slot(0-7): Attention button pressed
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: Slot(0-7): Powering off due to button press
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pending interrupts 0x0010 from Slot Status
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_blink: SLOTCTRL 6c write cmd 200
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pending interrupts 0x0010 from Slot Status
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_set_attention_status: SLOTCTRL 6c write cmd c0

[一 12月 4 15:06:53 2023] pciehp 0000:00:02.7:pcie004: pciehp_get_power_status: SLOTCTRL 6c value read 2f1
[一 12月 4 15:06:53 2023] pciehp 0000:00:02.7:pcie004: pciehp_unconfigure_device: domain:bus:dev = 0000:08:00
[一 12月 4 15:06:53 2023] pciehp 0000:00:02.7:pcie004: pending interrupts 0x0010 from Slot Status
[一 12月 4 15:06:53 2023] pciehp 0000:00:02.7:pcie004: pciehp_power_off_slot: SLOTCTRL 6c write cmd 40

[一 12月 4 15:06:54 2023] pciehp 0000:00:02.7:pcie004: pending interrupts 0x0018 from Slot Status
[一 12月 4 15:06:54 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_off: SLOTCTRL 6c write cmd 300

[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_on: SLOTCTRL 6c write cmd 100

cmd 0x0100

1
#define  PCI_EXP_SLTCTL_PWR_IND_ON     0x0100 /* Power Indicator on */

[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_set_attention_status: SLOTCTRL 6c write cmd c0

cmd 0x00c0
0000 0000 1100 0000

1
2
#define  PCI_EXP_SLTCTL_ATTN_IND_BLINK 0x0080 /* Attention Indicator blinking */
#define PCI_EXP_SLTCTL_ATTN_IND_ON 0x0040 /* Attention Indicator on */

[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_blink: SLOTCTRL 6c write cmd 200

cmd 0x0200

1
#define  PCI_EXP_SLTCTL_PWR_IND_BLINK  0x0200 /* Power Indicator blinking */

[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_set_attention_status: SLOTCTRL 6c write cmd c0

cmd 0x00c0
0000 0000 1100 0000

1
2
#define  PCI_EXP_SLTCTL_ATTN_IND_BLINK 0x0080 /* Attention Indicator blinking */
#define PCI_EXP_SLTCTL_ATTN_IND_ON 0x0040 /* Attention Indicator on */

[一 12月 4 15:06:53 2023] pciehp 0000:00:02.7:pcie004: pciehp_power_off_slot: SLOTCTRL 6c write cmd 400

cmd 0x0400

1
#define  PCI_EXP_SLTCTL_PWR_OFF        0x0400 /* Power Off */

[一 12月 4 15:06:54 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_off: SLOTCTRL 6c write cmd 300
cmd 0x0300

1
#define  PCI_EXP_SLTCTL_PWR_IND_OFF    0x0300 /* Power Indicator off */

Understanding CPU Topology for Improved Performance

The physical layout of CPU cores in a system is known as CPU topology. Understanding CPU topology can significantly impact the performance of a system, as it determines the effectiveness and efficiency of the cores.

What is CPU Topology?

CPU topology comprises three primary levels:

  • Socket: A physical connector that holds a CPU. A system can have multiple sockets, each of which can hold multiple cores.
  • Core: A single processing unit within a CPU that can run multiple threads simultaneously.
  • Thread: A single flow of execution within a core.

The CPU topology can be described using a tree-like structure, with the socket level at the top and the thread level at the bottom. The cores in a socket are connected to each other via a bus, and the threads in a core are connected to each other by a shared cache.

Importance of CPU Topology

Understanding CPU topology is crucial for improving system performance. The topology can be used to optimize the performance of a system by assigning threads to cores in a way that minimizes the amount of communication between cores. This can enhance the performance of applications that are heavily multithreaded.

Additionally, the CPU topology can be used to troubleshoot performance issues. For example, if an application is running slowly, the CPU topology can be used to identify which cores are being used the most. This information can help identify the source of the performance problem and take appropriate steps to improve it.

Here are some benefits of understanding CPU topology:

  • It helps to optimize system performance by assigning tasks to the most suitable cores.
  • It helps to troubleshoot performance issues by identifying heavily used cores.
  • It helps to understand how the system will scale as more cores are added.

Tools to Display CPU Topology

There are several tools available to display CPU topology, and one of the most commonly used tools is lscpu. Here is an example of using lscpu to display CPU topology:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[root@172-20-1-220 ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz
Stepping: 7
CPU MHz: 2100.000
BogoMIPS: 4200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 28160K
NUMA node0 CPU(s): 0-19,40-59
NUMA node1 CPU(s): 20-39,60-79
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp_epp pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities

hwloc-ls

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
[root@172-20-1-220 ~]# hwloc-ls
Machine (767GB total)
NUMANode L#0 (P#0 383GB)
Package L#0 + L3 L#0 (28MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#40)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#41)
L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#42)
L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#43)
L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#44)
L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#45)
L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#46)
L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#47)
L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#48)
L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#49)
L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#50)
L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#51)
L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#52)
L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#53)
L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#54)
L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#55)
L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
PU L#32 (P#16)
PU L#33 (P#56)
L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
PU L#34 (P#17)
PU L#35 (P#57)
L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
PU L#36 (P#18)
PU L#37 (P#58)
L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
PU L#38 (P#19)
PU L#39 (P#59)
HostBridge L#0
PCI 8086:a1d2
PCI 8086:a182
PCIBridge
PCIBridge
PCI 1a03:2000
GPU L#0 "card0"
GPU L#1 "controlD64"
HostBridge L#3
PCIBridge
PCI 1000:0097
Block(Disk) L#2 "sda"
NUMANode L#1 (P#1 384GB)
Package L#1 + L3 L#1 (28MB)
L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
PU L#40 (P#20)
PU L#41 (P#60)
L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
PU L#42 (P#21)
PU L#43 (P#61)
L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
PU L#44 (P#22)
PU L#45 (P#62)
L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
PU L#46 (P#23)
PU L#47 (P#63)
L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
PU L#48 (P#24)
PU L#49 (P#64)
L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
PU L#50 (P#25)
PU L#51 (P#65)
L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
PU L#52 (P#26)
PU L#53 (P#66)
L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
PU L#54 (P#27)
PU L#55 (P#67)
L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
PU L#56 (P#28)
PU L#57 (P#68)
L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
PU L#58 (P#29)
PU L#59 (P#69)
L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
PU L#60 (P#30)
PU L#61 (P#70)
L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
PU L#62 (P#31)
PU L#63 (P#71)
L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
PU L#64 (P#32)
PU L#65 (P#72)
L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
PU L#66 (P#33)
PU L#67 (P#73)
L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
PU L#68 (P#34)
PU L#69 (P#74)
L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
PU L#70 (P#35)
PU L#71 (P#75)
L2 L#36 (1024KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36
PU L#72 (P#36)
PU L#73 (P#76)
L2 L#37 (1024KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37
PU L#74 (P#37)
PU L#75 (P#77)
L2 L#38 (1024KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
PU L#76 (P#38)
PU L#77 (P#78)
L2 L#39 (1024KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
PU L#78 (P#39)
PU L#79 (P#79)
HostBridge L#5
PCIBridge
PCI 8086:1521
Net L#3 "enp175s0f0"
PCI 8086:1521
Net L#4 "enp175s0f1"
PCI 8086:1521
Net L#5 "enp175s0f2"
PCI 8086:1521
Net L#6 "enp175s0f3"
PCIBridge
PCI 8086:10fb
Net L#7 "enp176s0f0"
PCI 8086:10fb
Net L#8 "enp176s0f1"
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)
Misc(MemoryModule)

Virtual Machines and CPU Topology

Virtual machines (VMs) are software programs that create an isolated environment for running operating systems and applications. VMs are often used to run various operating systems on the same physical machine or to run applications that require more resources than are available on the host machine.

When a VM is created, the hypervisor, which manages the VMs, assigns a single thread to the VM. This is because assigning multiple threads to a VM can lead to performance issues. Threads share the same resources on a core, and multiple threads can compete for resources, leading to contention and slowdowns. Furthermore, threads may interfere with each other, causing further slowdowns.

To optimize VM performance, it’s generally best to assign a single thread to a VM. However, there are exceptions to this rule. For example, if a VM is running an application that is specifically designed to take advantage of multiple threads, it may be beneficial to assign multiple threads to the VM.

To take advantage of multiple threads in a virtual machine, it’s essential to use a hypervisor that supports thread pinning, an operating system that supports thread scheduling, and an application that is designed to take advantage of multiple threads. Multithreaded applications such as web servers, database servers, and media transcoders are good examples of applications that can take advantage of multiple threads.

Why thread of cpu toplogy always 1 or 2

There are two main reasons why the number of threads in a CPU topology is usually limited to 1 or 2:

  • Physical constraints: A CPU core can only run a single thread at a time due to having a single instruction pointer (IP) and a single set of registers. When two threads run on the same core, they compete for the same resources, leading to performance degradation.
  • Scheduling overhead: Scheduling threads on different cores can be expensive, as the operating system has to switch between threads and this may cause context switches. Context switches are costly, as they require the operating system to save the state of the current thread and restore the state of the next.

In some cases, having more than two threads per core may be beneficial. For instance, heavily multithreaded applications may take advantage of the extra threads. However, in most cases, the costs of having more than two threads per core outweigh the benefits.

There are a few exceptions to the rule that the number of threads in a CPU topology is usually limited to 1 or 2. For example, some CPUs support hyper-threading, which allows a single core to run two threads simultaneously. However, hyper-threading is not always a good idea, as it can sometimes lead to performance degradation.

Overall, the number of threads in a CPU topology is usually limited to 1 or 2 due to physical constraints and scheduling overhead. While there are exceptions, in most cases, the costs of having more than two threads per core outweigh the benefits.

Sockets and cores with performance

Sockets and cores do have an impact on performance.

  • Sockets: A socket is a physical connector that holds a CPU. A system can have multiple sockets, each of which can hold multiple cores. The more sockets a system has, the more cores it can have, which can lead to better performance.
  • Cores: A core is a single processing unit within a CPU. A core can run multiple threads simultaneously. The more cores a system has, the more threads it can run, which can also lead to better performance.

However, it’s important to note that the number of sockets and cores is not the only factor that affects performance. Other factors, such as the clock speed of the CPU, the amount of cache memory, and the type of memory, can also have a significant impact.

In general, systems with more sockets and cores will have better performance than systems with fewer sockets and cores. However, it’s important to choose a system that has the right balance of sockets, cores, clock speed, cache memory, and memory type for your needs.

Here are some examples of how sockets and cores can impact performance:

  • A system with two sockets and four cores will have better performance than a system with one socket and two cores. This is because the system with two sockets can run more threads simultaneously.
  • A system with a higher clock speed will have better performance than a system with a lower clock speed. This is because the system with a higher clock speed can execute instructions faster.
  • A system with more cache memory will have better performance than a system with less cache memory. This is because the system with more cache memory can store more data in memory, which reduces the number of times the CPU has to access slower memory.
  • A system with faster memory will have better performance than a system with slower memory. This is because the system with faster memory can transfer data to the CPU faster, which reduces the amount of time the CPU has to wait for data.

Why aws only offer single sockets instance?

There are a few reasons why cloud providers like AWS do not offer multi-socket instances.

  • Cost: Multi-socket instances are more expensive than single-socket instances. This is because they require more hardware, such as more CPUs and more memory.
  • Complexity: Multi-socket instances are more complex to manage than single-socket instances. This is because they have more components, such as more CPUs, more memory, and more storage.
  • Performance: Multi-socket instances do not always offer better performance than single-socket instances. This is because the performance of a multi-socket instance can be limited by the speed of the interconnect between the sockets.

For these reasons, cloud providers like AWS choose to offer single-socket instances. Single-socket instances are less expensive, easier to manage, and offer the same or better performance than multi-socket instances.

However, there are some cases where multi-socket instances may be a good choice. For example, if you need a lot of CPU power, or if you need to run applications that are not well-optimized for multi-threading, then a multi-socket instance may be a good option.

If you are considering using a multi-socket instance, it is important to weigh the costs and benefits carefully. You should also make sure that your applications are well-optimized for multi-threading.

Understand virtio memory balloon

Introduction

Virtio memory ballooning is a technique that adjusts memory allocation in virtualized environments. The hypervisor can add or remove memory from a virtual machine based on demand, using a balloon driver in the guest operating system. When demand is high, the balloon driver inflates and the guest operating system releases memory. When demand is low, the balloon driver deflates and the guest operating system can use more memory.

This technique optimizes memory usage and reduces the risk of memory exhaustion, making it useful in cloud computing environments. However, it also has trade-offs to consider. Inflating the balloon driver can cause performance issues if the guest operating system can’t release memory quickly enough. It may also struggle with high memory pressure. Understanding these limitations is key to making informed decisions about using virtio memory ballooning.

Overview of Virtio Memory Ballooning

Based on wiki memory ballooning is a technique used to eliminate the need to overprovision host memory used by a virtual machine. To implement it, the virtual machine’s kernel implements a “balloon driver” which allocates unused memory within the VM’s address space into a reserved memory pool (the “balloon”) so that it is unavailable to other processes on the VM. However, rather than being reserved for other uses within the VM, the physical memory mapped to those pages within the VM is actually unmapped from the VM by the host operating system’s hypervisor, making it available for other uses by the host machine. Depending on the amount of memory required by the VM, the size of the “balloon” may be increased or decreased dynamically, mapping and unmapping physical memory as required by the VM.

According to the Virtio v1.2 specification, Virtio Memory Ballooning follows the Virtio protocol. Including:

Feature bits

  • VIRTIO_BALLOON_F_MUST_TELL_HOST (0): Host must be notified before balloon pages are used.
  • VIRTIO_BALLOON_F_STATS_VQ (1): A virtqueue is present for reporting guest memory statistics.
  • VIRTIO_BALLOON_F_DEFLATE_ON_OOM (2): Balloon deflates when guest is out of memory.
  • VIRTIO_BALLOON_F_FREE_PAGE_HINT (3): The device supports free page hinting. The configuration field free_page_hint_cmd_id is valid.
  • VIRTIO_BALLOON_F_PAGE_POISON (4): The driver will immediately write poison_val to pages after deflating them. The configuration field poison_val is valid.
  • VIRTIO_BALLOON_F_PAGE_REPORTING (5): The device supports free page reporting. A virtqueue is present for reporting free guest memory.

Memory Statistics Tags

  • VIRTIO_BALLOON_S_SWAP_IN (0): Amount of memory swapped in (in bytes).
  • VIRTIO_BALLOON_S_SWAP_OUT (1): Amount of memory swapped out to disk (in bytes).
  • VIRTIO_BALLOON_S_MAJFLT (2): Number of major page faults that have occurred.
  • VIRTIO_BALLOON_S_MINFLT (3): Number of minor page faults that have occurred.
  • VIRTIO_BALLOON_S_MEMFREE (4): Amount of memory not being used (in bytes).
  • VIRTIO_BALLOON_S_MEMTOT (5): Total amount of memory available (in bytes).
  • VIRTIO_BALLOON_S_AVAIL (6): Estimate of available memory (in bytes) for starting new applications.
  • VIRTIO_BALLOON_S_CACHES (7): Amount of memory (in bytes) that can be quickly reclaimed without I/O.
  • VIRTIO_BALLOON_S_HTLB_PGALLOC (8): Number of successful hugetlb page allocations in the guest.
  • VIRTIO_BALLOON_S_HTLB_PGFAIL (9): Number of failed hugetlb page allocations in the guest.

Free page hinting

Free page hinting is used during migration to determine which pages within the guest are not being used. These pages are then skipped over while migrating the guest. The device will indicate it is ready to start hinting by setting the free_page_hint_cmd_id to one of the non-reserved values that can be used as a command ID. The driver is notified of the following reserved values:

  • VIRTIO_BALLOON_CMD_ID_STOP (0): any previously supplied command ID is invalid. The driver should stop hinting free pages until a new command ID is supplied, but should not release any hinted pages for use by the guest.
  • VIRTIO_BALLOON_CMD_ID_DONE (1): any previously supplied command ID is invalid. The driver should stop hinting free pages and release all hinted pages for use by the guest.

When a hint is provided, it indicates that the data contained in the given page is no longer needed and can be discarded. If the driver writes to the page, this overrides the hint and the data will be retained. Any stale pages that have not been written to since the page was hinted may lose their content. If read, the contents of such pages will be uninitialized memory.

Page Poison

Page Poison is a feature that lets the host know when the guest is initializing free pages with poison_val. When enabled, the driver immediately writes to pages after deflating and pages reported as free will retain poison_val. If the guest is not initializing freed pages, the driver should reject the VIRTIO_BALLOON_F_PAGE_POISON feature. If the feature has been negotiated, the driver will place the initialization value into the poison_val configuration field data.

Free Page Reporting

Free Page Reporting is a method similar to balloon inflation, but without a deflation queue. Reported free pages can be reused by the driver after the request is acknowledged, without notifying the device.

The driver initiates reporting by gathering free pages into a scatter-gather list, which is then added to the reporting_vq. The exact timing and selection of free pages is determined by the driver.

Once the driver has enough pages available, it sends a reporting request to the device, which acknowledges the request using the reporting_vq descriptor. After acknowledgement, the driver can reuse the reported free pages by returning them to the free page lists in the guest operating system.

The driver can continue to gather and report free pages until it has reached the desired number of pages.

Comparison to Other Memory Management Techniques

Virtio memory ballooning is just one of several memory management techniques available in virtualized environments. Here are some other techniques that are commonly used:

Overcommitment

Overcommitment is a technique that allows virtual machines to use more memory than physically available. This is useful when memory demand is highly variable. However, overcommitment can cause performance issues if the host system runs out of memory and needs to swap memory pages to disk.

KVM hypervisor automatically overcommits CPUs and memory. This means that more virtualized CPUs and memory can be allocated to virtual machines than there are physical resources. This saves system resources, resulting in less power, cooling, and investment in server hardware while still allowing under-utilized virtualized servers or desktops to run on fewer hosts.

Memory Compression

Memory compression compresses memory pages to free up memory in high demand situations. However, this technique can lead to performance problems if the compression algorithm is slow or if memory demand is high.

Zram, zcache, and zswap advance in-kernel compression in different ways. Zram and zcache, both found in the staging tree, have improved in design and implementation, but they are not stable enough for promotion into the core kernel. Zswap proposes a simplified frontswap-only fork of zcache for direct merging into the MM subsystem. While simpler than zcache, zswap is entirely dependent on still-in-staging zsmalloc and has limitations. If zswap is merged, it remains to be seen if it will ever be extended adequately.

Hypervisor Swapping

Hypervisor swapping is a technique in which the hypervisor swaps memory pages between the host and guest operating systems in order to optimize memory usage. This can be useful in situations where there is a high demand for memory or when the host system is running low on memory. However, hypervisor swapping can also lead to performance issues if the guest operating system can’t release memory quickly enough.

Compared to these techniques, virtio memory ballooning has some unique advantages. It optimizes memory usage within the guest operating system itself, reducing the risk of memory exhaustion and improving performance. However, it also has some trade-offs to consider, such as the potential for performance issues if the guest operating system can’t release memory quickly enough.

How to use Virtio Memory Ballooning on linux

Environment

On host side we use libvirt to setup a vm.

The memory tag means: The maximum allocation of memory for the guest at boot time.

The currentMemory tag means: The actual allocation of memory for the guest.

1
2
3
<maxMemory slots='16' unit='KiB'>1524288</maxMemory>
<memory unit='KiB'>8388608</memory>
<currentMemory unit='KiB'>8388608</currentMemory>

And add memballoon virtio device in vm xml:

1
<memballoon model='virtio'>

To use Virtio Memory Ballooning on Linux guest, you’ll need to ensure that your kernel has support for the virtio_balloon driver. You can check for this by running the following command:

1
lsmod | grep virtio_balloon

If the virtio_balloon driver is not listed, you may need to load it manually by running the following command:

1
modprobe virtio_balloon

We can do some test to confirm balloon driver is working:

Basic usage

Explaination from

libvirt/virsh.rst at master · libvirt/libvirt

1
2
3
4
5
6
7
8
9
10
11
12
# virsh dommemstat YOUR_VM_NAME          
actual 8388608 # Current balloon value (in KB)
swap_in 7011156 # The amount of data read from swap space (in kB)
swap_out 664776 # The amount of memory written out to swap space (in kB)
major_fault 234565 # The number of page faults where disk IO was required
minor_fault 84722778 # The number of other page faults
unused 6291308 # The amount of memory left unused by the system (in kB)
available 8388044 # The amount of usable memory as seen by the domain (in kB)
usable 6349618 # The amount of memory which can be reclaimed by balloon without causing host swapping (in KB) *
last_update 1682566755 # Timestamp of the last update of statistics (in seconds)
disk_caches 116620 # The amount of memory that can be reclaimed without additional I/O, typically disk caches (in KiB)
rss 8529188 # Resident Set Size of the running domain's process (in kB)

with memory balloon we can get details about guest usage which matches the Memory Statistics Tags we metioned above.

And from dominfo we can see the memory usage directly

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# virsh dominfo YOUR_VM_NAME
Id: 7
Name: 1970b0ef25e44adc834767fe81f155d5
UUID: 1970b0ef-25e4-4adc-8347-67fe81f155d5
OS Type: hvm
State: running
CPU(s): 4
CPU time: 214084.1s
Max memory: 8388608 KiB
Used memory: 8388608 KiB
Persistent: yes
Autostart: disable
Managed save: no
Security model: none
Security DOI: 0

Shrinking memory

At first, check the unused memory of your guest

1
2
# virsh dommemstat YOUR_VM_NAME | grep unused
unused 2868704

then we try to set memory to a size we want

Simply,

1
use actual - unused = 8388608 - 2868704 = 5519904

Then we use setmem

1
# virsh setmem YOUR_VM_NAME --size 5519904KiB --current

Check the shrink take effects:

1
2
3
4
5
6
7
8
9
10
11
# virsh dommemstat YOUR_VM_NAME
actual 5519904
swap_in 0
swap_out 2592
major_fault 6236
minor_fault 181380396
unused 140212
available 5139400
usable 3424496
last_update 1682567978
rss 5583008

actual changed to 5519904 and we check the guest on the other side

1
2
3
4
# free -hm
total used free shared buff/cache available
Mem: 4.9G 862M 134M 299M 3.9G 3.3G
Swap: 7.9G 3.5M 7.9G

Total memory changed even smaller than 5519904 ~= 5.26G about 7% memory missing and almost same with available 5139400

Expanding memory

To increase the memory allocation of a virtual machine using virtio memory ballooning, you can use the virsh setmem command. For example, to increase the memory allocation to 8GB, you would run:

1
virsh setmem YOUR_VM_NAME --size 8G --current

This will increase the memory allocation of the virtual machine to 8GB. However, it’s important to note that the guest operating system must have support for virtio memory ballooning in order to take advantage of this feature.

In addition, it’s important to monitor the memory usage of virtual machines to ensure that they have enough memory to operate effectively. This can be done using tools like virsh dommemstat to monitor memory usage statistics.

1
2
3
4
5
6
7
8
9
10
11
# virsh dommemstat YOUR_VM_NAME
actual 8388608
swap_in 0
swap_out 2592
major_fault 6236
minor_fault 181827159
unused 3008116
available 8008104
usable 6293140
last_update 1682571788
rss 7545844

Inside guest

1
2
3
4
# free -hm
total used free shared buff/cache available
Mem: 7.6G 862M 2.9G 299M 3.9G 6.0G
Swap: 7.9G 3.5M 7.9G

With 8GB memory from qemu side, guest have total 7.6G memory. There is still a 5% missing.

Industry Practices

Proxmox

Dynamic memory management shows that KSM and memory balloon works on windows and linux guest, a memory range from min and max will be required and guest’s memory will dynamicly changed between the range to impelement memory ballooning.

Google cloud

Dynamic resource management Memory ballooning is an interface mechanism between host and guest to dynamically adjust the size of the reserved memory for the guest. A virtio memory balloon device 
 is used to implement memory ballooning. Through the virtio memory balloon device, a host can explicitly ask a guest to yield a certain amount of free memory pages (also called memory balloon inflation), and reclaim the memory so that the host can use the free memory for other VMs. Likewise, the virtio memory balloon device can return memory pages back to the guest by deflating the memory balloon.

Compute Engine E2 VM instances that are based on a public image
 have a virtio memory balloon device , which monitors the guest operating system’s memory use. The guest operating system communicates its available memory to the host system. The host reallocates any unused memory to other processes on demand, thereby using memory more effectively. Compute Engine collects and uses this data to make more accurate rightsizing recommendations.

In Linux kernels before 5.2, the Linux memory system sometimes mistakenly prevents large allocations when the balloon device is present. This is rarely an issue in practice, but we recommend changing the virtual memory overcommit_memory setting to 1 to prevent the issue from occurring. This change is already made by default in all Google-provided images published since February 9, 2021.

To fix the setting, use the following command to change the value from 0 to 1:

1
sudo /sbin/sysctl -w vm.overcommit_memory=1

To persist this change across reboots, add the following to your /etc/sysctl.conf file:

1
vm.overcommit_memory=1

Nutanix

Squeeze even more memory of your HCI

Memory overcommit allows more memory to be assigned to VMs than is physically present in the server hardware. Unused memory allocated to a VM can be reclaimed by the hypervisor and made available to other VMs on the host. AHV adjusts memory usage for each VM according to its usage, allowing the host to use excess memory to satisfy the requirements of other VMs. This reduces hardware costs for large deployments or increases the utilization of an existing environment that can’t be immediately expanded with new nodes. VMs without memory overcommit will operate with their pre-assigned memory, and can coexist with overcommit enabled VMs. Nutanix uses a multi-tier approach combining ballooning and hypervisor-level swap to optimize performance. Metrics are presented to the administrator in Prism Central to indicate the gains achieved through overcommit and its impact on VM performance. Memory overcommit may not be appropriate for performance-sensitive workloads due to its dynamic nature.

Limits of Memory Overcommit

Memory overcommit has the following limitations:

  • You can enable or disable Memory Overcommit only while the VM is powered off.
  • Power off the VM enabled with memory overcommit before you change the memory allocation for the VM.
    For example, you cannot update the memory of a VM that is enabled with memory overcommit when it is still running. The system displays the following alert: InvalidVmState: Cannot complete request in state on.
  • Memory overcommit is not supported with VMs that use GPU passthrough and vNUMA.
    For example, you cannot update a VM to a vNUMA VM when it is enabled with memory overcommit. The system displays the following alert: InvalidArgument: Cannot use memory overcommit feature for a vNUMA VM error.
  • Memory overcommit feature can slow down the performance and the predictable performance of the VM
    For example, migrating a VM enabled with Memory Overcommit takes longer than migrating a VM not enabled with Memory Overcommit.
  • There may be a temporary spike in the aggregate memory usage in the cluster during the migration of a VM enabled with Memory Overcommit from one node to another.
    For example, when you migrate a VM from Node A to Node B, the total memory used in the cluster during migration is greater than the memory usage before the migration.
    The memory usage of the cluster eventually drops back to pre-migration levels when the cluster reclaims the memory for other VM operations.
  • Using Memory Overcommit heavily can cause a spike in the disk space utilization in the cluster. This spike is caused because the Host Swap uses some of the disk space in the cluster.
    If the VMs do not have a swap disk, then in case of memory pressure, AHV uses space from the swap disk created on ADSF to provide memory to the VM. This can lead to an increase in disk space consumption on the cluster.
  • All DR operations except Cross Cluster Live Migration (CCLM) are supported
    On the destination side, if a VM fails when you enable Memory Overcommit, the failed VM fails over (creating the VM on the remote site) as a fixed size VM. You can enable Memory Overcommit on this VM after the failover is complete.

Limitations and Challenges

Guest should support virtio memory ballooning, if the balloon driver not available there is no effective way to do it.

Distribution No Balloon Driver Partially Supported Fully Supported
CentOS 6.1, 6.2 6.3–6.9, 7.1, 7.2 7.3–7.7, 8.0–8.2
Oracle 7.3 7.4, 7.5 7.6, 7.7
Ubuntu See note. 12.04 14.04 and newer

Not all situations are suitable for memory ballooning. Frequent expansion and contraction of memory can be harmful if the memory usage changes dynamically.

Future Development

https://www.linux-kvm.org/page/Projects/auto-ballooning The auto ballooning project was initiated in 2013. The hypervisor and Linux kernel need to be updated to support the project, which has not been upstreamed yet.

Real-World Implementation Case Study

Conclusion

Virtualization is important in modern computing for flexible and efficient resource allocation. Memory management is challenging in virtualized environments when multiple virtual machines run on a single physical server. Virtio memory ballooning optimizes memory usage by dynamically adjusting guest memory reservation. It improves performance and reduces the risk of memory exhaustion. This article explains how to use virtio memory ballooning on Linux, compares it to other memory management techniques, and discusses industry practices, limitations, and future developments.

References

Powered by Notion AI

Qemu Colo Details

qemu quorum block filter

Based on the code design of blkverify.c and blkmirror.c, the main purpose is to mirror write requests to all the qcow images hanging in the quorum, and the read operation is to check whether the number of occurrences of the qiov version meets the value set by the threshold through the parameters set by the threshold. Then it returns the > value of the result with the highest number of occurrences, if the number of occurrences i is less than the threshold then it returns the quourm exception and the read operation returns -EIO.

The main use of this feature is for people who use NFS devices affected by bitflip errors.

If you set the read-pattern to FIFO and set the threshold to 1, you can construct a read-only first disk scenario.

block-replication

The micro checkpoint and COLO mentioned in the introduction to the QEMU FT solution will continuously create checkpoints, and the state of the pvm and svm will be the same at the moment the checkpoint is completed. But it will not be consistent until the next checkpoint.

To ensure consistency, the SVM changes need to be cached and discarded at the next checkpoint. To reduce the stress of network transfers between checkpoints, changes on the PVM disk are synchronized asynchronously to the SVM node.

For example, the first time VM1 does a checkpoint, it is recorded as state C1, then VM2’s state is also C1, at this time VM2’s disk changes start to cache, VM1’s changes are written to VM2’s node through this mechanism, if an error occurs at this time how should it be handled?

Suppose we discuss the simplest case of VM1 hanging, then because the next checkpoint has not yet been executed, VM2 continues to run the state of C1 for a period of time and the disk changes are cached, at this time it is only necessary to flush the cached data to VM2’s disk single point to continue to run or wait for FT reconstruction, which is the reason for the need to do SVM disk changes caching (here the data (including two copies, one is to restore to VM2 last checkpoint cache, the other is to VM2 in C1 after the cache of changes)

The following is the structure of block-replication:

  1. The block device on the primary node mounts two sub-devices via quorum, providing backup from the primary node to the secondary host. The read pattern (FIFO) is extended to meet the situation where the primary node will only read the local disk (the threshold needs to be set to 1 so that read operations will only be performed locally)
  2. A newly implemented filter called replication is responsible for controlling block replication
  3. The secondary node receives disk write requests from the primary node through the embedded nbd server
  4. The device on the secondary node is a custom block device, we call it an active disk. it should be an empty device at the beginning, but the block device needs to support bdrv_make_empty() and backing_file
  5. The hidden-disk is created automatically, and this file caches the contents modified by what is written from the primary node. It should also be an empty device at the beginning and support bdrv_make_empty() and backing_file
  6. The blockdev-backup job (sync=none) will synchronize all the contents of the hidden-disk cache that should have been overwritten by nbd-induced writes, so the primary and secondary nodes should have the same disk contents before the replication starts
  7. The secondary node also has a quorum node, so that the secondary can become the new primary after the failover and continue to perform the replication

There are seven types of internal errors that can exist when block replication runs:

  1. Primary disk I/O errors
  2. Primary disk forwarding errors
  3. blockdev-backup error
  4. secondary disk I/O errors
  5. active disk I/O error
  6. Error clearing hidden disk and active disk
  7. failover failure

For error 1 and error 5, just report block level errors directly upwards.

For 2, 3, 4, and 6 need to be reported to the control plane of FT for failover process control.

In the case of 7, if the active commit fails, it will prompt a secondary node write operation error and let the person performing the failover decide how to handle it.

colo checkpoint 

colo uses vm’s live migration to achieve the checkpoint function

Based on the above block-replication to achieve disk synchronization, the other part is how to synchronize the running state data of virtual machines, here directly using the existing live migration, that is, cloud host hot migration, so that after each checkpoint can be considered pvm and svm disk/memory are consistent, so need to be in This event depends on the time of live migration.

First, let’s organize the checkpoint process, which is divided into two major parts

Configuration phase

This part will be executed mainly when the colo is first set up, we know that by default at the beginning we will configure the disk synchronization of pvm and svm, but the memory is not actually synchronized yet, so at the beginning we will ask the svm to be pused at first after the startup, and then submit two synchronization operations from the pvm side

  1. Submit the drive-mirror task to mirror the contents of the disk from the pvm to the remote svm’s disk (embedded nbd is used here, which is also the target disk of the block replicaton later) to ensure that the pvm’s contents are consistent with the svm’s
  2. Submit a migration task to synchronize memory from pvm to svm, and since both pvm and svm are required to be paired at this point, you actually wait until both pvm and svm are synchronized, then you need to cancel the drive-mirror task, start block replication, and continue running vm
    Of course, the paused state mentioned in 2 has been changed to be similar to hot migration after the improvements made by intel. After the drive-mirror task is submitted, the id of the task and the information of the block replication disks are used as parameters for the colo migration, which will actually be automatically changed when migrating in the line of online migration. After the migration is completed, the drive-mirror task is automatically cancelled and block-replication is automatically started before running vm, which simplifies the steps a lot.

After the configuration, you need to manually issue a migrate command to the colo pvm, and the checkpoint will enter the cycle of monitoring after the first migrate.

Start the checkpoint

The checkpoint mechanism consists mainly of a loop, and the code flow of qemu is as follows:

Combined with this picture we explain the more important parts inside

Process phase

COLO-FT initializes some key variables such as migration status, FAILOVER status and listens to the internal checkpoint notify (triggered from COLO compare)

After the first successful migrate, the discount state is initialized and the migration state is changed to COLO

After receiving a request for a checkpoint, a series of checkpoint creation processes are performed

Colo Status

For the COLO-FT virtual machine, there are two important states

One is the MigrationState, which on the COLO-FT virtual machine is MIGRATION_STATUS_COLO corresponding to the string “COLO”, which is a prerequisite state to allow checkpointing, and the cloud host must have established the COLO-FT mechanism. FT mechanism, that is, through the above configuration phase to complete the configuration and the first checkpoint, will enter this state and the main loop

Another state is failover_state, which is a global variable defined in colo_failover.c, which is accessed by colo.c through failover_get_state(), and this parameter is set to FAILOVER_STATUS at the start of the checkpoint loop _NONE, which means that failover is not needed. The bottom half of qemu mounts the mechanism for modifying this state, so it can be triggered by user state commands, so you need to pay attention to whether failover is triggered or not when actually doing checkpoint

Communitaion

COLO communicates through messages to get the status of the SVM, as well as to send and confirm the start and completion of the checkpoint, and the message process inside has the following main steps

  1. Sending COLO_MESSAGE_CHECKPOINT_REQUEST
  2. After the SVM receives the message, pause the SVM and send COLO_MESSAGE_CHECKPOINT_READY
  3. PVM starts saving and live migration of VMSTATE
  4. SVM gets the migrated information and does the CPU synchronization and VM state LOAD locally.
  5. SVM will wait for a check message from PVM after the migration is completed, and PVM will send a message after the live migration is completed.
  6. PVM sends COLO_MESSAGE_VMSTATE_SIZE with the size of VMSTATE sent via QIOChannelBuffer
  7. SVM receives the message and checks if the size received locally is the same as the size sent, if it is, it replies COLO_MESSAGE_VMSTATE_RECEIVED
  8. After confirming the VMSTATE transfer, the SVM will do some migration and subsequent synchronization and cleanup.
  9. After completion, the SVM executes vm_start() and sends COLO_MESSAGE_VMSTATE_LOADED.
  10. After the PVM receives the message that the SVM has successfully loaded, the PVM will also execute vm_start().

The logic of suspend, migrate and resume the operation of PVM SVM is realized through the message collaboration between PVM and SVM

Existing problems, because the current checkpoint are notified to each other through the message, once the corresponding packet is sent and not returned, the next wait may always exist, can not be closed, assuming that at this time from the bottom half (bottom half) to send a request also did not do to clean up the wait state.

It should be noted that: the default checkpoint once the failure occurs, the vm will be a direct exit, requiring the rebuilding of COLO-FT, so the establishment of COLO-FT failure needs to be analyzed from two parts

Whether the configuration phase migration has failed
Whether the configuration is complete (migration has become colo state) but the checkpoint failed (the above process failed) resulting in COLO-FT exit

colo proxy

colo proxy as the core component of COLO-FT, this article mainly focuses on the functionality of colo proxy in QEMU

When QEMU implements the net module, it actually treats the actual device in the guest as a receiver, so the corresponding relationship is as follows

TX RX

qemu side network device (sender) ——————-→ guest inside driver (receiver)

Combined with the code, the filter will be executed before actually doing transimission, and then go to sender processing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
NetQueue *queue;
size_t size = iov_size(iov, iovcnt);
int ret;

if (size > NET_BUFSIZE) {
return size;
}

if (sender->link_down || !sender->peer) {
return size;
}

/* Let filters handle the packet first */
ret = filter_receive_iov(sender, NET_FILTER_DIRECTION_TX, sender,
QEMU_NET_PACKET_FLAG_NONE, iov, iovcnt, sent_cb);
if (ret) {
return ret;
}

ret = filter_receive_iov(sender->peer, NET_FILTER_DIRECTION_RX, sender,
QEMU_NET_PACKET_FLAG_NONE, iov, iovcnt, sent_cb);
if (ret) {
return ret;
}

queue = sender->peer->incoming_queue;

return qemu_net_queue_send_iov(queue, sender,
QEMU_NET_PACKET_FLAG_NONE,
iov, iovcnt, sent_cb);

The filter-mirror and filter-redirect in the network-filter implemented by colo act as the forwarding function of the proxy

The classic process for a network card is as follows:

The host device receives the network packet and sends it to the guest

  1. First execute the first filter-mirror, for qemu is transimission, so execute mirror action, the network packet mirror a copy sent off through outdev (chardev), and then call the next filter (because it is TX, so other filters will not be executed, so pvm on (the packet is sent directly to the guest)
  2. SVM’s indev connects to PVM’s mirror’s outdev (via socket), so it receives the packet sent by 1. The filter does not specify an outdev after receiving the packet, so it calls the next filter directly
  3. SVM calls filter-rewrite, the direction of this filter is ALL, so the packets to and from SVM will be processed by this filter, if the target is sent, because it is sent to VM so the first direction is TX, COLO will record the various states of this TCP packet
  4. Because there is no next filter so it is sent to the qemu network device, and then take the process of sending to the guest
  5. From the guest to the qemu network packet, the direction is RX so the filter processing order will be reversed and sent to the rewrite first
  6. SVM calls filter-rewrite this time in the direction of RX, so when processing, it will process the tcp packets returned by SVM, compare the input and output of tcp packets through the tcp packets table, and if the processing fails, it will put the packet in the queue and resend it again (note: need to continue deeper analysis), and then the filter- redirector
  7. Also in the PVM mirror filter, because there is no subsequent TX filter, the packet is sent directly to the qemu net device and then to the PVM guest.
  8. The packets coming out of the PVM guest will be sent to the primary in interface of the colo-compare thing by filter-redirector because it is in the RX direction, so some filters will be performed in the reverse direction
  9. SVM will send the return of SVM to the secondary in interface of colo-compare of PVM via redirector’s outdev
    colo-compare receives the packet and starts to do the relevant analysis to decide whether checkpoint is needed
  10. The filter-redirector’s indev receives the return from colo-compare after comparison and forwards it to the host net device via outdev

This is the end of a complete packet processing process.

Since colo-compare is responsible for comparing pvm and svm packets, there are some metrics that need to be understood

payload

payload_size = total size - header_size i.e. the size of the whole packet minus the length of the header

packet data offset packet header size after the distance and payload_size comparison is consistent

The following is a summary of what needs to be done here

The logic of colo-compare comparison is organized:

Protocol Action
TCP Compare whether the payloads are the same. If it is the same and the ack is the same then it is over. If it is the same but the ack of pvm is larger than the ack of svm, the packet sent to svm is dropped and the packet of pvm will not be sent (meaning sent back to the host NIC). So we will record the maximum ack in the current queue (both pvm and svm queues) until the ack exceeds the smaller of the two maximum ack values, and we can ensure that the packet payload is confirmed by both sides
UDP only palyload checked
ICMP only palyload checked
Other only packet size checked

Possible reasons for network packet loss are therefore:

  1. colo-compare did not receive the packet correctly
  2. svm’s redirector did not successfully forward packets to rewrite
  3. mirror did not replicate the packet successfully
  4. pvm’s redirector did not successfully send pvm’s return to colo-compare
  5. svm’s rewrite did not send/receive packets successfully
  6. colo-compare is not sending packets correctly
  7. svm’s redirector did not successfully forward packets to colo-compare

Problem processing

The processing of 1 mainly relies on the colo compare mechanism itself, for tcp packets will determine whether there are subsequent packets returned by ack, if there are subsequent packets, it means that the previous is missed

2 If the packet is not successfully sent to rewrite, it will not be processed by svm, so finally colo compare will encounter the situation where pvm has a packet but svm does not have a packet, and the processing is similar to 1

3 If mirror does not successfully copy the package, then there will also be a situation similar to 1, pvm exists package, svm no package

4 if pvm redirector did not successfully send the packet, then it seems from colo compare is pvm lost packets, but the same 1 processing, will wait for the pvm and svm minimum ack is exceeded, that is, both pvm or svm even if packet loss occurs, colo compare will wait for the updated packet to appear before returning the packet otherwise will always card does not reverse the current packet

5 If rewritte’s send-receive fails, this situation will cause the svm to not receive the packet and not return, similar to 1, but if failover occurs at this time, the svm packet is lost

6 this exception will lead to colo send and receive packet exceptions, network anomalies, not very well handled because itself colo compare is the core component

7 similar situation svm seems to have replied or actively sent the packet, but because colo compare did not receive, resulting in the svm within the data that did not reply, the benefit is that if the subsequent failover can occur, rewrite because the packet was recorded, will send the packet again, then it seems to be working again (need to test)

Trigger checkpoint

There are two conditions for triggering from colo-comare, because COLO-FT will establish a notification mechanism when it is established, and colo compare will trigger checkpoint from inside actively through this mechanism

  1. Checkpoint will be triggered if the payload of the compare tcp packet is inconsistent
  2. Timed to check if there is a certain period of time but has not received the same return packet (i.e., pvm, svm packet chain table content is inconsistent) trigger checkpoint
  3. If there is a packet in the pvm list but not in the secondary packet, then it means that the packet reply is late, this situation is handled by 2, if the comparison finds that the non-tcp packet comparison is inconsistent will trigger a checkpoint

KVM虚拟化性能分析

KVM虚拟化是一种常用的虚拟化技术,它可以将一台物理服务器划分为多个虚拟机,从而提高服务器的利用率和灵活性。然而,由于虚拟化带来的额外开销,KVM虚拟化的性能问题是一个常见的挑战。为了解决这些问题,我们需要使用一些性能诊断工具来分析和优化KVM虚拟化的性能。

以下是一些常用的KVM虚拟化性能诊断工具:

Perf

Perf是一种Linux性能分析工具,可以用于监视系统性能和调试性能问题。它基于Linux内核提供的性能事件接口,并提供了一个命令行界面,可以用于监视CPU使用率、内存使用情况、磁盘I/O等性能指标。

以下是使用Perf进行KVM虚拟化性能分析的最佳实践:

  1. 安装Perf

要安装Perf,请使用以下命令:

1
sudo apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r`
  1. 收集Perf数据

要使用Perf收集性能数据,请使用以下命令:

1
sudo perf record -g -p `pidof qemu-system-x86_64` -F 99

在这个例子中,-g选项表示收集函数调用图(用于生成Flame Graph),-p选项表示监视qemu-system-x86_64进程,-F选项表示使用99Hz的采样频率来收集性能数据。

  1. 生成Flame Graph

要生成Flame Graph,请使用以下命令:

1
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > output.svg

在这个例子中,perf script命令将Perf数据转换为脚本输出,stackcollapse-perf.pl命令将脚本输出转换为折叠栈,flamegraph.pl命令将折叠栈转换为Flame Graph。最终的Flame Graph将保存在output.svg文件中。

Sysstat

Sysstat是一个Linux系统性能监控工具,可以用于监视CPU、内存、磁盘I/O等性能指标。在KVM虚拟化中,您可以使用Sysstat来监视虚拟机的性能。以下是使用Sysstat进行KVM虚拟化性能分析的最佳实践:

  1. 安装Sysstat

要安装Sysstat,请使用以下命令:

1
sudo apt-get install sysstat
  1. 配置Sysstat

要配置Sysstat,请编辑/etc/default/sysstat文件,并更改以下变量:

1
2
HISTORY=7
INTERVAL=60

在这个例子中,Sysstat将每1分钟收集一次性能数据,并将数据保存最近7天。

  1. 分析Sysstat数据

Sysstat收集的数据保存在/var/log/sysstat目录下。您可以使用以下命令来查看Sysstat数据:

1
2
3
4
sar -u
sar -r
sar -b
sar -d

这些命令将分别显示CPU使用率、内存使用情况、磁盘I/O等性能数据。

  1. 使用Sysstat报告

Sysstat还提供了一个报告生成工具,可以根据Sysstat数据生成报告。要生成报告,请运行以下命令:

1
2
sar -A -o <outfile>
sadf -dh <outfile> > <reportfile>

这将生成一个包含所有性能数据的输出文件,然后使用sadf命令将输出文件转换为HTML格式的报告文件

希望这些最佳实践可以帮助您更好地使用Sysstat进行KVM虚拟化性能分析。

如果您希望通过Sysstat数据进行趋势分析,可以使用一个名为ksar的工具。

ksar是一个Java应用程序,可以将Sysstat数据转换为图表,从而更方便地进行趋势分析。

要使用ksar,请按照以下步骤操作:

  1. 安装Java

ksar是一个Java应用程序,因此您需要安装Java才能运行它。您可以从Oracle官方网站下载Java。

  1. 下载和安装ksar

您可以从ksar的官方网站下载最新的版本。下载完成后,将压缩文件解压缩到您选择的目录中。

  1. 运行ksar

要运行ksar,请打开终端并导航到ksar目录。然后,运行以下命令:

1
java -jar ksar.jar
  1. 加载Sysstat数据文件

ksar窗口中,单击“File”菜单,然后选择“Open”选项。选择您要加载的Sysstat数据文件。

  1. 生成图表

ksar窗口中,单击“Graphs”菜单,然后选择要生成的图表类型。ksar将生成一个图表,显示Sysstat数据的趋势。

如果您在KVM虚拟化中遇到了网络性能问题,可以使用以下工具来进行诊断:

tcpdump

tcpdump是一种常用的网络抓包工具。在KVM虚拟化中,您可以在宿主机上使用tcpdump来监视虚拟机的网络流量。以下是一个示例命令:

1
sudo tcpdump -i <interface> -w <output-file>

在这个命令中,是您要监视的网络接口,是保存抓包数据的输出文件。运行此命令后,tcpdump将开始监视指定的网络接口上的流量,并将所有数据保存到输出文件中。

Wireshark

Wireshark是一种网络协议分析工具,可以用于分析网络流量。在KVM虚拟化中,您可以在宿主机上使用Wireshark来分析虚拟机的网络流量。以下是一个示例命令:

1
sudo tshark -i <interface> -w <output-file>

virt-top

一个整合KVM虚拟化性能诊断工具的项目是virt-topvirt-top是一个基于ncurses的交互式监视器,可以用于监视KVM虚拟机的性能。以下是使用virt-top进行KVM虚拟化性能分析的最佳实践:

  1. 安装virt-top

要安装virt-top,请使用以下命令:

1
sudo apt-get install virt-top
  1. 运行virt-top

要运行virt-top,请运行以下命令:

1
sudo virt-top
  1. 监视虚拟机性能

virt-top窗口中,您可以使用上下方向键选择要监视的虚拟机。然后,您可以查看虚拟机的CPU使用率、内存使用情况、磁盘I/O等性能指标。

希望这些最佳实践可以帮助您更好地使用KVM虚拟化性能诊断工具。

Cpu features about kvm hidden

[TOC]

What kvm hidden did to qemu

Based on last blog, we can see how libvirt cpu feature configuration changes qemu cpuid. And we figure out hypervisor disable configuration have what kind of influence.

Then another recommanded feature from libvirt is kvm hidden. In the same way with last blog, we can find libvirt will configure kvm=off to -cpu and according to qemu:

1
2
3
4
5
6
7
8
9
10
11
12
13
DEFINE_PROP_BOOL("hv-relaxed", X86CPU, hyperv_relaxed_timing, false),
DEFINE_PROP_BOOL("hv-vapic", X86CPU, hyperv_vapic, false),
DEFINE_PROP_BOOL("hv-time", X86CPU, hyperv_time, false),
DEFINE_PROP_BOOL("hv-crash", X86CPU, hyperv_crash, false),
DEFINE_PROP_BOOL("hv-reset", X86CPU, hyperv_reset, false),
DEFINE_PROP_BOOL("hv-vpindex", X86CPU, hyperv_vpindex, false),
DEFINE_PROP_BOOL("hv-runtime", X86CPU, hyperv_runtime, false),
DEFINE_PROP_BOOL("hv-synic", X86CPU, hyperv_synic, false),
DEFINE_PROP_BOOL("hv-stimer", X86CPU, hyperv_stimer, false),
DEFINE_PROP_BOOL("hv-frequencies", X86CPU, hyperv_frequencies, false),
DEFINE_PROP_BOOL("check", X86CPU, check_cpuid, true),
DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),

those configures are defined by target/i386/cpu.c in variable x86_cpu_properties.

kvm=off will be treated as “kvm” is false and the local variable of this cpu changes expose_kvm to false.

1
2
3
if (!kvm_enabled() || !cpu->expose_kvm) {
env->features[FEAT_KVM] = 0;
}

x86_cpu_realizefn will invoke x86_cpu_expand_features to expand features from configuration, as a result FEAT_KVM will disable all features after realize features.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[FEAT_KVM] = {
.feat_names = {
"kvmclock", "kvm-nopiodelay", "kvm-mmu", "kvmclock",
"kvm-asyncpf", "kvm-steal-time", "kvm-pv-eoi", "kvm-pv-unhalt",
NULL, "kvm-pv-tlb-flush", NULL, NULL,
NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL,
"kvmclock-stable-bit", NULL, NULL, NULL,
NULL, NULL, NULL, NULL,
},
.cpuid_eax = KVM_CPUID_FEATURES, .cpuid_reg = R_EAX,
.tcg_features = TCG_KVM_FEATURES,
},

check its definition, almost all kvm related features is disabled.

Then go ahead to linux kernel arch/x86/include/uapi/asm/kvm_para.h defines those features from cpuid:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/* This CPUID returns a feature bitmap in eax.  Before enabling a particular
* paravirtualization, the appropriate feature bit should be checked.
*/
#define KVM_CPUID_FEATURES 0x40000001
#define KVM_FEATURE_CLOCKSOURCE 0
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
/* This indicates that the new set of kvmclock msrs
* are available. The use of 0x11 and 0x12 is deprecated
*/
#define KVM_FEATURE_CLOCKSOURCE2 3
#define KVM_FEATURE_ASYNC_PF 4
#define KVM_FEATURE_STEAL_TIME 5
#define KVM_FEATURE_PV_EOI 6
#define KVM_FEATURE_PV_UNHALT 7

/* The last 8 bits are used to indicate how to interpret the flags field
* in pvclock structure. If no bits are set, all flags are ignored.
*/
#define KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24

And before we check all features details let’s check how linux figure kvm feature at first.

For kernel, check kvm by kvm_para_available:

1
2
3
4
bool kvm_para_available(void)
{
return kvm_cpuid_base() != 0;
}

which will return a kvm based hypervisor by check cpu_has_hypervisor:

1
2
3
4
5
6
7
8
9
10
static noinline uint32_t __kvm_cpuid_base(void)
{
if (boot_cpu_data.cpuid_level < 0)
return 0; /* So we don't blow up on old processors */

if (cpu_has_hypervisor)
return hypervisor_cpuid_base("KVMKVMKVM\0\0\0", 0);

return 0;
}

and cpu_has_hypervisor is defined from the hypervisor feature we mentioned in last post:

1
#define cpu_has_hypervisor	boot_cpu_has(X86_FEATURE_HYPERVISOR)

So we combine those two part together to check the influence introduced by kvm hidden.

Note: here is the brief description about those features in cpuid:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
function: define KVM_CPUID_FEATURES (0x40000001)
returns : ebx, ecx, edx = 0
eax = and OR'ed group of (1 << flag), where each flags is:


flag || value || meaning
=============================================================================
KVM_FEATURE_CLOCKSOURCE || 0 || kvmclock available at msrs
|| || 0x11 and 0x12.
------------------------------------------------------------------------------
KVM_FEATURE_NOP_IO_DELAY || 1 || not necessary to perform delays
|| || on PIO operations.
------------------------------------------------------------------------------
KVM_FEATURE_MMU_OP || 2 || deprecated.
------------------------------------------------------------------------------
KVM_FEATURE_CLOCKSOURCE2 || 3 || kvmclock available at msrs
|| || 0x4b564d00 and 0x4b564d01
------------------------------------------------------------------------------
KVM_FEATURE_ASYNC_PF || 4 || async pf can be enabled by
|| || writing to msr 0x4b564d02
------------------------------------------------------------------------------
KVM_FEATURE_STEAL_TIME || 5 || steal time can be enabled by
|| || writing to msr 0x4b564d03.
------------------------------------------------------------------------------
KVM_FEATURE_PV_EOI || 6 || paravirtualized end of interrupt
|| || handler can be enabled by writing
|| || to msr 0x4b564d04.
------------------------------------------------------------------------------
KVM_FEATURE_PV_UNHALT || 7 || guest checks this feature bit
|| || before enabling paravirtualized
|| || spinlock support.
------------------------------------------------------------------------------
KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
|| || per-cpu warps are expected in
|| || kvmclock.
------------------------------------------------------------------------------

KVM_FEATURE_CLOCKSOURCE & KVM_FEATURE_CLOCKSOURCE2

This feature is used directly when implement kvmclock_init:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void __init kvmclock_init(void)
{
struct pvclock_vcpu_time_info *vcpu_time;
unsigned long mem, mem_wall_clock;
int size, cpu, wall_clock_size;
u8 flags;

size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);

if (!kvm_para_available())
return;

if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
} else if (!(kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

KVM_FEATURE_NOP_IO_DELAY

During guest init, paravirt_ops_setup will use this feature:

1
2
3
4
5
6
7
8
void __init kvm_guest_init(void)
{
int i;

if (!kvm_para_available())
return;

paravirt_ops_setup();

which changes io_delay of paravirt cpu ops to kvm_io_delay:

1
2
3
4
5
6
7
8
9
10
11
12
static void __init paravirt_ops_setup(void)
{
pv_info.name = "KVM";
pv_info.paravirt_enabled = 1;

if (kvm_para_has_feature(KVM_FEATURE_NOP_IO_DELAY))
pv_cpu_ops.io_delay = kvm_io_delay;

#ifdef CONFIG_X86_IO_APIC
no_timer_check = 1;
#endif
}

which just means without any io delay:

1
2
3
4
5
6
/*
* No need for any "IO delay" on KVM
*/
static void kvm_io_delay(void)
{
}

KVM_FEATURE_MMU_OP

Deprecated.

KVM_FEATURE_ASYNC_PF

When init kvm guest:

1
2
3
4
5
void __init kvm_guest_init(void)
{
// ...
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
x86_init.irqs.trap_init = kvm_apf_trap_init;

kvm_apf_trap_init will be set to x86_init.irqs.trap_init which will set async_page_fault when interrupt request for trap operations:

1
2
3
4
static void __init kvm_apf_trap_init(void)
{
set_intr_gate(14, async_page_fault);
}

And then when init kvm guest cpu, will manually enable cpu to allow to write async page fault:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
static void kvm_guest_cpu_init(void)
{
if (!kvm_para_available())
return;

if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
u64 pa = slow_virt_to_phys(this_cpu_ptr(&apf_reason));

#ifdef CONFIG_PREEMPT
pa |= KVM_ASYNC_PF_SEND_ALWAYS;
#endif
wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED);
__this_cpu_write(apf_reason.enabled, 1);
printk(KERN_INFO"KVM setup async PF for cpu %d\n",
smp_processor_id());
}

Then feature will enable async PF for this cpu.

Note: trap initialize will be done by arch/x86/kernel/traps.c:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
void __init trap_init(void)
{
int i;

#ifdef CONFIG_EISA
void __iomem *p = early_ioremap(0x0FFFD9, 4);

if (readl(p) == 'E' + ('I'<<8) + ('S'<<16) + ('A'<<24))
EISA_bus = 1;
early_iounmap(p, 4);
#endif

set_intr_gate(X86_TRAP_DE, divide_error);
set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
/* int4 can be called from all */
set_system_intr_gate(X86_TRAP_OF, &overflow);
set_intr_gate(X86_TRAP_BR, bounds);
set_intr_gate(X86_TRAP_UD, invalid_op);
set_intr_gate(X86_TRAP_NM, device_not_available);
#ifdef CONFIG_X86_32
set_task_gate(X86_TRAP_DF, GDT_ENTRY_DOUBLEFAULT_TSS);
#else
set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
#endif
set_intr_gate(X86_TRAP_OLD_MF, coprocessor_segment_overrun);
set_intr_gate(X86_TRAP_TS, invalid_TSS);
set_intr_gate(X86_TRAP_NP, segment_not_present);
set_intr_gate(X86_TRAP_SS, stack_segment);
set_intr_gate(X86_TRAP_GP, general_protection);
set_intr_gate(X86_TRAP_SPURIOUS, spurious_interrupt_bug);
set_intr_gate(X86_TRAP_MF, coprocessor_error);
set_intr_gate(X86_TRAP_AC, alignment_check);
#ifdef CONFIG_X86_MCE
set_intr_gate_ist(X86_TRAP_MC, &machine_check, MCE_STACK);
#endif
set_intr_gate(X86_TRAP_XF, simd_coprocessor_error);

/* Reserve all the builtin and the syscall vector: */
for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
set_bit(i, used_vectors);

#ifdef CONFIG_IA32_EMULATION
set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall);
set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif

#ifdef CONFIG_X86_32
set_system_trap_gate(SYSCALL_VECTOR, &system_call);
set_bit(SYSCALL_VECTOR, used_vectors);
#endif

/*
* Set the IDT descriptor to a fixed read-only location, so that the
* "sidt" instruction will not leak the location of the kernel, and
* to defend the IDT against arbitrary memory write vulnerabilities.
* It will be reloaded in cpu_init() */
__set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
idt_descr.address = fix_to_virt(FIX_RO_IDT);

/*
* Should be a barrier for any external CPU state:
*/
cpu_init();

x86_init.irqs.trap_init();

#ifdef CONFIG_X86_64
memcpy(&debug_idt_table, &idt_table, IDT_ENTRIES * 16);
set_nmi_gate(X86_TRAP_DB, &debug);
set_nmi_gate(X86_TRAP_BP, &int3);
#endif
}

and x86_init.irqs.trap_init(); will be used post other features.

KVM_FEATURE_STEAL_TIME

when do kvm guest init:

1
2
3
4
if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
has_steal_clock = 1;
pv_time_ops.steal_clock = kvm_steal_clock;
}

Paravirt steal lock will be replaced by kvm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
static u64 kvm_steal_clock(int cpu)
{
u64 steal;
struct kvm_steal_time *src;
int version;

src = &per_cpu(steal_time, cpu);
do {
version = src->version;
rmb();
steal = src->steal;
rmb();
} while ((version & 1) || (version != src->version));

return steal;
}

which will steal the time from cpu directly.

KVM_FEATURE_PV_EOI

From kvm guest init:

1
2
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);

During kvm guest cpu init:

1
2
3
4
5
6
7
8
9
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) {
unsigned long pa;
/* Size alignment is implied but just to make it explicit. */
BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
__this_cpu_write(kvm_apic_eoi, 0);
pa = slow_virt_to_phys(this_cpu_ptr(&kvm_apic_eoi))
| KVM_MSR_ENABLED;
wrmsrl(MSR_KVM_PV_EOI_EN, pa);
}

Besides, those paravirt kvm features is used by kernel so those features need to be disabled if kernel changed, for example, load kernel by kexec, to avoid the features pointing to old memory of old kernel, those features will disabled by write msr manually:

1
2
3
4
5
6
7
8
9
10
11
12
static void kvm_pv_guest_cpu_reboot(void *unused)
{
/*
* We disable PV EOI before we load a new kernel by kexec,
* since MSR_KVM_PV_EOI_EN stores a pointer into old kernel's memory.
* New kernel can re-enable when it boots.
*/
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
wrmsrl(MSR_KVM_PV_EOI_EN, 0);
kvm_pv_disable_apf();
kvm_disable_steal_time();
}

So does kvm guest cpu offline do:

1
2
3
4
5
6
7
8
static void kvm_guest_cpu_offline(void *dummy)
{
kvm_disable_steal_time();
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
wrmsrl(MSR_KVM_PV_EOI_EN, 0);
kvm_pv_disable_apf();
apf_task_wake_all();
}

That’s all due to paravirt use shared memory to use those features between guest and host.

KVM_FEATURE_PV_UNHALT

Allow to use para-virtualized spinlock

1
2
3
4
5
6
7
8
void __init kvm_spinlock_init(void)
{
if (!kvm_para_available())
return;
/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return;

KVM_FEATURE_CLOCKSOURCE_STABLE_BIT

kvm clock will set a PVCLOCK_TSC_STABLE_BIT to pvclock.

1
2
3
4
5
printk(KERN_INFO "kvm-clock: Using msrs %x and %x",
msr_kvm_system_time, msr_kvm_wall_clock);

if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);

when stable source detected:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
{
unsigned version;
u64 ret;
u64 last;
u8 flags;

do {
version = pvclock_read_begin(src);
ret = __pvclock_read_cycles(src, rdtsc_ordered());
flags = src->flags;
} while (pvclock_read_retry(src, version));

if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
src->flags &= ~PVCLOCK_GUEST_STOPPED;
pvclock_touch_watchdogs();
}

if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
(flags & PVCLOCK_TSC_STABLE_BIT))
return ret;

clocksource read will return directly.

Hyper-v impact

linux will converting hyperv and kvmclock

1
2
3
4
5
6
7
static bool compute_tsc_page_parameters(struct pvclock_vcpu_time_info *hv_clock,
HV_REFERENCE_TSC_PAGE *tsc_ref)
{
u64 max_mul;

if (!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT))
return false;

but if no stable tsc allowed, hypervclock and kvmclock computing will be skipped.

Function chain as following:

kvm_guest_time_update -> kvm_hv_setup_tsc_page -> compute_tsc_page_parameters

And source is from kvm request:

1
2
3
4
5
if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
r = kvm_guest_time_update(vcpu);
if (unlikely(r))
goto out;
}

We need to know more about KVM_REQ_CLOCK_UPDATE to figure out when. this request will be used.

The clue is kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); make request usage.

  • Ioctl kvm clock set -> KVM_SET_CLOCK -> kvm_gen_update_masterclock

  • kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu) -> kvm_gen_update_masterclock

  • kvm_guest_time_update -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
    first update is from kvm request:

    1
    2
    3
    4
    5
    if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
    r = kvm_guest_time_update(vcpu);
    if (unlikely(r))
    goto out;
    }

    then interrupt will be disabled to prevent clock changes:

    1
    2
    3
    4
    5
    6
    7
    8
    /* Keep irq disabled to prevent changes to the clock */
    local_irq_save(flags);
    this_tsc_khz = __this_cpu_read(cpu_tsc_khz);
    if (unlikely(this_tsc_khz == 0)) {
    local_irq_restore(flags);
    kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
    return 1;
    }
  • INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn); -> kvmclock_update_fn -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
    kvm lock will be updated by a schedule:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    /*
    * kvmclock updates which are isolated to a given vcpu, such as
    * vcpu->cpu migration, should not allow system_timestamp from
    * the rest of the vcpus to remain static. Otherwise ntp frequency
    * correction applies to one vcpu's system_timestamp but not
    * the others.
    *
    * So in those cases, request a kvmclock update for all vcpus.
    * We need to rate-limit these requests though, as they can
    * considerably slow guests that have a large number of vcpus.
    * The time for a remote vcpu to update its kvmclock is bound
    * by the delay we use to rate-limit the updates.
    */

    #define KVMCLOCK_UPDATE_DELAY msecs_to_jiffies(100)

    and kvmlock sync delays are

    1
    #define KVMCLOCK_SYNC_PERIOD (300 * HZ)
  • kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu) -> kvm_gen_kvmclock_update -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);

    • MSR_KVM_SYSTEM_TIME

    • kvm_arch_vcpu_load
      update clock if no master clock or host cpu to sync.

      1
      2
      3
      4
      5
      6
      7
      8
      9
      /*
      * On a host with synchronized TSC, there is no need to update
      * kvmclock on vcpu->cpu migration
      */
      if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
      kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
      if (vcpu->cpu != cpu)
      kvm_migrate_timers(vcpu);
      vcpu->cpu = cpu;
  • kvm_arch_vcpu_load -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
    Adjust time if needed

    1
    2
    3
    4
    5
    6
    /* Apply any externally detected TSC adjustments (due to suspend) */
    if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
    adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
    vcpu->arch.tsc_offset_adjustment = 0;
    kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
    }
  • kvm_set_guest_paused -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
    if guest kernel stopped by hypervisor use this to update pv clock.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    /*
    * kvm_set_guest_paused() indicates to the guest kernel that it has been
    * stopped by the hypervisor. This function will be called from the host only.
    * EINVAL is returned when the host attempts to set the flag for a guest that
    * does not support pv clocks.
    */
    static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
    {
    if (!vcpu->arch.pv_time_enabled)
    return -EINVAL;
    vcpu->arch.pvclock_set_guest_stopped_request = true;
    kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
    return 0;
    }
  • kvmclock_cpufreq_notifier -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
    see the annotation from code:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    /*
    * We allow guests to temporarily run on slowing clocks,
    * provided we notify them after, or to run on accelerating
    * clocks, provided we notify them before. Thus time never
    * goes backwards.
    *
    * However, we have a problem. We can't atomically update
    * the frequency of a given CPU from this function; it is
    * merely a notifier, which can be called from any CPU.
    * Changing the TSC frequency at arbitrary points in time
    * requires a recomputation of local variables related to
    * the TSC for each VCPU. We must flag these local variables
    * to be updated and be sure the update takes place with the
    * new frequency before any guests proceed.
    *
    * Unfortunately, the combination of hotplug CPU and frequency
    * change creates an intractable locking scenario; the order
    * of when these callouts happen is undefined with respect to
    * CPU hotplug, and they can race with each other. As such,
    * merely setting per_cpu(cpu_tsc_khz) = X during a hotadd is
    * undefined; you can actually have a CPU frequency change take
    * place in between the computation of X and the setting of the
    * variable. To protect against this problem, all updates of
    * the per_cpu tsc_khz variable are done in an interrupt
    * protected IPI, and all callers wishing to update the value
    * must wait for a synchronous IPI to complete (which is trivial
    * if the caller is on the CPU already). This establishes the
    * necessary total order on variable updates.
    *
    * Note that because a guest time update may take place
    * anytime after the setting of the VCPU's request bit, the
    * correct TSC value must be set before the request. However,
    * to ensure the update actually makes it to any guest which
    * starts running in hardware virtualization between the set
    * and the acquisition of the spinlock, we must also ping the
    * CPU after setting the request bit.
    *
    */
  • after kvm_guest_exit();
    update clock if vcpu request clock always up to date.

    1
    2
    if (unlikely(vcpu->arch.tsc_always_catchup))
    kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
  • hardware_enable_nolock -> kvm_arch_hardware_enable -> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
    multi functino access hardware_enable_nolock

    • kvm_cpu_hotplug
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
    void *v)
    {
    val &= ~CPU_TASKS_FROZEN;
    switch (val) {
    case CPU_DYING:
    hardware_disable();
    break;
    case CPU_STARTING:
    hardware_enable();
    break;
    }
    return NOTIFY_OK;
    }
    • kvm_resume

Note: for hv_stimer

1
2
3
4
5
6
7
/*
* KVM_REQ_HV_STIMER has to be processed after
* KVM_REQ_CLOCK_UPDATE, because Hyper-V SynIC timers
* depend on the guest clock being up-to-date
*/
if (kvm_check_request(KVM_REQ_HV_STIMER, vcpu))
kvm_hv_process_stimers(vcpu);

will be done after guest clock up-to-date.

Hyper-v impact conclusion

With kvm hidden, hyper-v tsc compute will be skipped:

1
2
3
4
5
6
7
static bool compute_tsc_page_parameters(struct pvclock_vcpu_time_info *hv_clock,
struct ms_hyperv_tsc_page *tsc_ref)
{
u64 max_mul;

if (!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT))
return false;

which can be triggered by above kvm code.

During migration, we know that guest will be stopped (paused) by KVM_KVMCLOCK_CTRL and we could check kvm userspace’s (qemu) usage:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
static void kvmclock_vm_state_change(void *opaque, int running,
RunState state)
{
KVMClockState *s = opaque;
CPUState *cpu;
int cap_clock_ctrl = kvm_check_extension(kvm_state, KVM_CAP_KVMCLOCK_CTRL);
int ret;

if (running) {
struct kvm_clock_data data = {};

/*
* If the host where s->clock was read did not support reliable
* KVM_GET_CLOCK, read kvmclock value from memory.
*/
if (!s->clock_is_reliable) {
uint64_t pvclock_via_mem = kvmclock_current_nsec(s);
/* We can't rely on the saved clock value, just discard it */
if (pvclock_via_mem) {
s->clock = pvclock_via_mem;
}
}

s->clock_valid = false;

data.clock = s->clock;
ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
if (ret < 0) {
fprintf(stderr, "KVM_SET_CLOCK failed: %s\n", strerror(ret));
abort();
}

if (!cap_clock_ctrl) {
return;
}
CPU_FOREACH(cpu) {
run_on_cpu(cpu, do_kvmclock_ctrl, RUN_ON_CPU_NULL);
}
} else {

if (s->clock_valid) {
return;
}

s->runstate_paused = runstate_check(RUN_STATE_PAUSED);

kvm_synchronize_all_tsc();

kvm_update_clock(s);
/*
* If the VM is stopped, declare the clock state valid to
* avoid re-reading it on next vmsave (which would return
* a different value). Will be reset when the VM is continued.
*/
s->clock_valid = true;
}
}

when set guest to running, qemu will use KVM_SET_CLOCK else will use kvm_update_clock works as following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
static void kvm_update_clock(KVMClockState *s)
{
struct kvm_clock_data data;
int ret;

ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
if (ret < 0) {
fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
abort();
}
s->clock = data.clock;

/* If kvm_has_adjust_clock_stable() is false, KVM_GET_CLOCK returns
* essentially CLOCK_MONOTONIC plus a guest-specific adjustment. This
* can drift from the TSC-based value that is computed by the guest,
* so we need to go through kvmclock_current_nsec(). If
* kvm_has_adjust_clock_stable() is true, and the flags contain
* KVM_CLOCK_TSC_STABLE, then KVM_GET_CLOCK returns a TSC-based value
* and kvmclock_current_nsec() is not necessary.
*
* Here, however, we need not check KVM_CLOCK_TSC_STABLE. This is because:
*
* - if the host has disabled the kvmclock master clock, the guest already
* has protection against time going backwards. This "safety net" is only
* absent when kvmclock is stable;
*
* - therefore, we can replace a check like
*
* if last KVM_GET_CLOCK was not reliable then
* read from memory
*
* with
*
* if last KVM_GET_CLOCK was not reliable && masterclock is enabled
* read from memory
*
* However:
*
* - if kvm_has_adjust_clock_stable() returns false, the left side is
* always true (KVM_GET_CLOCK is never reliable), and the right side is
* unknown (because we don't have data.flags). We must assume it's true
* and read from memory.
*
* - if kvm_has_adjust_clock_stable() returns true, the result of the &&
* is always false (masterclock is enabled iff KVM_GET_CLOCK is reliable)
*
* So we can just use this instead:
*
* if !kvm_has_adjust_clock_stable() then
* read from memory
*/
s->clock_is_reliable = kvm_has_adjust_clock_stable();
}

But from the annotation in kvmclock_vm_state_change:

1
2
3
4
5
/*
* If the VM is stopped, declare the clock state valid to
* avoid re-reading it on next vmsave (which would return
* a different value). Will be reset when the VM is continued.
*/

qemu seems to relay on vmsave to reset the guest while vm is continued, we just keep our eyes on that.

Combine qemu guest state change hook:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
case KVM_SET_CLOCK: {
struct kvm_arch *ka = &kvm->arch;
struct kvm_clock_data user_ns;
u64 now_ns;

r = -EFAULT;
if (copy_from_user(&user_ns, argp, sizeof(user_ns)))
goto out;

r = -EINVAL;
if (user_ns.flags)
goto out;

r = 0;
/*
* TODO: userspace has to take care of races with VCPU_RUN, so
* kvm_gen_update_masterclock() can be cut down to locked
* pvclock_update_vm_gtod_copy().
*/
kvm_gen_update_masterclock(kvm);

/*
* This pairs with kvm_guest_time_update(): when masterclock is
* in use, we use master_kernel_ns + kvmclock_offset to set
* unsigned 'system_time' so if we use get_kvmclock_ns() (which
* is slightly ahead) here we risk going negative on unsigned
* 'system_time' when 'user_ns.clock' is very small.
*/
spin_lock_irq(&ka->pvclock_gtod_sync_lock);
if (kvm->arch.use_master_clock)
now_ns = ka->master_kernel_ns;
else
now_ns = get_kvmclock_base_ns();
ka->kvmclock_offset = user_ns.clock - now_ns;
spin_unlock_irq(&ka->pvclock_gtod_sync_lock);

kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);

will be used to update guest clock.

Hand on test to confirm clock updates

Enable kvm trace by:

1
echo 1 > /sys/kernel/debug/tracing/events/kvm/enable

Then collect the output when vm migrated to this host:

1
cat /sys/kernel/debug/tracing/trace_pipe > trace_migrated_vm

We can see following logs at first:

1
2
3
4
5
6
7
8
9
<...>-89383 [001] .... 97852.765277: kvm_update_master_clock: masterclock 0 hostclock 0x2 offsetmatched 0
<...>-89441 [002] d... 97852.785366: kvm_write_tsc_offset: vcpu=0 prev=0 next=18446539041810541506
<...>-89441 [002] d... 97852.785402: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock 0x2
<...>-89442 [002] d... 97852.786522: kvm_write_tsc_offset: vcpu=1 prev=0 next=18446539041810541506
<...>-89442 [002] d... 97852.786533: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock 0x2
<...>-89443 [002] d... 97852.787341: kvm_write_tsc_offset: vcpu=2 prev=0 next=18446539041810541506
<...>-89443 [002] d... 97852.787348: kvm_track_tsc: vcpu_id 2 masterclock 0 offsetmatched 2 nr_online 3 hostclock 0x2
<...>-89444 [002] d... 97852.788099: kvm_write_tsc_offset: vcpu=3 prev=0 next=18446539041810541506
<...>-89444 [002] d... 97852.788120: kvm_track_tsc: vcpu_id 3 masterclock 0 offsetmatched 3 nr_online 4 hostclock 0x2

kvm_update_master_clock is used for vm migration:

And the tsc offset changed:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<...>-89441 [002] d... 97852.785366: kvm_write_tsc_offset: vcpu=0 prev=0 next=18446539041810541506
<...>-89441 [002] d... 97852.785402: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock 0x2
<...>-89442 [002] d... 97852.786522: kvm_write_tsc_offset: vcpu=1 prev=0 next=18446539041810541506
<...>-89442 [002] d... 97852.786533: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock 0x2
<...>-89443 [002] d... 97852.787341: kvm_write_tsc_offset: vcpu=2 prev=0 next=18446539041810541506
<...>-89443 [002] d... 97852.787348: kvm_track_tsc: vcpu_id 2 masterclock 0 offsetmatched 2 nr_online 3 hostclock 0x2
<...>-89444 [002] d... 97852.788099: kvm_write_tsc_offset: vcpu=3 prev=0 next=18446539041810541506

<...>-89441 [003] d... 97852.872014: kvm_write_tsc_offset: vcpu=0 prev=18446539041810541506 next=18446539041810541506
<...>-89442 [003] d... 97852.872105: kvm_write_tsc_offset: vcpu=1 prev=18446539041810541506 next=18446539041810541506
<...>-89443 [003] d... 97852.872189: kvm_write_tsc_offset: vcpu=2 prev=18446539041810541506 next=18446539041810541506
<...>-89444 [003] d... 97852.872264: kvm_write_tsc_offset: vcpu=3 prev=18446539041810541506 next=18446539041810541506

<...>-89441 [000] d... 97856.399432: kvm_write_tsc_offset: vcpu=0 prev=18446539041810541506 next=18446562414330701094
<...>-89442 [000] d... 97856.403066: kvm_write_tsc_offset: vcpu=1 prev=18446539041810541506 next=18446562414330701094
<...>-89443 [000] d... 97856.403273: kvm_write_tsc_offset: vcpu=2 prev=18446539041810541506 next=18446562414330701094
<...>-89444 [000] d... 97856.403414: kvm_write_tsc_offset: vcpu=3 prev=18446539041810541506 next=18446562414330701094

Follow the trace we can find linux kernel code:

kvm_vcpu_write_tsc_offset -> kvm_x86_write_l1_tsc_offset -> write_l1_tsc_offset -> vmx_write_l1_tsc_offset -> trace_kvm_write_tsc_offset

And there are multi usages of kvm_vcpu_write_tsc_offset

  • kvm_synchronize_tsc
    • MSR_IA32_TSC -> kvm_synchronize_tsc
    • kvm_vm_ioctl_create_vcpu -> kvm_arch_vcpu_postcreate -> kvm_synchronize_tsc
  • adjust_tsc_offset_guest
    • kvm_guest_time_update -> adjust_tsc_offset_guest and kvm_hv_setup_tsc_page this is hyper-v impacted case
    • MSR_IA32_TSC -> adjust_tsc_offset_guest
    • MSR_IA32_TSC_ADJUST -> adjust_tsc_offset_guest
    • kvm_arch_vcpu_load -> adjust_tsc_offset_host -> adjust_tsc_offset_guest
  • kvm_arch_vcpu_load same as above

So the following three parts of kvm_vcpu_write_tsc_offset matches with guest creation.

  • Create vcpu
  • Load vcpu
  • Adjust tsc offset

In last guest hang post, we can see windows guest try to get counter ref:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
static u64 get_time_ref_counter(struct kvm *kvm)
{
struct kvm_hv *hv = to_kvm_hv(kvm);
struct kvm_vcpu *vcpu;
u64 tsc;

/*
* Fall back to get_kvmclock_ns() when TSC page hasn't been set up,
* is broken, disabled or being updated.
*/
if (hv->hv_tsc_page_status != HV_TSC_PAGE_SET)
return div_u64(get_kvmclock_ns(kvm), 100);

vcpu = kvm_get_vcpu(kvm, 0);
tsc = kvm_read_l1_tsc(vcpu, rdtsc());
return mul_u64_u64_shr(tsc, hv->tsc_ref.tsc_scale, 64)
+ hv->tsc_ref.tsc_offset;
}

But this is used by MSR read request from guest. And now we need to debug hv_tsc_page_status and kvm_hv_setup_tsc_page usage.

Without kvm hidden:

1
2
3
4
5
<...>-114210 [002] d... 12255.411580: kvm_exit: vcpu 1 reason MSR_READ rip 0xfffff800ece454c5 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
<...>-114210 [002] .... 12255.411581: kvm_msr: msr_read 40000020 = 0x6fac3c27
<...>-114210 [002] d... 12255.411582: kvm_entry: vcpu 1, rip 0xfffff800ece454c7
<...>-114211 [000] .... 12255.411585: kvm_vcpu_wakeup: wait time 1759974 ns, polling valid
<...>-114211 [000] .... 12255.411585: kvm_hv_timer_state: vcpu_id 2 hv_timer 0

We can find kvm_hv_timer_state in trace, and according to linux kernel code:

1
2
TRACE_EVENT(kvm_hv_timer_state,
TP_PROTO(unsigned int vcpu_id, unsigned int hv_timer_in_use),

There are two ways to show the trace:

  • start_sw_timer -> trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, false); which is always false (means 0 in trace)
  • start_hv_timer -> trace_kvm_hv_timer_state(vcpu->vcpu_id, ktimer->hv_timer_in_use); which returns hv_timer_in_use from ktimer->hv_timer_in_use

Check the code about start_hv_timer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
static bool start_hv_timer(struct kvm_lapic *apic)
{
struct kvm_timer *ktimer = &apic->lapic_timer;
struct kvm_vcpu *vcpu = apic->vcpu;
bool expired;

WARN_ON(preemptible());
if (!kvm_can_use_hv_timer(vcpu))
return false;

if (!ktimer->tscdeadline)
return false;

if (static_call(kvm_x86_set_hv_timer)(vcpu, ktimer->tscdeadline, &expired))
return false;

ktimer->hv_timer_in_use = true;
hrtimer_cancel(&ktimer->timer);

/*
* To simplify handling the periodic timer, leave the hv timer running
* even if the deadline timer has expired, i.e. rely on the resulting
* VM-Exit to recompute the periodic timer's target expiration.
*/
if (!apic_lvtt_period(apic)) {
/*
* Cancel the hv timer if the sw timer fired while the hv timer
* was being programmed, or if the hv timer itself expired.
*/
if (atomic_read(&ktimer->pending)) {
cancel_hv_timer(apic);
} else if (expired) {
apic_timer_expired(apic, false);
cancel_hv_timer(apic);
}
}

trace_kvm_hv_timer_state(vcpu->vcpu_id, ktimer->hv_timer_in_use);

return true;
}

ktimer->hv_timer_in_use is set to true so we focus on start_sw_timer next.

There are several ways to goes into restart_apic_timer

  • restart_apic_timer -> start_sw_timer
    • vmx_exit_handlers_fastpath or __vmx_handle_exit -> handle_fastpath_preemption_timer -> kvm_lapic_expired_hv_timer -> restart_apic_timer
    • vcpu_block -> post_block -> vmx_post_block -> kvm_lapic_switch_to_hv_timer -> restart_apic_timer
    • MSR_IA32_TSC_DEADLINE ->handle_fastpath_set_tscdeadline -> kvm_set_lapic_tscdeadline_msr -> __start_apic_timer -> restart_apic_timer
    • APIC_TDCR -> restart_apic_timer
  • vcpu_block -> vmx_pre_block -> kvm_lapic_switch_to_sw_timer -> start_sw_timer

Because we see a trace before shows:

1
kvm_vcpu_wakeup: wait time 1759974 ns, polling valid

which is in kvm_vcpu_block, so this means vmx_post_block restart_apic_timer

1
2
trace_kvm_vcpu_wakeup(block_ns, waited, vcpu_valid_wakeup(vcpu));
kvm_arch_vcpu_block_finish(vcpu);

And because the code runs as:

1
2
if (!start_hv_timer(apic))
start_sw_timer(apic);

start_hv_timer must returns false:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static bool start_hv_timer(struct kvm_lapic *apic)
{
struct kvm_timer *ktimer = &apic->lapic_timer;
struct kvm_vcpu *vcpu = apic->vcpu;
bool expired;

WARN_ON(preemptible());
if (!kvm_can_use_hv_timer(vcpu))
return false;

if (!ktimer->tscdeadline)
return false;

if (static_call(kvm_x86_set_hv_timer)(vcpu, ktimer->tscdeadline, &expired))
return false;

kvm_can_use_hv_timer check seems works on x86 machine and while X86_FEATURE_MWAIT is supported.

From the trace we could know, when vcpu exit and come back to work, the timer will be updated, and use vcpu 3 as example:

1
<...>-114212 [002] d... 12297.437890: kvm_exit: vcpu 3 reason HLT rip 0xfffff800ecc2b36e info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000

vcpu 3 HLT and cause kvm_exit.

Then it wakeup after 4774180 ns and hv_timer is traced without usage.

1
2
<...>-114212 [002] .... 12255.393408: kvm_vcpu_wakeup: wait time 4774180 ns, polling valid
<...>-114212 [002] .... 12255.393410: kvm_hv_timer_state: vcpu_id 3 hv_timer 0

And hv_timer will be cancelled after live migration:

1
2
if (apic->lapic_timer.hv_timer_in_use)
cancel_hv_timer(apic);

Let’s check hv_timer before migration:

Can we resolve compatibility issues?

See the code of qemu, it will disable features of FEAT_KVM after all features setup, so we can not manually assign those features:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
for (l = plus_features; l; l = l->next) {
const char *prop = l->data;
object_property_set_bool(OBJECT(cpu), true, prop, &local_err);
if (local_err) {
goto out;
}
}

for (l = minus_features; l; l = l->next) {
const char *prop = l->data;
object_property_set_bool(OBJECT(cpu), false, prop, &local_err);
if (local_err) {
goto out;
}
}

if (!kvm_enabled() || !cpu->expose_kvm) {
env->features[FEAT_KVM] = 0;
}

Virtio on Linux

Introduction

Virtio is an open standard that defines a protocol for communication between drivers and devices of different types, see Chapter 5 (“Device Types”) of the virtio spec ([1]). Originally developed as a standard for paravirtualized devices implemented by a hypervisor, it can be used to interface any compliant device (real or emulated) with a driver.

Virtio是一个开放的标准,它定义了驱动程序和不同类型的设备之间的通信协议,见virtio规范([1])的第五章(”设备类型”)。它最初是作为由管理程序实现的准虚拟化设备的标准而开发的,但它可以用来将任何符合要求的设备(真实的或模拟的)与驱动程序连接。

For illustrative purposes, this document will focus on the common case of a Linux kernel running in a virtual machine and using paravirtualized devices provided by the hypervisor, which exposes them as virtio devices via standard mechanisms such as PCI.

为了说明问题,本文将重点讨论Linux内核在虚拟机中运行并使用由管理程序提供的准虚拟化设备的常见情况,管理程序通过标准机制(如PCI)将它们暴露为virtio设备。

Device - Driver communication: virtqueues

Although the virtio devices are really an abstraction layer in the hypervisor, they’re exposed to the guest as if they are physical devices using a specific transport method – PCI, MMIO or CCW – that is orthogonal to the device itself. The virtio spec defines these transport methods in detail, including device discovery, capabilities and interrupt handling.

尽管virtio设备实际上是管理程序中的一个抽象层,但它们被暴露给客户,就像它们是使用特定的传输方法–PCI、MMIO或CCW–的物理设备一样,这与设备本身是正交的。virtio规范详细定义了这些传输方法,包括设备发现、能力和中断处理。

The communication between the driver in the guest OS and the device in the hypervisor is done through shared memory (that’s what makes virtio devices so efficient) using specialized data structures called virtqueues, which are actually ring buffers 1 of buffer descriptors similar to the ones used in a network device:

客户操作系统中的驱动程序和管理程序中的设备之间的通信是通过共享内存完成的(这就是virtio设备如此高效的原因),使用称为virtqueues的专门数据结构,这实际上是类似于网络设备中使用的缓冲区描述符的环形缓冲区1

struct vring_desc

Virtio ring descriptors, 16 bytes long. These can chain together via next.

Definition:

1
2
3
4
5
6
struct vring_desc {
__virtio64 addr;
__virtio32 len;
__virtio16 flags;
__virtio16 next;
};

Members

  • addr

    buffer address (guest-physical)

  • len

    buffer length

  • flags

    descriptor flags

  • next

    index of the next descriptor in the chain, if the VRING_DESC_F_NEXT flag is set. We chain unused descriptors via this, too.

All the buffers the descriptors point to are allocated by the guest and used by the host either for reading or for writing but not for both.

Refer to Chapter 2.5 (“Virtqueues”) of the virtio spec ([1]) for the reference definitions of virtqueues and “Virtqueues and virtio ring: How the data travels” blog post ([2]) for an illustrated overview of how the host device and the guest driver communicate.

描述符指向的所有缓冲区都是由guest分配的,并由host用于读取或写入,但不能同时使用。

请参考virtio规范([1])的第2.5章(”虚拟队列”),了解虚拟队列的参考定义和 “虚拟队列和virtio环。数据是如何传输的 “博文([2]),以图文并茂的方式概述了主机设备和客户驱动的通信方式。

The vring_virtqueue struct models a virtqueue, including the ring buffers and management data. Embedded in this struct is the virtqueue struct, which is the data structure that’s ultimately used by virtio drivers:

struct virtqueue

a queue to register buffers for sending or receiving.

Definition:

1
2
3
4
5
6
7
8
9
10
11
struct virtqueue {
struct list_head list;
void (*callback)(struct virtqueue *vq);
const char *name;
struct virtio_device *vdev;
unsigned int index;
unsigned int num_free;
unsigned int num_max;
void *priv;
bool reset;
};

Members

  • list

    the chain of virtqueues for this device

  • callback

    the function to call when buffers are consumed (can be NULL).

  • name

    the name of this virtqueue (mainly for debugging)

  • vdev

    the virtio device this queue was created for.

  • index

    the zero-based ordinal number for this queue.

  • num_free

    number of elements we expect to be able to fit.

  • num_max

    the maximum number of elements supported by the device.

  • priv

    a pointer for the virtqueue implementation to use.

  • reset

    vq is in reset state or not.

Description

A note on num_free: with indirect buffers, each buffer needs one element in the queue, otherwise a buffer will need one element per sg element.

The callback function pointed by this struct is triggered when the device has consumed the buffers provided by the driver. More specifically, the trigger will be an interrupt issued by the hypervisor (see vring_interrupt()). Interrupt request handlers are registered for a virtqueue during the virtqueue setup process (transport-specific).

关于num_free的说明:对于间接缓冲区,每个缓冲区需要队列中的一个元素,否则一个缓冲区将需要每个sg元素的一个元素。

当设备消耗完驱动提供的缓冲区时,这个结构所指向的回调函数会被触发。更具体地说,触发器将是由管理程序发出的中断(见vring_interrupt())。中断请求处理程序是在虚拟队列设置过程中为虚拟队列注册的(特定于传输)。

irqreturn_t vring_interrupt(int irq, void *_vq)

notify a virtqueue on an interrupt

Parameters

Description

Calls the callback function of _vq to process the virtqueue notification.

Device discovery and probing

In the kernel, the virtio core contains the virtio bus driver and transport-specific drivers like virtio-pci and virtio-mmio. Then there are individual virtio drivers for specific device types that are registered to the virtio bus driver.

在内核中,virtio核心包含virtio总线驱动和特定的传输驱动,如virtio-pci和virtio-mmio。然后,还有针对特定设备类型的单独的virtio驱动程序,它们被注册到virtio总线驱动程序上。

How a virtio device is found and configured by the kernel depends on how the hypervisor defines it. Taking the QEMU virtio-console device as an example. When using PCI as a transport method, the device will present itself on the PCI bus with vendor 0x1af4 (Red Hat, Inc.) and device id 0x1003 (virtio console), as defined in the spec, so the kernel will detect it as it would do with any other PCI device.

内核如何发现和配置virtio设备,取决于管理程序如何定义它。以QEMU virtio-console设备为例。当使用PCI作为传输方式时,该设备将在PCI总线上以供应商0x1af4(Red Hat, Inc.)和设备ID 0x1003(virtio console)的形式出现,正如规范中所定义的那样,所以内核会像检测其他PCI设备那样检测它。

During the PCI enumeration process, if a device is found to match the virtio-pci driver (according to the virtio-pci device table, any PCI device with vendor id = 0x1af4):

在PCI枚举过程中,如果发现一个设备与virtio-pci驱动相匹配(根据virtio-pci设备表,任何PCI设备的厂商ID=0x1af4)。

1
2
3
4
5
/* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */
static const struct pci_device_id virtio_pci_id_table[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_REDHAT_QUMRANET, PCI_ANY_ID) },
{ 0 }
};

then the virtio-pci driver is probed and, if the probing goes well, the device is registered to the virtio bus:

然后对virtio-pci驱动进行探测,如果探测顺利,该设备就被注册到virtio总线上。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static int virtio_pci_probe(struct pci_dev *pci_dev,
const struct pci_device_id *id)
{
...

if (force_legacy) {
rc = virtio_pci_legacy_probe(vp_dev);
/* Also try modern mode if we can't map BAR0 (no IO space). */
if (rc == -ENODEV || rc == -ENOMEM)
rc = virtio_pci_modern_probe(vp_dev);
if (rc)
goto err_probe;
} else {
rc = virtio_pci_modern_probe(vp_dev);
if (rc == -ENODEV)
rc = virtio_pci_legacy_probe(vp_dev);
if (rc)
goto err_probe;
}

...

rc = register_virtio_device(&vp_dev->vdev);

When the device is registered to the virtio bus the kernel will look for a driver in the bus that can handle the device and call that driver’s probe method.

At this point, the virtqueues will be allocated and configured by calling the appropriate virtio_find helper function, such as virtio_find_single_vq() or virtio_find_vqs(), which will end up calling a transport-specific find_vqs method.

当设备被注册到virtio总线上时,内核将在总线上寻找一个可以处理该设备的驱动程序,并调用该驱动程序的探测方法。

此时,将通过调用适当的virtio_find辅助函数,如virtio_find_single_vq()或virtio_find_vqs()来分配和配置virtqueues,最终会调用一个特定于传输的find_vqs方法。

Cpu feature configuration code diving

If disable a feature in libvirt domain xml configuration, what will happen?

General code about libvirt cpu conf

Read cpu_conf.c main entrance is virCPUDefFormatBuf

Libvirt have two types format:

  • CUSTOM: user define model and features of a cpu conf
  • HOST_MODEL: matches a most suitable feature list with host

And while handle conf definition:

1
2
3
4
5
formatModel = (def->mode == VIR_CPU_MODE_CUSTOM ||
def->mode == VIR_CPU_MODE_HOST_MODEL);
formatFallback = (def->type == VIR_CPU_TYPE_GUEST &&
(def->mode == VIR_CPU_MODE_HOST_MODEL ||
(def->mode == VIR_CPU_MODE_CUSTOM && def->model)));

see the enum:

1
2
3
4
5
6
7
typedef enum {
VIR_CPU_TYPE_HOST,
VIR_CPU_TYPE_GUEST,
VIR_CPU_TYPE_AUTO,

VIR_CPU_TYPE_LAST
} virCPUType;
  • VIR_CPU_TYPE_AUTO : detect the input xml to tell is guest or host cpu model definition
  • VIR_CPU_TYPE_GUEST : guest cpu model means the cpu conf define from domain xml
  • VIR_CPU_TYPE_HOST : host cpu model means the cpu conf load from host capabilities xml

So the could focus on formatFallback.

Verification is required, if you use a custom mode without a cpu model is not allowed, because custom means you need specify a collections of cpu features and custom features of the subset.

1
2
3
4
5
if (!def->model && def->mode == VIR_CPU_MODE_CUSTOM && def->nfeatures) {
virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
_("Non-empty feature list specified without CPU model"));
return -1;
}

while define model, need to get a fallback value for guest cpu

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
if ((formatModel && def->model) || formatFallback) {
virBufferAddLit(buf, "<model");
if (formatFallback) {
const char *fallback;

fallback = virCPUFallbackTypeToString(def->fallback);
if (!fallback) {
virReportError(VIR_ERR_INTERNAL_ERROR,
_("Unexpected CPU fallback value: %d"),
def->fallback);
return -1;
}
virBufferAsprintf(buf, " fallback='%s'", fallback);
if (def->vendor_id)
virBufferEscapeString(buf, " vendor_id='%s'", def->vendor_id);
}
if (formatModel && def->model) {
virBufferEscapeString(buf, ">%s</model>\n", def->model);
} else {
virBufferAddLit(buf, "/>\n");
}
}

Fallback type:

1
2
3
4
5
6
typedef enum {
VIR_CPU_FALLBACK_ALLOW,
VIR_CPU_FALLBACK_FORBID,

VIR_CPU_FALLBACK_LAST
} virCPUFallback;
  • VIR_CPU_FALLBACK_ALLOW means just use the cpu capabilities from host capabilities xml
  • VIR_CPU_FALLBACK_FORBIDmeans can stop guest from start with unsupported feature

Also the topology can be defined:

1
2
3
4
5
6
7
if (def->sockets && def->cores && def->threads) {
virBufferAddLit(buf, "<topology");
virBufferAsprintf(buf, " sockets='%u'", def->sockets);
virBufferAsprintf(buf, " cores='%u'", def->cores);
virBufferAsprintf(buf, " threads='%u'", def->threads);
virBufferAddLit(buf, "/>\n");
}

from xml too.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
for (i = 0; i < def->nfeatures; i++) {
virCPUFeatureDefPtr feature = def->features + i;

if (!feature->name) {
virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
_("Missing CPU feature name"));
return -1;
}

if (def->type == VIR_CPU_TYPE_GUEST) {
const char *policy;

policy = virCPUFeaturePolicyTypeToString(feature->policy);
if (!policy) {
virReportError(VIR_ERR_INTERNAL_ERROR,
_("Unexpected CPU feature policy %d"),
feature->policy);
return -1;
}
virBufferAsprintf(buf, "<feature policy='%s' name='%s'/>\n",
policy, feature->name);
} else {
virBufferAsprintf(buf, "<feature name='%s'/>\n",
feature->name);
}
}

Features will follow policies:

1
2
3
4
5
6
VIR_ENUM_IMPL(virCPUFeaturePolicy, VIR_CPU_FEATURE_LAST,
"force",
"require",
"optional",
"disable",
"forbid")

Following part explains about those policies.

force

The virtual CPU will claim the feature is supported regardless of it being supported by host CPU.

require

Guest creation will fail unless the feature is supported by the host CPU or the hypervisor is able to emulate it.

optional

The feature will be supported by virtual CPU if and only if it is supported by host CPU.

disable

The feature will not be supported by virtual CPU.

forbid

Guest creation will fail if the feature is supported by host CPU.

virCPUDefFormatBuf is used by capabilities.c which collects host features from host capabilities xml. But now we need to check the code in domain_capabilities.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
typedef virCPUDef *virCPUDefPtr;
struct _virCPUDef {
int type; /* enum virCPUType */
int mode; /* enum virCPUMode */
int match; /* enum virCPUMatch */
virCPUCheck check;
virArch arch;
char *model;
char *vendor_id; /* vendor id returned by CPUID in the guest */
int fallback; /* enum virCPUFallback */
char *vendor;
unsigned int microcodeVersion;
unsigned int sockets;
unsigned int cores;
unsigned int threads;
size_t nfeatures;
size_t nfeatures_max;
virCPUFeatureDefPtr features;
virCPUCacheDefPtr cache;
};

nfeatures will be set in _virCPUDef and supported features are parsed from domain xml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/*
* Parses CPU definition XML from a node pointed to by @xpath. If @xpath is
* NULL, the current node of @ctxt is used (i.e., it is a shortcut to ".").
*
* Missing <cpu> element in the XML document is not considered an error unless
* @xpath is NULL in which case the function expects it was provided with a
* valid <cpu> element already. In other words, the function returns success
* and sets @cpu to NULL if @xpath is not NULL and the node pointed to by
* @xpath is not found.
*
* Returns 0 on success, -1 on error.
*/
int
virCPUDefParseXML(xmlXPathContextPtr ctxt,
const char *xpath,
virCPUType type,
virCPUDefPtr *cpu)

Finally, qemu_command.c would use those features to qemu commandline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
for (i = 0; i < cpu->nfeatures; i++) {
if (STREQ("rtm", cpu->features[i].name))
rtm = true;
if (STREQ("hle", cpu->features[i].name))
hle = true;

switch ((virCPUFeaturePolicy) cpu->features[i].policy) {
case VIR_CPU_FEATURE_FORCE:
case VIR_CPU_FEATURE_REQUIRE:
if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_QUERY_CPU_MODEL_EXPANSION))
virBufferAsprintf(buf, ",%s=on", cpu->features[i].name);
else
virBufferAsprintf(buf, ",+%s", cpu->features[i].name);
break;

case VIR_CPU_FEATURE_DISABLE:
case VIR_CPU_FEATURE_FORBID:
if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_QUERY_CPU_MODEL_EXPANSION))
virBufferAsprintf(buf, ",%s=off", cpu->features[i].name);
else
virBufferAsprintf(buf, ",-%s", cpu->features[i].name);
break;

case VIR_CPU_FEATURE_OPTIONAL:
case VIR_CPU_FEATURE_LAST:
break;
}
}

like -cpu ... feature1=on,feature2=off to make those features take effects.

Turn to qemu

Firstly, qemu will parse input -cpu .... string:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
const char *parse_cpu_model(const char *cpu_model)
{
ObjectClass *oc;
CPUClass *cc;
gchar **model_pieces;
const char *cpu_type;

model_pieces = g_strsplit(cpu_model, ",", 2);

oc = cpu_class_by_name(CPU_RESOLVING_TYPE, model_pieces[0]);
if (oc == NULL) {
error_report("unable to find CPU model '%s'", model_pieces[0]);
g_strfreev(model_pieces);
exit(EXIT_FAILURE);
}

cpu_type = object_class_get_name(oc);
cc = CPU_CLASS(oc);
cc->parse_features(cpu_type, model_pieces[1], &error_fatal);
g_strfreev(model_pieces);
return cpu_type;
}

An object from oc = cpu_class_by_name(CPU_RESOLVING_TYPE, model_pieces[0]);will return a cpu object class which support parse features. See following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
static void cpu_common_parse_features(const char *typename, char *features,
Error **errp)
{
char *val;
static bool cpu_globals_initialized;
/* Single "key=value" string being parsed */
char *featurestr = features ? strtok(features, ",") : NULL;

/* should be called only once, catch invalid users */
assert(!cpu_globals_initialized);
cpu_globals_initialized = true;

while (featurestr) {
val = strchr(featurestr, '=');
if (val) {
GlobalProperty *prop = g_new0(typeof(*prop), 1);
*val = 0;
val++;
prop->driver = typename;
prop->property = g_strdup(featurestr);
prop->value = g_strdup(val);
prop->errp = &error_fatal;
qdev_prop_register_global(prop);
} else {
error_setg(errp, "Expected key=value format, found %s.",
featurestr);
return;
}
featurestr = strtok(NULL, ",");
}
}

key=value format will be parse and store into qemu’s global property.

From: target/i386/cpu.c

qemu defined #define CPUID_EXT_HYPERVISOR (1U << 31) for CPUID EXT to expose hypervisor information.

Then x86 cpu will use those global properties to initialize vcpu:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
/* Parse "+feature,-feature,feature=foo" CPU feature string
*/
static void x86_cpu_parse_featurestr(const char *typename, char *features,
Error **errp)
{
char *featurestr; /* Single 'key=value" string being parsed */
static bool cpu_globals_initialized;
bool ambiguous = false;

if (cpu_globals_initialized) {
return;
}
cpu_globals_initialized = true;

if (!features) {
return;
}

for (featurestr = strtok(features, ",");
featurestr;
featurestr = strtok(NULL, ",")) {
const char *name;
const char *val = NULL;
char *eq = NULL;
char num[32];
GlobalProperty *prop;

/* Compatibility syntax: */
if (featurestr[0] == '+') {
plus_features = g_list_append(plus_features,
g_strdup(featurestr + 1));
continue;
} else if (featurestr[0] == '-') {
minus_features = g_list_append(minus_features,
g_strdup(featurestr + 1));
continue;
}

eq = strchr(featurestr, '=');
if (eq) {
*eq++ = 0;
val = eq;
} else {
val = "on";
}

feat2prop(featurestr);
name = featurestr;

if (g_list_find_custom(plus_features, name, compare_string)) {
warn_report("Ambiguous CPU model string. "
"Don't mix both \"+%s\" and \"%s=%s\"",
name, name, val);
ambiguous = true;
}
if (g_list_find_custom(minus_features, name, compare_string)) {
warn_report("Ambiguous CPU model string. "
"Don't mix both \"-%s\" and \"%s=%s\"",
name, name, val);
ambiguous = true;
}

/* Special case: */
if (!strcmp(name, "tsc-freq")) {
int ret;
uint64_t tsc_freq;

ret = qemu_strtosz_metric(val, NULL, &tsc_freq);
if (ret < 0 || tsc_freq > INT64_MAX) {
error_setg(errp, "bad numerical value %s", val);
return;
}
snprintf(num, sizeof(num), "%" PRId64, tsc_freq);
val = num;
name = "tsc-frequency";
}

prop = g_new0(typeof(*prop), 1);
prop->driver = typename;
prop->property = g_strdup(name);
prop->value = g_strdup(val);
prop->errp = &error_fatal;
qdev_prop_register_global(prop);
}

if (ambiguous) {
warn_report("Compatibility of ambiguous CPU model "
"strings won't be kept on future QEMU versions");
}
}

which is registered as cc->parse_features = x86_cpu_parse_featurestr;.

features from qemu commandline will be put as global property for x86 cpu.

And before start virtual machine, qemu will insure there is not unavailable or missing features:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
static void x86_cpu_get_unavailable_features(Object *obj, Visitor *v,
const char *name, void *opaque,
Error **errp)
{
X86CPU *xc = X86_CPU(obj);
strList *result = NULL;

x86_cpu_list_feature_names(xc->filtered_features, &result);
visit_type_strList(v, "unavailable-features", &result, errp);
}

/* Check for missing features that may prevent the CPU class from
* running using the current machine and accelerator.
*/
static void x86_cpu_class_check_missing_features(X86CPUClass *xcc,
strList **missing_feats)
{
X86CPU *xc;
Error *err = NULL;
strList **next = missing_feats;

if (xcc->host_cpuid_required && !accel_uses_host_cpuid()) {
strList *new = g_new0(strList, 1);
new->value = g_strdup("kvm");
*missing_feats = new;
return;
}

xc = X86_CPU(object_new(object_class_get_name(OBJECT_CLASS(xcc))));

x86_cpu_expand_features(xc, &err);
if (err) {
/* Errors at x86_cpu_expand_features should never happen,
* but in case it does, just report the model as not
* runnable at all using the "type" property.
*/
strList *new = g_new0(strList, 1);
new->value = g_strdup("type");
*next = new;
next = &new->next;
}

x86_cpu_filter_features(xc, false);

x86_cpu_list_feature_names(xc->filtered_features, next);

object_unref(OBJECT(xc));
}

while qemu init cpu:

1
static void x86_cpu_realizefn(DeviceState *dev, Error **errp)

features will be set to a CPU object:

1
2
3
if (!kvm_enabled() || !cpu->expose_kvm) {
env->features[FEAT_KVM] = 0;
}

we could find “hypervisor” related cpu features defined by FEAT_1_ECX:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[FEAT_1_ECX] = {
.type = CPUID_FEATURE_WORD,
.feat_names = {
"pni" /* Intel,AMD sse3 */, "pclmulqdq", "dtes64", "monitor",
"ds-cpl", "vmx", "smx", "est",
"tm2", "ssse3", "cid", NULL,
"fma", "cx16", "xtpr", "pdcm",
NULL, "pcid", "dca", "sse4.1",
"sse4.2", "x2apic", "movbe", "popcnt",
"tsc-deadline", "aes", "xsave", "osxsave",
"avx", "f16c", "rdrand", "hypervisor",
},
.cpuid = { .eax = 1, .reg = R_ECX, },
.tcg_features = TCG_EXT_FEATURES,
}

then cpu will read those features with key words:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/*
* Finishes initialization of CPUID data, filters CPU feature
* words based on host availability of each feature.
*
* Returns: 0 if all flags are supported by the host, non-zero otherwise.
*/
static void x86_cpu_filter_features(X86CPU *cpu, bool verbose)
{
CPUX86State *env = &cpu->env;
FeatureWord w;
const char *prefix = NULL;

if (verbose) {
prefix = accel_uses_host_cpuid()
? "host doesn't support requested feature"
: "TCG doesn't support requested feature";
}

for (w = 0; w < FEATURE_WORDS; w++) {
uint64_t host_feat =
x86_cpu_get_supported_feature_word(w, false);
uint64_t requested_features = env->features[w];
uint64_t unavailable_features = requested_features & ~host_feat;
mark_unavailable_features(cpu, w, unavailable_features, prefix);
}

if ((env->features[FEAT_7_0_EBX] & CPUID_7_0_EBX_INTEL_PT) &&
kvm_enabled()) {
KVMState *s = CPU(cpu)->kvm_state;
uint32_t eax_0 = kvm_arch_get_supported_cpuid(s, 0x14, 0, R_EAX);
uint32_t ebx_0 = kvm_arch_get_supported_cpuid(s, 0x14, 0, R_EBX);
uint32_t ecx_0 = kvm_arch_get_supported_cpuid(s, 0x14, 0, R_ECX);
uint32_t eax_1 = kvm_arch_get_supported_cpuid(s, 0x14, 1, R_EAX);
uint32_t ebx_1 = kvm_arch_get_supported_cpuid(s, 0x14, 1, R_EBX);

if (!eax_0 ||
((ebx_0 & INTEL_PT_MINIMAL_EBX) != INTEL_PT_MINIMAL_EBX) ||
((ecx_0 & INTEL_PT_MINIMAL_ECX) != INTEL_PT_MINIMAL_ECX) ||
((eax_1 & INTEL_PT_MTC_BITMAP) != INTEL_PT_MTC_BITMAP) ||
((eax_1 & INTEL_PT_ADDR_RANGES_NUM_MASK) <
INTEL_PT_ADDR_RANGES_NUM) ||
((ebx_1 & (INTEL_PT_PSB_BITMAP | INTEL_PT_CYCLE_BITMAP)) !=
(INTEL_PT_PSB_BITMAP | INTEL_PT_CYCLE_BITMAP)) ||
(ecx_0 & INTEL_PT_IP_LIP)) {
/*
* Processor Trace capabilities aren't configurable, so if the
* host can't emulate the capabilities we report on
* cpu_x86_cpuid(), intel-pt can't be enabled on the current host.
*/
mark_unavailable_features(cpu, FEAT_7_0_EBX, CPUID_7_0_EBX_INTEL_PT, prefix);
}
}
}

Mainly the features is set by:

1
2
3
4
5
6
7
for (w = 0; w < FEATURE_WORDS; w++) {
uint64_t host_feat =
x86_cpu_get_supported_feature_word(w, false);
uint64_t requested_features = env->features[w];
uint64_t unavailable_features = requested_features & ~host_feat;
mark_unavailable_features(cpu, w, unavailable_features, prefix);
}

this part and supported feature keeps 0 because

requested_features & ~host_feat host unavialable features would be ~ at first.

We can dump those configurations from qemu vcpu to check is usage.

How kernel use it

Then we move to linux kernel check about those features usages.

#define X86_FEATURE_HYPERVISOR (4*32+31) /* Running on a hypervisor */

kernel use X86_FEATURE_HYPERVISOR means if running on hypervisor.

Hand on test

Now try to run a guest detecting hypervisor and figure out how to bypass the detection by virtualization level configs.

Linux

http://www.etallen.com/cpuid.html use a cpuid tool to dump cpu id of a guest to check our configuration.

By run cpuid to dump features, we can see following output with our expected values:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
feature information (1/ecx):
PNI/SSE3: Prescott New Instructions = true
PCLMULDQ instruction = true
DTES64: 64-bit debug store = false
MONITOR/MWAIT = false
CPL-qualified debug store = false
VMX: virtual machine extensions = true
SMX: safer mode extensions = false
Enhanced Intel SpeedStep Technology = false
TM2: thermal monitor 2 = false
SSSE3 extensions = true
context ID: adaptive or shared L1 data = false
SDBG: IA32_DEBUG_INTERFACE = false
FMA instruction = true
CMPXCHG16B instruction = true
xTPR disable = false
PDCM: perfmon and debug = false
PCID: process context identifiers = true
DCA: direct cache access = false
SSE4.1 extensions = true
SSE4.2 extensions = true
x2APIC: extended xAPIC support = true
MOVBE instruction = true
POPCNT instruction = true
time stamp counter deadline = true
AES instruction = true
XSAVE/XSTOR states = true
OS-enabled XSAVE/XSTOR = true
AVX: advanced vector extensions = true
F16C half-precision convert instruction = true
RDRAND instruction = true
hypervisor guest status = true

the hypervisor guest status = true matches with linux kernel’s definition.

While with hypervisor feature disabled the output changed to:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
feature information (1/edx):
x87 FPU on chip = true
VME: virtual-8086 mode enhancement = true
DE: debugging extensions = true
PSE: page size extensions = true
TSC: time stamp counter = true
RDMSR and WRMSR support = true
PAE: physical address extensions = true
MCE: machine check exception = true
CMPXCHG8B inst. = true
APIC on chip = true
SYSENTER and SYSEXIT = true
MTRR: memory type range registers = true
PTE global bit = true
MCA: machine check architecture = true
CMOV: conditional move/compare instr = true
PAT: page attribute table = true
PSE-36: page size extension = true
PSN: processor serial number = false
CLFLUSH instruction = true
DS: debug store = false
ACPI: thermal monitor and clock ctrl = false
MMX Technology = true
FXSAVE/FXRSTOR = true
SSE extensions = true
SSE2 extensions = true
SS: self snoop = true
hyper-threading / multi-core supported = true
TM: therm. monitor = false
IA64 = false
PBE: pending break event = false
feature information (1/ecx):
PNI/SSE3: Prescott New Instructions = true
PCLMULDQ instruction = true
DTES64: 64-bit debug store = false
MONITOR/MWAIT = false
CPL-qualified debug store = false
VMX: virtual machine extensions = true
SMX: safer mode extensions = false
Enhanced Intel SpeedStep Technology = false
TM2: thermal monitor 2 = false
SSSE3 extensions = true
context ID: adaptive or shared L1 data = false
SDBG: IA32_DEBUG_INTERFACE = false
FMA instruction = true
CMPXCHG16B instruction = true
xTPR disable = false
PDCM: perfmon and debug = false
PCID: process context identifiers = true
DCA: direct cache access = false
SSE4.1 extensions = true
SSE4.2 extensions = true
x2APIC: extended xAPIC support = true
MOVBE instruction = true
POPCNT instruction = true
time stamp counter deadline = true
AES instruction = true
XSAVE/XSTOR states = true
OS-enabled XSAVE/XSTOR = true
AVX: advanced vector extensions = true
F16C half-precision convert instruction = true
RDRAND instruction = true
hypervisor guest status = false

the hypervisor guest status = false value changed as expected.

Linux drawbacks

Read the usage about X86_FEATURE_HYPERVISOR in linux kernel. Some drawbacks can be found in kernel code directly.

From qspintlock.h :

1
2
3
4
5
6
7
8
9
10
11
12
13
/*
* RHEL7 specific:
* To provide backward compatibility with pre-7.4 kernel modules that
* inlines the ticket spinlock unlock code. The virt_spin_lock() function
* will have to recognize both a lock value of 0 or _Q_UNLOCKED_VAL as
* being in an unlocked state.
*/
static inline bool virt_spin_lock(struct qspinlock *lock)
{
int lockval;

if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
return false;

Slow spin lock will not be detected.

From paravirt-spinlocks.c:

1
2
3
4
5
6
7
8
static int __init queued_enable_pv_ticketlock(void)
{
if (!static_cpu_has(X86_FEATURE_HYPERVISOR) ||
(pv_lock_ops.queued_spin_lock_slowpath !=
native_queued_spin_lock_slowpath))
static_key_slow_inc(&paravirt_ticketlocks_enabled);
return 0;
}

From tsc.c:

1
2
3
4
5
6
7
8
9
/*
* Don't enable ART in a VM, non-stop TSC required,
* and the TSC counter resets must not occur asynchronously.
*/
if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
!boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
art_to_tsc_denominator < ART_MIN_DENOMINATOR ||
tsc_async_resets)
return;

Always run timer will be started which actually should not be enabled.

From apic.c:

1
2
3
if (!boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER) ||
boot_cpu_has(X86_FEATURE_HYPERVISOR))
return;

From mshyperv.c:

1
2
if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
return 0;

Can not detect if run on hyperv.

From radeon_device :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/**
* radeon_device_is_virtual - check if we are running is a virtual environment
*
* Check if the asic has been passed through to a VM (all asics).
* Used at driver startup.
* Returns true if virtual or false if not.
*/
bool radeon_device_is_virtual(void)
{
#ifdef CONFIG_X86
return boot_cpu_has(X86_FEATURE_HYPERVISOR);
#else
return false;
#endif
}

Radeon gpu will not detect it is running as guest.

For kernel it may failed to detect that it is running over hypervisor. So related performance improvement changed won’t be applied so there will be a performance drop for those guests.

So does the userspace application also can not do specific things without knowing it is running in virtual machine.

Packed virtqueue: How to reduce overhead with virtio

This is the final post of a three-post series, the previous posts are “Virtio devices and drivers overview: The headjack and the phone,” and “Virtqueues and virtio ring: How the data travels.”

这是三篇系列文章的最后一篇,之前的文章是”Virtio设备和驱动概述:头戴式耳机和手机“,以及”Virtqueues和virtio环:数据如何传输“。

Split virtqueue issues: Too much spinning around

While the split virtqueue shines because of the simplicity of its design, it has a fundamental problem: The avail-used buffer cycle needs to use memory in a very sparse way. This puts pressure on the CPU cache utilization, and in the case of hardware means several PCI transactions for each descriptor.

虽然split virtqueue因其设计的简单性而大放异彩,但它有一个基本问题:可用的缓冲区环需要以一种非常稀疏的方式使用内存。这给CPU的缓存利用率带来了压力,在硬件的情况下,意味着每个描述符都要有几个PCI事务。

Packed virtqueue amends it by merging the three rings in just one location in virtual environment guest memory. While this may seem complicated at first glance, it’s a natural step after the split version if we realize that the device can discard and overwrite the data it already has read from the driver, and the same happens the other way around.

Packed virtqueue对其进行了修正,将三个环合并在虚拟环境guest内存的一个位置。虽然这乍看起来很复杂,但如果我们意识到设备可以丢弃和覆盖它已经从驱动中读取的数据,那么这就是分裂版本之后的一个自然步骤,反之亦然。

Supplying descriptors to the device: How to fill device todo-list

After initialization in the same process as described in Virtio device initialization: feature bits, and after the agreement on RING_PACKED feature flag, the driver and the device starts with a shared blank canvas of descriptors with an agreed length (up to 215 entries) in a agreed guest’s memory location. The layout of these descriptors is:

1
2
3
4
5
6
struct virtq_desc { 
le64 addr;
le32 len;
le16 id;
le16 flags;
};

Listing: Memory layout of a packed virtqueue descriptor

在Virtio设备初始化:特征位中描述的相同过程中进行初始化后,在就RING_PACKED特征标志达成一致后,驱动程序和设备开始在商定的客体内存位置上共享一个空白的描述符,其长度是商定的(最多215条)。这些描述符的布局是:。

This time, the id field is not an index for the device to look for the buffer: it is an opaque value for it, only has meaning for the driver.

The driver also maintains an internal single-bit ring wrap counter initialized to 1. The driver will flip its value every time it makes available the last descriptor in the ring.

As with split descriptors, the first step is to write the different fields: address, length, id and flags. However, packed descriptors take into account two new flags: AVAIL(0x7) and USED(0x15). To mark a descriptor as available, the driver makes the AVAIL(0x7) flag the same as its internal wrap counter, and the used flag the inverse. While just a binary flag avail/used would be easier to implement, it would prevent useful optimizations we will describe later.

这一次,id字段不是设备寻找缓冲区的索引:它是一个不透明的值,只对驱动有意义。

驱动程序还维护一个内部的单比特环形缠绕计数器,初始化为1,每次提供环形的最后一个描述符时,驱动程序都会翻转其值。

与分割描述符一样,第一步是写入不同的字段:地址、长度、ID和标志。然而,打包描述符考虑到了两个新的标志。AVAIL(0x7)和USED(0x15)。为了将一个描述符标记为可用,驱动程序使AVAIL(0x7)标志与它的内部包装计数器相同,而使用的标志则是相反的。虽然只有一个二进制标志AVA/USED会更容易实现,但它会妨碍我们后面要描述的有用的优化。

As an example, if the driver allocates a write buffer with 0x1000 bytes on position 0x80000000 in the step 1 in the diagram, and makes it the first available descriptor setting AVAIL(0x7) flag the same as internal wrap counter (set) in step 2. The descriptor table would look like this:

Avail idx Address Length ID Flags Used idx
0x80000000 0x1000 0 W|A

Figure: Descriptor table after add the first buffer

举个例子,如果驱动程序在图中的第1步中在0x80000000位置分配了一个0x1000字节的写缓冲区,并使其成为第一个可用的描述符,在第2步中设置AVAIL(0x7)标志与内部包络计数相同(设置)。描述符表将看起来像这样。

Note that the avail and used idx columns are in the table just for guidance, they don’t exist in the descriptor table: Each side should have its internal counter to know which position needs to poll or write next, and also the device must track the driver’s wrap counter. Lastly, as with used virtqueue, the driver notifies the device if the latter has notifications enabled (step 3 in the diagram).

注意,表中的avail和used idx列只是为了指导,它们在描述符表中并不存在。每一方都应该有自己的内部计数器,以知道下一步需要轮询或写入哪个位置,同时设备也必须跟踪驱动的wrap计数器。最后,和使用的virtqueue一样,如果设备启用了通知功能,驱动程序就会通知设备(图中第3步)。

And the usual diagram of the updates. Note the lack of the avail and used ring, as only the descriptor table is needed now.

还有通常的更新图。请注意,由于现在只需要描述符表,所以缺少可用和已用环。

Diagram: Driver makes available a descriptor using a packed queue

Returning used descriptors: How the device fills the “done” list

As the driver, the device maintains an internal single-bit ring wrap counter initialized to 1, and knows that the driver also has its internal ring wrap counter set. When the latter first searches for the first descriptor the driver has made available, it polls the first entry of the ring, looking for the avail flag equal to the driver internal wrap flag (set in this case).

作为驱动程序,设备维护着一个初始化为1的内部单比特环形缠绕计数器,并且知道驱动程序也设置了其内部环形缠绕计数器。当后者第一次搜索驱动器提供的第一个描述符时,它就会轮询环的第一个条目,寻找等于驱动器内部包络标志的可用标志(在这种情况下是设置的)。

As with a used ring, the length of the written data is returned in the “length” entry (if any), and the id of the used descriptor. At last, the device will make the avail (A) and used (U) flag the same as the device’s internal wrap counter.

Following the example, the device will let the descriptor table as figure 6. The device will know that the buffer has been returned because the used flag matches the available flag, and with the device internal wrap counter at the moment it wrote the descriptor. The returned address is not important: only the ID.

Avail idx Address Length ID Flags Used idx
0x80000000 0x1000 0 W|A|U

Figure: Descriptor table after add the first buffer

与已用环一样,写入数据的长度会在 “length “条目中返回(如果有的话),以及已用描述符的id。最后,设备将使可用(A)和已用(U)标志与设备的内部缠绕计数器相同。

按照这个例子,设备将让描述符表如图6所示。设备将知道缓冲区已经被返回,因为使用的标志与可用的标志相匹配,并且在写描述符的时候与设备内部的wrap计数器相匹配。返回的地址并不重要:只有ID。

Diagram: Device marks a descriptor as used using a packed queue

Wrapping the descriptor ring: How the lanes keep separated?

When the driver fills the complete descriptor table, it wraps and changes its internal Driver Ring Wrap. So, in the second round, the available descriptions will have the avail and used flags clear, so the device will have to poll looking for this condition once it wraps reading descriptors. Let’s see a full example of the different situations.

当驱动程序填满了完整的描述符表,它就会包裹并改变其内部的驱动程序环形包裹。所以,在第二轮中,可用的描述符将有avail和used标志被清除,所以设备一旦包裹读取描述符,就必须轮询寻找这个条件。让我们来看看不同情况的完整例子。

If we have a descriptor table with only two entries, the Driver Ring Wrap Counter is set, and it fills the descriptor table making available two buffers at the beginning of the operation, driver will reverse its internal wrap counter, so it will be clear (0). We have the next table:

Avail idx Address Length ID Flags Used idx
0x80000000 0x1000 0 W|A
0x81000000 0x1000 1 W|A

Figure: Full two-entries descriptor table

如果我们有一个只有两个条目的描述符表,驱动环形缠绕计数器被设置,它填满描述符表,在操作开始时腾出两个缓冲区,驱动将扭转其内部缠绕计数器,所以它将是clear(0)。我们有下一个表。

After that, the device realizes that has both descriptors with id #0 and #1 available: it knows that the driver had its wrap counter set when it wrote them, the avail flag is set on them, and the used one is clear on both. If device uses the descriptor with id #1, we have the Figure 8 descriptor table. The buffer #0 still belongs to the device!

Avail idx Address Length ID Flags Used idx
0x80000000 0x1000 1 W|A|U
0x81000000 0x1000 1 W|A

Figure: Using first buffer out of order

之后,设备意识到有两个ID为#0和#1的描述符是可用的:它知道驱动程序在写它们的时候设置了wrap计数器,它们的avail标志被设置,而且这两个描述符的used标志都是清零的。如果设备使用id为#1的描述符,我们就有了图8的描述符表。缓冲区#0仍然属于设备!

Now the driver realize the buffer #1 has been used, since avail and used flags are the same (set) and match the device’s internal wrap counter at the moment it wrote it. If device now uses the buffer id #0, it will make the table look like this:

Avail idx Address Length ID Flags Used idx
0x80000000 0x1000 1 W|A|U
0x81000000 0x1000 0 W|A|U

Figure: Using second buffer out of order

现在驱动程序意识到1号缓冲区已经被使用了,因为avail和used标志是一样的(设置),并且与设备的内部wrap计数器在写的时候是一致的。如果设备现在使用缓冲区ID #0,它将使表看起来像这样。

But there is a more interesting case: Starting from the “first buffer out of order” situation, the driver makes available the buffer #1 again. In that case, the descriptor table goes directly from the “first buffer” to the next figure, “Full two-entries descriptor table.”

Avail idx Address Length ID Flags Used idx
0x81000000 0x1000 1 W|(!A)|U
0x81000000 0x1000 1 W|A

Figure: Full two-entries descriptor table

但还有一种更有趣的情况。从 “第一个缓冲区失灵 “的情况开始,驱动程序再次提供了1号缓冲区。在这种情况下,描述符表直接从 “第一个缓冲区 “进入下一个图,”完整的两行描述符表”。

Chained descriptors: No more jumps

Chained descriptors work likewise: no need for the next field in the head (or subsequent) descriptor in the chain to search subsequent ones, since the latter always occupies the next position. However, while in the split used ring you only need to return as used the id of the head of the chain, in packed you only need to return the tail id.

链式描述符的工作原理也是如此:不需要在链中的头部(或后续)描述符的下一个字段来搜索后续的描述符,因为后者总是占据着下一个位置。然而,在分割使用的环中,你只需要返回链头的id作为使用,而在打包中你只需要返回尾部的id。

Back to the used ring, every time we use chained descriptors, we make the used idx lag regarding the avail idx. More than one descriptor mark as available to the device, but we only send one as used to the driver. While this is not a problem in the split ring, this would cause descriptor entry exhaustion in the packed version.

回到已用环,每次我们使用链式描述符时,都会使已用idx滞后于可用idx。一个以上的描述符被标记为设备可用,但我们只把一个描述符作为已使用的描述符发送给驱动。虽然这在分割环中不是一个问题,但在打包版本中会导致描述符条目耗尽。

The straightforward solution is to make the device mark as used every descriptor in the chain. However, this can be expensive, since we are modifying a shared area of memory, and could cause cache bounces.

However, the driver already knows the chain, so it can skip all the chain with only the last id. This is why we need to compare the used/avail pair with the driver/device Wrap Counter: after a jump, we wouldn’t know if the next descriptor has been made available in this driver’s round or in the next if we only have a binary available/used flag.

直接的解决方案是让设备将链上的每个描述符都标记为已使用。然而,这可能是昂贵的,因为我们正在修改内存的共享区域,并可能导致缓存跳出。

然而,驱动程序已经知道了链,所以它可以跳过所有的链,只保留最后一个ID。这就是为什么我们需要将已用/可用对与驱动/设备的Wrap Counter进行比较:在跳转之后,如果我们只有一个二进制的可用/已用标志,我们就不知道下一个描述符是在这个驱动的回合中还是在下一个回合中被提供的。

For example, in a four entries ring, the driver makes available the chain of three descriptors:

Avail idx Address Length ID Flags Used idx
0x80000000 0x1000 0 W|A
0x81000000 0x1000 1 W|A
0x82000000 0x1000 2 W|A
0

Figure: Three chained descriptors available

例如,在一个四项环中,驱动器提供了三个描述符的链。

After that, the device discovers the chain (polling position 0) and marks it as used, overwriting only the position 0. It skips completely the positions 1 and 2. When the driver polls for used, it will skip them too, knowing that the chain was 3 descriptors long:

Avail idx Address Length ID Flags Used idx
0x80000000 0x1000 2 W|A|U
0x81000000 0x1000 1 W|A
0x82000000 0x1000 2 W|A
0

Figure: Using the descriptor chain

之后,设备会发现这个链(轮询位置0),并将其标记为已用,只覆盖位置0,完全跳过位置1和2。当驱动轮询已使用时,它也会跳过这些位置,因为它知道该链有3个描述符长。

Now the driver produces another two descriptor long chain, and it has to take into account the wrapping:

Avail idx Address Length ID Flags Used idx
0x81000000 0x1000 1 W|(!A)|U
0x81000000 0x1000 1 W|A
0x82000000 0x1000 2 W|A
0x80000000 0x1000 0 W|A

Figure: Make available another descriptor chain

现在,驱动程序又产生了一个两根描述符的长链,它必须考虑到包装的问题。

And the device marks it as used, so only the first descriptor in the chain (4th in the table) needs to be updated.

Avail idx Address Length ID Flags Used idx
0x81000000 0x1000 1 W|(!A)|U
0x81000000 0x1000 1 W|A
0x82000000 0x1000 2 W|A
0x80000000 0x1000 0 W|A|U

Figure: Using another descriptor chain

Although the next descriptor (2nd) seems like available, since the avail flag is different from the used one, the device knows that it is not because of knowing the internal Driver Wrap Counter: The right flag combination is avail clear, used set.

而设备将其标记为已使用,所以只有链中的第一个描述符(表中的第四个)需要更新。

尽管下一个描述符(第2个)看起来是可用的,但由于avail标志与used标志不同,设备知道它不是,因为知道内部的Driver Wrap Counter。正确的标志组合是avail clear,used set。

Indirect descriptors: When chains are not enough

Indirect descriptors work like in the split case. First, the driver allocates a table of indirect descriptors each with the same layout as the regular packed descriptors anywhere in memory. After that, it sets each descriptor in this indirect table to the buffer it wants to make available for the driver (steps 1-2), and inserts a descriptor in the virtqueue with the flag VIRTQ_DESC_F_INDIRECT (0x4) set (step 3). The descriptor’s address and length correspond to the indirect table’s ones.

间接描述符的工作方式与分割情况类似。首先,驱动程序分配一个间接描述符表,每个描述符的布局与内存中任何地方的常规打包描述符相同。之后,它将这个间接表中的每个描述符设置为它想为驱动提供的缓冲区(步骤1-2),并在virtqueue中插入一个设置了标志VIRTQ_DESC_F_INDIRECT(0x4)的描述符(步骤3)。该描述符的地址和长度对应于间接表的那些。

In packed layout buffers must come in order in the indirect table, and the ID field is completely ignored. Also, the only valid flag for them is VIRTQ_DESC_F_WRITE, others are reserved and ignored by the device. As usual, the driver will notify the device if the conditions for the notification are met (step 4).

在打包布局中,缓冲区必须按顺序出现在间接表中,ID字段完全被忽略。另外,它们唯一有效的标志是VIRTQ_DESC_F_WRITE,其他的是保留的,被设备忽略。像往常一样,如果通知的条件得到满足,驱动程序将通知设备(步骤4)。

Diagram: Driver makes available a descriptor using a packed queue

For example, the driver would need to allocate this 48 bytes table for a 3 descriptors indirect table:

Address Length ID Flags
0x80000000 0x1000 W
0x81000000 0x1000 W
0x82000000 0x1000 W

Figure: Three descriptor long indirect packed table

And if it introduces the indirect table the first in the descriptor table, assuming it is allocated in 0x83000000 address:

Avail idx Address Length ID Flags Used idx
0x80000000 48 0 A|I

Figure: Drivers makes an indirect table available

After indirect buffer consumption, the device needs to return the indirect buffer id (0 in the example) in its used descriptor. The table looks like the return of the first buffer, except for the indirect (I) flag set:

Avail idx Address Length ID Flags Used idx
0x80000000 48 0 A|U|I

Figure: Device makes an indirect table used

After that, the device cannot access the memory table anymore unless the driver makes it available again, so the latter can free or reuse it.

Notifications: how to manage interruptions?

Like in the used queue, each side of the communication maintains two identical structures used for controlling notifications between the device and the driver. The driver’s one is read-only by the device, and the device’s one is read-only by the driver.

The struct layout is:

1
2
3
4
struct pvirtq_event_suppress { 
le16 desc;
le16 flags;
};

Listing: Event suppression struct notification

就像在用过的队列中,通信的每一方都维护着两个相同的结构,用于控制设备和驱动之间的通知。驱动程序的那个结构是设备只读的,而设备的那个结构是驱动程序只读的。

The member flags can take the values:

  • 0: Notifications are enabled
  • 1: Notifications are disabled
  • 2: Notifications are enabled for a specific descriptor, specified from the desc member.

If flags value is 2, the other side will notify until the wrap counter matches the most significant bit of desc and the descriptor placed in the position desc discarding that bit is made used/available. For this mode to work, VIRTIO_F_RING_EVENT_IDX flag needs to be negotiated in Virtio device initialization: feature bits.

None of these mechanisms are 100% reliable, since the other side could have sent the notification already when we set the values, so expect it even when disable.

Note that, since the descriptor ring size is not being forced to be a power of two (comparing with the split version), the notification structure can fit in the same page as the descriptor table. This can be advantageous for some implementations.

成员标志可以采取以下值。

  • 0: 通知被启用
  • 1: 通知被禁用
  • 2: 对一个特定的描述符启用通知,由desc成员指定。

如果标志值为2,另一方将进行通知,直到wrap计数器与desc的最重要的位相匹配,并且放置在desc位置的描述符放弃该位而被使用/可用。为了使这种模式工作,VIRTIO_F_RING_EVENT_IDX标志需要在Virtio设备初始化中协商:特征位。

这些机制都不是100%可靠的,因为当我们设置这些值时,对方可能已经发送了通知,所以即使在禁用的情况下也要期待它。

请注意,由于描述符环的大小没有被强制为2的幂(与分裂版本相比),通知结构可以与描述符表放在同一页面中。这对某些实现来说是有利的。

Summary

In this series we have taken you through the different virtio data plane layouts and its virtqueues implementations. They are the means for virtio devices and virtio drivers to exchange information.

We start by covering the simpler and less optimized split virtqueue layout. This layout is relatively easy to implement and to debug thus it’s a good entry point for learning the virtio dataplane basics.

We then moved on to the packed virtqueue layout specified in virtio 1.1 which allows requests exchange using a more compact descriptor representation. This avoids all the overhead of scattering the data through memory, avoiding cache contention and reducing the PCI transactions in case of actual hardware.

在这个系列中,我们已经带你了解了不同的virtio数据平面布局及其virtqueues的实现。它们是virtio设备和virtio驱动交换信息的手段。

我们首先介绍了更简单、更不优化的分离式virtqueue布局。这种布局相对容易实现和调试,因此它是学习virtio数据平面基础知识的一个很好的切入点。

然后,我们转向virtio 1.1中规定的打包式virtqueue布局,它允许使用更紧凑的描述符来交换请求。这避免了在内存中分散数据的所有开销,避免了缓存争用,并在实际硬件的情况下减少了PCI事务。

We also covered a number of optimizations on top of both ring layouts which depends on the communication/device type or how each part is implemented. Mainly, they are oriented to reduce the communication overhead, both in notifications and in memory transactions. Virtio offers a simple protocol to communicate what features and optimizations support each side, so they can agree on how the data is going to be exchanged and is highly future-proof.

我们还在这两个环状布局的基础上进行了一些优化,这取决于通信/设备类型或每个部分的实现方式。主要的是,它们的方向是减少通信开销,包括通知和内存事务。Virtio提供了一个简单的协议来沟通每一方支持哪些功能和优化,所以他们可以就数据的交换方式达成一致,并且是高度面向未来的。

This series covered the essence of the virtio data plane and provided you with the tool to analyze and develop your own virtio device and drivers. It should be noted that this series summarizes the relevant sections from the virtio spec thus you should refer to the spec for additional information and see it as the source of truth.

In the next posts we will return to vDPA including the kernel framework, hands on blogs and vDPA in Kubernetes.

这个系列涵盖了virtio数据平面的本质,并为你提供了分析和开发自己的virtio设备和驱动的工具。应该注意的是,这个系列总结了virtio规范中的相关部分,因此你应该参考规范以获得更多信息,并将其视为真理的来源。

在接下来的文章中,我们将回到vDPA,包括内核框架、实践博客和Kubernetes中的vDPA。