Phenomenon: When creating a new virtual machine, after the virtual machine enters the “running” state (libvirt reports running, and the qemu process starts), a disk is loaded. During the kernel startup process, the disk (vdb) is recognized, and then qemu receives a device removal event, which is fed back to libvirt. Libvirt updates the XML, causing inconsistency between the disk state recorded in the zstack database and the XML on the host.
The main issue here is that the libvirt loading device interface returns success, and the XML corresponding to the device is also added. However, this device is deleted according to the event feedback from qemu.
Important log information: Here, let’s first analyze the system logs in the guest OS:
Here, we notice the logs related to pciehp because this virtual machine is UEFI-booted, leading to numerous pcie-related logs (due to UEFI boot requiring the q35 machine type, which defaults to pcie devices).
The initially observed logs include an error log from pcieport:
1 | pci 0000:00:02.7: BAR 13: failed to assign [io size 0x1000] |
Followed by the recognition of the vdb device:
1 | _virtio_blk virtio6: [vdb] 104857600 512-byte logical blocks (537 GB/500 GiB) |
An external interrupt is sent to the virtual machine:
1 | pciehp 0000:00:02.7:pcie004: Slot(0-7): Attention button pressed |
Subsequently, through ausearch, it is identified that libvirt received a device deletion event, leading to the removal of the mentioned device:
libvirt received device deleted event, removing the device
Based on these scenarios, we have summarized the steps to reproduce the issue:
- During the kernel startup process
- Load the data disk.
- Check for inconsistencies between the XML and the database.
- Through repeated testing of VM boot and data disk loading, the issue can be reproduced.
Regarding the error logs mentioned above, the explanation is as follows:
pci 0000:00:02.7: BAR 13: failed to assign [io size 0x1000]
:- According to https://access.redhat.com/solutions/3144711, this error may occur because in virtualized environments, there might be more PCIe ports than in a real physical environment, leading to this error. However, it does not have any actual impact.
_virtio_blk virtio6: [vdb] 104857600 512-byte logical blocks (537 GB/500 GiB)
:- This indicates that the virtio-blk disk has indeed been successfully loaded, as it has been recognized within the virtual machine.
pciehp 0000:00:02.7:pcie004: Slot(0-7): Attention button pressed
:- “Attention button pressed” indicates that when resetting the PCIe slot, QEMU sends the corresponding interrupt. When the host receives this interrupt, the corresponding processing logic prints this log.
As for the key QEMU code, by searching the codebase, it has been confirmed that QEMU sends the corresponding interrupt when resetting the PCIe slot, and the host prints the log as part of the corresponding processing logic.
code from pcie.c
1 | pci_word_test_and_clear_mask(exp_cap + PCI_EXP_SLTSTA, |
which is used in qdev.c
1 | QLIST_FOREACH(bus, &dev->child_bus, sibling) { |
During the device hotplug process, there will be a reset action.
Kernel-related code:
from pciehp_hpc.c
1 | static int pciehp_poll(void *data) |
This code represents a kernel function for polling PCIe Hot Plug events. It uses a kernel thread (kthread) to continuously poll for interrupt events or user requests related to PCIe Hot Plug. The function includes a timeout mechanism with an initial delay of 10 seconds and then repeats the polling process based on the specified polling time. The function stops when the kernel thread should stop (kthread_should_stop() returns true).
And pciehp_handle_button_press is implemented as following:
1 | void pciehp_handle_button_press(struct controller *ctrl) |
Both OFF_STATE and ON_STATE could changed to each other by same press request.
Based on the test results, we preliminarily conclude that there is a race condition between hot-plug operations and kernel boot, leading to unexpected changes in the PCIe slot’s state from off → on → off. (Note: The crucial point here is that pciehp_handle_button_press(ctrl); simultaneously handles both on and off scenarios.)
With reference to the above keywords, we identified a related Bugzilla entry for QEMU version 4.2 by searching for ‘qemu pci device kernel boot race condition’:
https://bugzilla.kernel.org/show_bug.cgi?id=211691
“The document mentions a virtio-net failover mechanism introduced by QEMU 4.2, addressing the issue of hot-plugging network cards failing during the VM startup phase. This problem arises from a race condition in the QEMU code that sets the PCIe slot’s state. The provided QEMU patch resolves the issue:
The title of this patch is: ‘pcie: don’t set link state active if the slot is empty.’
Upon reviewing its content, it appears that during PCIe initialization and the hot-plug phase, the ‘reset’ is called, potentially causing inconsistencies in the slot’s state. This patch addresses the problem by preventing the setting of the link state to active if the slot is empty, eliminating the observed issue.”
TIPs:
Search for changes related to the virtual machine process using ausearch
:
1 | ausearch -m "VIRT_RESOURCE" -p 63259 |
Libvirt’s XML and QEMU event update mechanism: Details can be found in TIC-1360 - Cloud VM disk does not exist, capacity inconsistency between UI interface and underlying view (Closed).
Quick reference for PCIe events:
From linux pci_regs.h:
1 |
QEMU systemtap trace, refer to: QEMU Tracing Documentation
1 | /usr/bin/qemu-trace-stap run /usr/libexec/qemu-kvm pci_cfg_write |
Guest os pcie trace analysis:
1 | [一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pending interrupts 0x0010 from Slot Status |
cmd 0x0100
1 | #define PCI_EXP_SLTCTL_PWR_IND_ON 0x0100 /* Power Indicator on */ |
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_set_attention_status: SLOTCTRL 6c write cmd c0
cmd 0x00c0
0000 0000 1100 0000
1 | #define PCI_EXP_SLTCTL_ATTN_IND_BLINK 0x0080 /* Attention Indicator blinking */ |
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_blink: SLOTCTRL 6c write cmd 200
cmd 0x0200
1 | #define PCI_EXP_SLTCTL_PWR_IND_BLINK 0x0200 /* Power Indicator blinking */ |
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_set_attention_status: SLOTCTRL 6c write cmd c0
cmd 0x00c0
0000 0000 1100 0000
1 | #define PCI_EXP_SLTCTL_ATTN_IND_BLINK 0x0080 /* Attention Indicator blinking */ |
[一 12月 4 15:06:53 2023] pciehp 0000:00:02.7:pcie004: pciehp_power_off_slot: SLOTCTRL 6c write cmd 400
cmd 0x0400
1 | #define PCI_EXP_SLTCTL_PWR_OFF 0x0400 /* Power Off */ |
[一 12月 4 15:06:54 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_off: SLOTCTRL 6c write cmd 300
cmd 0x0300
1 | #define PCI_EXP_SLTCTL_PWR_IND_OFF 0x0300 /* Power Indicator off */ |