2023-12-06

The disk in the guest OS is unmounted during the kernel startup process.

Phenomenon: When creating a new virtual machine, after the virtual machine enters the “running” state (libvirt reports running, and the qemu process starts), a disk is loaded. During the kernel startup process, the disk (vdb) is recognized, and then qemu receives a device removal event, which is fed back to libvirt. Libvirt updates the XML, causing inconsistency between the disk state recorded in the zstack database and the XML on the host.

The main issue here is that the libvirt loading device interface returns success, and the XML corresponding to the device is also added. However, this device is deleted according to the event feedback from qemu.

Important log information: Here, let’s first analyze the system logs in the guest OS:

Here, we notice the logs related to pciehp because this virtual machine is UEFI-booted, leading to numerous pcie-related logs (due to UEFI boot requiring the q35 machine type, which defaults to pcie devices).

The initially observed logs include an error log from pcieport:

1	pci 0000:00:02.7: BAR 13: failed to assign [io size 0x1000]

Followed by the recognition of the vdb device:

1	_virtio_blk virtio6: [vdb] 104857600 512-byte logical blocks (537 GB/500 GiB)

An external interrupt is sent to the virtual machine:

1	pciehp 0000:00:02.7:pcie004: Slot(0-7): Attention button pressed

Subsequently, through ausearch, it is identified that libvirt received a device deletion event, leading to the removal of the mentioned device:

libvirt received device deleted event, removing the device

Based on these scenarios, we have summarized the steps to reproduce the issue:

During the kernel startup process
Load the data disk.
Check for inconsistencies between the XML and the database.
Through repeated testing of VM boot and data disk loading, the issue can be reproduced.

Regarding the error logs mentioned above, the explanation is as follows:

pci 0000:00:02.7: BAR 13: failed to assign [io size 0x1000]:
- According to https://access.redhat.com/solutions/3144711, this error may occur because in virtualized environments, there might be more PCIe ports than in a real physical environment, leading to this error. However, it does not have any actual impact.
_virtio_blk virtio6: [vdb] 104857600 512-byte logical blocks (537 GB/500 GiB):
- This indicates that the virtio-blk disk has indeed been successfully loaded, as it has been recognized within the virtual machine.
pciehp 0000:00:02.7:pcie004: Slot(0-7): Attention button pressed:
- “Attention button pressed” indicates that when resetting the PCIe slot, QEMU sends the corresponding interrupt. When the host receives this interrupt, the corresponding processing logic prints this log.

As for the key QEMU code, by searching the codebase, it has been confirmed that QEMU sends the corresponding interrupt when resetting the PCIe slot, and the host prints the log as part of the corresponding processing logic.

code from pcie.c

pci_word_test_and_clear_mask(exp_cap + PCI_EXP_SLTSTA,
                             PCI_EXP_SLTSTA_EIS |/* on reset,
                                                    the lock is released */
                             PCI_EXP_SLTSTA_CC |
                             PCI_EXP_SLTSTA_PDC |
                             PCI_EXP_SLTSTA_ABP);

which is used in qdev.c

QLIST_FOREACH(bus, &dev->child_bus, sibling) {
     object_property_set_bool(OBJECT(bus), true, "realized",
                                  &local_err);
     if (local_err != NULL) {
         goto child_realize_fail;
     }
 }
 if (dev->hotplugged) {
     device_reset(dev);
 }
 dev->pending_deleted_event = false;
 
 if (hotplug_ctrl) {
     hotplug_handler_plug(hotplug_ctrl, dev, &local_err);
     if (local_err != NULL) {
         goto child_realize_fail;
     }
}

During the device hotplug process, there will be a reset action.

Kernel-related code:

from pciehp_hpc.c

static int pciehp_poll(void *data)
{
    struct controller *ctrl = data;
 
    schedule_timeout_idle(10 * HZ); /* start with 10 sec delay */
 
    while (!kthread_should_stop()) {
        /* poll for interrupt events or user requests */
        while (pciehp_isr(IRQ_NOTCONNECTED, ctrl) == IRQ_WAKE_THREAD ||
               atomic_read(&ctrl->pending_events))
            pciehp_ist(IRQ_NOTCONNECTED, ctrl);
 
        if (pciehp_poll_time <= 0 || pciehp_poll_time > 60)
            pciehp_poll_time = 2; /* clamp to sane value */
 
        schedule_timeout_idle(pciehp_poll_time * HZ);
    }
 
    return 0;
}
 
    /* Check Attention Button Pressed */
    if (events & PCI_EXP_SLTSTA_ABP) {
        ctrl_info(ctrl, "Slot(%s): Attention button pressed\n",
              slot_name(ctrl));
        pciehp_handle_button_press(ctrl);
    }

This code represents a kernel function for polling PCIe Hot Plug events. It uses a kernel thread (kthread) to continuously poll for interrupt events or user requests related to PCIe Hot Plug. The function includes a timeout mechanism with an initial delay of 10 seconds and then repeats the polling process based on the specified polling time. The function stops when the kernel thread should stop (kthread_should_stop() returns true).

And pciehp_handle_button_press is implemented as following:

void pciehp_handle_button_press(struct controller *ctrl)
{
	mutex_lock(&ctrl->state_lock);
	switch (ctrl->state) {
	case OFF_STATE:
	case ON_STATE:
		if (ctrl->state == ON_STATE) {
			ctrl->state = BLINKINGOFF_STATE;
			ctrl_info(ctrl, "Slot(%s): Powering off due to button press\n",
				  slot_name(ctrl));
		} else {
			ctrl->state = BLINKINGON_STATE;
			ctrl_info(ctrl, "Slot(%s) Powering on due to button press\n",
				  slot_name(ctrl));
		}
		/* blink power indicator and turn off attention */
		pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_BLINK,
				      PCI_EXP_SLTCTL_ATTN_IND_OFF);
		schedule_delayed_work(&ctrl->button_work, 5 * HZ);
		break;
	case BLINKINGOFF_STATE:
	case BLINKINGON_STATE:
		/*
		 * Cancel if we are still blinking; this means that we
		 * press the attention again before the 5 sec. limit
		 * expires to cancel hot-add or hot-remove
		 */
		ctrl_info(ctrl, "Slot(%s): Button cancel\n", slot_name(ctrl));
		cancel_delayed_work(&ctrl->button_work);
		if (ctrl->state == BLINKINGOFF_STATE) {
			ctrl->state = ON_STATE;
			pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_ON,
					      PCI_EXP_SLTCTL_ATTN_IND_OFF);
		} else {
			ctrl->state = OFF_STATE;
			pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_OFF,
					      PCI_EXP_SLTCTL_ATTN_IND_OFF);
		}
		ctrl_info(ctrl, "Slot(%s): Action canceled due to button press\n",
			  slot_name(ctrl));
		break;
	default:
		ctrl_err(ctrl, "Slot(%s): Ignoring invalid state %#x\n",
			 slot_name(ctrl), ctrl->state);
		break;
	}
	mutex_unlock(&ctrl->state_lock);
}

Both OFF_STATE and ON_STATE could changed to each other by same press request.

Based on the test results, we preliminarily conclude that there is a race condition between hot-plug operations and kernel boot, leading to unexpected changes in the PCIe slot’s state from off → on → off. (Note: The crucial point here is that pciehp_handle_button_press(ctrl); simultaneously handles both on and off scenarios.)

With reference to the above keywords, we identified a related Bugzilla entry for QEMU version 4.2 by searching for ‘qemu pci device kernel boot race condition’:

https://bugzilla.kernel.org/show_bug.cgi?id=211691

“The document mentions a virtio-net failover mechanism introduced by QEMU 4.2, addressing the issue of hot-plugging network cards failing during the VM startup phase. This problem arises from a race condition in the QEMU code that sets the PCIe slot’s state. The provided QEMU patch resolves the issue:

QEMU Patch Link

The title of this patch is: ‘pcie: don’t set link state active if the slot is empty.’

Upon reviewing its content, it appears that during PCIe initialization and the hot-plug phase, the ‘reset’ is called, potentially causing inconsistencies in the slot’s state. This patch addresses the problem by preventing the setting of the link state to active if the slot is empty, eliminating the observed issue.”

TIPs:

Search for changes related to the virtual machine process using ausearch:

1	ausearch -m "VIRT_RESOURCE" -p 63259

Libvirt’s XML and QEMU event update mechanism: Details can be found in TIC-1360 - Cloud VM disk does not exist, capacity inconsistency between UI interface and underlying view (Closed).

Quick reference for PCIe events:

From linux pci_regs.h:

#define PCI_EXP_SLTCTL      24  /* Slot Control */
#define  PCI_EXP_SLTCTL_ABPE    0x0001  /* Attention Button Pressed Enable */
#define  PCI_EXP_SLTCTL_PFDE    0x0002  /* Power Fault Detected Enable */
#define  PCI_EXP_SLTCTL_MRLSCE  0x0004  /* MRL Sensor Changed Enable */
#define  PCI_EXP_SLTCTL_PDCE    0x0008  /* Presence Detect Changed Enable */
#define  PCI_EXP_SLTCTL_CCIE    0x0010  /* Command Completed Interrupt Enable */
#define  PCI_EXP_SLTCTL_HPIE    0x0020  /* Hot-Plug Interrupt Enable */
#define  PCI_EXP_SLTCTL_AIC 0x00c0  /* Attention Indicator Control */
#define  PCI_EXP_SLTCTL_ATTN_IND_SHIFT 6      /* Attention Indicator shift */
#define  PCI_EXP_SLTCTL_ATTN_IND_ON    0x0040 /* Attention Indicator on */
#define  PCI_EXP_SLTCTL_ATTN_IND_BLINK 0x0080 /* Attention Indicator blinking */
#define  PCI_EXP_SLTCTL_ATTN_IND_OFF   0x00c0 /* Attention Indicator off */
#define  PCI_EXP_SLTCTL_PIC 0x0300  /* Power Indicator Control */
#define  PCI_EXP_SLTCTL_PWR_IND_ON     0x0100 /* Power Indicator on */
#define  PCI_EXP_SLTCTL_PWR_IND_BLINK  0x0200 /* Power Indicator blinking */
#define  PCI_EXP_SLTCTL_PWR_IND_OFF    0x0300 /* Power Indicator off */
#define  PCI_EXP_SLTCTL_PCC 0x0400  /* Power Controller Control */
#define  PCI_EXP_SLTCTL_PWR_ON         0x0000 /* Power On */
#define  PCI_EXP_SLTCTL_PWR_OFF        0x0400 /* Power Off */
#define  PCI_EXP_SLTCTL_EIC 0x0800  /* Electromechanical Interlock Control */
#define  PCI_EXP_SLTCTL_DLLSCE  0x1000  /* Data Link Layer State Changed Enable */

QEMU systemtap trace, refer to: QEMU Tracing Documentation

1	/usr/bin/qemu-trace-stap run /usr/libexec/qemu-kvm pci_cfg_write

Guest os pcie trace analysis:

[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pending interrupts 0x0010 from Slot Status
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_on: SLOTCTRL 6c write cmd 100
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pending interrupts 0x0010 from Slot Status
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_set_attention_status: SLOTCTRL 6c write cmd c0
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: Slot(0-7): Attention button pressed
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: Slot(0-7): Powering off due to button press
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pending interrupts 0x0010 from Slot Status
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_blink: SLOTCTRL 6c write cmd 200
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pending interrupts 0x0010 from Slot Status
[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_set_attention_status: SLOTCTRL 6c write cmd c0

[一 12月 4 15:06:53 2023] pciehp 0000:00:02.7:pcie004: pciehp_get_power_status: SLOTCTRL 6c value read 2f1
[一 12月 4 15:06:53 2023] pciehp 0000:00:02.7:pcie004: pciehp_unconfigure_device: domain:bus:dev = 0000:08:00
[一 12月 4 15:06:53 2023] pciehp 0000:00:02.7:pcie004: pending interrupts 0x0010 from Slot Status
[一 12月 4 15:06:53 2023] pciehp 0000:00:02.7:pcie004: pciehp_power_off_slot: SLOTCTRL 6c write cmd 40

[一 12月 4 15:06:54 2023] pciehp 0000:00:02.7:pcie004: pending interrupts 0x0018 from Slot Status
[一 12月 4 15:06:54 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_off: SLOTCTRL 6c write cmd 300

[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_on: SLOTCTRL 6c write cmd 100

cmd 0x0100

1	#define PCI_EXP_SLTCTL_PWR_IND_ON 0x0100 /* Power Indicator on */

[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_set_attention_status: SLOTCTRL 6c write cmd c0

cmd 0x00c0
0000 0000 1100 0000

1 2	#define PCI_EXP_SLTCTL_ATTN_IND_BLINK 0x0080 /* Attention Indicator blinking / #define PCI_EXP_SLTCTL_ATTN_IND_ON 0x0040 / Attention Indicator on */

[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_blink: SLOTCTRL 6c write cmd 200

cmd 0x0200

1	#define PCI_EXP_SLTCTL_PWR_IND_BLINK 0x0200 /* Power Indicator blinking */

[一 12月 4 15:06:48 2023] pciehp 0000:00:02.7:pcie004: pciehp_set_attention_status: SLOTCTRL 6c write cmd c0

cmd 0x00c0
0000 0000 1100 0000

1 2	#define PCI_EXP_SLTCTL_ATTN_IND_BLINK 0x0080 /* Attention Indicator blinking / #define PCI_EXP_SLTCTL_ATTN_IND_ON 0x0040 / Attention Indicator on */

[一 12月 4 15:06:53 2023] pciehp 0000:00:02.7:pcie004: pciehp_power_off_slot: SLOTCTRL 6c write cmd 400

cmd 0x0400

1	#define PCI_EXP_SLTCTL_PWR_OFF 0x0400 /* Power Off */

[一 12月 4 15:06:54 2023] pciehp 0000:00:02.7:pcie004: pciehp_green_led_off: SLOTCTRL 6c write cmd 300
cmd 0x0300

1	#define PCI_EXP_SLTCTL_PWR_IND_OFF 0x0300 /* Power Indicator off */