Virtio devices and drivers overview: The headjack and the phone

This three-part series will take you through the main virtio data plane layouts: the split virtqueue and the packed virtqueue. This is the basis for the communication between hosts and virtual environments like guests or containers.

这个由三部分组成的系列,将会带你了解virtio数据平面的布局,split virtuqueue 和 packed virtqueue。这是物理机与虚拟环境比如虚拟机或容器交流的基础。

One of the challenges when coming to explain these approaches is the lack of documentation and the many terms involved. This set of posts attempts to demystify the virtio data plane and provide you with a clear down to earth explanation of what is what.

在解释这些方法时,面临的挑战之一是缺乏文档和涉及的许多术语。这组文章试图揭开virtio数据平面的神秘面纱,并为你提供一个清晰的解释,说明什么是什么。

This is a technical deep dive and is relevant for those who are interested in the bits and bytes of things. It details the communication format between the different virtio parts and data plane protocols.

这是一个技术上的深入研究,与那些对事物的比特和字节感兴趣的人有关。它详细介绍了不同virtio部件和数据平面协议之间的通信格式。

While further extensions, optimizations and features are being added to both virtqueue versions, to improve performance and to simplify implementation, the core of the virtqueue operations remains the same. This is because it has been designed with extensibility in mind.

虽然两个版本的virtqueue都加入了进一步的扩展、优化和功能,以提高性能和简化实现,但virtqueue操作的核心仍然保持不变。这是因为它在设计时就考虑到了可扩展性。

Packed virtqueue, which complements the split virtqueue has been merged in the virtio 1.1 spec, and successfully implemented in both emulated devices (qemu, virtio_net, dpdk) and physical devices.

作为split virtqueue的补充,packed virtqueue已被合并到virtio 1.1规范中,并在模拟设备(qemu、virtio_net、dpdk)和物理设备中成功实现。

We’ll start with an overview of the virtio device, drivers and their data plane interaction. Then we’ll move on to explain the details of the split virtqueue ring layout. This is followed by an overview of the packed ring layout and the advantages it brings over the split virtqueue approach.

我们将首先概述virtio设备、驱动程序和它们的数据平面互动。然后,我们将继续解释split virtqueue ring layout的细节。随后,我们将概述packed ring layout以及它比split virtqueue方法带来的优势。

Virtio devices and drivers overview: who is who

This section provides a brief overview of the virtio devices, virtio drivers, examples of the different architectures you can use and the different components. If you’re already familiar with these topics or you have already followed the virtio networking series you can jump directly to the next section focusing on the virtio rings.

本节简要介绍了virtio设备、virtio驱动、你可以使用的不同架构的例子以及不同的组件。如果你已经熟悉了这些主题,或者你已经关注了virtio网络系列,你可以直接跳到下一节,重点介绍virtio rings。

Virtio devices: In and out the virtual world

A virtio device is a device that exposes a virtio interface for the software to manage and exchange information. It can be exposed to the emulated environment using PCI, Memory Mapping I/O (Just to expose the device in a region of memory) and S/390 Channel I/O. Part of the communication needs to be delegated to theses, like device discovery.

Virtio设备是一个暴露出virtio接口的设备,供软件管理和交换信息。它可以使用PCI、内存映射I/O(只是在内存的一个区域暴露设备)和S/390通道I/O暴露在仿真环境中。部分通信需要委托给这些设备,如设备发现。

Its main task is to convert the signal from the format they have outside of the virtual environment (the VM, the container, etc) to the format they need to be exchanged through the virtio dataplane and vice-versa. This signal could be real (for example the electricity or the light from a NIC) or already virtual (like the representation the host has from a network packet).

它的主要任务是将信号从它们在虚拟环境(虚拟机、容器等)之外的格式转换成它们需要通过virtio数据线交换的格式,反之亦然。这个信号可以是真实的(例如来自网卡的电或光),或者已经是虚拟的(如主机从网络包中得到的表示)。

The virtio interface consist of the following mandatory parts (virtio1.1 spec):

  • Device status field
  • Feature bits
  • Notifications
  • One or more virtqueues

Now we’ll provide additional details to each of these parts and how the device and driver starts communicating using these.

virtio接口由以下强制性部分组成(virtio1.1规范):

  • 设备状态字段
  • 特征位
  • 通知
  • 一个或多个虚拟队列

现在我们将提供这些部分的额外细节,以及设备和驱动如何使用这些部分开始通信。

Device status field: Is everything ok?

The device status field is a sequence of bits the device and the driver use to perform their initialization. We can imagine it as traffic lights on a console, each part set and clear each bit indicating their status.

设备状态字段是设备和驱动程序用来执行其初始化的一个比特序列。我们可以把它想象成控制台上的交通灯,每个部分设置和清除每个位,表示它们的状态。

The guest or the driver set the bit ACKNOWLEDGE (0x1) in the device status field to indicate that it acknowledges the device, and the bit DRIVER (0x2) to indicate an initialization in progress. After that, it starts a feature negotiation using the feature bits (more on this later), and sets bit DRIVER_OK (0x4) and FEATURES_OK (0x8) to acknowledge the features, so communication can start. If the device wants to indicate a fatal failure, it can set bit DEVICE_NEEDS_RESET (0x40), and the driver can do the same with bit FAILED (0x80).

访客或驱动程序在设备状态字段中设置位ACKNOWLEDGE(0x1)以表示它确认了设备,并设置位DRIVER(0x2)以表示初始化正在进行。之后,它开始使用功能位进行功能协商(后面会详细介绍),并设置位DRIVER_OK(0x4)和FEATURES_OK(0x8)来确认功能,这样就可以开始通信了。如果设备想指示一个致命的故障,它可以设置位DEVICE_NEEDS_RESET (0x40),而驱动程序可以用位FAILED (0x80)做同样的事情。

The device communicates the location of these bits using transport specific methods, like PCI scanning or knowing the address for MMIO.

设备使用传输的特定方法来传达这些位的位置,如PCI扫描或知道MMIO的地址。

Feature bits: Setting the communication agreement points

Device’s feature bits are used to communicate what features it supports, and to agree with the drivers about what of them will be used. These can be device-generic or device-specific. As an example of the first case, a bit can acknowledge if the device supports SR-IOV or what memory mode can be used. An example of the second case can be the different offloads it can perform, like checksumming or scatter-gather If the device is a network interface.

设备的功能位用于交流它支持哪些功能,并与驱动程序商定将使用其中哪些功能。这些位可以是设备通用的,也可以是设备特定的。作为第一种情况的一个例子,一个比特可以确认设备是否支持SR-IOV或者可以使用什么内存模式。第二种情况的一个例子是,如果设备是一个网络接口,它可以执行不同的卸载,如校验和或散点收集。

After the device initialization exposed in the previous section, the former reads the feature bits the device offers, and sends back the subset that it can handle. If they agree on them, the driver will allocate and inform about the virtqueues to the device, and all other configuration needed.

在上一节暴露的设备初始化之后,前者会读取设备提供的功能位,并发回它能处理的子集。如果他们达成一致,驱动程序将分配和通知设备的虚拟队列,以及所有其他需要的配置。

Notifications: You have work to do

Devices and drivers must notify that they have information to communicate using a notification. While the semantic of these is specified in the standard, the implementation of these are transport specific, like a PCI interruption or to write to a specific memory location. The device and the driver needs to expose at least one notification method. We will expand on this later in future sections.

设备和驱动程序必须通知他们有使用通知的信息进行通信。虽然这些的语义是在标准中规定的,但这些的实现是特定于传输的,比如PCI中断或写到一个特定的内存位置。设备和驱动程序需要公开至少一个通知方法。我们将在以后的章节中对此进行阐述。

One or more virtqueues: The communication vehicles

A virtqueue is just a queue of guest’s buffers that the host consumes, either reading them or writing to them, and returns to the guest. The current memory layout of a virtqueue implementation is a circular ring, so it is often called the virtring or vring.

They will be the main topic of the next section, Virtqueues and virtio ring, so at this moment is enough with that definition.

virtqueue只是一个guest缓冲区的队列,主机消耗这些缓冲区,要么读取它们,要么写入它们,然后返回给客体。目前virtqueue实现的内存布局是一个圆形的环,所以它通常被称为virtring或vring。

它们将是下一节的主要话题,即virtqueues和virtio ring,所以此刻有了这个定义就足够了。

Virtio drivers: The software avatar

The virtio driver is the software part in the virtual environment that talks with the virtio device using the relevant parts of the virtio spec.

Generally speaking, its virtio control plane tasks are:

  • Look for the device
  • To allocate shared memory in the guest for the communication

Start it using the protocol in Virtio devices.

virtio驱动是虚拟环境中的软件部分,它使用virtio规范的相关部分与virtio设备对话。

一般来说,其virtio控制面的任务是。

  • 寻找设备
  • 在客户中为通信分配共享内存

使用virtio设备中的协议启动它。

Devices and drivers interaction: The scenarios

In this section we are going to locate each virtio networking element (device, driver, and how the communication works) in three different architectures, to provide both a common frame to start explaining the virtio data plane and to show how adaptive it is. We have already presented these elements in past posts, so you can skip this section if you are a virtio-net series reader. On the other hand, if you have not read them, you can use them as a reference to understand this part better.

在这一节中,我们将把每个virtio网络元素(设备、驱动和通信如何工作)放在三个不同的架构中,以提供一个共同的框架来开始解释virtio数据平面,并展示它的适应性。我们已经在过去的文章中介绍了这些元素,所以如果你是virtio-net系列的读者,你可以跳过这一部分。另一方面,如果你没有读过这些文章,你可以把它们作为参考来更好地理解这一部分。

In Introduction to virtio-networking and vhost-net we showed the environment in which qemu created an emulated net device and offered it to the guest’s virtio-net driver. In this environment, the driver notifications are routed from whatever method is exposed to guests (usually, PCI) to KVM interruptions that stop the guest’s processor and return the control to the host (vmexit). Similarly, the device notifications are a special ioctl the host can send to the KVM device (vCPU IRQ). QEMU can access virtqueue information using the shared memory.

在介绍virtio-networking和vhost-net时,我们展示了qemu创建一个模拟的net设备并将其提供给客户的virtio-net驱动程序的环境。在这个环境中,驱动程序的通知从任何暴露给客体的方法(通常是PCI)被路由到KVM中断,停止客体的处理器并将控制权返回给主机(vmexit)。同样地,设备通知是主机可以向KVM设备发送的特殊ioctl(vCPU IRQ)。QEMU可以使用共享内存访问virtqueue信息。

Please note the implications of the virtio rings shared memory concept: The memory the driver and the device access is the same page in RAM, they are not two different regions that follow a protocol to synchronize.

请注意virtio环的共享内存概念的含义。驱动程序和设备访问的内存是RAM中的同一个页面,它们不是两个不同的区域,它们遵循一个协议来进行同步。

Figure 1: Qemu emulated device component diagram

Since the notification now needs to travel from the guest (KVM), to QEMU, and then to the kernel for the latter to forward the network frame, we can spawn a thread in the kernel with access to the guest’s shared memory mapping and then let it handle the virtio dataplane.

由于通知现在需要从guest(KVM)到QEMU,再到内核,以便后者转发网络帧,我们可以在内核中生成一个线程,访问客体的共享内存映射,然后让它处理virtio数据平面。

In that context, QEMU initiates the device using the virtio dataplane, and then forwards the virtio device status to vhost-net, delegating the data plane to it. In this scenario, KVM will use an event file descriptor (eventfd) to communicate the device interruptions, and expose another one to receive CPU interruptions. The guest does not need to be aware of this change, it will operate as the previous scenario.

在这种情况下,QEMU使用virtio数据平面启动设备,然后将virtio设备状态转发给vhost-net,将数据平面委托给它。在这种情况下,KVM将使用一个事件文件描述符(eventfd)来传达设备中断,并公开另一个文件描述符来接收CPU中断。guest不需要意识到这种变化,它将像之前的方案一样操作。

Also, in order to increase the performance, we created an in-kernel virtio-net device (called vhost-net) to offload the data plane directly to the kernel, where packet forwarding takes place:

另外,为了提高性能,我们创建了一个内核内的virtio-net设备(称为vhost-net),将数据平面直接卸载到内核,在那里进行数据包转发。

Figure 2: Virtio-net components diagram

Later on, we moved the virtio device from the kernel to an userspace process in the host (covered in the post “A journey to the vhost-users realm”) that can run a packet forwarding framework like DPDK. The protocol to set all this up is called virtio-user.

后来,我们把virtio设备从内核移到了主机的用户空间进程中(在 “通往vhost-users领域的旅程 “一文中有所涉及),该进程可以运行像DPDK这样的包转发框架。设置这一切的协议被称为virtio-user。

Figure 3: Virtio-user components diagram

It even allows guests to run virtio drivers in guest’s userland, instead of the kernel! In this case, virtio names driver the process that is managing the memory and the virtqueues, not the kernel code that runs in the guest.

它甚至允许客户在客户的用户区运行virtio驱动,而不是在内核中运行 在这种情况下,virtio将驱动程序命名为管理内存和virtqueues的进程,而不是在guest中运行的内核代码

Figure 4: Virtio-user with userland driver in guest

Lastly, we can directly do a virtio device passthrough with the proper hardware. If the NIC supports the virtio data plane, we can expose it directly to the guest with proper hardware (IOMMU device, able to translate between the guest’s and device’s memory addresses) and software (for example, VFIO linux driver, that enables the host to directly give the control of a PCI device to the guest). The device uses the typical hardware signals for notifications infrastructure, like PCI and CPU interruptions (IRQ).

最后,我们可以通过适当的硬件直接进行virtio设备透传。如果网卡支持virtio数据平面,我们可以通过适当的硬件(IOMMU设备,能够在guest和设备的内存地址之间进行转换)和软件(例如,VFIO linux驱动,使主机能够直接将PCI设备的控制权交给guest)将其直接暴露给guest。该设备使用典型的硬件信号来通知基础设施,如PCI和CPU中断(IRQ)。

If a hardware NIC wants to go this way, the easiest approach is to build its driver on top of vDPA, also explained in earlier posts of this series.

如果硬件网卡想走这条路,最简单的方法是在vDPA的基础上构建它的驱动程序,在本系列的早期文章中也有解释.

Figure 5: Virtio hardware passthrough components diagram

We will explain what happens inside of the dataplane communication in the rest of the posts.

我们将在接下来的文章中解释数据平面通信内部发生了什么。

Thanks to the deep investment in standardization, the virtio data plane is the same in whatever way we use across these scenarios, and whatever transport protocol we use. The format of the exchanged messages are the same, and different devices or drivers can negotiate different capabilities or features based on its characteristics using the feature bits, previously mentioned. This way, the virtqueues only act as a common thin layer of device-driver communication that allows to reduce the investment of development and deployment.

由于对标准化的深入投资,virtio数据平面在这些场景中,无论我们使用什么方式,无论我们使用什么传输协议,都是一样的。交换的消息的格式是相同的,不同的设备或驱动程序可以根据它的特点,使用前面提到的特征位,协商不同的能力或特征。这样一来,虚拟队列只是作为设备-驱动程序通信的一个普通薄层,可以减少开发和部署的投资。

As stated on previous blogs on this series, the interest of this standardization is to achieve a slim layer of communication with the virtual environment (instead of emulating a complete piece of hardware), that makes it easier to verify for correctness across different virtualization technologies or hardware.

正如本系列的前几篇博客所述,这种标准化的兴趣在于实现与虚拟环境的薄层通信(而不是模拟一个完整的硬件),这使得在不同的虚拟化技术或硬件之间验证正确性更加容易。