2023-02-23

Packed virtqueue: How to reduce overhead with virtio

This is the final post of a three-post series, the previous posts are “Virtio devices and drivers overview: The headjack and the phone,” and “Virtqueues and virtio ring: How the data travels.”

这是三篇系列文章的最后一篇，之前的文章是”Virtio设备和驱动概述：头戴式耳机和手机“，以及”Virtqueues和virtio环：数据如何传输“。

Split virtqueue issues: Too much spinning around

While the split virtqueue shines because of the simplicity of its design, it has a fundamental problem: The avail-used buffer cycle needs to use memory in a very sparse way. This puts pressure on the CPU cache utilization, and in the case of hardware means several PCI transactions for each descriptor.

虽然split virtqueue因其设计的简单性而大放异彩，但它有一个基本问题：可用的缓冲区环需要以一种非常稀疏的方式使用内存。这给CPU的缓存利用率带来了压力，在硬件的情况下，意味着每个描述符都要有几个PCI事务。

Packed virtqueue amends it by merging the three rings in just one location in virtual environment guest memory. While this may seem complicated at first glance, it’s a natural step after the split version if we realize that the device can discard and overwrite the data it already has read from the driver, and the same happens the other way around.

Packed virtqueue对其进行了修正，将三个环合并在虚拟环境guest内存的一个位置。虽然这乍看起来很复杂，但如果我们意识到设备可以丢弃和覆盖它已经从驱动中读取的数据，那么这就是分裂版本之后的一个自然步骤，反之亦然。

Supplying descriptors to the device: How to fill device todo-list

After initialization in the same process as described in Virtio device initialization: feature bits, and after the agreement on RING_PACKED feature flag, the driver and the device starts with a shared blank canvas of descriptors with an agreed length (up to 215 entries) in a agreed guest’s memory location. The layout of these descriptors is:
1
2
3
4
5
6
struct virtq_desc { 
        le64 addr;
        le32 len;
        le16 id;
        le16 flags;
};
Listing: Memory layout of a packed virtqueue descriptor

在Virtio设备初始化：特征位中描述的相同过程中进行初始化后，在就RING_PACKED特征标志达成一致后，驱动程序和设备开始在商定的客体内存位置上共享一个空白的描述符，其长度是商定的（最多215条）。这些描述符的布局是：。

This time, the id field is not an index for the device to look for the buffer: it is an opaque value for it, only has meaning for the driver.

The driver also maintains an internal single-bit ring wrap counter initialized to 1. The driver will flip its value every time it makes available the last descriptor in the ring.

As with split descriptors, the first step is to write the different fields: address, length, id and flags. However, packed descriptors take into account two new flags: AVAIL(0x7) and USED(0x15). To mark a descriptor as available, the driver makes the AVAIL(0x7) flag the same as its internal wrap counter, and the used flag the inverse. While just a binary flag avail/used would be easier to implement, it would prevent useful optimizations we will describe later.

这一次，id字段不是设备寻找缓冲区的索引：它是一个不透明的值，只对驱动有意义。

驱动程序还维护一个内部的单比特环形缠绕计数器，初始化为1，每次提供环形的最后一个描述符时，驱动程序都会翻转其值。

与分割描述符一样，第一步是写入不同的字段：地址、长度、ID和标志。然而，打包描述符考虑到了两个新的标志。AVAIL(0x7)和USED(0x15)。为了将一个描述符标记为可用，驱动程序使AVAIL(0x7)标志与它的内部包装计数器相同，而使用的标志则是相反的。虽然只有一个二进制标志AVA/USED会更容易实现，但它会妨碍我们后面要描述的有用的优化。

As an example, if the driver allocates a write buffer with 0x1000 bytes on position 0x80000000 in the step 1 in the diagram, and makes it the first available descriptor setting AVAIL(0x7) flag the same as internal wrap counter (set) in step 2. The descriptor table would look like this:

Avail idx Address Length ID Flags Used idx

0x80000000 0x1000 0 W|A ←

→ …

Figure: Descriptor table after add the first buffer

Avail idx	Address	Length	ID	Flags	Used idx
	0x80000000	0x1000	0	W\|A	←
→	…

举个例子，如果驱动程序在图中的第1步中在0x80000000位置分配了一个0x1000字节的写缓冲区，并使其成为第一个可用的描述符，在第2步中设置AVAIL(0x7)标志与内部包络计数相同（设置）。描述符表将看起来像这样。

Note that the avail and used idx columns are in the table just for guidance, they don’t exist in the descriptor table: Each side should have its internal counter to know which position needs to poll or write next, and also the device must track the driver’s wrap counter. Lastly, as with used virtqueue, the driver notifies the device if the latter has notifications enabled (step 3 in the diagram).

注意，表中的avail和used idx列只是为了指导，它们在描述符表中并不存在。每一方都应该有自己的内部计数器，以知道下一步需要轮询或写入哪个位置，同时设备也必须跟踪驱动的wrap计数器。最后，和使用的virtqueue一样，如果设备启用了通知功能，驱动程序就会通知设备（图中第3步）。

And the usual diagram of the updates. Note the lack of the avail and used ring, as only the descriptor table is needed now.

还有通常的更新图。请注意，由于现在只需要描述符表，所以缺少可用和已用环。

Diagram: Driver makes available a descriptor using a packed queue

Returning used descriptors: How the device fills the “done” list

As the driver, the device maintains an internal single-bit ring wrap counter initialized to 1, and knows that the driver also has its internal ring wrap counter set. When the latter first searches for the first descriptor the driver has made available, it polls the first entry of the ring, looking for the avail flag equal to the driver internal wrap flag (set in this case).

作为驱动程序，设备维护着一个初始化为1的内部单比特环形缠绕计数器，并且知道驱动程序也设置了其内部环形缠绕计数器。当后者第一次搜索驱动器提供的第一个描述符时，它就会轮询环的第一个条目，寻找等于驱动器内部包络标志的可用标志（在这种情况下是设置的）。

As with a used ring, the length of the written data is returned in the “length” entry (if any), and the id of the used descriptor. At last, the device will make the avail (A) and used (U) flag the same as the device’s internal wrap counter.

Following the example, the device will let the descriptor table as figure 6. The device will know that the buffer has been returned because the used flag matches the available flag, and with the device internal wrap counter at the moment it wrote the descriptor. The returned address is not important: only the ID.

Avail idx Address Length ID Flags Used idx

0x80000000 0x1000 0 W|A|U

→ … ←

Figure: Descriptor table after add the first buffer

Avail idx	Address	Length	ID	Flags	Used idx
	0x80000000	0x1000	0	W\|A\|U
→	…				←

与已用环一样，写入数据的长度会在 “length “条目中返回（如果有的话），以及已用描述符的id。最后，设备将使可用(A)和已用(U)标志与设备的内部缠绕计数器相同。

按照这个例子，设备将让描述符表如图6所示。设备将知道缓冲区已经被返回，因为使用的标志与可用的标志相匹配，并且在写描述符的时候与设备内部的wrap计数器相匹配。返回的地址并不重要：只有ID。

Diagram: Device marks a descriptor as used using a packed queue

Wrapping the descriptor ring: How the lanes keep separated?

When the driver fills the complete descriptor table, it wraps and changes its internal Driver Ring Wrap. So, in the second round, the available descriptions will have the avail and used flags clear, so the device will have to poll looking for this condition once it wraps reading descriptors. Let’s see a full example of the different situations.

当驱动程序填满了完整的描述符表，它就会包裹并改变其内部的驱动程序环形包裹。所以，在第二轮中，可用的描述符将有avail和used标志被清除，所以设备一旦包裹读取描述符，就必须轮询寻找这个条件。让我们来看看不同情况的完整例子。

If we have a descriptor table with only two entries, the Driver Ring Wrap Counter is set, and it fills the descriptor table making available two buffers at the beginning of the operation, driver will reverse its internal wrap counter, so it will be clear (0). We have the next table:

Avail idx Address Length ID Flags Used idx

→ 0x80000000 0x1000 0 W|A ←

0x81000000 0x1000 1 W|A

Figure: Full two-entries descriptor table

Avail idx	Address	Length	ID	Flags	Used idx
→	0x80000000	0x1000	0	W\|A	←
	0x81000000	0x1000	1	W\|A

如果我们有一个只有两个条目的描述符表，驱动环形缠绕计数器被设置，它填满描述符表，在操作开始时腾出两个缓冲区，驱动将扭转其内部缠绕计数器，所以它将是clear（0）。我们有下一个表。

After that, the device realizes that has both descriptors with id #0 and #1 available: it knows that the driver had its wrap counter set when it wrote them, the avail flag is set on them, and the used one is clear on both. If device uses the descriptor with id #1, we have the Figure 8 descriptor table. The buffer #0 still belongs to the device!

Avail idx Address Length ID Flags Used idx

→ 0x80000000 0x1000 1 W|A|U

0x81000000 0x1000 1 W|A ←

Figure: Using first buffer out of order

Avail idx	Address	Length	ID	Flags	Used idx
→	0x80000000	0x1000	1	W\|A\|U
	0x81000000	0x1000	1	W\|A	←

之后，设备意识到有两个ID为#0和#1的描述符是可用的：它知道驱动程序在写它们的时候设置了wrap计数器，它们的avail标志被设置，而且这两个描述符的used标志都是清零的。如果设备使用id为#1的描述符，我们就有了图8的描述符表。缓冲区#0仍然属于设备!

Now the driver realize the buffer #1 has been used, since avail and used flags are the same (set) and match the device’s internal wrap counter at the moment it wrote it. If device now uses the buffer id #0, it will make the table look like this:

Avail idx Address Length ID Flags Used idx

→ 0x80000000 0x1000 1 W|A|U ←

0x81000000 0x1000 0 W|A|U

Figure: Using second buffer out of order

Avail idx	Address	Length	ID	Flags	Used idx
→	0x80000000	0x1000	1	W\|A\|U	←
	0x81000000	0x1000	0	W\|A\|U

现在驱动程序意识到1号缓冲区已经被使用了，因为avail和used标志是一样的（设置），并且与设备的内部wrap计数器在写的时候是一致的。如果设备现在使用缓冲区ID #0，它将使表看起来像这样。

But there is a more interesting case: Starting from the “first buffer out of order” situation, the driver makes available the buffer #1 again. In that case, the descriptor table goes directly from the “first buffer” to the next figure, “Full two-entries descriptor table.”

Avail idx Address Length ID Flags Used idx

0x81000000 0x1000 1 W|(!A)|U ←

→ 0x81000000 0x1000 1 W|A

Figure: Full two-entries descriptor table

Avail idx	Address	Length	ID	Flags	Used idx
	0x81000000	0x1000	1	W\|(!A)\|U	←
→	0x81000000	0x1000	1	W\|A

但还有一种更有趣的情况。从 “第一个缓冲区失灵 “的情况开始，驱动程序再次提供了1号缓冲区。在这种情况下，描述符表直接从 “第一个缓冲区 “进入下一个图，”完整的两行描述符表”。

Chained descriptors: No more jumps

Chained descriptors work likewise: no need for the next field in the head (or subsequent) descriptor in the chain to search subsequent ones, since the latter always occupies the next position. However, while in the split used ring you only need to return as used the id of the head of the chain, in packed you only need to return the tail id.

链式描述符的工作原理也是如此：不需要在链中的头部（或后续）描述符的下一个字段来搜索后续的描述符，因为后者总是占据着下一个位置。然而，在分割使用的环中，你只需要返回链头的id作为使用，而在打包中你只需要返回尾部的id。

Back to the used ring, every time we use chained descriptors, we make the used idx lag regarding the avail idx. More than one descriptor mark as available to the device, but we only send one as used to the driver. While this is not a problem in the split ring, this would cause descriptor entry exhaustion in the packed version.

回到已用环，每次我们使用链式描述符时，都会使已用idx滞后于可用idx。一个以上的描述符被标记为设备可用，但我们只把一个描述符作为已使用的描述符发送给驱动。虽然这在分割环中不是一个问题，但在打包版本中会导致描述符条目耗尽。

The straightforward solution is to make the device mark as used every descriptor in the chain. However, this can be expensive, since we are modifying a shared area of memory, and could cause cache bounces.

However, the driver already knows the chain, so it can skip all the chain with only the last id. This is why we need to compare the used/avail pair with the driver/device Wrap Counter: after a jump, we wouldn’t know if the next descriptor has been made available in this driver’s round or in the next if we only have a binary available/used flag.

直接的解决方案是让设备将链上的每个描述符都标记为已使用。然而，这可能是昂贵的，因为我们正在修改内存的共享区域，并可能导致缓存跳出。

然而，驱动程序已经知道了链，所以它可以跳过所有的链，只保留最后一个ID。这就是为什么我们需要将已用/可用对与驱动/设备的Wrap Counter进行比较：在跳转之后，如果我们只有一个二进制的可用/已用标志，我们就不知道下一个描述符是在这个驱动的回合中还是在下一个回合中被提供的。

For example, in a four entries ring, the driver makes available the chain of three descriptors:

Avail idx Address Length ID Flags Used idx

0x80000000 0x1000 0 W|A ←

0x81000000 0x1000 1 W|A

0x82000000 0x1000 2 W|A

→ 0

Figure: Three chained descriptors available

Avail idx	Address	Length	ID	Flags	Used idx
	0x80000000	0x1000	0	W\|A	←
	0x81000000	0x1000	1	W\|A
	0x82000000	0x1000	2	W\|A
→				0

例如，在一个四项环中，驱动器提供了三个描述符的链。

After that, the device discovers the chain (polling position 0) and marks it as used, overwriting only the position 0. It skips completely the positions 1 and 2. When the driver polls for used, it will skip them too, knowing that the chain was 3 descriptors long:

Avail idx Address Length ID Flags Used idx

0x80000000 0x1000 2 W|A|U

0x81000000 0x1000 1 W|A

0x82000000 0x1000 2 W|A

→ 0 ←

Figure: Using the descriptor chain

Avail idx	Address	Length	ID	Flags	Used idx
	0x80000000	0x1000	2	W\|A\|U
	0x81000000	0x1000	1	W\|A
	0x82000000	0x1000	2	W\|A
→				0	←

之后，设备会发现这个链（轮询位置0），并将其标记为已用，只覆盖位置0，完全跳过位置1和2。当驱动轮询已使用时，它也会跳过这些位置，因为它知道该链有3个描述符长。

Now the driver produces another two descriptor long chain, and it has to take into account the wrapping:

Avail idx Address Length ID Flags Used idx

0x81000000 0x1000 1 W|(!A)|U

→ 0x81000000 0x1000 1 W|A

0x82000000 0x1000 2 W|A

0x80000000 0x1000 0 W|A ←

Figure: Make available another descriptor chain

Avail idx	Address	Length	ID	Flags	Used idx
	0x81000000	0x1000	1	W\|(!A)\|U
→	0x81000000	0x1000	1	W\|A
	0x82000000	0x1000	2	W\|A
	0x80000000	0x1000	0	W\|A	←

现在，驱动程序又产生了一个两根描述符的长链，它必须考虑到包装的问题。

And the device marks it as used, so only the first descriptor in the chain (4th in the table) needs to be updated.

Avail idx Address Length ID Flags Used idx

0x81000000 0x1000 1 W|(!A)|U

→ 0x81000000 0x1000 1 W|A ←

0x82000000 0x1000 2 W|A

0x80000000 0x1000 0 W|A|U

Figure: Using another descriptor chain

Although the next descriptor (2nd) seems like available, since the avail flag is different from the used one, the device knows that it is not because of knowing the internal Driver Wrap Counter: The right flag combination is avail clear, used set.

Avail idx	Address	Length	ID	Flags	Used idx
	0x81000000	0x1000	1	W\|(!A)\|U
→	0x81000000	0x1000	1	W\|A	←
	0x82000000	0x1000	2	W\|A
	0x80000000	0x1000	0	W\|A\|U

而设备将其标记为已使用，所以只有链中的第一个描述符（表中的第四个）需要更新。

尽管下一个描述符（第2个）看起来是可用的，但由于avail标志与used标志不同，设备知道它不是，因为知道内部的Driver Wrap Counter。正确的标志组合是avail clear，used set。

Indirect descriptors: When chains are not enough

Indirect descriptors work like in the split case. First, the driver allocates a table of indirect descriptors each with the same layout as the regular packed descriptors anywhere in memory. After that, it sets each descriptor in this indirect table to the buffer it wants to make available for the driver (steps 1-2), and inserts a descriptor in the virtqueue with the flag VIRTQ_DESC_F_INDIRECT (0x4) set (step 3). The descriptor’s address and length correspond to the indirect table’s ones.

间接描述符的工作方式与分割情况类似。首先，驱动程序分配一个间接描述符表，每个描述符的布局与内存中任何地方的常规打包描述符相同。之后，它将这个间接表中的每个描述符设置为它想为驱动提供的缓冲区（步骤1-2），并在virtqueue中插入一个设置了标志VIRTQ_DESC_F_INDIRECT（0x4）的描述符（步骤3）。该描述符的地址和长度对应于间接表的那些。

In packed layout buffers must come in order in the indirect table, and the ID field is completely ignored. Also, the only valid flag for them is VIRTQ_DESC_F_WRITE, others are reserved and ignored by the device. As usual, the driver will notify the device if the conditions for the notification are met (step 4).

在打包布局中，缓冲区必须按顺序出现在间接表中，ID字段完全被忽略。另外，它们唯一有效的标志是VIRTQ_DESC_F_WRITE，其他的是保留的，被设备忽略。像往常一样，如果通知的条件得到满足，驱动程序将通知设备（步骤4）。

Diagram: Driver makes available a descriptor using a packed queue

For example, the driver would need to allocate this 48 bytes table for a 3 descriptors indirect table:

Address	Length	ID	Flags
0x80000000	0x1000	…	W
0x81000000	0x1000	…	W
0x82000000	0x1000	…	W

Figure: Three descriptor long indirect packed table

And if it introduces the indirect table the first in the descriptor table, assuming it is allocated in 0x83000000 address:

Avail idx	Address	Length	ID	Flags	Used idx
	0x80000000	48	0	A\|I	←
→	…

Figure: Drivers makes an indirect table available

After indirect buffer consumption, the device needs to return the indirect buffer id (0 in the example) in its used descriptor. The table looks like the return of the first buffer, except for the indirect (I) flag set:

Avail idx	Address	Length	ID	Flags	Used idx
	0x80000000	48	0	A\|U\|I
→	…				←

Figure: Device makes an indirect table used

After that, the device cannot access the memory table anymore unless the driver makes it available again, so the latter can free or reuse it.

Notifications: how to manage interruptions?

Like in the used queue, each side of the communication maintains two identical structures used for controlling notifications between the device and the driver. The driver’s one is read-only by the device, and the device’s one is read-only by the driver.

The struct layout is:
1
2
3
4
struct pvirtq_event_suppress { 
        le16 desc;
        le16 flags; 
};
Listing: Event suppression struct notification

就像在用过的队列中，通信的每一方都维护着两个相同的结构，用于控制设备和驱动之间的通知。驱动程序的那个结构是设备只读的，而设备的那个结构是驱动程序只读的。

The member flags can take the values:

0: Notifications are enabled

1: Notifications are disabled

2: Notifications are enabled for a specific descriptor, specified from the desc member.

If flags value is 2, the other side will notify until the wrap counter matches the most significant bit of desc and the descriptor placed in the position desc discarding that bit is made used/available. For this mode to work, VIRTIO_F_RING_EVENT_IDX flag needs to be negotiated in Virtio device initialization: feature bits.

None of these mechanisms are 100% reliable, since the other side could have sent the notification already when we set the values, so expect it even when disable.

Note that, since the descriptor ring size is not being forced to be a power of two (comparing with the split version), the notification structure can fit in the same page as the descriptor table. This can be advantageous for some implementations.

成员标志可以采取以下值。

0: 通知被启用
1: 通知被禁用
2: 对一个特定的描述符启用通知，由desc成员指定。

如果标志值为2，另一方将进行通知，直到wrap计数器与desc的最重要的位相匹配，并且放置在desc位置的描述符放弃该位而被使用/可用。为了使这种模式工作，VIRTIO_F_RING_EVENT_IDX标志需要在Virtio设备初始化中协商：特征位。

这些机制都不是100%可靠的，因为当我们设置这些值时，对方可能已经发送了通知，所以即使在禁用的情况下也要期待它。

请注意，由于描述符环的大小没有被强制为2的幂（与分裂版本相比），通知结构可以与描述符表放在同一页面中。这对某些实现来说是有利的。

Summary

In this series we have taken you through the different virtio data plane layouts and its virtqueues implementations. They are the means for virtio devices and virtio drivers to exchange information.

We start by covering the simpler and less optimized split virtqueue layout. This layout is relatively easy to implement and to debug thus it’s a good entry point for learning the virtio dataplane basics.

We then moved on to the packed virtqueue layout specified in virtio 1.1 which allows requests exchange using a more compact descriptor representation. This avoids all the overhead of scattering the data through memory, avoiding cache contention and reducing the PCI transactions in case of actual hardware.

在这个系列中，我们已经带你了解了不同的virtio数据平面布局及其virtqueues的实现。它们是virtio设备和virtio驱动交换信息的手段。

我们首先介绍了更简单、更不优化的分离式virtqueue布局。这种布局相对容易实现和调试，因此它是学习virtio数据平面基础知识的一个很好的切入点。

然后，我们转向virtio 1.1中规定的打包式virtqueue布局，它允许使用更紧凑的描述符来交换请求。这避免了在内存中分散数据的所有开销，避免了缓存争用，并在实际硬件的情况下减少了PCI事务。

We also covered a number of optimizations on top of both ring layouts which depends on the communication/device type or how each part is implemented. Mainly, they are oriented to reduce the communication overhead, both in notifications and in memory transactions. Virtio offers a simple protocol to communicate what features and optimizations support each side, so they can agree on how the data is going to be exchanged and is highly future-proof.

我们还在这两个环状布局的基础上进行了一些优化，这取决于通信/设备类型或每个部分的实现方式。主要的是，它们的方向是减少通信开销，包括通知和内存事务。Virtio提供了一个简单的协议来沟通每一方支持哪些功能和优化，所以他们可以就数据的交换方式达成一致，并且是高度面向未来的。

This series covered the essence of the virtio data plane and provided you with the tool to analyze and develop your own virtio device and drivers. It should be noted that this series summarizes the relevant sections from the virtio spec thus you should refer to the spec for additional information and see it as the source of truth.

In the next posts we will return to vDPA including the kernel framework, hands on blogs and vDPA in Kubernetes.

这个系列涵盖了virtio数据平面的本质，并为你提供了分析和开发自己的virtio设备和驱动的工具。应该注意的是，这个系列总结了virtio规范中的相关部分，因此你应该参考规范以获得更多信息，并将其视为真理的来源。

在接下来的文章中，我们将回到vDPA，包括内核框架、实践博客和Kubernetes中的vDPA。