2022-09-04

Guest Free Page Hinting notes 01

基于virtio1.2的推送，在virtio-balloon设备下有一条新特性：free page hints

因为不太了解这个东西具体做了啥，查了一番资料

KVM: Guest Free Page Hinting

在2019年2月有这样一封邮件记录 https://lwn.net/Articles/778432/

The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.

看起来主要目的是为了优化guest和host之间的空闲内存管理，避免出现需要快速释放或者回收page cache内存

同时里面提到了

Known code re-work:

Plan to re-use Wei’s work, which communicates the poison value to the host.
The nomenclatures used in virtio-balloon needs to be changed so that the code can easily be distinguished from Wei’s Free Page Hint code.
Sorting based on zonenum, to avoid repetitive zone locks for the same zone.

需要对virtio-balloon做一些修改来来保证代码能够和这部分Free Page Hint的代码保持区别。

这么说感觉好像是Hint的代码和virtio-balloon是两套

Virtio-balloon: support free page reporting

基于上面查到的资料，又发现了另外一篇直接提到Virtio-balloon的改动 https://lwn.net/Articles/759413/

里面新增的 VIRTIO_BALLOON_F_FREE_PAGE_HINT 是可以和 https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.pdf virtio1.2的spec对应上的

摘抄一下里面的描述

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.

针对热迁移场景，会不停的copy memory。第一轮会复制所有内存，后续只需要复制guest写过的内存。

因此hypervisor需要记录guest写过哪些内存，然后全都复制一遍。

而这个功能是用来优化guest free page transfer的。即忽略这些已经被标记为free page的内容，如果后续这些page被使用了或者被写了，下一轮内存拷贝才考虑这些page。

通过这段描述，可以知道这个优化需要提供一个机制，来提供free page hint，并以此为基础来优化live migration。

小插曲

在准备看代码之前，发现了一段很有意思的内容

- mm/get_from_free_page_list: The new implementation to get free page
  hints based on the suggestions from Linus:
  https://lkml.org/lkml/2018/6/11/764
  This avoids the complex call chain, and looks more prudent.

对获取 get_from_free_page_list 操作，linus回了很长的一段建议

里面很有意思的是，是不是要加一个新的 GFP_NONE，来标记分配失败？

1 2	Maybe it will help to have GFP_NONE which will make any allocation fail if attempted. Linus, would this address your comment?

而linus的回复是，如果不用这么复杂的会引起内存分配的调用，用一个简单的机制来避免这个问题发生感觉更好

So instead of having virtio_balloon_send_free_pages() call a really
generic complex chain of functions that in _some_ cases can do memory
allocation, why isn't there a short-circuited "vitruque_add_datum()"
that is guaranteed to never do anything like that?

中间还有很长的一些简化代码的建议，里面有这么一句话，评价这部分代码太复杂并且太脆弱了

1 2	The whole sequence of events really looks "this is too much complexity, and way too fragile" to me at so many levels.

让我联想到目前ZStack里面一些功能的实现逻辑存在情况

实现了机制B解决机制A的问题
复用了，不熟悉的机制A，忽略了A本身存在的问题

结合一个实际功能说一下这个问题，比如vm的kernel panic检测，有两个必要选项

给vm增加一个pvpanic的xml配置
虚拟机内部需要启用内核pvpanic模块

因此这个功能实现需要guest和host相互配合才能判定是否可用

基于这个前提，guest内部的逻辑需要提供传递guest内部是否支持pvpanic的信息，host上需要从配置中获取是否配置过pvpanic，因此实现这个逻辑的时候需要分别查询这两个信息。而查询host上的配置最终导致了一些控制面的bug。

后来反思这个问题的时候，只从host配置获取的逻辑出发，但是忽略了运行时配置不会变更的前提，其实并没有必要增加一个多余的查询逻辑，反而导致这个问题依赖了已有的配置查询机制，最终引起了更复杂的现象。

kernel的开源世界也会有人碰到这样的问题，所以整理好功能设计的方法论还是很重要的，至少能够指导怎么做能设计的更好，提升committer和coder的水平。

反过来想想：

guest tool在运行时返回的云主机所支持的特性，实际上总是和他的版本绑定的，只要获取过一次其实就不需要反复获取了。
如果guest tool版本发生了变化，才需要重新获取这个信息
提供主动更新guest tool特性的功能即可

其实这样拆解这个问题，云主机其实本身就应该保存这些特性信息而不需要总是去获取，这样机制的设计可以简化很多。

而之前的设计出发点并不是基于对整个功能的理解，而是类似新增 GFP_NONE 来解决问题的思路。