2022-09-29

kvm: mmu: Rework the x86 TDP direct mapped case

Over the years, the needs for KVM's x86 MMU have grown from running small
guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
we previously depended upon shadow paging to run all guests, we now have
the use of two dimensional paging (TDP). This RFC proposes and
demonstrates two major changes to the MMU. First, an iterator abstraction 
that simplifies traversal of TDP paging structures when running an L1
guest. This abstraction takes advantage of the relative simplicity of TDP
to simplify the implementation of MMU functions. Second, this RFC changes
the synchronization model to enable more parallelism than the monolithic
MMU lock. This "direct mode" MMU is currently in use at Google and has
given us the performance necessary to live migrate our 416 vCPU, 12TiB
m2-ultramem-416 VMs.

过去的数年里，对KVM’s x86 MMU的需求从运行小虚拟机发展到需要支持在线迁移数T内存上百个vCPU的虚拟机。而我们以前试用shadow paging来运行虚拟机，目前我们使用two dimensional paging(TDP)。这个提议说明了两个主要的针对MMU的修改。首先，增加了一个用于L1 guest运行时简化访问TDP paging数据结构的迭代器抽象。这个抽象使用了一个相对简单的TDP来简化MMU功能的实现。其次，这个RFC修改了同步模型来支持并行的而不是使用现在这个巨大的MMU锁。这个“direct mode” MMU在google内部使用并且已经提供了热迁移416 vCPU + 12TB内存的虚拟机的性能了。

The primary motivation for this work was to handle page faults in
parallel. When VMs have hundreds of vCPUs and terabytes of memory, KVM's
MMU lock suffers from extreme contention, resulting in soft-lockups and
jitter in the guest. To demonstrate this I also written, and will submit
a demand paging test to KVM selftests. The test creates N vCPUs, which
each touch disjoint regions of memory. Page faults are picked up by N
user fault FD handlers, one for each vCPU. Over a 1 second profile of
the demand paging test, with 416 vCPUs and 4G per vCPU, 98% of the
execution time was spent waiting for the MMU lock! With this patch
series the total execution time for the test was reduced by 89% and the
execution was dominated by get_user_pages and the user fault FD ioctl.
As a secondary benefit, the iterator-based implementation does not use
the rmap or struct kvm_mmu_pages, saving ~0.2% of guest memory in KVM
overheads.

这个工作的主要动机就是并行的处理page fault。当虚拟机油上百个vCPU和数T内存的时候，KVM的MMU锁竞争会非常的激烈，导致guest进入kernel loop或者是都懂状体啊。为了展示这个改动的结果，我实现了一个paging test用于KVM的测试。这个测试会创建一个N vCPUs每一个vCPU都来弄乱一部分内存。Page faults会被N个用户态错误FD处理，每一个对应一个vCPU。在任意1s的测试面里，416个vCPUS以及一个vCPU 4G内存，98%的时间会消耗在MMU lock上。通过这部分补丁，89%的时间都被节省了，处理逻辑被get_user_pages以及用户fault FD ioctl处理了。另外一个好处，基于迭代器的实现并不使用rmap或者是kvm_mmu_pages的数据结构，节约了大约0.2%的kvm虚拟机内存损耗。

The goal of this  RFC is to demonstrate and gather feedback on the
iterator pattern, the memory savings it enables for the "direct case"
and the changes to the synchronization model. Though they are interwoven
in this series, I will separate the iterator from the synchronization
changes in a future series. I recognize that some feature work will be
needed to make this patch set ready for merging. That work is detailed
at the end of this cover letter.

这篇RFC是为了展示并且收集对于迭代器模式的反馈，内存的节省在使用“direct case”之后启用了并且修改了同步模型。虽然他们之前紧密关联，我将会在之后的改动里把迭代器从同步模型里分离出来。我意识到在合并之前还有一些工作要做。这部分工作的细节附在这个letter之后。

The overall purpose of the KVM MMU is to program paging structures
(CR3/EPT/NPT) to encode the mapping of guest addresses to host physical
addresses (HPA), and to provide utilities for other KVM features, for
example dirty logging. The definition of the L1 guest physical address
(GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA,
and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the
MMU must program the x86 page tables to encode the full translation of
guest virtual addresses (GVA) to HPA. This requires "shadowing" the
guest's page tables to create a composite x86 paging structure. This
solution is complicated, requires separate paging structures for each
guest CR3, and requires emulating guest page table changes. The TDP case
is much simpler. In this case, KVM lets the guest control CR3 and
programs the EPT/NPT paging structures with the GPA -> HPA mapping. The
guest has no way to change this mapping and only one version of the
paging structure is needed per L1 address space (normal execution or
system management mode, on x86).

KVM MMU的主要目的是通过程序实现分页数据结构（CR3/EPT/NPT）来把虚拟机地址encode成物理机物理地址，以及提供其他KVM特性，比如dirty logging。对L1 guest的物理地址到Host物理地址的映射可以分为两个部分。KVM的memslots映射GPA到HVA，然后kernel MM/x86的host page tables映射HVA到HPA。在没有TDP的情况下，MMU必须把整个GVA到HPA的过程通过编程的方式实现，这个就需要用到“shadowing”也就guest的page tables来创建一个复合的x86 paging结构。这个实现方法是很复杂的，需要根据每个guest的CR3把分页结构分离开，并且模拟guest的page table变化。TDP的场景就更容易一些，这个场景里，KVM让guest控制CR3并且编程了一个EPT/NPT的分页数据结构来做GPA到HPA的映射。这样guest就不需要修改这个映射，并且只需要在每个L1guest的地址空间维护一个版本的分页数据结构（一般的在x86上的形式）

This RFC implements a "direct MMU" through alternative implementations
of MMU functions for running L1 guests with TDP. The direct MMU gets its
name from the direct role bit in struct kvm_mmu_page in the existing MMU
implementation, which indicates that the PTEs in a page table (and their
children) map a linear range of L1 GPAs. Though the direct MMU does not
currently use struct kvm_mmu_page, all of its pages would implicitly
have that bit set. The direct MMU falls back to the existing shadow
paging implementation when TDP is not available, and interoperates with
the existing shadow paging implementation for nesting.

这篇RFC实现了一个可选的使用MMU功能来结合TDP运行L1 guest的“direct MMU”功能。这个direct MMU的命名是因为直接从kvm_mmu_page的role bit获取了一个direct role（也是目前MMU里已有的实现）表示page table的PTEs（以及他们的子page）映射了一个组线性的L1 GPA地址。虽然direct MMU并没有使用kvm_mmu_page，不过所有的page都会做这个设置。当TDP不可用的时候direct MMU会降级到已有的shadow paging功能并且如果是嵌套虚拟化，也会联动到已有的shadow paging功能。

In order to handle page faults in parallel, the MMU needs to allow a
variety of changes to PTEs concurrently. The first step in this series
is to replace the MMU lock with a read/write lock to enable multiple
threads to perform operations at the same time and interoperate with
functions that still need the monolithic lock. With threads handling
page faults in parallel, the functions operating on the page table
need to: a) ensure PTE modifications are atomic, and  b) ensure that page
table memory is freed and accessed safely Conveniently, the iterator
pattern introduced in this series handles both concerns.

为了并行处理page fault，MMU需要允许并发修改PTEs（page table entry）。这部分补丁的第一步就是把MMU锁替换为读/写锁来弃用多线程来支持同时执行操作，并且把原本的需要一个大锁的函数也做了改动。通过线程并行处理page tauls，对应的page table的功能需要

保证PTE的修改是原子的
保证page table的内存是空闲的并且可以被安全的访问。

为了更方便的解决这两个问题，我们引入了迭代器模式

The direct walk iterator implements a pre-order traversal of the TDP
paging structures. Threads are able to read and write page table memory
safely in this traversal through the use of RCU and page table memory is
freed in RCU callbacks, as part of a three step process. (More on that
below.) To ensure that PTEs are updated atomically, the iterator
provides a function for updating the current pte. If the update
succeeds, the iterator handles bookkeeping based on the current and
previous value of the PTE. If it fails, some other thread will have
succeeded, and the iterator repeats that PTE on the next iteration,
transparently retrying the operation. The iterator also handles yielding
and reacquiring the appropriate MMU lock, and flushing the TLB or
queuing work to be done on the next flush.

这个direct walk iterator实现了一个顺序的TDP paging结构的遍历。线程通过使用RCU允许安全的在这个遍历过程里读和写page table的内存，并且page table的内存会在RCU的callbacks里释放，并作为处理逻辑的第三个步骤的一部分（不仅是底下这部分）。为了保证PTEs被原子的更新，迭代器提供了一个更新当前pte的方法。如果更新成功，迭代器会对之前和现在的PTE值做一个记录。如果失败了，其他线程会执行成功，迭代器会在下一次迭代继续重复这个PTE，并显式的重试一下。迭代器充实也会处理yielding并且重新获取合适的MMU lock，并且刷新TLB或者保存修改的内容到下一次flush位置。

In order to minimize TLB flushes, we expand the tlbs_dirty count to
track unflushed changes made through the iterator, so that other threads
know that the in-memory page tables they traverse might not be what the
guest is using to access memory. Page table pages that have been
disconnected from the paging structure root are freed in a three step
process. First the pages are filled with special, nonpresent PTEs so
that guest accesses to them, through the paging structure caches result
in TDP page faults. Second, the pages are added to a disconnected list,
a snapshot of which is transferred to a free list, after each TLB flush.
The TLB flush clears the paging structure caches, so the guest will no
longer use the disconnected pages. Lastly, the free list is processed
asynchronously to queue RCU callbacks which free the memory. The RCU
grace period ensures no kernel threads are using the disconnected pages.
This allows the MMU to leave the guest in an inconsistent, but safe,
state with respect to the in-memory paging structure. When functions
need to guarantee that the guest will use the in-memory state after a
traversal, they can either flush the TLBs unconditionally or, if using
the MMU lock in write mode, flush the TLBs under the lock only if the
tlbs_dirty count is elevated.

为了最小化TLB的flush，我们拓展了tlbs_dirty count来跟踪没有flush的迭代器修改，因此其他线程就能知道内存的page tables并不是guest当前要访问的内存（因为没刷新）。Page tables pages将会从paging数据结构里堵啊开，并且对应的pages就会被释放掉。

page将会用特殊的，并未出现的PTEs写满，因此guest访问这些地址的时候就是通过paging的cache，最后抛出一个TDP的page faults。
pages将会被加入到一个disconnected的list里，一个快照记录需要被释放的信息，然后按顺序做TLB flush。TLB的flush会清理对应的paging数据结构的缓存，所以guest将没办法访问disconnected的page
free list会被异步处理，并且进入队列等RCU的callbacks来释放内存。

RCU grace period保证没有内核线程会使用这些disconnected的pages。这个允许MMU通过内存中的paging数据结构让guest保持一个不一致但是安全的状态。如果一些方法需要保证guest需要使用遍历之后的内存中的状态，可以直接flush这个TLB或者是使用写模式的MMU锁，flush TLB操作需要保证tlbs_dirty比较高的时候拿到锁执行。

The use of the direct MMU can be controlled by a module parameter which
is snapshotted on VM creation and follows the life of the VM. This
snapshot is used in many functions to decide whether or not to use
direct MMU handlers for a given operation. This is a maintenance burden
and in future versions of this series I will address that and remove
some of the code the direct MMU replaces. I am especially interested in
feedback from the community as to how this series can best be merged. I
see two broad approaches: replacement and integration or modularization.

Direct MMU的使用可以通过模块参数，一个随着虚拟机的创建并伴随VM生命周期的参数控制。这个快照在很多方法里面用来判断是否需要使用direct MMU的处理。这是一个不太好维护的东西，未来需要去掉一些direct MMU相关的代码。我很期待社区的反馈来保证这些code在最好的状态合并。目前我认为有两个方法，替换+集成或者是模块化。

Replacement and integration would require amending the existing shadow
paging implementation to use a similar iterator pattern. This would mean
expanding the iterator to work with an rmap to support shadow paging and
reconciling the synchronization changes made to the direct case with the
complexities of shadow paging and nesting.

替换和集成需要给目前的shadow paging增加类似的迭代器模式。也就意味着拓展迭代器并能够和rmap一起工作来支持shadow paging以及使得原本的同步修改模式和direct模式来适应复杂的shadow paging以及nesting

The modularization approach would require factoring out the "direct MMU"
or "TDP MMU" and "shadow MMU(s)." The function pointers in the MMU
struct would need to be expanded to fully encompass the interface of the
MMU and multiple, simpler, implementations of those functions would be
needed. As it is, use of the module parameter snapshot gives us a rough
outline of the previously undocumented shape of the MMU interface, which
could facilitate modularization. Modularization could allow for the
separation of the shadow paging implementations for running guests
without TDP, and running nested guests with TDP, and the breakup of
paging_tmpl.h.

模块化主要是重构几个分开的部分“direct MMU”或者“TDP MMU”以及“shadow MMU”。需要把MMU的结构拓展为一个完整的包含MMU以及多MMU，单MMU并实现这些对应的接口。试用类似模块参数快照能够给出一个粗略的MMU接口的大致的轮廓，也能够作为模块化的基础。模块化允许shadow paging在guest中的使用，也支持嵌套虚拟化的guest试用TDP，也是paging_tmpl.h的突破。

1 2	In addition to the integration question, below are some of the work items I plan to address before sending the series out again:

关于集成的部分我重新列了一下：

Disentangle the iterator pattern from the synchronization changes
	Currently the direct_walk_iterator is very closely tied to the use
	of atomic operations, RCU, and a rwlock for MMU operations. This
	does not need to be the case: instead I would like to see those
	synchronization changes built on top of this iterator pattern.

Support 5 level paging and PAE
	Currently the direct walk iterator only supports 4 level, 64bit
	architectures.

Support MMU memory reclaim
	Currently this patch series does not respect memory limits applied
	through kvm_vm_ioctl_set_nr_mmu_pages.

Support nonpaging guests
	Guests that are not using virtual addresses can be direct mapped,
	even without TDP.

Implement fast invalidation of all PTEs
	This series was prepared between when the fast invalidate_all
	mechanism was removed and when it was re-added. Currently, there
	is no fast path for invalidating all direct MMU PTEs.

Move more operations to execute concurrently
	In this patch series, only page faults are able to execute
	concurrently, however several other functions can also execute
	concurrently, simply by changing the write lock acquisition to a
	read lock.

把迭代器从同步修改里分离出来

目前的direct_walk_iterator是和原子操作，RCU和MMU操作的读写锁。但实际上没有必要这样，其实这些同步修改应该实现在迭代器模式之上。

支持5级的paging以及PAE

目前direct walk interator只支持4级，64bit的架构

支持MMU的内存回收

当前这个补丁并没有支持kvm_vm_ioctl_set_nr_mmu_pages内存相关的限制

支持nonpaging guests

guest如果没使用虚拟地址的也能直接映射，甚至不需要TDP

实现快速的对所有PTEs的失效检测