Over the years, the needs for KVM's x86 MMU have grown from running small guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where we previously depended upon shadow paging to run all guests, we now have the use of two dimensional paging (TDP). This RFC proposes and demonstrates two major changes to the MMU. First, an iterator abstraction that simplifies traversal of TDP paging structures when running an L1 guest. This abstraction takes advantage of the relative simplicity of TDP to simplify the implementation of MMU functions. Second, this RFC changes the synchronization model to enable more parallelism than the monolithic MMU lock. This "direct mode" MMU is currently in use at Google and has given us the performance necessary to live migrate our 416 vCPU, 12TiB m2-ultramem-416 VMs.
The primary motivation for this work was to handle page faults in parallel. When VMs have hundreds of vCPUs and terabytes of memory, KVM's MMU lock suffers from extreme contention, resulting in soft-lockups and jitter in the guest. To demonstrate this I also written, and will submit a demand paging test to KVM selftests. The test creates N vCPUs, which each touch disjoint regions of memory. Page faults are picked up by N user fault FD handlers, one for each vCPU. Over a 1 second profile of the demand paging test, with 416 vCPUs and 4G per vCPU, 98% of the execution time was spent waiting for the MMU lock! With this patch series the total execution time for the test was reduced by 89% and the execution was dominated by get_user_pages and the user fault FD ioctl. As a secondary benefit, the iterator-based implementation does not use the rmap or struct kvm_mmu_pages, saving ~0.2% of guest memory in KVM overheads.
The goal of this RFC is to demonstrate and gather feedback on the iterator pattern, the memory savings it enables for the "direct case" and the changes to the synchronization model. Though they are interwoven in this series, I will separate the iterator from the synchronization changes in a future series. I recognize that some feature work will be needed to make this patch set ready for merging. That work is detailed at the end of this cover letter.
The overall purpose of the KVM MMU is to program paging structures (CR3/EPT/NPT) to encode the mapping of guest addresses to host physical addresses (HPA), and to provide utilities for other KVM features, for example dirty logging. The definition of the L1 guest physical address (GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA, and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the MMU must program the x86 page tables to encode the full translation of guest virtual addresses (GVA) to HPA. This requires "shadowing" the guest's page tables to create a composite x86 paging structure. This solution is complicated, requires separate paging structures for each guest CR3, and requires emulating guest page table changes. The TDP case is much simpler. In this case, KVM lets the guest control CR3 and programs the EPT/NPT paging structures with the GPA -> HPA mapping. The guest has no way to change this mapping and only one version of the paging structure is needed per L1 address space (normal execution or system management mode, on x86).
This RFC implements a "direct MMU" through alternative implementations of MMU functions for running L1 guests with TDP. The direct MMU gets its name from the direct role bit in struct kvm_mmu_page in the existing MMU implementation, which indicates that the PTEs in a page table (and their children) map a linear range of L1 GPAs. Though the direct MMU does not currently use struct kvm_mmu_page, all of its pages would implicitly have that bit set. The direct MMU falls back to the existing shadow paging implementation when TDP is not available, and interoperates with the existing shadow paging implementation for nesting.
In order to handle page faults in parallel, the MMU needs to allow a variety of changes to PTEs concurrently. The first step in this series is to replace the MMU lock with a read/write lock to enable multiple threads to perform operations at the same time and interoperate with functions that still need the monolithic lock. With threads handling page faults in parallel, the functions operating on the page table need to: a) ensure PTE modifications are atomic, and b) ensure that page table memory is freed and accessed safely Conveniently, the iterator pattern introduced in this series handles both concerns.
The direct walk iterator implements a pre-order traversal of the TDP paging structures. Threads are able to read and write page table memory safely in this traversal through the use of RCU and page table memory is freed in RCU callbacks, as part of a three step process. (More on that below.) To ensure that PTEs are updated atomically, the iterator provides a function for updating the current pte. If the update succeeds, the iterator handles bookkeeping based on the current and previous value of the PTE. If it fails, some other thread will have succeeded, and the iterator repeats that PTE on the next iteration, transparently retrying the operation. The iterator also handles yielding and reacquiring the appropriate MMU lock, and flushing the TLB or queuing work to be done on the next flush.
这个direct walk iterator实现了一个顺序的TDP paging结构的遍历。线程通过使用RCU允许安全的在这个遍历过程里读和写page table的内存,并且page table的内存会在RCU的callbacks里释放,并作为处理逻辑的第三个步骤的一部分(不仅是底下这部分)。为了保证PTEs被原子的更新,迭代器提供了一个更新当前pte的方法。如果更新成功,迭代器会对之前和现在的PTE值做一个记录。如果失败了,其他线程会执行成功,迭代器会在下一次迭代继续重复这个PTE,并显式的重试一下。迭代器充实也会处理yielding并且重新获取合适的MMU lock,并且刷新TLB或者保存修改的内容到下一次flush位置。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
In order to minimize TLB flushes, we expand the tlbs_dirty count to track unflushed changes made through the iterator, so that other threads know that the in-memory page tables they traverse might not be what the guest is using to access memory. Page table pages that have been disconnected from the paging structure root are freed in a three step process. First the pages are filled with special, nonpresent PTEs so that guest accesses to them, through the paging structure caches result in TDP page faults. Second, the pages are added to a disconnected list, a snapshot of which is transferred to a free list, after each TLB flush. The TLB flush clears the paging structure caches, so the guest will no longer use the disconnected pages. Lastly, the free list is processed asynchronously to queue RCU callbacks which free the memory. The RCU grace period ensures no kernel threads are using the disconnected pages. This allows the MMU to leave the guest in an inconsistent, but safe, state with respect to the in-memory paging structure. When functions need to guarantee that the guest will use the in-memory state after a traversal, they can either flush the TLBs unconditionally or, if using the MMU lock in write mode, flush the TLBs under the lock only if the tlbs_dirty count is elevated.
The use of the direct MMU can be controlled by a module parameter which is snapshotted on VM creation and follows the life of the VM. This snapshot is used in many functions to decide whether or not to use direct MMU handlers for a given operation. This is a maintenance burden and in future versions of this series I will address that and remove some of the code the direct MMU replaces. I am especially interested in feedback from the community as to how this series can best be merged. I see two broad approaches: replacement and integration or modularization.
Direct MMU的使用可以通过模块参数,一个随着虚拟机的创建并伴随VM生命周期的参数控制。这个快照在很多方法里面用来判断是否需要使用direct MMU的处理。这是一个不太好维护的东西,未来需要去掉一些direct MMU相关的代码。我很期待社区的反馈来保证这些code在最好的状态合并。目前我认为有两个方法,替换+集成 或者是模块化。
1 2 3 4 5
Replacement and integration would require amending the existing shadow paging implementation to use a similar iterator pattern. This would mean expanding the iterator to work with an rmap to support shadow paging and reconciling the synchronization changes made to the direct case with the complexities of shadow paging and nesting.
The modularization approach would require factoring out the "direct MMU" or "TDP MMU" and "shadow MMU(s)." The function pointers in the MMU struct would need to be expanded to fully encompass the interface of the MMU and multiple, simpler, implementations of those functions would be needed. As it is, use of the module parameter snapshot gives us a rough outline of the previously undocumented shape of the MMU interface, which could facilitate modularization. Modularization could allow for the separation of the shadow paging implementations for running guests without TDP, and running nested guests with TDP, and the breakup of paging_tmpl.h.
Disentangle the iterator pattern from the synchronization changes Currently the direct_walk_iterator is very closely tied to the use of atomic operations, RCU, and a rwlock for MMU operations. This does not need to be the case: instead I would like to see those synchronization changes built on top of this iterator pattern.
Support 5 level paging and PAE Currently the direct walk iterator only supports 4 level, 64bit architectures.
Support MMU memory reclaim Currently this patch series does not respect memory limits applied through kvm_vm_ioctl_set_nr_mmu_pages.
Support nonpaging guests Guests that are not using virtual addresses can be direct mapped, even without TDP.
Implement fast invalidation of all PTEs This series was prepared between when the fast invalidate_all mechanism was removed and when it was re-added. Currently, there is no fast path for invalidating all direct MMU PTEs.
Move more operations to execute concurrently In this patch series, only page faults are able to execute concurrently, however several other functions can also execute concurrently, simply by changing the write lock acquisition to a read lock.