2022-10-11

JVM metaspace memory leak analysis

Background

Out system became unavailable due to java.lang.OutOfMemoryError: Metaspace and according to a stackoverflow answer: https://stackoverflow.com/questions/36051813/java8-java-lang-outofmemoryerror-metaspace, I started to figure out what wrong with the system.

Before finding start analysis jvm memory dump, we need to known what means memory leak. In this case, we found thread failed to execution because create task failed due to OOM. But there are two typical reason.

Too many classes need to be loaded
Memory leak causes some classes not unloaded

Everytime you new a object, metaspace will be allocated to store metadata of the object.

Following code could reproduce this issue:

import javassist.ClassPool;

public class MetaspaceOOM {
    static ClassPool cp = ClassPool.getDefault();

    public static void main(String[] args) throws Exception{
        for (int i = 0; ; i++) {
            Class c = cp.makeClass("eu.plumbr.demo.Generated" + i).toClass();
        }
    }
}

pom.xml

<dependency>
    <groupId>org.javassist</groupId>
    <artifactId>javassist</artifactId>
    <version>3.27.0-GA</version>
</dependency>

so dynamic created class will add new class information into metaspace but if there is no class loader reference to the class, the class will be cleared after GC.

So it’s clear that if memory leak happened to your metaspace, you can find your application no problem when just started and oom happens after it runs for a period of time.

Tools

A couple of tools are available for memory leak analysis and only refer to what I have used when I figure out this issue.

JConsole

JConsole is a graphical monitoring tool to monitor Java Virtual Machine (JVM) and Java applications both on a local or remote machine.

JConsole uses underlying features of Java Virtual Machine to provide information on performance and resource consumption of applications running on the Java platform using Java Management Extensions (JMX) technology. JConsole comes as part of Java Development Kit (JDK) and the graphical console can be started using “jconsole” command.

As JConsole is a part of JDK, you can easily find it. On macOS, you can use commandline directly

jconsole

or find your java_home by

1	/usr/libexec/java_home -V

the jconsole is under

1	/Library/Java/JavaVirtualMachines/jdk1.8.0_271.jdk/Contents/Home/bin/jconsole

and then connect to local or remote application.

Note:

In order to use JConsole, jmx need to be enabled for your application’s jvm options

-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=10000 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=your_ip_address

make sure your ip adress and port is available for remote access.

From JConsole we can monitor loaded class number, unloaded class number, perform gc and check if metaspace is continuously increasing.

VisualVM

A replacement for JConsole except jvm monitor, heap dump, application snapshot is available which is quite efficient for memory leak analysis.

Memory Analyzer (MAT)

The Eclipse Memory Analyzer is a fast and feature-rich Java heap analyzer that helps you find memory leaks and reduce memory consumption.Use the Memory Analyzer to analyze productive heap dumps with hundreds of millions of objects, quickly calculate the retained sizes of objects, see who is preventing the Garbage Collector from collecting objects, run a report to automatically extract leak suspects.

My application use openjdk-1.8, so only MemoryAnalyzer-1.11.0.20201202-macosx.cocoa.x86_64.dmg

could be used. After that MAT only support java11.

Start debug

Combining background and tools, finding continuously increasing memory is easy but how to get to the code seems more diffcult.

But we could seperate debug procedure to several steps

Do heap dump during a period to known always in memory classes and dynamic classes
Compare heap dump to figure out what increase metaspace
Find a operation that causes the leak

Use visualVM connect to java application and monitor for a time

Metaspace leak can be monitored because after manually perforce GC, it still not decrease.

GC performed we have 72246 loaded classes.

after aboubt 1minutes, GC performed but loaded class increased to 72262

![截屏2022-10-11 16.08.23](/Users/kayo/Desktop/blog/JVM metaspace memory leak analysis/截屏2022-10-11 16.08.23.png)

Aboviously there are some leak problems, so still use 1 minute as period and collect two heap dump for next step analysis.

Open the heap dumps with MAT to do compare, from heap dump’s dominator tree

Luckily, we found the increased class at first glance. The SessionFactoryImpl seems to blame.

Except this we found another issue with Groovy’s GStringTemplateEngine which causes groovy.reflection.ClassInfo increases and we will tell the details in next section.

Expand the suspects to get more details about it. By list all objects, query plan cache increased, so just google for this value

from https://docs.jboss.org/hibernate/orm/5.0/javadocs/org/hibernate/engine/query/spi/QueryPlanCache.html: Acts as a cache for compiled query plans, as well as query-parameter metadata.

And check its source code:

/**
 * the cache of the actual plans...
 */
private final BoundedConcurrentHashMap queryPlanCache;

it seems a bounded hashmap and created like below:

1	queryPlanCache = new BoundedConcurrentHashMap( maxQueryPlanCount, 20, BoundedConcurrentHashMap.Eviction.LIRS );

and hibernate use default 2048 as maxQueryPlanCount and LIRS as eviction strategy. So no need to solve a cache increasing.

GStringTemplateEngine

By debug steps, we find GStringTemplateScript occupied metaspace and thousands of classes are created.

Check its implements,

1	groovyClass = loader.parseClass(new GroovyCodeSource(templateExpressions.toString(), "GStringTemplateScript" + GStringTemplateEngine.counter.incrementAndGet() + ".groovy", "x"));

GStringTemplateScript will be created every GStringTemplate created. So for template creation, will cause dynamic class creation and finally let jvm run out of memory.

According to some related fix, https://issues.apache.org/jira/browse/GROOVY-7017 this fix use a seperate class loader to make sure gstring template could be gced from heap but metaspace is still occupied.

And another article talk about: https://tigase.net/how-aws-helped-us-optimize-memory-usage-tigase-http-api/ AWS improve Tigase HTTP API Memory usage, following fix is suggested:

We load all templates at once using single GStringTemplateEngine and cache generate templates. No more automatic reloading of templates.
When manual reload of templates is initiated we release old instance of GStringTemplateEngine and parse templates using the new one.

So do some test before change our code:

import groovy.text.GStringTemplateEngine;
import javassist.ClassPool;

import java.io.IOException;
import java.io.StringReader;
import java.lang.management.ManagementFactory;
import java.lang.management.MemoryPoolMXBean;
import java.util.HashMap;

public class GroovyMetaspaceOOM {
    static ClassPool cp = ClassPool.getDefault();

    public static void main(String[] args) throws Exception{
        final String EVENT_CHINESE_TEMPLATE = "事件 发生了 事件详情: 名称:";

        GStringTemplateEngine engine = new GStringTemplateEngine();
        HashMap<Integer, groovy.text.Template> templateHashMap = new HashMap<>();

        while (true) {
            groovy.text.Template template = templateHashMap.get(EVENT_CHINESE_TEMPLATE.hashCode());
            if (template == null) {
                try {
                    template = (groovy.text.Template) engine.createTemplate(new StringReader(EVENT_CHINESE_TEMPLATE));
                } catch (ClassNotFoundException | IOException e) {
                    throw new RuntimeException(e);
                }

                templateHashMap.put(EVENT_CHINESE_TEMPLATE.hashCode(), template);
            }

            System.out.println(template.make().toString());

            for (MemoryPoolMXBean memoryMXBean : (ManagementFactory.getMemoryPoolMXBeans())) {
                if ("Metaspace".equals(memoryMXBean.getName())) {
                    System.out.println(memoryMXBean.getUsage().getUsed()/1024/1024 + " mb");
                }
            }

            System.gc();
        }
    }
}

by use templateHashMap as cache to avoid duplicate template creation, we can see Metaspace memory usage not changed

but if change the code without templateHashMap:

import groovy.text.GStringTemplateEngine;
import javassist.ClassPool;

import java.io.IOException;
import java.io.StringReader;
import java.lang.management.ManagementFactory;
import java.lang.management.MemoryPoolMXBean;

public class GroovyMetaspaceOOM {
    static ClassPool cp = ClassPool.getDefault();

    public static void main(String[] args) throws Exception{
        final String EVENT_CHINESE_TEMPLATE = "事件 发生了 事件详情: 名称:";

        GStringTemplateEngine engine = new GStringTemplateEngine();

        while (true) {
            groovy.text.Template template;
            //= templateHashMap.get(EVENT_CHINESE_TEMPLATE.hashCode());
            try {
                template = (groovy.text.Template) engine.createTemplate(new StringReader(EVENT_CHINESE_TEMPLATE));
            } catch (ClassNotFoundException | IOException e) {
                throw new RuntimeException(e);
            }

            System.out.println(template.make().toString());

            for (MemoryPoolMXBean memoryMXBean : (ManagementFactory.getMemoryPoolMXBeans())) {
                if ("Metaspace".equals(memoryMXBean.getName())) {
                    System.out.println(memoryMXBean.getUsage().getUsed()/1024/1024 + " mb");
                }
            }

            System.gc();
        }
    }
}

the metaspace will increase until oom.

2022-10-06

linux►memory management

Input–output memory management unit

From Wikipedia, the free encyclopedia

In computing, an input–output memory management unit (IOMMU) is a memory management unit (MMU) connecting a direct-memory-access–capable (DMA-capable) I/O bus to the main memory. Like a traditional MMU, which translates CPU-visible virtual addresses to physical addresses, the IOMMU maps device-visible virtual addresses (also called device addresses or I/O addresses in this context) to physical addresses. Some units also provide memory protection from faulty or malicious devices.

计算机中，input–output memory management unit (IOMMU) 是直接连接到主存的一个direct-memory-access-capable（DMA-capable，允许内存直接访问的）I/O总线内存管理单元（MMU）。类似传统的MMU，会翻译CPU可见的虚拟地址为物理地址，IOMMU映射设备可见的虚拟地址（也叫做设备地址或者I/O地址）到物理地址。一些单元也提供内存保护功能，避免设备出错或者受到攻击。

An example IOMMU is the graphics address remapping table (GART) used by AGP and PCI Express graphics cards on Intel Architecture and AMD computers.

On the x86 architecture, prior to splitting the functionality of northbridge and southbridge between the CPU and Platform Controller Hub (PCH), I/O virtualization was not performed by the CPU but instead by the chipset.[1][2]

举个例子，在Intel架构和AMD计算机上被AGP和PCI Express图形卡使用的graphics address remapping table（GART）就是IOMMU

在x86架构，提前拆分CPU，平台控制器HUB（PCH），I/O虚拟化在北桥和南桥的功能并不是CPU直接处理的，而是由芯片组解决的。

The advantages of having an IOMMU, compared to direct physical addressing of the memory (DMA), include[*citation needed*]:

Large regions of memory can be allocated without the need to be contiguous in physical memory – the IOMMU maps contiguous virtual addresses to the underlying fragmented physical addresses. Thus, the use of vectored I/O (scatter-gather lists) can sometimes be avoided.

Devices that do not support memory addresses long enough to address the entire physical memory can still address the entire memory through the IOMMU, avoiding overheads associated with copying buffers to and from the peripheral’s addressable memory space.

For example, x86 computers can address more than 4 gigabytes of memory with the Physical Address Extension (PAE) feature in an x86 processor. Still, an ordinary 32-bit PCI device simply cannot address the memory above the 4 GiB boundary, and thus it cannot directly access it. Without an IOMMU, the operating system would have to implement time-consuming bounce buffers (also known as double buffers[3]).

Memory is protected from malicious devices that are attempting DMA attacks and faulty devices that are attempting errant memory transfers because a device cannot read or write to memory that has not been explicitly allocated (mapped) for it. The memory protection is based on the fact that OS running on the CPU (see figure) exclusively controls both the MMU and the IOMMU. The devices are physically unable to circumvent or corrupt configured memory management tables.

In virtualization, guest operating systems can use hardware that is not specifically made for virtualization. Higher performance hardware such as graphics cards use DMA to access memory directly; in a virtual environment all memory addresses are re-mapped by the virtual machine software, which causes DMA devices to fail. The IOMMU handles this re-mapping, allowing the native device drivers to be used in a guest operating system.

In some architectures IOMMU also performs hardware interrupt re-mapping, in a manner similar to standard memory address re-mapping.

Peripheral memory paging can be supported by an IOMMU. A peripheral using the PCI-SIG PCIe Address Translation Services (ATS) Page Request Interface (PRI) extension can detect and signal the need for memory manager services.

For system architectures in which port I/O is a distinct address space from the memory address space, an IOMMU is not used when the CPU communicates with devices via I/O ports. In system architectures in which port I/O and memory are mapped into a suitable address space, an IOMMU can translate port I/O accesses.

IOMMU的优势，和内存的直接物理映射（DMA）相比，包括：

支持分配大片的内存，而且不需要是物理上连续的。IOMMU的映射保证虚拟地址连续，物理地址可以不连续。因此vectored I/O (scatter-gather lists) 有些时候是不能用的
内存地址长度不支持映射整个物理内存的时候也可以通过IOMMU来找到整个内存的地址。同时可以避免拷贝特定的内存到外围的内存空间而造成损耗
- 对x86计算机，具备通过Physical Address Extension（PAE）寻址超过4GB的内存的特性。同时一个普通的32位 PCI设备并不能很简单的找到超过4GB的内存地址，并且也无法访问这个地址。没有IOMMU的情况下，操作系统必须要实现一个消耗时间的bounce buffers（也叫做double buffer）
内存保护，避免恶意设备通过DMA攻击或者错误的设备尝试传输不正确的内存，因为一个设备并不能读或者写并不是这个设备分配映射的内存。内存保护是基于OS在CPU上运行并且排他的控制MMU以及IOMMU的事实实现的。设备在物理上就是没办法去避免或者破坏已经配置好的内存管理表的。
- 在虚拟化中，guest操作系统可以使用非特殊的虚拟化模式的硬件（物理硬件）。高性能硬件，比如说图形卡就是使用DMA来直接访问内存的。在虚拟环境中，所有的内存地址都被虚拟机软件重新做了映射，这就会导致DMA设备会访问失败。IOMMU需要处理这个重新映射的行为，并允许本地的硬件驱动能够被虚拟机操作系统使用。
在一些架构中，IOMMU也执行硬件中断的重新映射，类似标准内存地址的重新映射
IOMMU也支持外围的内存页。一个外围的使用PIC-SIG PCIe地址翻译服务（Address Translation Services）页请求接口（Page Request Interface）拓展能够探测并发出信号给需要内存管理的服务

对port I/O在内存空间中有有独有地址空间系统架构里，IOMMU并不通过I/O ports进行设备和CPU的沟通。系统架构里，port I/O和内存是映射到合适的地址空间里，IOMMU可以翻译port I/O的访问。

The disadvantages of having an IOMMU, compared to direct physical addressing of the memory, include:[4]

Some degradation of performance from translation and management overhead (e.g., page table walks).

Consumption of physical memory for the added I/O page (translation) tables. This can be mitigated if the tables can be shared with the processor.

In order to decrease the page table size the granularity of many IOMMUs is equal to the memory paging (often 4096 bytes), and hence each small buffer that needs protection against DMA attack has to be page aligned and zeroed before making visible to the device. Due to OS memory allocation complexity this means that the device driver needs to use bounce buffers for the sensitive data structures and hence decreasing overall performance.

IOMMU的缺点，和内存直接访问物理地址，包括：

因为翻译和管理地址导致的性能损耗（比如page table walks）
增加I/O page（translation）tables需要消耗物理内存。如果处理器可以共享page table就能够给减缓这个问题
为了减少page tables的大小以及减少page的粒度，大部分IOMMUs是和内存页（通常是4096字节）对等的，并且因为每个小buffer都需要避免DMA attack，在提供给设备访问之前，必须要被对齐并且置为0。由于OS内存分配的复杂性，这意味着设备驱动对敏感的数据结构要使用一个弹性的buffers因此降低了性能（因为要动态分配）

When an operating system is running inside a virtual machine, including systems that use paravirtualization, such as Xen and KVM, it does not usually know the host-physical addresses of memory that it accesses. This makes providing direct access to the computer hardware difficult, because if the guest OS tried to instruct the hardware to perform a direct memory access (DMA) using guest-physical addresses, it would likely corrupt the memory, as the hardware does not know about the mapping between the guest-physical and host-physical addresses for the given virtual machine. The corruption can be avoided if the hypervisor or host OS intervenes in the I/O operation to apply the translations. However, this approach incurs a delay in the I/O operation.

An IOMMU solves this problem by re-mapping the addresses accessed by the hardware according to the same (or a compatible) translation table that is used to map guest-physical address to host-physical addresses.[5]

当操作系统运行在虚拟机里，包括操作系统辅助虚拟化，比如Xen和KVM，通常情况下是不知道内存要访问的物理机器物理地址的。这个情况下要直接提供计算机的硬件地址是很困难的，因为如果guest OS尝试去命令硬件使用guest的物理地址执行直接内存访问（DMA），将会是一个错误的地址（因为虚拟化，guest的内存空间实际上和host的内存空间不是一回事），这是由于硬件并不知道guest物理地址和host物理地址之间的映射关系。通过host OS的介入把这个I/O操作翻译掉，就能够解决这个问题，而这个方法就会造成I/O操作变慢。

一个IOMMU可以通过宠幸映射硬件关联地址的翻译来解决这个问题，也就是用来做guest物理地址和host物理地址的映射。

2022-09-29

virtualization►kvm

kvm: mmu: Rework the x86 TDP direct mapped case

Over the years, the needs for KVM's x86 MMU have grown from running small
guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
we previously depended upon shadow paging to run all guests, we now have
the use of two dimensional paging (TDP). This RFC proposes and
demonstrates two major changes to the MMU. First, an iterator abstraction 
that simplifies traversal of TDP paging structures when running an L1
guest. This abstraction takes advantage of the relative simplicity of TDP
to simplify the implementation of MMU functions. Second, this RFC changes
the synchronization model to enable more parallelism than the monolithic
MMU lock. This "direct mode" MMU is currently in use at Google and has
given us the performance necessary to live migrate our 416 vCPU, 12TiB
m2-ultramem-416 VMs.

过去的数年里，对KVM’s x86 MMU的需求从运行小虚拟机发展到需要支持在线迁移数T内存上百个vCPU的虚拟机。而我们以前试用shadow paging来运行虚拟机，目前我们使用two dimensional paging(TDP)。这个提议说明了两个主要的针对MMU的修改。首先，增加了一个用于L1 guest运行时简化访问TDP paging数据结构的迭代器抽象。这个抽象使用了一个相对简单的TDP来简化MMU功能的实现。其次，这个RFC修改了同步模型来支持并行的而不是使用现在这个巨大的MMU锁。这个“direct mode” MMU在google内部使用并且已经提供了热迁移416 vCPU + 12TB内存的虚拟机的性能了。

The primary motivation for this work was to handle page faults in
parallel. When VMs have hundreds of vCPUs and terabytes of memory, KVM's
MMU lock suffers from extreme contention, resulting in soft-lockups and
jitter in the guest. To demonstrate this I also written, and will submit
a demand paging test to KVM selftests. The test creates N vCPUs, which
each touch disjoint regions of memory. Page faults are picked up by N
user fault FD handlers, one for each vCPU. Over a 1 second profile of
the demand paging test, with 416 vCPUs and 4G per vCPU, 98% of the
execution time was spent waiting for the MMU lock! With this patch
series the total execution time for the test was reduced by 89% and the
execution was dominated by get_user_pages and the user fault FD ioctl.
As a secondary benefit, the iterator-based implementation does not use
the rmap or struct kvm_mmu_pages, saving ~0.2% of guest memory in KVM
overheads.

这个工作的主要动机就是并行的处理page fault。当虚拟机油上百个vCPU和数T内存的时候，KVM的MMU锁竞争会非常的激烈，导致guest进入kernel loop或者是都懂状体啊。为了展示这个改动的结果，我实现了一个paging test用于KVM的测试。这个测试会创建一个N vCPUs每一个vCPU都来弄乱一部分内存。Page faults会被N个用户态错误FD处理，每一个对应一个vCPU。在任意1s的测试面里，416个vCPUS以及一个vCPU 4G内存，98%的时间会消耗在MMU lock上。通过这部分补丁，89%的时间都被节省了，处理逻辑被get_user_pages以及用户fault FD ioctl处理了。另外一个好处，基于迭代器的实现并不使用rmap或者是kvm_mmu_pages的数据结构，节约了大约0.2%的kvm虚拟机内存损耗。

The goal of this  RFC is to demonstrate and gather feedback on the
iterator pattern, the memory savings it enables for the "direct case"
and the changes to the synchronization model. Though they are interwoven
in this series, I will separate the iterator from the synchronization
changes in a future series. I recognize that some feature work will be
needed to make this patch set ready for merging. That work is detailed
at the end of this cover letter.

这篇RFC是为了展示并且收集对于迭代器模式的反馈，内存的节省在使用“direct case”之后启用了并且修改了同步模型。虽然他们之前紧密关联，我将会在之后的改动里把迭代器从同步模型里分离出来。我意识到在合并之前还有一些工作要做。这部分工作的细节附在这个letter之后。

The overall purpose of the KVM MMU is to program paging structures
(CR3/EPT/NPT) to encode the mapping of guest addresses to host physical
addresses (HPA), and to provide utilities for other KVM features, for
example dirty logging. The definition of the L1 guest physical address
(GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA,
and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the
MMU must program the x86 page tables to encode the full translation of
guest virtual addresses (GVA) to HPA. This requires "shadowing" the
guest's page tables to create a composite x86 paging structure. This
solution is complicated, requires separate paging structures for each
guest CR3, and requires emulating guest page table changes. The TDP case
is much simpler. In this case, KVM lets the guest control CR3 and
programs the EPT/NPT paging structures with the GPA -> HPA mapping. The
guest has no way to change this mapping and only one version of the
paging structure is needed per L1 address space (normal execution or
system management mode, on x86).

KVM MMU的主要目的是通过程序实现分页数据结构（CR3/EPT/NPT）来把虚拟机地址encode成物理机物理地址，以及提供其他KVM特性，比如dirty logging。对L1 guest的物理地址到Host物理地址的映射可以分为两个部分。KVM的memslots映射GPA到HVA，然后kernel MM/x86的host page tables映射HVA到HPA。在没有TDP的情况下，MMU必须把整个GVA到HPA的过程通过编程的方式实现，这个就需要用到“shadowing”也就guest的page tables来创建一个复合的x86 paging结构。这个实现方法是很复杂的，需要根据每个guest的CR3把分页结构分离开，并且模拟guest的page table变化。TDP的场景就更容易一些，这个场景里，KVM让guest控制CR3并且编程了一个EPT/NPT的分页数据结构来做GPA到HPA的映射。这样guest就不需要修改这个映射，并且只需要在每个L1guest的地址空间维护一个版本的分页数据结构（一般的在x86上的形式）

This RFC implements a "direct MMU" through alternative implementations
of MMU functions for running L1 guests with TDP. The direct MMU gets its
name from the direct role bit in struct kvm_mmu_page in the existing MMU
implementation, which indicates that the PTEs in a page table (and their
children) map a linear range of L1 GPAs. Though the direct MMU does not
currently use struct kvm_mmu_page, all of its pages would implicitly
have that bit set. The direct MMU falls back to the existing shadow
paging implementation when TDP is not available, and interoperates with
the existing shadow paging implementation for nesting.

这篇RFC实现了一个可选的使用MMU功能来结合TDP运行L1 guest的“direct MMU”功能。这个direct MMU的命名是因为直接从kvm_mmu_page的role bit获取了一个direct role（也是目前MMU里已有的实现）表示page table的PTEs（以及他们的子page）映射了一个组线性的L1 GPA地址。虽然direct MMU并没有使用kvm_mmu_page，不过所有的page都会做这个设置。当TDP不可用的时候direct MMU会降级到已有的shadow paging功能并且如果是嵌套虚拟化，也会联动到已有的shadow paging功能。

In order to handle page faults in parallel, the MMU needs to allow a
variety of changes to PTEs concurrently. The first step in this series
is to replace the MMU lock with a read/write lock to enable multiple
threads to perform operations at the same time and interoperate with
functions that still need the monolithic lock. With threads handling
page faults in parallel, the functions operating on the page table
need to: a) ensure PTE modifications are atomic, and  b) ensure that page
table memory is freed and accessed safely Conveniently, the iterator
pattern introduced in this series handles both concerns.

为了并行处理page fault，MMU需要允许并发修改PTEs（page table entry）。这部分补丁的第一步就是把MMU锁替换为读/写锁来弃用多线程来支持同时执行操作，并且把原本的需要一个大锁的函数也做了改动。通过线程并行处理page tauls，对应的page table的功能需要

保证PTE的修改是原子的
保证page table的内存是空闲的并且可以被安全的访问。

为了更方便的解决这两个问题，我们引入了迭代器模式

The direct walk iterator implements a pre-order traversal of the TDP
paging structures. Threads are able to read and write page table memory
safely in this traversal through the use of RCU and page table memory is
freed in RCU callbacks, as part of a three step process. (More on that
below.) To ensure that PTEs are updated atomically, the iterator
provides a function for updating the current pte. If the update
succeeds, the iterator handles bookkeeping based on the current and
previous value of the PTE. If it fails, some other thread will have
succeeded, and the iterator repeats that PTE on the next iteration,
transparently retrying the operation. The iterator also handles yielding
and reacquiring the appropriate MMU lock, and flushing the TLB or
queuing work to be done on the next flush.

这个direct walk iterator实现了一个顺序的TDP paging结构的遍历。线程通过使用RCU允许安全的在这个遍历过程里读和写page table的内存，并且page table的内存会在RCU的callbacks里释放，并作为处理逻辑的第三个步骤的一部分（不仅是底下这部分）。为了保证PTEs被原子的更新，迭代器提供了一个更新当前pte的方法。如果更新成功，迭代器会对之前和现在的PTE值做一个记录。如果失败了，其他线程会执行成功，迭代器会在下一次迭代继续重复这个PTE，并显式的重试一下。迭代器充实也会处理yielding并且重新获取合适的MMU lock，并且刷新TLB或者保存修改的内容到下一次flush位置。

In order to minimize TLB flushes, we expand the tlbs_dirty count to
track unflushed changes made through the iterator, so that other threads
know that the in-memory page tables they traverse might not be what the
guest is using to access memory. Page table pages that have been
disconnected from the paging structure root are freed in a three step
process. First the pages are filled with special, nonpresent PTEs so
that guest accesses to them, through the paging structure caches result
in TDP page faults. Second, the pages are added to a disconnected list,
a snapshot of which is transferred to a free list, after each TLB flush.
The TLB flush clears the paging structure caches, so the guest will no
longer use the disconnected pages. Lastly, the free list is processed
asynchronously to queue RCU callbacks which free the memory. The RCU
grace period ensures no kernel threads are using the disconnected pages.
This allows the MMU to leave the guest in an inconsistent, but safe,
state with respect to the in-memory paging structure. When functions
need to guarantee that the guest will use the in-memory state after a
traversal, they can either flush the TLBs unconditionally or, if using
the MMU lock in write mode, flush the TLBs under the lock only if the
tlbs_dirty count is elevated.

为了最小化TLB的flush，我们拓展了tlbs_dirty count来跟踪没有flush的迭代器修改，因此其他线程就能知道内存的page tables并不是guest当前要访问的内存（因为没刷新）。Page tables pages将会从paging数据结构里堵啊开，并且对应的pages就会被释放掉。

page将会用特殊的，并未出现的PTEs写满，因此guest访问这些地址的时候就是通过paging的cache，最后抛出一个TDP的page faults。
pages将会被加入到一个disconnected的list里，一个快照记录需要被释放的信息，然后按顺序做TLB flush。TLB的flush会清理对应的paging数据结构的缓存，所以guest将没办法访问disconnected的page
free list会被异步处理，并且进入队列等RCU的callbacks来释放内存。

RCU grace period保证没有内核线程会使用这些disconnected的pages。这个允许MMU通过内存中的paging数据结构让guest保持一个不一致但是安全的状态。如果一些方法需要保证guest需要使用遍历之后的内存中的状态，可以直接flush这个TLB或者是使用写模式的MMU锁，flush TLB操作需要保证tlbs_dirty比较高的时候拿到锁执行。

The use of the direct MMU can be controlled by a module parameter which
is snapshotted on VM creation and follows the life of the VM. This
snapshot is used in many functions to decide whether or not to use
direct MMU handlers for a given operation. This is a maintenance burden
and in future versions of this series I will address that and remove
some of the code the direct MMU replaces. I am especially interested in
feedback from the community as to how this series can best be merged. I
see two broad approaches: replacement and integration or modularization.

Direct MMU的使用可以通过模块参数，一个随着虚拟机的创建并伴随VM生命周期的参数控制。这个快照在很多方法里面用来判断是否需要使用direct MMU的处理。这是一个不太好维护的东西，未来需要去掉一些direct MMU相关的代码。我很期待社区的反馈来保证这些code在最好的状态合并。目前我认为有两个方法，替换+集成或者是模块化。

Replacement and integration would require amending the existing shadow
paging implementation to use a similar iterator pattern. This would mean
expanding the iterator to work with an rmap to support shadow paging and
reconciling the synchronization changes made to the direct case with the
complexities of shadow paging and nesting.

替换和集成需要给目前的shadow paging增加类似的迭代器模式。也就意味着拓展迭代器并能够和rmap一起工作来支持shadow paging以及使得原本的同步修改模式和direct模式来适应复杂的shadow paging以及nesting

The modularization approach would require factoring out the "direct MMU"
or "TDP MMU" and "shadow MMU(s)." The function pointers in the MMU
struct would need to be expanded to fully encompass the interface of the
MMU and multiple, simpler, implementations of those functions would be
needed. As it is, use of the module parameter snapshot gives us a rough
outline of the previously undocumented shape of the MMU interface, which
could facilitate modularization. Modularization could allow for the
separation of the shadow paging implementations for running guests
without TDP, and running nested guests with TDP, and the breakup of
paging_tmpl.h.

模块化主要是重构几个分开的部分“direct MMU”或者“TDP MMU”以及“shadow MMU”。需要把MMU的结构拓展为一个完整的包含MMU以及多MMU，单MMU并实现这些对应的接口。试用类似模块参数快照能够给出一个粗略的MMU接口的大致的轮廓，也能够作为模块化的基础。模块化允许shadow paging在guest中的使用，也支持嵌套虚拟化的guest试用TDP，也是paging_tmpl.h的突破。

1 2	In addition to the integration question, below are some of the work items I plan to address before sending the series out again:

关于集成的部分我重新列了一下：

Disentangle the iterator pattern from the synchronization changes
	Currently the direct_walk_iterator is very closely tied to the use
	of atomic operations, RCU, and a rwlock for MMU operations. This
	does not need to be the case: instead I would like to see those
	synchronization changes built on top of this iterator pattern.

Support 5 level paging and PAE
	Currently the direct walk iterator only supports 4 level, 64bit
	architectures.

Support MMU memory reclaim
	Currently this patch series does not respect memory limits applied
	through kvm_vm_ioctl_set_nr_mmu_pages.

Support nonpaging guests
	Guests that are not using virtual addresses can be direct mapped,
	even without TDP.

Implement fast invalidation of all PTEs
	This series was prepared between when the fast invalidate_all
	mechanism was removed and when it was re-added. Currently, there
	is no fast path for invalidating all direct MMU PTEs.

Move more operations to execute concurrently
	In this patch series, only page faults are able to execute
	concurrently, however several other functions can also execute
	concurrently, simply by changing the write lock acquisition to a
	read lock.

把迭代器从同步修改里分离出来

目前的direct_walk_iterator是和原子操作，RCU和MMU操作的读写锁。但实际上没有必要这样，其实这些同步修改应该实现在迭代器模式之上。

支持5级的paging以及PAE

目前direct walk interator只支持4级，64bit的架构

支持MMU的内存回收

当前这个补丁并没有支持kvm_vm_ioctl_set_nr_mmu_pages内存相关的限制

支持nonpaging guests

guest如果没使用虚拟地址的也能直接映射，甚至不需要TDP

实现快速的对所有PTEs的失效检测

2022-09-28

virDomainDefParseXML -> virDomainDiskDefParseXML -> virDomainDeviceInfoParseXML -> virDomainDeviceAddressParseXML -> virPCIDeviceAddressParseXML

if (devaddr) {
    if (virDomainParseLegacyDeviceAddress(devaddr,
                                          &def->info.addr.pci) < 0) {
        virReportError(VIR_ERR_INTERNAL_ERROR,
                       _("Unable to parse devaddr parameter '%s'"),
                       devaddr);
        goto error;
    }
    def->info.type = VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI;
} else {

qemuDomainSaveInternal ->

if (!(cookie = qemuDomainSaveCookieNew(vm)))
    goto endjob;

if (!(data = virQEMUSaveDataNew(xml, cookie, was_running, compressed,
                                driver->xmlopt)))
    goto endjob;
xml = NULL;

ret = qemuDomainSaveMemory(driver, vm, path, data, compressedpath,
                           flags, QEMU_ASYNC_JOB_SAVE);
if (ret < 0)
    goto endjob;

/* Shut it down */
qemuProcessStop(driver, vm, VIR_DOMAIN_SHUTOFF_SAVED,
                QEMU_ASYNC_JOB_SAVE, 0);

        rc = qemuMonitorMigrateToFd(priv->mon,
                                    QEMU_MONITOR_MIGRATE_BACKGROUND,
                                    fd);
                                    
int
qemuMonitorMigrateToFd(qemuMonitorPtr mon,
                       unsigned int flags,
                       int fd)
{
    int ret;
    VIR_DEBUG("fd=%d flags=0x%x", fd, flags);

    QEMU_CHECK_MONITOR(mon);

    if (qemuMonitorSendFileHandle(mon, "migrate", fd) < 0)
        return -1;

    ret = qemuMonitorJSONMigrate(mon, flags, "fd:migrate");

    if (ret < 0) {
        if (qemuMonitorCloseFileHandle(mon, "migrate") < 0)
            VIR_WARN("failed to close migration handle");
    }

    return ret;
}

qemu should support sendFD (qemu is should using a unix socket monitor)

qemu monitor get fd (SCM_RIGHTS)

SCM_RIGHTS
Send or receive a set of open file descriptors from
another process. The data portion contains an integer
array of the file descriptors.

/* Perform the migration */
if (qemuMigrationSrcToFile(driver, vm, fd, compressedpath, asyncJob) < 0)
    goto cleanup;

1	qemuMonitorJSONMigrate

2022-09-27

virtualization►kvm

A Simplified TDP with Large Tables

Abstract. Among the performance bottlenecks for the virtual machine, memory comes next to the I/O as the second major source of overhead to be addressed. While the SPT and TDP have proved to be quite effective and mature solutions in memory virtualization, it is not yet guaranteed that they perform equally well for arbitrary kind of workloads, especially considering that the performance of HPC workloads is more sensitive to the virtual than to the native execution environment. We propose that based on the current TDP design, modification could be made to reduce the 2D page table walk with the help of large page table. By doing this, not only the guest and host context switching due to guest page fault could be avoided, but also the second dimension of paging could be potentially simplified, which will lead to better performance.

目前内存的损耗已经成为所有虚拟机性能瓶颈中，仅次于I/O的损耗。虽然已经证明了SPT和TDP是非常有效和成熟的内存虚拟化解决方法，但是相比于非虚拟机环境来说，内存虚拟化的性能对任何类型的负载，特别是高性能计算这种对性能特别敏感的情况是没有提供保证的。所以我们基于现在的TDP的设计提出了一些可行的修改，通过large page table检查少2D页表的查找。通过这个改动，不仅仅是guest和host的上下文切换导致的guest page fault可以被避免（这里说的应该是context change之后，会碰到page fault，large page table似乎是不是增加了page的大小，这样页表的条数就减少了，即使context change也不会发生table entry的变化，减少了page fault？可以继续看看）。同时second dimension of paging（也就是TDP）也可以被简化并获得更好的性能

In the context of system virtualization, SPT (shadow page table) and TDP (twodimensional paging)1 are the two mature solutions for memory virtualization in the current hypervisors. Both of them perform address translation transparently from the guest to the host. In dealing with the translation chain from GVA (guest virtual address) to HPA (host physical address), the SPT combines the three intermediate steps for each GVA→HPA into a single entry, which contains the wanted address and saves further efforts to walk through both of the guest and host page tables as long as the cached entry is not invalidated in any form. However, as SPT is a part of the hypervisor and must be kept as consistent as possible with the guest page table, the processor had to exit from the guest to host mode to update the SPT and make it accessible, during which a considerable number of CPU cycles could have been wasted. TDP comes as a remedy by keeping GVA→GPA translation in the guest, while shifting GPA→HPA translation from the hypervisor to the processor. The expensive vmexit and vmentry due to guest page fault are avoided by TDP. Unfortunately in case of TLB-miss (translation look-aside block) the multi-level page table must still be walked through to fetch the missing data from the memory. Because of this nature, the TLB is not quite helpful in preventing page table from being walked when running workloads with poor temporal locality or cache access behavior. As a result, 1 TDP it is known as the AMD NPT and Intel EPT. For technical neutrality reason TDP is used to refer to the paging mechanism with hardware assistance the performance gain could be more or less offset by the overhead. We attempt to combine the merits of the two methods and meanwhile avoid the downsides of them by adapting the mmu code in the hypervisor.

在系统虚拟化中，SPT（shadow page table 影子页表）和TDP（twodimensional paging我也不知道咋翻译）是当前hypervisor（VMM，虚拟机监视器）成熟的内存虚拟化解决方案。这两个方案显而易见的就是做guest到host的地址翻译。主要是处理GVA（虚拟机虚拟地址）到HPA（宿主机物理地址）的翻译链。SPT把三个中间步骤结合在一起（这里简单解释一下就是虚拟机虚拟地址->虚拟机内存地址->宿主机虚拟地址->虚拟机物理地址）实际上每一步都有一个对应的entry，SPT总是会把这些内容放在cache里面来节省需要访问所有关联关系的时间，所以为了保证SPT的内容都是一致的，处理器需要从guest模式exit到host模式来更新STP的内容（这是因为kernel提供了这个功能，所以如果发生了Host上的进程切换，映射关系可能需要更新，所以其实很多CPU时间就损耗在这个切换上了。TDP应运而生（remedy，其实是治疗药物的意思，意译了一下）通过在guest上保持guest虚拟地址到guest物理地址的翻译不变，把guest物理地址到host物理地址的翻译从hypervisor层（kvm kernel）转移到了处理器上，而结果就是上面提到的exit就可以避免了（主要是guest page fault之后，需要一层一层往上找，如果是STP，就需要exit退出guest模式，如果处理器能处理host的地址翻译，那么就不需要exit，处理器能完成这个工作）。当然如果TLB-miss的话（意思是说page table的缓存，也就是 TLB，会记录活跃的内存页page，如果TLB中包含，那么可以直接从cache里返回地址。否则就要走正常的寻址逻辑->多级页表）还是需要查询多级页表在内存里找到对应的数据。因为这个特性，TLB其实在很多时候并不能解决没有时间局限性的或者是缓存访问的行为（这里解释一下，比如存在一个pageA，有时间局限性意思是说未来短时间内可能会被多次访问，没有的意思就是说，随机访问很多不同page的情况）。所以结果是，TDP实际上就是AMD的NPT和Intel的EPT。基于技术中立的态度（指不偏向AMD或者Intel），TPD被用来作为硬件辅助分页机制来提升部分性能的机制。我们尝试调整hypervisor（kvm）的mmu代码，尝试结合这两个方法，同时避免一些已有的缺点。

评论：

讲的非常的准确，优点/缺点，以及要做什么

For TLB contains a number of the most recently accessed GPA→HPA mappings, the more likely these entries will be needed in future, the more time could be saved from the page table walk in subsequent operations. To the nature of the paging methods themselves, the efficiency of both SPT and TDP rely on to what extent the cached results of the previous page table walks could be reused. Since the SPT cannot be maintained without interrupting the guest execution and exit to the host kernel mode, there will be little chance other than reducing the occurrence of the page fault in the guest to improve the performance of SPT. This, however, largely depends on the memory access behavior of the individual workload and remains beyond the control of the hypervisor. TDP, on the hand, bears the hope for performance improvement. Currently the TDP is adopting the same paging mode as the host does, known as “multi-level page table walk-through”. It is an N-ary tree structure [1], where N could be 1024 in 32-bit mode, or 512 in 64-bit or 32-bit PAE modes. In the 32-bit mode only 2 level page table are involved for walking through, which poses minor overhead. However, the overhead grows quickly non-negligible as the paging level increases. In spite of the various paging modes adopted in the guest, only two modes - the 64-bit and the 32-bit PAE modes are available for the TDP. In the worst case if all TLB large missed, up to 24 memory accesses are possible for a single address translation.

TLB包含了最近访问的GPA->HPA的映射，这些将来要被继续访问的条目如果保存的越多（因为TLB是cache，所以有这么样的假设，所以cache不适用的场景都是类似的，比如超出cache上限等），那么就能够节约更多的查询page table这个操作消耗的时间。结合这两个分页方法本身，SPT和TDP的性能，取决于用什么cache来拓展之前的page table查询结果，并且可以被复用。因为SPT需要通过中断的方式维护（主要是vm exit然后更新）所以实际上CPU上导致的损耗使得优化SPT的可能性很低。当然SPT很大程度上和单个负载的内存访问方式有关系，（比如一直没有上下文切换，也只访问热点数据，那又是没啥问题的）。TPD，实际上更有可能提升性能。目前TDP采用的是和host一样的分页模式，也就是（multi-level page table walk-through 多级页表访问）。是一个N长度数组的树结构，N可以是32位模式的1024，或者是64位和32位PAE模式的512。在32位模式，只有两级页表能够在访问中使用，这种情况下损耗更下一些。当然随着分页级别的增加，这个损耗变得不容忽视。不管guest里面可能采用各种不同的分页级别设置，TDP只支持64位和32位PAE模式。最差的情况下TLB没有命中的话，最差的情况一次地址翻译需要24次内存访问（TODO）

评论：

介绍了一下当前访问过程中存在的一些问题

Though undesirable, this has presumably been done for two reasons: 1. compatibility between the host and guest paging modes, and more importantly, 2. efficiency in memory utilization. However, as the TDP table is in the hypervisor and invisible to the guest, it is actually up to the hypervisor to adopt the paging mode without having to maintain this kind of compatibility [2]. For the tree structure forms a hierarchy of the 1-to-many mappings, mappings could be built in a “lazy” way only on demand, significant amount of memory space for entries could be saved compared with the 1-to-1 mapping based structure, say, an array. On the other hand, this structure also means more memory access and time cost while looking up an element within it. As performance rather than memory saving comes as the top concern, paging methods more efficient than the current one may exist. One candidate is naturally a 1-to-1 mapping based structure, such as an array, or a hash list. With more memory being invested to save all the possible GPPFN (guest physical page frame number) to HPPFN (host physical page frame number) mappings, fewer memory accesses suffice2 in the second dimension of the TDP. By doing this, the whole translation from GVA to HPA is expected to be effectively simplified and accelerated due to a reduced paging structure in TDP table. In addition, a simplified TDP method combines the merits of both TDP and SPT - to avoid the vmexit as well as to maintain the relatively short mapping chains from GVA to HPA. Until significant change is made available to the paging mode of the current processor, it could be a better practice merely by modifying the current hypervisor software.

尽管这很讨厌（要说什么特殊条件了），这个可能通过以下两个原因能够解决：1. host和guest使用完全相同的paging mode（很重要）2. 提升内存使用率。然而，TPD table是保存在hypervisor并且对guest不可见，所以实际上只需要hypervisor采用对应的paging模式，而没有必要维护兼容性。用树状的层级结构表示1对多的映射，映射可以按需是否使用lazy的方式，相比于1对1映射能保存大量的内存空间的条目。另一方面，这个数据结构也意味着更多的内存访问，以及访问的时间消耗。因为主要考虑性能而不是节省内存，比当前这个方式效率更高的方法肯定是存在的。比如默认的1对1映射的数据结构，比如数组或者hash列表。随着更多的内存被用于保存可能被使用的GPPFN（guest physical page frame number）和 HPPFN（host physical page fram number）的映射，使用TDP映射需要访问内存的时间就更少了。通过减少TDP table里paging数据结构的层级，整个GVA到HPA的翻译，会变得更加简化，并且加速。同时，简化TDP方法，并结合TDP和SPT的特点（避免vmexit，同时维护一个很短的GVA到HPA的映射关系）。除非处理器的分页模式发生重大变革，目前来说修改当前的hypervisor是最好的方法。

评论：

所以还是改原本的code，没有办法在结构上改进，提升也是有限的

Major work focusing on improving the memory virtualization could be summarized as the following. To work around the unfavorable sides and combine the best qualities of SPT and TDP, one attempt is to enable the hypervisor to reconfigure its paging method at run-time as a response to the ever changing behavior of the workloads in memory accessing, which were implemented in the past in a few hypervisors such as Xen and Palacios according to [3,4]. Although not all workloads could be benefited from this, overall performance gain have been observed for the selected benchmarks. The downside, however, is that it adds further complexity to the hypervisor with the methods of performance metric sampling, paging method decision making, as well as the dynamic switching logic. Furthermore, such activities in the kernel could also do harm to the performance.

主要的改进内存虚拟化的工作总结如下。去掉SPT和TDP里面不好的部分，结合他们的特点，一个尝试就是让hypervisor在运行时重新设置paging方法，并作为内存访问的负载的一个变化行为，这已经在过去的一些hypervisor上实现了，比如Xen和Palscios。虽然不是所有的负载都适用于这个方法，所有的性能提升都是对应不同的基准来观测的。当然这个缺点就是要增加hypervisor的复杂度，主要是影响性能检测，决定分页方法，同时还有动态切换。当然，kernel里这样的行为也是会影响性能的。

评论：

突然开始保守了，前面可能吹太猛了，因为模型上没有变化，只能控制面改了

To reduce the overhead for walking through the multi-level page tables in TDP, a hashed list is applied to provide direct address mapping for GPA [2]. In contrast with the O(n²) complexity of the conventional multi-level forward page tables for both GVA→GPA and GPA→HPA translations, the hashed page table has only one paging level and achieves a complexity of O(n) in theory. The performance is at least not worse due to the reduced page table walk and cache pressure, showed by the benchmark. Since the hash table is a data structure more capable in searching, inserting and deleting etc., and relatively easier to be implemented within the existing framework of the hardware and software, current TDP design could be simplified by applying it for better performance. As more reflections were cast on the current multi-level paging modes, a variety of changes have been prompted for a simplified paging work. Theoretically, a “flat nested page table” could be formed by combining the intermediate page levels from 4 to 1, which results in an 8 memory access for the translation from GVA to HPA, and a reduced overhead for 2D TDP walk [5]. By extending the processor and hypervisor with the “direct segment” function, the memory access for the GVA to HPA translation could even be further reduced to 4 or 0 [6].

为了TDP访问减少多级页表造成的损耗，增加一个hash列表来提供GPA的直接映射（GPA到HPA）。对比原本的GVA->GPA和GPA->HPA翻译的O(n²)的多级页表访问复杂度，使用哈希页表，只有一层分页级别，所以理论上复杂度只有O(n)。结合测试基准，性能在减少了页表的访问以及缓存的压力之后并没有变差。因为哈希表是一个便于搜索插入和删除的数据结构并且在很多框架软件硬件上都很容易时间，目前的TDP设计可以被简化成使用哈希表来获取更好的性能。当然对更多的针对目前的多级页表模式的改进可以被提出来用来简化寻址工作。理论上“flat nested page table” 一个很大的内嵌页表修改页级别从4改成1，最多也就是需要8次内存访问，同时减少2D TDP的查询。通过拓展处理器以及hypervisor的direct segment方法，内存的翻译可以被减少到4-0次

评论：

把原本的多级页表，改成了单层的hash表

比如原本的页表是1024，通过多级，可以保存1024ⁿ的页信息，虽然查询的时间长了，但是保存的地址多了，寻址上，在没有cache的情况下，找的时间就会指数级增加

换成hash表，查询速度变快了，但是相对的保存的条目数量就受到了page table的限制，因此提出的改进其实希望是一个很大的hash表

For the TLB plays a critical role in reducing the address translation overhead [7] and justifies the use of TDP, it becomes another concern besides the paging level. Specific to the AMD processor, a way is suggested in [8] to accelerate the TDP walk for guest by extending the existing page walk cache also to include the nested dimension of the 2D page walk, caching the nested page table translations, as well as skipping multiple page entry references. This technique has already gained its application in some AMD processors. Not limited to virtual cases, attention is paid in [9] to compare the effectiveness of five MMU cache organizations, which shows that two of the newly introduced structures - the variants of the translation outperform the existing structures in many situations.

TLB在减少地址翻译的损耗扮演了一个很重要的角色，并且很适合TDP的使用场景，所以它成为了除了paging level之外的另外一个条件。仅限AMD处理器，实际上有一个加速guest TDP查询的方法，就是增加已有的page walk cache并且包含嵌套的2D page walk。缓存嵌套的页表翻译，同时跳过多页面引用。这个技术已经在一些AMD处理器的一些应用上被采用了。不仅限于虚拟化场景，其实对比与其他五个MMU cache组织，也说明了最新的两个结构，对于翻译性能的提升，在很多情况都是优于已有的结构的。

评论：

我的cache无限大，速度够快，那其实根本不需要page，直接都放到cache里不是更好

As a potential technical breakthrough, TDP is different from SPT in many aspects. However, for compatibility reason, the main structure of SPT is still reused by TDP. This, though at first may seem quite misleading, enables the TDP to fit seamlessly into the current framework previously created for SPT. As far as TDP feature is available in the hardware, it is preferred to SPT for general better performance. While in the absence of TDP hardware feature, SPT may serve as a fall-back way and the only choice for the hypervisor to perform the guest-to-host address translation.

作为一个潜在的技术重大突破，TDP其实和SPT在很多方面都不一样。然而，因为很多兼容性原因，SPT的主要数据结构都被TDP复用了。因此一开始看的时候会感觉非常误导，启用TDP和现有的为SPT创建的框架无缝衔接。目前为止，TDP特性在硬件上可用，相比SPT来说有更好的性能。因为缺少TDP的硬件特性，SPT可以所谓一个备用方，也是hypervisor层来实现guest到host地址翻译的唯一方法。

评论：

好像也没想出啥好办法，开始叹气

In KVM, SPT and TDP share the same data structure of the virtual MMU and page tables (surprisingly, both are named as shadow page table). The shadow page table is organized as shown by Fig. 1, of which kvm mmu page is the basic unit gluing all information about the shadow pages together. For each level of the shadow page table, a pageful of 64-bit sptes containing the translations for this page are pointed to by *spt, whose role regarding the paging mode, dirty and access bits, level etc. are defined by the corresponding bits in role. The page pointed to by spt will have its page->private pointing back at the shadow page structure. The sptes in spt point either at guest pages, or at lower-level shadow pages [10]. As the sptes contained in a shadow page may be either one level of the PML4, PDP, PD and PT, the pte parents provides the reverse mapping for the pte/ptes pointing at the current page’s spt. The bit 0 of parent ptes is used to differentiate this number from one to many. If bit 0 is zero, only one spte points at this pages and parent ptes points at this single spte, otherwise, multiple sptes are pointing at this page and the parent ptes & 0x1 points at a data structure with a list of parent ptes. spt array forms a directed acyclic graph structure, with the shadow page as a node, and guest pages as leaves [10].

在KVM中，SPT和TDP使用相同的数据结构，也就是virtual MMU和page tables（很惊讶，他们都叫做shadow page table）。shadow page tables的结构如Fig.1所示。kvm mmu page是一个基础的单元，把所有shadow pages相关的信息黏合在一起（突然想到python是glue language）。对任意一级的shadow page table来说，一个64bit的sptes整个page包含了这个page所有的翻译，是通过*spt指针指向的，里面的role字段则是表示paging mode，dirty bit，access bit以及级别等等。都是在role里面的比特位定义的。spt指向的page有对应的page->priavte pointing back的从shadow page structure指回去的指针。spt中的spte指针也只想guest的page，或者是更低级别的shadow page。因为shadow page包含的sptes也可能是另外一个级别的PML4，PDP，PD和PT，对应的pte parents需要提供对pte/ptes指针的反向指针，指向当前page的spt。Parent ptes的第一个比特位，bit0是用来区分是一对多还是一对一，如果bit0是zero，那么说明当前page的spte指针只有一个spte和page parent对应。否则说明有多个sptes指向这个page，同时parent page的ptes & 0x1 指向一个数据结构，表示一个parent ptes的列表。spt数据组织了一个直接非环图结构，shadow page作为一个node，guest的page作为叶子结点。

评论：

这段没看懂，回头继续看

KVM MMU also maintains the minimal pieces of information to mark the current state and keep the sptes up to date. unsync indicates if the translations in the current page are still consistent with the guest’s translation. Inconsistence arises when the translation has been modified before the TLB is flushed, which has been read by the guest. unsync children counts the sptes in the page pointing at pages that are unsync or have unsynchronized children. unsync child bitmap is a bitmap indicating which sptes in spt point (directly or indirectly) at pages that may be unsynchronized. For more detailed description, the related Linux kernel documentation [10] is available for reference.

KVM MMU也维护了一个最小程组的信息，来标记当前的状态以及保证sptes是最新状态。如果当前的page和guest的翻译还是一致的就不会标记为需要同步。如果翻译被改动，而TLB还没刷新就会guest读到的内容不一致（因为正确的映射关系是通过shadow page维护的，guest的TLB只是cache，如果没有即使更新就会出现shadow page和TLB不一致的情况）

Unsync children会计算当前page的指向没同步的或者是存在没同步的子节点的stpes的数量，

Unsync children bitmap是一个bitmap标记spt指针的sptes（直接或间接的）指向页面的这些指针可能没有同步。

更多细节的描述可以看Linux kernel的文档。

评论：

操作系统代码不熟，需要回炉重造

Multiple kvm mmu page instances are linked by an hlist node structure headed by hlist head, which form the elements in the hash list - mmu page hash pointed to by kvm->arch. Meanwhile it’s also linked to either the lists active mmu pages or zapped obsolete pages in the kvm->arch, depending on the current state of the entries contained by this page. Both SPT and TDP keep their “shadow page table” entries and other related information in the same structure. The major difference lies in the hypervisor configuration of the runtime behaviors upon paging-fault-related events in the guest. While the SPT relies on the mechanism of “guest page write-protecting” and “host kernel mode trapping” upon guest page fault for keeping the SPT synchronized with the guest page table, the TDP achieves the same result by a hardware mechanism. As VMCB (virtual machine control block) by AMD or VMCS (virtual machine control structure) by Intel is the basic hardware facility the TDP makes use of, it’s the key thing making difference. Code snippet in Fig. 2 shows the configuration of VMCB for TDP, and that the root address of the TDP page table is kept in the VMCB structure. Meanwhile the guest is configured as exitless for paging-fault exception, which means that the page fault events is handled by the processor. With this configuration, guest is left running undisturbed when the guest page fault occurs.

多个kvm mmu page实例会被连接到一个hlist数据结构的head上，组织称一个hash list。kvm->arch来指向mmu page hash指针。同时也会在kvm->arch连接到active的mmu page或者是过期的page，取决于当前这个page包含的entries的状态。SPT和TDP会保持他们的shadow page table entires和其他的关联信息在相同的数据结构里。主要的不同是hypervisor的runtime行为配置对guest中paging-fault相关的事件的处理。因为SPT依赖于guest page write-protecting的机制，以及host kernel mode trapping，当出现guest page fault的时候，为了保证SPT和guest的page同步，TDP需要实现一个硬件的机制。在VMCB（virtual machine control block）AMD的数据结构以及VMCS（virtual machine control structure）intel的数据结构提供给TDP使用，也是最关键的部分。

代码放在了Fig2，展示了给TDP用的VMCB的配置，对应TDP page的root address被保存在VMCB数据结构里。同时guest配置为碰到page fault错误也不做vm exit（即不走shadow page table的逻辑），这个page fault交给处理器处理。通过这个配置，guest就不会因为page fault受到过多影响（指退出guest mode）

Besides, as SPT maps GVA to HPA, the spt entries are created and maintained in a per-process way, which leads to poor reusability hence higher memory consumption. These are obvious downsides especially when multiple processes are running in parallel. In contrast, the TDP maintains only the mappings from GPA to HPA, which effectively eliminated such problems associated with SPT. Guest page table is also accessed by the physical processor and forms the first dimension of the entire page table walk. In this way the TDP can not only eliminate the cost for frequent switching between the host and guest modes due to SPT synchronization, but also simplify the mappings and maintenance efforts the “shadow page tables” needs.

除此之外，比如SPT用来映射GVA到HPA，对应的spt条目由每个进程创建和维护，在内存消耗很大的场景下重用性很差（因为进程切换，大部分换出换入）。特别是多进程并行的时候，这些缺陷显而易见。对比之下，TDP仅维护了GPA到HPA的映射，有效的减轻了SPT这样的问题。Guest page table也会被物理处理器访问，并且构建了完整的page的访问。在这个情况下TDP不仅可以减少在host和guest之间因为SPT同步导致不断切换的问题，也能简化shadow page tables映射的麻烦。

评论：

又重新讲了一下TDP的作用，感觉有点重复

Two stages are involved in the buildup of the TDP table, namely, the creation of the virtual mmu, and the filling of TDP page tables upon guest page fault during the execution of the guest. As Fig. 3 depicts, in the context of the function kvm vm ioctl, the virtual mmu is created for the first time along with the guest VCPU. It is also when the VMCB is configured. One thing to be noticed is that, as the root address of the TDP page table, the root hpa of the kvm mmu is left without to be allocated a page table, which is deferred to the second stage.

构建TDP table需要两个阶段，首先是创建virtual mmu并且在guest执行的时候发生page fault时填充TDP表。如Fig3所示，kvm vm ioctl功能的上下文中，virtual mmu是首先被创建的，并且不和guest VCPU相关。这也是VMCB被配置的时候。值得注意的是，TDP pagetable的root address中，kvm mmu的root hpa会被保留并且不分配给page table，接下来就到第二阶段了。

Figure 4 depicts the context function vcpu enter guest, in which operations related to the second stage take place. This function serves as an interface for the inner loop3 in the internal architecture of the QEMU-KVM, dealing with host-guest mode switching. Before the guest mode is entered by the processor, much preparation work needs to be done in this context, including the checking and handling of many events, exceptions, requests as well as mmu reloading or I/O emulation. The only thing needed for mmu reloading is to allocate a page for the TDP table and make the starting address of it known to the root hpa of the kvm mmu and the CR3 of the VCPU, which is performed by kvm mmu load.

Fig4说明了vcpu enter guest功能呢，也就是第二阶段发生的时候。这个功能作为一个接口在QEMU-KVM的架构中使用，处理host-guest mode的切换。在guest mode enter之前，很多的准备工作要基于这个上下文完成。包含检查以及处理很多的事件，异常，请求以及mmu加载或者是I/O模拟，需要对mmu加载做的工作就是给TDP表的page分配，并且设置一个起始地址给kvm mmu的root hpa，并且设置VCPU的CR3，这就是kvm mmu load需要做的事情。

Guest begins to execute until it can’t proceed any further due to some faulty conditions. More often than not, control flow had to be returned to the hypervisor or the host OS kernel to handle the events the guest encountered. Obviously too much vmexit are an interference and grave source of overhead for the guest. With TDP, however, guest is free from vmexit upon guest paging faults. As the guest enters for the first time into execution, the paging mode is enabled and the guest page tables are initialized, however, the TDP tables are still empty. Any fresh access to a page by the guest will first trigger a guest page fault. After the fault is fixed by the guest, another page fault in the second dimension of the TDP is triggered due to the missing entry in TDP table.

guest开始执行直到无法处理其他的条件。大部分情况下，当guest发出一些事件的时候流程处理会转交给hypervisor或者host kernel。显然大部分vmexit就是导致guest性能损耗的原因。通过TDP，guest就能够避免因为page fault导致的vmexit。因为guest一开始就进入了执行状态，paging模式被启用，并且guest的page tables也被初始化了，然而TDP的table还是空的。任意的对新页的访问都会导致guest page fault。放这个错误被处理之后，另外一个page fault在TDP模型里面就是TDP table里缺少条目。

tdp page fault is the page fault handler in this case. As illustrated by Fig. 5, first the host page frame number - pfn is calculated for the faulting address through a chain of functions in try async pf. The pfn is then mapped one level after another into the corresponding positions of the TDP tables by the function direct map. In a predefined format, the entry for a faulting address is split into pieces of PML4E, PDPE, PDE, PTE as well as offset in a page. During the loop, iterator - an instance of the structure kvm shadow walk iterator is used to retrieve the latest physical, virtual addresses and position in the TDP tables for a given address, of which iterator.level determines the number of times for the mapping process.

tdp page fault就是这个场景下的处理逻辑。和Fig5画的一样，首先host page frame number（pfn）会通过这个一系列async pf的函数给错误的地址计算一个pfn。这个pfn被加入到层级映射里面，来映射到一个具体的TDP表的位置。在预定义的情况下，这个给faulting address用的条目可以被分为PML4E，PDPE，PDE，PTE也就是page的offset。再循环里，一个kvm shadow walk迭代器的实力会被用来滴贵的访问最后一个物理、虚拟地址并且找到对应的地址在TDP table里的位置，这个迭代器里面的level也就是对应的进程多少级映射的信息。

Although the conventional TDP shown in Fig. 6 is mature and the default configuration for better performance, for a certain kind of workloads the limitation is still obvious. They may suffer large overhead due to walking into the second dimension of multi-level page table upon heavy TLB-miss. It is ideal to have a “flat” TDP table by which the wanted pfn can be obtained with a single lookup. Unfortunately, there has long been a problem to allocate large chunk of physically continuous memory in the kernel space. Three functions, namely vmalloc(), kmalloc() and get free pages are used to allocate memory in the current Linux Kernel. The first allocates memory continuous only in virtual address, which is easier to perform but not desired dealing with performance. The second and the third allocate memory chunk continuous in both virtual and physical addresses, however, the maximum memory size allocated is quite limited, thus tends to fall short of the expectation for this purpose. In addition, kmalloc() is very likely to fail allocating large amount of memory, especially in low-memory situations [11]. The amount of memory get free pages can allocate is also limited within 2MAX ORDER−1, where MAX ORDER in the current Linux Kernel for x86 is 11, which means that each time at most 4MB memory can be obtained in the hypervisor. In this condition what we could do is to make the TDP table as “flat” as possible, and to reduce the number of paging with it. Here “flat” means large and physically continuous memory chunk for TDP table. Instead of having thousands of TDP tables managed by their own kvm mmu page instances, we want to merge as many TDP tables as possible into a larger table managed by fewer kvm mmu page instances.

虽然通常的TDP使用表示在Fig6，是一个非常成熟的作为默认性能更好的设置，并且对于一些确切的负载上有的限制还是很明显的。比如有非常多TLB-miss的时候，需要做非常多的多级页表查询造成很大的性能损耗。如果存在一个flat的TDP table，保证pfn的查询只需要一次，那么就非常理想了。不幸的是，内核空间里面没有办法分配一个连续的长的物理内存来处理这个问题。三个方法，分别是vmalloc()，kmalloc()以及获取空闲page来在当前linux内核中分配内存。第一个方法仅在虚拟地址分配连续内存，这个方法简单，但是没办法达到满意的性能。第二个个第三个方法，会在虚拟和物理地址上都分配连续的地址，而最大的内存大小是很有限的，因此还是不能够达到预期。附带一提kmalloc是非常容易分配大量内存失败的，特别是内存少的场景。通过获取空闲page来分配内存也被限制在2^MAX_ORDER-1,这个MAX_ORDER在当前的x86 kernel里面是11，也就是说最多4MB的内存可以被分配给hypervisor使用。在这个场景下，我们可以让这个TDP table尽量的大，来减少page的数量，这个flat表示给TDP table使用的很大并且物理连续的内存块。代替原本的由KVM mmu管理的上千个TDP table，我们希望尽可能的把这些TDP table合并起来，并使用更少的kvm mmu page实例来管理。

There could be various ways to implement this, depending on how the indices of a page table entry are split. Two things are to be noticed for this: 1. to leave the indices for paging as long as possible, and 2. to reuse the current source code for KVM as much as we can. Consequently, we come up with a quite straightforward design by merging the bits for currently used indices within a guest page table entry. As Figs. 7 and 8 depict, the former PML4, PDP (higher 18 bits) could be combined as a single index to the root of a TDP table segment, and similarly PD and PT (lower 18 bits) as the index for a physical page. By filling the TDP table entries in a linear ascend order for the GPPFN, the HPPFN could be obtained conveniently by a single lookup into the table. As a result, for the currently used maximal address space of a the 64-bit(48 bits effectively in use) guest, we may have 218 = 256K segments for the TDP tables, with the index of each segment ranging from 0 to 218 − 1 to the host physical pages. The TDP table size is enlarged by 29 times, while the number of the table segments could be reduced to 1 29 of the former. This is actually a f

有很多不同的方法可以实现这个，主要是看选择用什么方式来拆分page table entry。

两个值得注意的点：

需要保证分页的索引尽可能的长
尽可能的复用当前的kvm代码

因此我们选择了一个非常直接的设计，也就是把bits索引合并到一整个page table entry里。

如Fig7和Fig8战士的，之前的PML4，PDP（高18位）可以被结合成单个索引，作为TDP table segment的根。类似的PD和PT（低18位）可以作为物理页的索引。通过线性的递增GPPFN和HPPFN来填充TDP的页表，就能够简单的实现一个table的单词查询访问。结果上来说，通过使用最大的64位地址空间（有效位为48位）的guest，我们可以有大约2¹⁸ = 256k 个段用于TDP tables。每个段的索引就是从0到2¹⁸-1对应到host的物理页。TDP表的大小被扩大了2⁹倍，而对应的table段的数量减少到了之前的 1/2⁹

This is actually a fundamental change to the current mmu implementation. Several data structures and functions oriented to the operations upon 4KB∗29 = 2MB TDP page table must be adopted to the type upon 4KB ∗ 218 = 1GB. For example, as depicted by Fig. 9, the data structure of kvm mmu page could be modified as following to reflect the change: 1. since in a “flat” table, there is only two levels and a single root table as parents, the parents-children relation is quite obvious. Besides, all the first level pages have a common parent but no children at all. Members such as unsync children, parent ptes and unsync child bitmap are not necessary; 2. members as gfn, role, unsync etc. are multiplied by 512 to hold the informations previously owned by an individual 4KB ∗ 2⁹ = 2MB page table; 3. spt points to a table segment covering an area of 4KB ∗ 2¹⁸ = 1GB; 4. link is moved to a newly introduced structure - page entity to identify the 4KB ∗ 2⁹ = 2MB pages that are either in the active or zapped obselete list. By modifying it this way, the depth of the TDP table hierarchy could be reduced from 4 to 2, while the width expanded from 2⁹ to 2¹⁸.

这其实是一个对当前mmu实现的基础修改。不同的基于原本的4KB * 2⁹ = 2MB的TDP page table的操作需要支持到现在的 4KB * 2¹⁸ = 1GB。举个例子，比如图9所示，kvm_mmu_page的数据结构可以被修改为如下的映射：

在一个“flat”表里，只有两个层级和一个单根的表作为parents，亲子关系非常的清晰。此外，第一级的页面有一个共同的parent，但是没有子级。比如没有sync的子节点，parent ptes以及没有同步的子bitmap是没有必要的
里面的参数比如gfn，角色，未同步等，会被分为512保存，而之前每一个4KB ∗ 2⁹ = 2MB page table
spt指针指向一个table段并覆盖一个4KB * 2¹⁸ = 1GB的部分
link会被连接到一个新的数据结构来区别是4KB ∗ 2⁹ = 2MB page是active还是弃用状态。

通过这些修改，TDP table结构的深度从4变味了2，并且宽度从2⁹ 变成了 2¹⁸

Since each kvm mmu page instance contains 2¹⁸ table entries now, there will be less kvm mmu page instances in use, which means that 2¹⁸ rather than 2⁹ sptes need to be mapped to a single kvm mmu page instance. This could be achieved by masking out the lower 30 bits of an address and setting the obtained page descriptor’s private field to this kvm mmu page instance, as shown in Fig. 10. Other major affected functions include 1. shadow walk init, 2. kvm mmu get page, 3. direct map, 4. kvm mmu prepare zap page, 5. kvm mmu commit zap page and 6. mmu alloc direct roots

因为每个kvm mmu page的实例包含2¹⁸ 个条目，因此只会有更少的kvm mmu page，也就意味着2¹⁸ 而不是2⁹个sptes需要被映射到一个kvm mmu page实例。通过把低30位地址内容作为page表述的私有比阿亮就能够给解决这个问题了如fig10所示。另外一些主要的功能影响包括：

shadow walk int
kvm mmu get page
direct map
kvm mmu prepare zap pages
kvm mmu commit zap page
mmu alloc direct roots

Taken a guest commonly with 4GB memory as an example. A page contains 4KB/8B = 512 entries, and for the 4GB, 4GB/4KB = 220 entries are needed, so 220/512 = 2048 pages of 4KB size should be used to save all the table entries. All together it makes a space of about 4KB ∗ 2048 = 8MB size. Although this may be far more than in the conventional TDP case, it is a modest demand and acceptable compared with a host machine configured with dozens of GB RAM.

以常见的guest内存4GB为例，一个page包含4KB/8B也就是512个条目（也就是512条索引一个page），4GB的情况就有4GB/4KB=2²⁰个条目（需要索引），

2²⁰ / 512 = 2048个4KB的page作为索引保存所有table条目。一共需要4KB * 2048 = 8MB大小。虽然这个比常见的TDP场景差的远，不过是一个最简单的可以接受的用来配置host和GB RAM的规则。

On the other hand, with the 2MB TDP large pages, only 4 kvm mmu page instances are sufficient to cover the entire 4GB address space. Only 4 entries are filled in the root table, which poses no pressure at all to the TLB. For an arbitrary guest virtual address, at most 2 ∗ 5 + 4 = 14 (10 in hypervisor, 4 in guest) memory accesses are enough to get the host physical address - far less than that of the current translation scheme (20 in hypervisor, 4 in guest). With a flatter TDP page table and reduced number of memory access, the KVM guest is expected to be less sensitive to workloads and yield higher performance.

另一方面，用2MB的TDP大page，只需要4个kvm mmu page实例来覆盖4GB的空间。只需要在root table里面加4个条目。实际上对TLB不会造成太大压力。对于个简单的guest内存来说至少 4 * 5 + 4 = 14 （hypervisor10次，guest4次）的内存访问是足够获取host地址的，而不是目前的（hypervisor20次，guest4次）。使用更flatter的TDP page table并且减少内存访问，KVM guest有希望运行的对负载更不敏感同时性能更好。

We studied the current implementation of the SPT and TDP for the KVM, and attempted to simplify the second dimension paging of the TDP based on a change of the table structure and the related functions in the hypervisor. With this change on software, the current TDP paging level could be reduced and the overall guest performance will be improved. We have implemented a part of this design and found that, the large TDP page table could be allocated without problem as long as the amount is less than 4MB. However, as it is a relative radical change to the traditional mainstream KVM source code, many functions within the mmu code are affected, an executable implementation as well as a benchmark result are unfortunately not yet available. In future we will keep on engaging with this task and work out a concrete solution based on this design.

目前研究了SPT和TDP的kvm上的实现，并且计划简化TPD的second dimension paging的基于TDP的表结构修改以及相关的hypervisor功能呢。通过软件层面的这个修改，目前的TDP级别减少并且guest整体的性能会提升。我们实现了部分设计并且发现使用一个很大的TDP page table小于4MB的时候可以成功分配。然而有一个相关的改动就是原本的主干的KVM代码，很多mmu相关的code都被影响了，一个可执行的benchmark目前并不可用，未来希望继续做相关的具体实现和设计。

引用

Preiss, B.R., Eng, P.: Data Structures and Algorithms with Object-Oriented Design Patterns in Java. Wiley, Chichester (1999)
Hoang, G., Bae, C., Lange, J., Zhang, L., Dinda, P., Joseph, R.: A case for alternative nested paging models for virtualized systems. Comput. Archit. Lett. 9, 17–20, University of Michigan (2010)
Wang, X., Zang, J., Wang, Z., Luo, Y., Li, X.: Selective hardware/software memory virtualization, VEE 2011, Department of Computer Science and Technology, Beijing University, March 2011
Bae, C.S., Lange, J.R., Dinda, P.A.: Enhancing virtualized application performance through dynamic adaptive paging mode selection, Northwestern University and University of Pittsburgh, ICAC 2011, June 2011
Ahn, J., Jin, S., Huh, J.: Revisiting hardware-assisted page walks for virtualized systems. Computer Science Department, KAIST, ISCA 2012, April 2012
Gandhi, J., Basu, A., Hill, M.D., Swift, M.M.: Efficient memory virtualization. University of Wisconsin-Madison and AMD Research, October 2014
Adavanced Micro Devices Inc, AMD-V Nested Paging White Paper. Adavanced Micro Devices, July 2008
Bhargave, R., Serebin, B., Spadini, F., Manne, S.: Accelerating two-dimensional page walks for virtualized systems. Computing Solutions Group and Advanced Architecture & Technology Lab, March 2008
Barr, T.W., Cox, A.L., Rixner, S.: Translation Caching: Skip, Don’t Walk (the Page Table), Rice University, June 2010
Linux kernel Documentation about MMU in KVM. https://www.kernel.org/doc/ Documentation/virtual/kvm/mmu.txt
Johnson, M.K.: Memory allocation. Linux Journal, issue 16, August 1995. http:// www.linuxjournal.com/article/1133
Rubini, A., Corbet, J.: Linux Device Drivers, 2nd edn, June 2014. http://www.xml.com/ldd/chapter/book/ch13.html

2022-09-23

management

Some thought about software engineer interview

I can’t stop thinking about how to interview a suitable software engineer for my team, because a graduate is fired only after joining my team for three months. There are some points to figure the reasons about the story:

Low learning efficiency, can not figure out what is supposed to be known, for example, basic git usage to submit a patch, learn it over month.
Bad code kills, code is not good and can not comprehen the code even after your explaination and testing.
Bad manners, diss the colleague to explain why he didn’t do his work as expected.

Notice: I’m not sure if those requests are only requested in my country, but actually his buddy and colleague cost a lot of energy to solve his own problems.

Fortunately, I am very sharp to colleague’s work so that guy didn’t disturb me very much. But after that I start thinking about is there any way to get a more suitable devloper to avoid this situation from happening again. (Maybe for undergraduate, our the interview is too easy)

Details about current interview

Introduction for current interview

Typically, every interviewee need to intend two or three around technical interview.

interview for basic skills, like code skills or knowledges we required
interview for project engineering, explain technical view of experinenced project and know a big picture of a project.
interview for code reading/writing and math problem capability (but not always used)

But many teams only interview their own candidate, the skills maybe not well interviewed or not executed as expected, because we do not have any objective test to get the result of every interview. Only perspective point is used so when try to scale a team or required more developers the period is more longer than we expected.

Requirements

From last failure, some basic requirements need to be recorded.

Learning skiils. Self-driving is essential for a developer
Code skills. Typlically not read code but comprehend the code
Good manners. May be a optional quality but it is important for teamwork

Except those requrements more basic requirements is involved in current list:

System knownledge, operation system usage, architecture and so on.
Code skills, code usage (including useful lib, programming language implementation)
Communication skills, throught interview not actually maybe pretended
Teamwork skills, for those who have work experience

So more personal skills need to be involved in interview in my expectation. But in my opinion, it’s not a good thing for devloper as a interviewer need to judge interviewee’s those aspects directly.

Well formed questions need to be collected for all those aspects we required for a “suitable developer”.

Research before planning

More research need to be done before we get some solutions.

When I search “how to interview software engineers”. I got this page:

https://arc.dev/employer-blog/software-engineer-interview-questions/

There are some interesting questions:

Discuss one of your previous projects and explain how you completed it successfully.
When you ran into an obstacle with your project, how did you handle the issue?
What are your thoughts on unit testing?
What is your process for finding a bug in an application?

Combine with experience and actual works, the debug skills and project skills can be easily tested.

So those kind of open questions seems a good way for a interview.

Besides those questions, knowledge test should also be used for graduate, but this is quite easy.

Plan

According to below discuss, we decide to make a plan to help us feel easy with interviews.

Change seperated interview for the whole backend team.
Use prepared questions to interview in order to get objective result from different interviewers.
Add more tests for graduate
Logic analysis test should be involved in the interview questions

2022-09-22

project-related-works

Build internal maven repositories

Installation of management tool

Choose sonatype nexus repository manager to bulid internal maven repositories.

Download link: https://help.sonatype.com/repomanager3/product-information/download

Select a tar match your system and execute following commands:

wget https://download.sonatype.com/nexus/3/nexus-3.41.1-01-unix.tar.gz
tar zxvf nexus-3.41.1-01-unix.tar.gz
cd nexus-3.41.1-01/bin/
./nexus start

Because use root to start nexus, check the log:

1	tail -f /root/sonatype-work/nexus3/log/nexus.log

application is started

-------------------------------------------------

Started Sonatype Nexus OSS 3.41.1-01

-------------------------------------------------

But we want this application run as service, so do more configuration for systemd, create a file /etc/systemd/system/nexus.service contains:

[Unit]
Description=nexus service
After=network.target

[Service]
Type=forking
LimitNOFILE=65536
ExecStart=/root/nexus-3.41.1-01/bin/nexus start
ExecStop=/root/nexus-3.41.1-01/bin/nexus stop
User=root
Restart=on-abort
TimeoutSec=600

[Install]
WantedBy=multi-user.target

then enable and start nexus:

1
2
3

sudo systemctl daemon-reload
sudo systemctl enable nexus.service
sudo systemctl start nexus.service

after that management tool installation finished, let’s start to build internal maven repositories.

Create private repository

First, access sonatype nexus by 8081 port and finish the wizard.

Repository types

According to nexus docs, there are several proxy types for nexus:

Proxy repository: is a type linked to remote repository, which act as cache for the local request
Hosted repository: local stored and follow maven policy
Repository group: combine multi repositories as one

So our usage, we choose maven2 hosted repository to publish our own jar.

Following configurations are used for this repository:

format: maven2
type: hosted
layout policy: strict
content disposition: inline
blob store: default
deployment policy: disable redeploy

And try to upload first jar to our hosted repository

Accesshttp://your_nexus_address/#browse/upload:your_repo this page to uplaod jar with group id and artifact id.

Then, change pom.xml to enable the repository:

<repositories>
    <repository>
        <id>zstack-premium</id>
        <name>zstack-premium</name>
        <url>http://your_nexus_address/repository/your_repo/</url>
    </repository>
</repositories>

But if use maven3 need to update settings to allow http access.

<mirror>
    <id>zstack-premium-mirror</id>
    <name>zstack-premium</name>
    <url>http://your_nexus_address/repository/your_repo/</url>
    <mirrorOf>zstack-premium</mirrorOf>
</mirror>

Test mvn install the lib is download as expected.

Deploy project to the repository

Add distribution management into your pom.xml

<distributionManagement>
  <repository>
    <id>your_repo_id</id>
    <name>readable name</name>
    <url>your_repo_address</url>
  </repository>
</distributionManagement>

to set which repo address your deployment should go, then add server related setttings to your .m2/settting.xml

<server>
  <id>your_repo_id</id>
  <username>your_username</username>
  <password>your_password</password>
</server>

Then use mvn deploy to finish upload.

There are some errors during the configuration phase.

Error status code 401, means authentication failure, use mvn -X to check

[DEBUG] Reading global settings from /opt/maven/conf/settings.xml
[DEBUG] Reading user settings from /root/.m2/settings.xml
two configuration files are used, change the global one with the sever section fix the issue.
Error status code 400, nexus repository should change the deployment policy to allow redeploy

2022-09-20

arch-notes

Shared timeout implements under multi-thread context model

This blog shows timeout impelents under multi-thread Java application. Discuss its advantage and disadvantage, also tell difficulties we meet in practice.

Why shared timeout

For a complex micro service system, message is a basic way used for communication between all modules. So timeout is come into used, for every API or message, to avoid unfinished tasks.

Let’s think if you what to do something without timeout, it may never finished with response. So timeout does the matter to help user and system get rid of those situations.

But when we talk about shared timeout, what is it and what it does.

For example, one API may sending several message to execute on different services and finally finished. During the process, if a timeout happen, API should return as timeouted as we expected. So the design of timeout mechanism should share a same period of timeout.

Assume API timeout is T and three messages tend to be used for API’s execution, and message₁ use time T₁, message₂ use time T₂ and message₃ use time T₃ , in shared timeout situation, message₁‘s timeout is T , message₂‘s timeout is T - T₁, message₃‘s timeout is T - T₁ - T₂. The remaining timeout should be used for next message to confirm all sub messages could be timeouted as expected.

Application level implements

For message level timeout, the easiest way to implement timeout management is on message bus. Every timeout the message received, check its timeout header and calculate its remaining timeout seems to make sense.

But if we record its timeout by decrease message used time. The machanism became more complex, because every messages time usage need to be recorded but for most product the message profiling is disable to avoid any performance overhead.

In order to solve this problem, we use message deadline as metadata for message lifecyle. Every new coming message should be set a deadline due to its configuration and every sub message calculate its remaining timeout by using the deadline time substract current time. So with a general service to get current time this timeout mechanism get more efficient.

More challenges

In multi-thread application. Lifecyle maintanence of message’s timeout is really matter.

Espacially, ZStack use in-process micro services architecture, messages pass through memory or http, for API message, the timeout always works well, but for internal messages more problems came out.

For example, a async invoker util functios may send several messages but they shares the same thread before handling, so the thread’s context will be used as message’s initial context where the timeout stored.

When thread context changed we can clear the context by thread pool’s thread lifecyle hook.

But some user case the timeout do not work as expected.

GC task (in memory task triggered by a fixed time rate or any system event)
Thread level task (async and sync task queue)

GC task

GC task is used to handling some unexpected async operation or retry to delete some resource and so on.

If a thread submit the task executes the task itself, the context will be used directly, and actually the context mess up the GC task’s execution. So always use a new thread to start GC job is a good choice.

For multi-thread application, message dilevery and handling involves different threads, especially some task driver might be used to construct work flow and task execution. So the timeout context need be passed from one thread to another.

Assume Thread1 do task1 and finished with submit a new task2, maybe sometimes after Thread2 start to handle task2 but at this time task2 is required to contain timeout. Or the timeout cannot be passed.

Fixed thread task

Same thread handling all tasks, so task should store its context when submitted to task queue.

Execution need to recover task context before execution.

Benefits of the concepts

For a in-process arch, use a global level timeout during api or inner task lifecycle and the whole timeout can be managed.

Easy implements

In java program, use aop to maintain the timeout get/set seems a good choice.

A typical ZStack task workflow, usually use api at the first step. A new coming api message, ZStack will set timeout to it, but in order to know parent messages timeout, we need to manage the timeout information to the message.

So a design named TaskContext is created to contains the global variables during whole task lifecycle. Use aop, all async tasks will use its parent TaskContext and clear it before start. With TaskContext, the timeout can be managed.

But some user scenarios still need to be discussed, list it before details:

Inner message is used as the start of a task, it should support timeout.
API use a inner message configured with timeout which should be supported.
How did new coming mechanism aware of the timeout from task context
How to avoid task context be messed

Inner message

Inner message level timeout configuration need to be supported as some tasks is executed by GC task which we mentioned, may send inner message directly, so timeout maybe requested for those tasks

Duplicate configuration

For api messages, it may use a workflow contains several inner messsages when those messages all have timeout configuration, we need to use the origin timeout but do not use new configured one.

Aware of timeout

It not practical, becamse task context is a in-memory variable and marked by thread-id, so everytime the thread switched the task context need to be copied to the new thread. If any mechanism do not support task context copy, it will result in timeout loss, if any inner message used, a new timeout will be set from timeout manager. And it seems no good solution to make the new mechanism aware of this. That’s the shortage of using aop.

Do not touch task context

Only timeout manager should use task context for timeout handling and other task context usage including manually clear it or set value should be avoid.

But for some reason the access of TaskContext supposed to be available to core module for timeout or other context usage (for example, task id), so only keep it cleared after thread context switch and only assign value to framework known fields to avoid any mess up operations from other developer.

Conclusion

Actually task context is more likely a global variable for every thread to use. Keep it from abuse and oom is the first task and aop in involved to resolve this problem. So how to trace task context seems the next valuable target of this version of code.

Check the code

https://github.com/zstackio/zstack check the code if you are interested in this feature.

2022-09-04

virtualization►virtio-balloon

Guest Free Page Hinting notes 01

基于virtio1.2的推送，在virtio-balloon设备下有一条新特性：free page hints

因为不太了解这个东西具体做了啥，查了一番资料

KVM: Guest Free Page Hinting

在2019年2月有这样一封邮件记录 https://lwn.net/Articles/778432/

The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.

看起来主要目的是为了优化guest和host之间的空闲内存管理，避免出现需要快速释放或者回收page cache内存

同时里面提到了

Known code re-work:

Plan to re-use Wei’s work, which communicates the poison value to the host.
The nomenclatures used in virtio-balloon needs to be changed so that the code can easily be distinguished from Wei’s Free Page Hint code.
Sorting based on zonenum, to avoid repetitive zone locks for the same zone.

需要对virtio-balloon做一些修改来来保证代码能够和这部分Free Page Hint的代码保持区别。

这么说感觉好像是Hint的代码和virtio-balloon是两套

Virtio-balloon: support free page reporting

基于上面查到的资料，又发现了另外一篇直接提到Virtio-balloon的改动 https://lwn.net/Articles/759413/

里面新增的 VIRTIO_BALLOON_F_FREE_PAGE_HINT 是可以和 https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.pdf virtio1.2的spec对应上的

摘抄一下里面的描述

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.

针对热迁移场景，会不停的copy memory。第一轮会复制所有内存，后续只需要复制guest写过的内存。

因此hypervisor需要记录guest写过哪些内存，然后全都复制一遍。

而这个功能是用来优化guest free page transfer的。即忽略这些已经被标记为free page的内容，如果后续这些page被使用了或者被写了，下一轮内存拷贝才考虑这些page。

通过这段描述，可以知道这个优化需要提供一个机制，来提供free page hint，并以此为基础来优化live migration。

小插曲

在准备看代码之前，发现了一段很有意思的内容

- mm/get_from_free_page_list: The new implementation to get free page
  hints based on the suggestions from Linus:
  https://lkml.org/lkml/2018/6/11/764
  This avoids the complex call chain, and looks more prudent.

对获取 get_from_free_page_list 操作，linus回了很长的一段建议

里面很有意思的是，是不是要加一个新的 GFP_NONE，来标记分配失败？

1 2	Maybe it will help to have GFP_NONE which will make any allocation fail if attempted. Linus, would this address your comment?

而linus的回复是，如果不用这么复杂的会引起内存分配的调用，用一个简单的机制来避免这个问题发生感觉更好

So instead of having virtio_balloon_send_free_pages() call a really
generic complex chain of functions that in _some_ cases can do memory
allocation, why isn't there a short-circuited "vitruque_add_datum()"
that is guaranteed to never do anything like that?

中间还有很长的一些简化代码的建议，里面有这么一句话，评价这部分代码太复杂并且太脆弱了

1 2	The whole sequence of events really looks "this is too much complexity, and way too fragile" to me at so many levels.

让我联想到目前ZStack里面一些功能的实现逻辑存在情况

实现了机制B解决机制A的问题
复用了，不熟悉的机制A，忽略了A本身存在的问题

结合一个实际功能说一下这个问题，比如vm的kernel panic检测，有两个必要选项

给vm增加一个pvpanic的xml配置
虚拟机内部需要启用内核pvpanic模块

因此这个功能实现需要guest和host相互配合才能判定是否可用

基于这个前提，guest内部的逻辑需要提供传递guest内部是否支持pvpanic的信息，host上需要从配置中获取是否配置过pvpanic，因此实现这个逻辑的时候需要分别查询这两个信息。而查询host上的配置最终导致了一些控制面的bug。

后来反思这个问题的时候，只从host配置获取的逻辑出发，但是忽略了运行时配置不会变更的前提，其实并没有必要增加一个多余的查询逻辑，反而导致这个问题依赖了已有的配置查询机制，最终引起了更复杂的现象。

kernel的开源世界也会有人碰到这样的问题，所以整理好功能设计的方法论还是很重要的，至少能够指导怎么做能设计的更好，提升committer和coder的水平。

反过来想想：

guest tool在运行时返回的云主机所支持的特性，实际上总是和他的版本绑定的，只要获取过一次其实就不需要反复获取了。
如果guest tool版本发生了变化，才需要重新获取这个信息
提供主动更新guest tool特性的功能即可

其实这样拆解这个问题，云主机其实本身就应该保存这些特性信息而不需要总是去获取，这样机制的设计可以简化很多。

而之前的设计出发点并不是基于对整个功能的理解，而是类似新增 GFP_NONE 来解决问题的思路。

2022-09-04

languages►java

Java Class getName() vs getSimpleName()

Work with Java reflection and invoke getSimpleName() met java.lang.NoClassDefFoundError

Use getName() instead, the logic seems work well.

Before compare these two methods’ difference go through the code quickly.

Class::getName()

for getName()

public String getName() {
    String name = this.name;
    if (name == null)
        this.name = name = getName0();
    return name;
}

and the getName0() is native method

1	private native String getName0();

Class::getSimpleName()

for getSimpleName()

public String getSimpleName() {
    if (isArray())
        return getComponentType().getSimpleName()+"[]";

    String simpleName = getSimpleBinaryName();
    if (simpleName == null) { // top level class
        simpleName = getName();
        return simpleName.substring(simpleName.lastIndexOf(".")+1); // strip the package name
    }
    // According to JLS3 "Binary Compatibility" (13.1) the binary
    // name of non-package classes (not top level) is the binary
    // name of the immediately enclosing class followed by a '$' followed by:
    // (for nested and inner classes): the simple name.
    // (for local classes): 1 or more digits followed by the simple name.
    // (for anonymous classes): 1 or more digits.

    // Since getSimpleBinaryName() will strip the binary name of
    // the immediatly enclosing class, we are now looking at a
    // string that matches the regular expression "\$[0-9]*"
    // followed by a simple name (considering the simple of an
    // anonymous class to be the empty string).

    // Remove leading "\$[0-9]*" from the name
    int length = simpleName.length();
    if (length < 1 || simpleName.charAt(0) != '$')
        throw new InternalError("Malformed class name");
    int index = 1;
    while (index < length && isAsciiDigit(simpleName.charAt(index)))
        index++;
    // Eventually, this is the empty string iff this is an anonymous class
    return simpleName.substring(index);
}

The main part is

1	String simpleName = getSimpleBinaryName();

And then get enclosingClass:

private String getSimpleBinaryName() {
    Class<?> enclosingClass = getEnclosingClass();
    if (enclosingClass == null) // top level class
        return null;
    // Otherwise, strip the enclosing class' name
    try {
        return getName().substring(enclosingClass.getName().length());
    } catch (IndexOutOfBoundsException ex) {
        throw new InternalError("Malformed class name", ex);
    }
}

what is enclosingClass:

// There are five kinds of classes (or interfaces):
// a) Top level classes
// b) Nested classes (static member classes)
// c) Inner classes (non-static member classes)
// d) Local classes (named classes declared within a method)
// e) Anonymous classes

in my case it tend to be a) Top level classes, so next part of code goes to getDeclaringClass()

// JVM Spec 4.8.6: A class must have an EnclosingMethod
// attribute if and only if it is a local class or an
// anonymous class.
EnclosingMethodInfo enclosingInfo = getEnclosingMethodInfo();
Class<?> enclosingCandidate;

if (enclosingInfo == null) {
    // This is a top level or a nested class or an inner class (a, b, or c)
    enclosingCandidate = getDeclaringClass();
} else {
    Class<?> enclosingClass = enclosingInfo.getEnclosingClass();
    // This is a local class or an anonymous class (d or e)
    if (enclosingClass == this || enclosingClass == null)
        throw new InternalError("Malformed enclosing method information");
    else
        enclosingCandidate = enclosingClass;
}

and then getDeclaringClass0() will be used.

@CallerSensitive
public Class<?> getDeclaringClass() throws SecurityException {
    final Class<?> candidate = getDeclaringClass0();

    if (candidate != null)
        candidate.checkPackageAccess(
                ClassLoader.getClassLoader(Reflection.getCallerClass()), true);
    return candidate;
}

which is a native method:

1	private native Class<?> getDeclaringClass0();

Comparision

From the code before this section, aboveously the getSimpleName() always call native method but getName() may use Class’s local variable directly.

So see the local variable before we come to conclusion.

1 2	// cache the name to reduce the number of calls into the VM private transient String name;

the name used by getName() use a transient String as cache to reduce the number of calls into VM.

And combine to getName()’s implement, the first time getName0() invoked, this name is set.

And go for Java doc:

1	public String getSimpleName()

Returns the simple name of the underlying class as given in the source code. Returns an empty string if the underlying class is anonymous.

The simple name of an array is the simple name of the component type with “[]” appended. In particular the simple name of an array whose component type is anonymous is “[]”.

Returns:

the simple name of the underlying class
Since:

1.5

1	public String getName()

Returns the name of the entity (class, interface, array class, primitive type, or void) represented by this Class object, as a String.

If this class object represents a reference type that is not an array type then the binary name of the class is returned, as specified by The Java™ Language Specification.

If this class object represents a primitive type or void, then the name returned is a String equal to the Java language keyword corresponding to the primitive type or void.

If this class object represents a class of arrays, then the internal form of the name consists of the name of the element type preceded by one or more ‘[‘ characters representing the depth of the array nesting. The encoding of element type names is as follows:

Element Type Encoding

boolean Z

byte B

char C

class or interface Lclassname;

double D

float F

int I

long J

short S

The class or interface name classname is the binary name of the class specified above.

Examples:

String.class.getName()
  returns "java.lang.String"
byte.class.getName()
  returns "byte"
(new Object[3]).getClass().getName()
  returns "[Ljava.lang.Object;"
(new int[3][4][5][6][7][8][9]).getClass().getName()
  returns "[[[[[[[I"

Returns:

the name of the class or interface represented by this object.

and more details for the error:

1 2	public class NoClassDefFoundError extends LinkageError

Thrown if the Java Virtual Machine or a ClassLoader instance tries to load in the definition of a class (as part of a normal method call or as part of creating a new instance using the new expression) and no definition of the class could be found.

The searched-for class definition existed when the currently executing class was compiled, but the definition can no longer be found.

So that means class found at compile time but not available at runtime Refer this link

Check for our configuration:

<dependency>
    <groupId>xxx</groupId>
    <artifactId>xxx</artifactId>
    <version>1.1.1</version>
    <scope>system</scope>
    <systemPath>${project.basedir}/ext-libs/xxx</systemPath>
</dependency>

use a system scope.

According to maven doc:

There are 6 scopes:

compile
This is the default scope, used if none is specified. Compile dependencies are available in all classpaths of a project. Furthermore, those dependencies are propagated to dependent projects.
provided
This is much like compile, but indicates you expect the JDK or a container to provide the dependency at runtime. For example, when building a web application for the Java Enterprise Edition, you would set the dependency on the Servlet API and related Java EE APIs to scope provided because the web container provides those classes. A dependency with this scope is added to the classpath used for compilation and test, but not the runtime classpath. It is not transitive.
runtime
This scope indicates that the dependency is not required for compilation, but is for execution. Maven includes a dependency with this scope in the runtime and test classpaths, but not the compile classpath.
test
This scope indicates that the dependency is not required for normal use of the application, and is only available for the test compilation and execution phases. This scope is not transitive. Typically this scope is used for test libraries such as JUnit and Mockito. It is also used for non-test libraries such as Apache Commons IO if those libraries are used in unit tests (src/test/java) but not in the model code (src/main/java).
system
This scope is similar to provided except that you have to provide the JAR which contains it explicitly. The artifact is always available and is not looked up in a repository.
import
This scope is only supported on a dependency of type pom in the <dependencyManagement> section. It indicates the dependency is to be replaced with the effective list of dependencies in the specified POM’s <dependencyManagement> section. Since they are replaced, dependencies with a scope of import do not actually participate in limiting the transitivity of a dependency.

system most like provided but a dependency with this scope is added to the classpath used for compilation and test, but not the runtime classpath

so runtime getSimpleName() will met exception.

Class loader

get back to the error call trace again:

Caused by: java.lang.NoClassDefFoundError: xxxxx
        at java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_161]
        at java.lang.ClassLoader.defineClass(ClassLoader.java:763) ~[?:1.8.0_161]
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) ~[?:1.8.0_161]
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) ~[?:1.8.0_161]
        at java.net.URLClassLoader.access$100(URLClassLoader.java:73) ~[?:1.8.0_161]
        at java.net.URLClassLoader$1.run(URLClassLoader.java:368) ~[?:1.8.0_161]
        at java.net.URLClassLoader$1.run(URLClassLoader.java:362) ~[?:1.8.0_161]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
        at java.net.URLClassLoader.findClass(URLClassLoader.java:361) ~[?:1.8.0_161]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[?:1.8.0_161]
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338) ~[?:1.8.0_161]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_161]
        at java.lang.Class.getDeclaringClass0(Native Method) ~[?:1.8.0_161]

and another class not found exception:

Caused by: java.lang.ClassNotFoundException: xxxxx
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381) ~[?:1.8.0_161]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[?:1.8.0_161]
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338) ~[?:1.8.0_161]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_161]
        at java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_161]
        at java.lang.ClassLoader.defineClass(ClassLoader.java:763) ~[?:1.8.0_161]
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) ~[?:1.8.0_161]
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) ~[?:1.8.0_161]
        at java.net.URLClassLoader.access$100(URLClassLoader.java:73) ~[?:1.8.0_161]
        at java.net.URLClassLoader$1.run(URLClassLoader.java:368) ~[?:1.8.0_161]
        at java.net.URLClassLoader$1.run(URLClassLoader.java:362) ~[?:1.8.0_161]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_161]
        at java.net.URLClassLoader.findClass(URLClassLoader.java:361) ~[?:1.8.0_161]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[?:1.8.0_161]
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338) ~[?:1.8.0_161]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_161]
        at java.lang.Class.getDeclaringClass0(Native Method) ~[?:1.8.0_161]

when try to get simple name of class A extends B

find A and try to define A but need to define B first, but B is not available at runtime so a exception raised.

Element Type		Encoding
boolean		Z
byte		B
char		C
class or interface		Lclassname;
double		D
float		F
int		I
long		J
short		S