Dive into edk2-ovmf fisrt debug trip

Prepare environment

install compiler related rpm

1
yum --disablerepo=* --enablerepo=ali* install -y make gcc binutils iasl nasm libuuid-devel gcc-c++

if cross-build firmware on x86 machine, install cross compilers:

1
yum --disablerepo=* --enablerepo=ali* install -y gcc-aarch64-linux-gnu gcc-arm-linux-gnu

get source code

1
2
3
git clone https://github.com/tianocore/edk2.git
cd edk2
git submodule update --init

than setup environment variables:

1
source edksetup.sh

build base tools at first:

1
make -C BaseTools

note: only need once

Build firmware

Then build firmware for x64 qemu:

1
build -t GCC5 -a X64 -p OvmfPkg/OvmfPkgX64.dsc

The firmware volumes built can be found in Build/OvmfX64/DEBUG_GCC5/FV.

Building the aarch64 firmware instead:

1
build -t GCC5 -a AARCH64 -p ArmVirtPkg/ArmVirtQemu.dsc

The build results land in Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/FV.

Qemu expects the aarch64 firmware images being 64M im size. The firmware images can’t be used as-is because of that, some padding is needed to create an image which can be used for pflash:

1
2
3
4
dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M count=64
dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
dd of="QEMU_VARS-pflash.raw" if="/dev/zero" bs=1M count=64
dd of="QEMU_VARS-pflash.raw" if="QEMU_VARS.fd" conv=notrunc

There are a bunch of compile time options, typically enabled using -D NAME or -D NAME=TRUE. Options which are enabled by default can be turned off using -D NAME=FALSE. Available options are defined in the *.dsc files referenced by the build command. So a feature-complete build looks more like this:

1
2
3
4
5
6
build -t GCC5 -a X64 -p OvmfPkg/OvmfPkgX64.dsc \
-D FD_SIZE_4MB \
-D NETWORK_IP6_ENABLE \
-D NETWORK_HTTP_BOOT_ENABLE \
-D NETWORK_TLS_ENABLE \
-D TPM2_ENABLE

From OvmfPkgX64.dsc lots of features is defined.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
  #
# Network definition
#
DEFINE NETWORK_TLS_ENABLE = FALSE
DEFINE NETWORK_IP6_ENABLE = FALSE
DEFINE NETWORK_HTTP_BOOT_ENABLE = FALSE
DEFINE NETWORK_ALLOW_HTTP_CONNECTIONS = TRUE
DEFINE NETWORK_ISCSI_ENABLE = TRUE

!include NetworkPkg/NetworkDefines.dsc.inc

#
# Device drivers
#
DEFINE PVSCSI_ENABLE = TRUE
DEFINE MPT_SCSI_ENABLE = TRUE
DEFINE LSI_SCSI_ENABLE = FALSE

#
# Flash size selection. Setting FD_SIZE_IN_KB on the command line directly to
# one of the supported values, in place of any of the convenience macros, is
# permitted.
#
!ifdef $(FD_SIZE_1MB)
DEFINE FD_SIZE_IN_KB = 1024
!else
!ifdef $(FD_SIZE_2MB)
DEFINE FD_SIZE_IN_KB = 2048
!else
!ifdef $(FD_SIZE_4MB)
DEFINE FD_SIZE_IN_KB = 4096
!else
DEFINE FD_SIZE_IN_KB = 4096

Secure boot support (on x64) requires SMM mode. Well, it builds and works without SMM, but it’s not secure then. Without SMM nothing prevents the guest OS writing directly to flash, bypassing the firmware, so protected UEFI variables are not actually protected.

Also suspend (S3) support works with enabled SMM only in case parts of the firmware (PEI specifically, see below for details) run in 32bit mode. So the secure boot variant must be compiled this way:

1
2
3
4
5
build -t GCC5 -a IA32 -a X64 -p OvmfPkg/OvmfPkgIa32X64.dsc \
-D FD_SIZE_4MB \
-D SECURE_BOOT_ENABLE \
-D SMM_REQUIRE \
[ ... add network + tpm + other options as needed ... ]

The FD_SIZE_4MB option creates a larger firmware image, being 4MB instead of 2MB (default) in size, offering more space for both code and vars. The RHEL/CentOS builds use that. The Fedora builds are 2MB in size, for historical reasons.

If you need 32-bit firmware builds for some reason, here is how to do it:

1
2
build -t GCC5 -a ARM -p ArmVirtPkg/ArmVirtQemu.dsc
build -t GCC5 -a IA32 -p OvmfPkg/OvmfPkgIa32.dsc

The build results will be in in Build/ArmVirtQemu-ARM/DEBUG_GCC5/FV and Build/OvmfIa32/DEBUG_GCC5/FV

Booting fresh firmware builds

The x86 firmware builds create three different images:

  • OVMF_VARS.fd

    This is the firmware volume for persistent UEFI variables, i.e. where the firmware stores all configuration (boot entries and boot order, secure boot keys, …). Typically this is used as template for an empty variable store and each VM gets its own private copy, libvirt for example stores them in /var/lib/libvirt/qemu/nvram.

  • OVMF_CODE.fd

    This is the firmware volume with the code. Separating this from VARS does (a) allow for easy firmware updates, and (b) allows to map the code read-only into the guest.

  • OVMF.fd

    The all-in-one image with both CODE and VARS. This can be loaded as ROM using -bios, with two drawbacks: (a) UEFI variables are not persistent, and (b) it does not work for SMM_REQUIRE=TRUE builds.

qemu handles pflash storage as block devices, so we have to create block devices for the firmware images:

1
2
3
4
5
6
7
CODE=${WORKSPACE}/Build/OvmfX64/DEBUG_GCC5/FV/OVMF_CODE.fd
VARS=${WORKSPACE}/Build/OvmfX64/DEBUG_GCC5/FV/OVMF_VARS.fd
qemu-system-x86_64 \
-blockdev node-name=code,driver=file,filename=${CODE},read-only=on \
-blockdev node-name=vars,driver=file,filename=${VARS},snapshot=on \
-machine q35,pflash0=code,pflash1=vars \
[ ... ]

Here is the arm version of that (using the padded files created using dd, see above):

1
2
3
4
5
6
7
CODE=${WORKSPACE}/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/FV/QEMU_EFI-pflash.raw
VARS=${WORKSPACE}/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/FV/QEMU_VARS-pflash.raw
qemu-system-aarch64 \
-blockdev node-name=code,driver=file,filename=${CODE},read-only=on \
-blockdev node-name=vars,driver=file,filename=${VARS},snapshot=on \
-machine virt,pflash0=code,pflash1=vars \
[ ... ]

Source code structure

The core edk2 repo holds a number of packages, each package has its own toplevel directory. Here are the most interesting ones:

  • OvmfPkg

    This holds both the x64-specific code (i.e. OVMF itself) and virtualization-specific code shared by all architectures (virtio drivers).

  • ArmVirtPkg

    Arm specific virtual machine support code.

  • MdePkg, MdeModulePkg

    Most core code is here (PCI support, USB support, generic services and drivers, …).

  • PcAtChipsetPkg

    Some Intel architecture drivers and libs.

  • ArmPkg, ArmPlatformPkg

    Common Arm architecture support code.

  • CryptoPkg, NetworkPkg, FatPkg, CpuPkg, …

    As the names of the packages already suggest: Crypto support (using openssl), Network support (including network boot), FAT Filesystem driver, …

Firmware boot phases

The firmware modules in the edk2 repo often named after the boot phase they are running in. Most drivers are named SomeThing**Dxe** for example.

  • ResetVector

    This is where code execution starts after a machine reset. The code will do the bare minimum needed to enter SEC. On x64 the most important step is the transition from 16-bit real mode to 32-bit mode or 64bit long mode.

  • SEC (Security)

    This code typically loads and uncompresses the code for PEI and SEC. On physical hardware SEC often lives in ROM memory and can not be updated. The PEI and DXE firmware volumes are loaded from (updateable) flash.

    With OVMF both SEC firmware volume and the compressed volume holding PXE and DXE code are part of the OVMF_CODE image and will simply be mapped into guest memory.

  • PEI (Pre-EFI Initialization)

    Platform Initialization is done here. Initialize the chipset. Not much to do here in virtual machines, other than loading the x64 e820 memory map (via fw_cfg) from qemu, or get the memory map from the device tree (on aarch64). The virtual hardware is ready-to-go without much extra preaparation.

    PEIMs (PEI Modules) can implement functionality which must be executed before entering the DXE phase. This includes security-sensitive things like initializing SMM mode and locking down flash memory.

  • DXE (Driver Execution Environment)

    When PEI is done it hands over control to the full EFI environment contained in the DXE firmware volume. Most code is here. All kinds of drivers. the firmware setup efi app, …

    Strictly speaking this isn’t only one phase. The code for all phases after PEI is part of the DXE firmware volume though.

Add debug to EDK2

The default OVMF build writes debug messages to IO port 0x402. The following qemu command line options save them in the file called debug.log:

1
-debugcon file:debug.log -global isa-debugcon.iobase=0x402

Or with build option

1
-D DEBUG_ON_SERIAL_PORT

serial output can be captured

1
-serial file:serial.log

note:

The RELEASE build target (‘-b RELEASE’ build option, see below) disables all debug messages. The default build target is DEBUG.

more build scripts:

On systems with the bash shell you can use OvmfPkg/build.sh to simplify
building and running OVMF.

So, for example, to build + run OVMF X64:

1
2
$ OvmfPkg/build.sh -a X64
$ OvmfPkg/build.sh -a X64 qemu

And to run a 64-bit UEFI bootable ISO image:

1
$ OvmfPkg/build.sh -a X64 qemu -cdrom /path/to/disk-image.iso

To build a 32-bit OVMF without debug messages using GCC 4.8:

1
$ OvmfPkg/build.sh -a IA32 -b RELEASE -t GCC48

UEFI Windows 7 & Windows 2008 Server

  • One of the ‘-vga std’ and ‘-vga qxl’ QEMU options should be used.
  • Only one video mode, 1024x768x32, is supported at OS runtime.
  • The ‘-vga qxl’ QEMU option is recommended. After booting the installed guest OS, select the video card in Device Manager, and upgrade its driver to the QXL XDDM one. Download location: http://www.spice-space.org/download.html, Guest | Windows binaries. This enables further resolutions at OS runtime, and provides S3 (suspend/resume) capability.

Debug in practice

Test with qemu commandline

add with config in libvirt vm xml:

1
2
3
4
5
6
<qemu:commandline>
<qemu:arg value='-debugcon'/>
<qemu:arg value='file:/var/log/libvirt/qemu/debug.log'/>
<qemu:arg value='-global'/>
<qemu:arg value='isa-debugcon.iobase=0x402'/>
</qemu:commandline>

debug log will be found in /var/log/libvirt/qemu/debug.log

before we met Win2012 internal reboot issue, use debug to check what happen, the log repeat following lines but guest seems hang on vnc:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Register PPI Notify: DCD0BE23-9586-40F4-B643-06522CED4EDE
Install PPI: 8C8CE578-8A3D-4F1C-9935-896185C32DD3
Install PPI: 5473C07A-3DCB-4DCA-BD6F-1E9689E7349A
The 0th FV start address is 0x00000820000, size is 0x000E0000, handle is 0x820000
Register PPI Notify: 49EDB1C1-BF21-4761-BB12-EB0031AABB39
Register PPI Notify: EA7CA24B-DED5-4DAD-A389-BF827E8F9B38
Install PPI: B9E0ABFE-5979-4914-977F-6DEE78C278A6
Install PPI: DBE23AA9-A345-4B97-85B6-B226F1617389
DiscoverPeimsAndOrderWithApriori(): Found 0xB PEI FFS files in the 0th FV
Loading PEIM 9B3ADA4F-AE56-4C24-8DEA-F03B7558AE50
Loading PEIM at 0x0000082BE40 EntryPoint=0x0000082F201 PcdPeim.efi
Install PPI: 06E81C58-4AD7-44BC-8390-F10265F72480
Install PPI: 01F34D25-4DE2-23AD-3FF3-36353FF323F1
Install PPI: 4D8B155B-C059-4C8F-8926-06FD4331DB8A
Install PPI: A60C6B59-E459-425D-9C69-0BCC9CB27D81
Register PPI Notify: 605EA650-C65C-42E1-BA80-91A52AB618C6
Loading PEIM A3610442-E69F-4DF3-82CA-2360C4031A23
Loading PEIM at 0x00000830B40 EntryPoint=0x00000831F5F ReportStatusCodeRouterPei.efi
Install PPI: 0065D394-9951-4144-82A3-0AFC8579C251
Install PPI: 229832D3-7A30-4B36-B827-F40CB7D45436
Loading PEIM 9D225237-FA01-464C-A949-BAABC02D31D0
Loading PEIM at 0x00000832B40 EntryPoint=0x00000833D89 StatusCodeHandlerPei.efi
Loading PEIM 222C386D-5ABC-4FB4-B124-FBB82488ACF4
Loading PEIM at 0x00000834A40 EntryPoint=0x0000083A6DF PlatformPei.efi
Select Item: 0x0
FW CFG Signature: 0x554D4551
Select Item: 0x1
FW CFG Revision: 0x3
SecCoreStartupWithStack(0xFFFCC000, 0x820000)
SEC: Normal boot
DecompressMemFvs: OutputBuffer@A00000+0xCE0090 ScratchBuffer@1700000+0x10000 PcdOvmfDecompressionScratchEnd=0x1710000

track the log to edk2 code, the entry seems start at:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
/**
Locates the PEI Core entry point address

@param[in,out] Fv The firmware volume to search
@param[out] PeiCoreEntryPoint The entry point of the PEI Core image

@retval EFI_SUCCESS The file and section was found
@retval EFI_NOT_FOUND The file and section was not found
@retval EFI_VOLUME_CORRUPTED The firmware volume was corrupted

**/
VOID
FindPeiCoreImageBase (
IN OUT EFI_FIRMWARE_VOLUME_HEADER **BootFv,
OUT EFI_PHYSICAL_ADDRESS *PeiCoreImageBase
)
{
BOOLEAN S3Resume;

*PeiCoreImageBase = 0;

S3Resume = IsS3Resume ();
if (S3Resume && !FeaturePcdGet (PcdSmmSmramRequire)) {
//
// A malicious runtime OS may have injected something into our previously
// decoded PEI FV, but we don't care about that unless SMM/SMRAM is required.
//
DEBUG ((DEBUG_VERBOSE, "SEC: S3 resume\n"));
GetS3ResumePeiFv (BootFv);
} else {
//
// We're either not resuming, or resuming "securely" -- we'll decompress
// both PEI FV and DXE FV from pristine flash.
//
DEBUG ((DEBUG_VERBOSE, "SEC: %a\n",
S3Resume ? "S3 resume (with PEI decompression)" : "Normal boot"));
FindMainFv (BootFv);

DecompressMemFvs (BootFv);
}

FindPeiCoreImageBaseInFv (*BootFv, PeiCoreImageBase);
}

and if not IsS3Resume and not SMM required, VM will use S3 resume but in our situation vm go throught Normal boot.

Than locates the comparessed main firmware volume /usr/share/edk2/ovmf/OVMF_CODE.cc.fd

1
2
3
4
5
6
7
8
9
10
11
12
13
/**
Locates the compressed main firmware volume and decompresses it.

@param[in,out] Fv On input, the firmware volume to search
On output, the decompressed BOOT/PEI FV

@retval EFI_SUCCESS The file and section was found
@retval EFI_NOT_FOUND The file and section was not found
@retval EFI_VOLUME_CORRUPTED The firmware volume was corrupted

**/
EFI_STATUS
DecompressMemFvs (

Next step is still find PEI core entry address, EFI_SECTION_PE32 and EFI_SECTION_TE will be used to find entry address

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/**
Locates the PEI Core entry point address

@param[in] Fv The firmware volume to search
@param[out] PeiCoreEntryPoint The entry point of the PEI Core image

@retval EFI_SUCCESS The file and section was found
@retval EFI_NOT_FOUND The file and section was not found
@retval EFI_VOLUME_CORRUPTED The firmware volume was corrupted

**/
EFI_STATUS
FindPeiCoreImageBaseInFv (
IN EFI_FIRMWARE_VOLUME_HEADER *Fv,
OUT EFI_PHYSICAL_ADDRESS *PeiCoreImageBase
)
{
EFI_STATUS Status;
EFI_COMMON_SECTION_HEADER *Section;

Status = FindFfsFileAndSection (
Fv,
EFI_FV_FILETYPE_PEI_CORE,
EFI_SECTION_PE32,
&Section
);
if (EFI_ERROR (Status)) {
Status = FindFfsFileAndSection (
Fv,
EFI_FV_FILETYPE_PEI_CORE,
EFI_SECTION_TE,
&Section
);
if (EFI_ERROR (Status)) {
DEBUG ((DEBUG_ERROR, "Unable to find PEI Core image\n"));
return Status;
}
}

*PeiCoreImageBase = (EFI_PHYSICAL_ADDRESS)(UINTN)(Section + 1);
return EFI_SUCCESS;
}

This is the last step of find PEI Core image.

Back to where FindPeiCoreImageBase’s code path, we can find SecCoreStartupWithStack is the start function which is used in OvmfPkg/Sec/X64/SecEntry.nasm

SecCoreStartupWithStack -> SecStartupPhase2 -> FindAndReportEntryPoints -> FindPeiCoreImageBase

1
2
3
4
5
6
7
8
9
;
; Setup parameters and call SecCoreStartupWithStack
; rcx: BootFirmwareVolumePtr
; rdx: TopOfCurrentStack
;
mov rcx, rbp
mov rdx, rsp
sub rsp, 0x20
call ASM_PFX(SecCoreStartupWithStack)

combine with log, before next SecCoreStartupWithStack invoking:

1
2
3
4
5
Select Item: 0x0^M
FW CFG Signature: 0x554D4551^M
Select Item: 0x1^M
FW CFG Revision: 0x3^M
SecCoreStartupWithStack(0xFFFCC000, 0x820000)^M

Select Item is printed.

Because FV is successfully loaded:

1
The 0th FV start address is 0x00000820000, size is 0x000E0000, handle is 0x820000

And compare to first boot:

1
2
3
4
5
6
7
Loading PEIM at 0x00000834A40 EntryPoint=0x0000083A6DF PlatformPei.efi^M
Select Item: 0x0^M
FW CFG Signature: 0x554D4551^M
Select Item: 0x1^M
FW CFG Revision: 0x3^M
QemuFwCfg interface (DMA) is supported.^M
Platform PEIM Loaded^M

It seems Platform PEIM not loaded correctly.

Check QemuFwCfg related code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
RETURN_STATUS
EFIAPI
QemuFwCfgInitialize (
VOID
)
{
UINT32 Signature;
UINT32 Revision;

//
// Enable the access routines while probing to see if it is supported.
// For probing we always use the IO Port (IoReadFifo8()) access method.
//
mQemuFwCfgSupported = TRUE;
mQemuFwCfgDmaSupported = FALSE;

QemuFwCfgSelectItem (QemuFwCfgItemSignature);
Signature = QemuFwCfgRead32 ();
DEBUG ((DEBUG_INFO, "FW CFG Signature: 0x%x\n", Signature));
QemuFwCfgSelectItem (QemuFwCfgItemInterfaceVersion);
Revision = QemuFwCfgRead32 ();
DEBUG ((DEBUG_INFO, "FW CFG Revision: 0x%x\n", Revision));
if ((Signature != SIGNATURE_32 ('Q', 'E', 'M', 'U')) ||
(Revision < 1)
) {
DEBUG ((DEBUG_INFO, "QemuFwCfg interface not supported.\n"));
mQemuFwCfgSupported = FALSE;
return RETURN_SUCCESS;
}

if ((Revision & FW_CFG_F_DMA) == 0) {
DEBUG ((DEBUG_INFO, "QemuFwCfg interface (IO Port) is supported.\n"));
} else {
mQemuFwCfgDmaSupported = TRUE;
DEBUG ((DEBUG_INFO, "QemuFwCfg interface (DMA) is supported.\n"));
}

if (mQemuFwCfgDmaSupported && MemEncryptSevIsEnabled ()) {
EFI_STATUS Status;

//
// IoMmuDxe driver must have installed the IOMMU protocol. If we are not
// able to locate the protocol then something must have gone wrong.
//
Status = gBS->LocateProtocol (&gEdkiiIoMmuProtocolGuid, NULL,
(VOID **)&mIoMmuProtocol);
if (EFI_ERROR (Status)) {
DEBUG ((DEBUG_ERROR,
"QemuFwCfgSevDma %a:%a Failed to locate IOMMU protocol.\n",
gEfiCallerBaseName, __FUNCTION__));
ASSERT (FALSE);
CpuDeadLoop ();
}
}

return RETURN_SUCCESS;
}

because FW CFG Revision: 0x3 is supported, according to the code, Revision is > 1 so QemuFwCfg interface is supported, but from next part 0x03 & FW_CFG_F_DMA which is BIT1 0x01 is not 0 so actually QemuFwCfg interface (DMA) is supported. is expected.

1
2
3
4
5
6
if ((Revision & FW_CFG_F_DMA) == 0) {
DEBUG ((DEBUG_INFO, "QemuFwCfg interface (IO Port) is supported.\n"));
} else {
mQemuFwCfgDmaSupported = TRUE;
DEBUG ((DEBUG_INFO, "QemuFwCfg interface (DMA) is supported.\n"));
}

so that means the boot is not correctly performed. But next boot still triggered so we should check how the boot can be triggered before we dig out the truth.

According to code,we can know edk2-ovmf is try to load PlatformPei.efi

1
2
Loading PEIM 222C386D-5ABC-4FB4-B124-FBB82488ACF4^M
Loading PEIM at 0x00000834A40 EntryPoint=0x0000083A751 PlatformPei.efi^M

use the guid 222C386D-5ABC-4FB4-B124-FBB82488ACF4 we can easily get definitions in PlatformPei.inf

1
2
3
4
5
6
7
[Defines]
INF_VERSION = 0x00010005
BASE_NAME = PlatformPei
FILE_GUID = 222c386d-5abc-4fb4-b124-fbb82488acf4
MODULE_TYPE = PEIM
VERSION_STRING = 1.0
ENTRY_POINT = InitializePlatform

which ENTRY_POINT is InitializePlatform, but before entry LibraryClasses should be initialized first:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[LibraryClasses]
BaseLib
CacheMaintenanceLib
DebugLib
HobLib
IoLib
PciLib
ResourcePublicationLib
PeiServicesLib
PeiServicesTablePointerLib
PeimEntryPoint
QemuFwCfgLib
QemuFwCfgS3Lib
QemuFwCfgSimpleParserLib
MtrrLib
MemEncryptSevLib
PcdLib

add logs code execution details.

following log shows the trace when edk2 fall into infinite loop:

1
2
3
4
5
6
7
8
9
10
11
12
13
Loading PEIM 222C386D-5ABC-4FB4-B124-FBB82488ACF4^M
Loading PEIM at 0x00000834A40 EntryPoint=0x0000083A76B PlatformPei.efi^M
Select Item: 0x0^M
FW CFG Signature: 0x554D4551^M
Select Item: 0x1^M
FW CFG Revision: 0x3^M
=========== debug entry point ===========^M
check signature match QEMU result: 0^M
check revision < 1 result: 0^M
check supported result: 2^M
=========== enter InternalMemEncryptSevStatus =========== ^M
=========== ReadSevMsr = true =========== ^M
SecCoreStartupWithStack(0xFFFCC000, 0x820000)^M

SecCoreStartupWithStack is first log when guest boot, and we end with ReadSevMsr = true but nothing after that.

Before we could see PlatformPei.efi is loading, so refer to PlatformPei.efi first. But the QemuFwCfgLib seems not successfully initialized, because the log end up during its load procedure not show Platform PEIM Loaded

1
2
3
4
5
6
7
8
EFI_STATUS
EFIAPI
InitializePlatform (
IN EFI_PEI_FILE_HANDLE FileHandle,
IN CONST EFI_PEI_SERVICES **PeiServices
)
{
DEBUG ((DEBUG_INFO, "Platform PEIM Loaded\n"));

And refer to OvmfPkgX64.dsc describes c lib every procedure should use:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[LibraryClasses.common.PEIM]
HobLib|MdePkg/Library/PeiHobLib/PeiHobLib.inf
PeiServicesTablePointerLib|MdePkg/Library/PeiServicesTablePointerLibIdt/PeiServicesTablePointerLibIdt.inf
PeiServicesLib|MdePkg/Library/PeiServicesLib/PeiServicesLib.inf
MemoryAllocationLib|MdePkg/Library/PeiMemoryAllocationLib/PeiMemoryAllocationLib.inf
PeimEntryPoint|MdePkg/Library/PeimEntryPoint/PeimEntryPoint.inf
ReportStatusCodeLib|MdeModulePkg/Library/PeiReportStatusCodeLib/PeiReportStatusCodeLib.inf
OemHookStatusCodeLib|MdeModulePkg/Library/OemHookStatusCodeLibNull/OemHookStatusCodeLibNull.inf
PeCoffGetEntryPointLib|MdePkg/Library/BasePeCoffGetEntryPointLib/BasePeCoffGetEntryPointLib.inf
!ifdef $(DEBUG_ON_SERIAL_PORT)
DebugLib|MdePkg/Library/BaseDebugLibSerialPort/BaseDebugLibSerialPort.inf
!else
DebugLib|OvmfPkg/Library/PlatformDebugLibIoPort/PlatformDebugLibIoPort.inf
!endif
PeCoffLib|MdePkg/Library/BasePeCoffLib/BasePeCoffLib.inf
ResourcePublicationLib|MdePkg/Library/PeiResourcePublicationLib/PeiResourcePublicationLib.inf
ExtractGuidedSectionLib|MdePkg/Library/PeiExtractGuidedSectionLib/PeiExtractGuidedSectionLib.inf
!if $(SOURCE_DEBUG_ENABLE) == TRUE
DebugAgentLib|SourceLevelDebugPkg/Library/DebugAgent/SecPeiDebugAgentLib.inf
!endif
CpuExceptionHandlerLib|UefiCpuPkg/Library/CpuExceptionHandlerLib/PeiCpuExceptionHandlerLib.inf
MpInitLib|UefiCpuPkg/Library/MpInitLib/PeiMpInitLib.inf
QemuFwCfgS3Lib|OvmfPkg/Library/QemuFwCfgS3Lib/PeiQemuFwCfgS3LibFwCfg.inf
PcdLib|MdePkg/Library/PeiPcdLib/PeiPcdLib.inf
QemuFwCfgLib|OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPeiLib.inf

take eyes on QemuFwCfgLib|OvmfPkg/Library/QemuFwCfgLib/QemuFwCfgPeiLib.inf

so check more from QemuFwCfgPei.c , following shows debug log added version

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
BOOLEAN
EFIAPI
QemuFwCfgIsAvailable (
VOID
)
{
return InternalQemuFwCfgIsAvailable ();
}


RETURN_STATUS
EFIAPI
QemuFwCfgInitialize (
VOID
)
{
UINT32 Signature;
UINT32 Revision;

//
// Enable the access routines while probing to see if it is supported.
// For probing we always use the IO Port (IoReadFifo8()) access method.
//
mQemuFwCfgSupported = TRUE;
mQemuFwCfgDmaSupported = FALSE;

QemuFwCfgSelectItem (QemuFwCfgItemSignature);
Signature = QemuFwCfgRead32 ();
DEBUG ((DEBUG_INFO, "FW CFG Signature: 0x%x\n", Signature));
QemuFwCfgSelectItem (QemuFwCfgItemInterfaceVersion);
Revision = QemuFwCfgRead32 ();
DEBUG ((DEBUG_INFO, "FW CFG Revision: 0x%x\n", Revision));
DEBUG ((DEBUG_INFO, "=========== debug entry point ===========\n"));

DEBUG ((DEBUG_INFO, "check signature match QEMU result: %d\n", Signature != SIGNATURE_32 ('Q', 'E', 'M', 'U')));
DEBUG ((DEBUG_INFO, "check revision < 1 result: %d\n", Revision < 1));
if ((Signature != SIGNATURE_32 ('Q', 'E', 'M', 'U')) ||
(Revision < 1)
) {
DEBUG ((DEBUG_INFO, "QemuFwCfg interface not supported.\n"));
mQemuFwCfgSupported = FALSE;
return RETURN_SUCCESS;
}

DEBUG ((DEBUG_INFO, "check supported result: %d\n", (Revision & FW_CFG_F_DMA)));
if ((Revision & FW_CFG_F_DMA) == 0) {
DEBUG ((DEBUG_INFO, "QemuFwCfg interface (IO Port) is supported.\n"));
} else {
//
// If SEV is enabled then we do not support DMA operations in PEI phase.
// This is mainly because DMA in SEV guest requires using bounce buffer
// (which need to allocate dynamic memory and allocating a PAGE size'd
// buffer can be challenge in PEI phase)
//
if (MemEncryptSevIsEnabled ()) {
DEBUG ((DEBUG_INFO, "SEV: QemuFwCfg fallback to IO Port interface.\n"));
} else {
mQemuFwCfgDmaSupported = TRUE;
DEBUG ((DEBUG_INFO, "QemuFwCfg interface (DMA) is supported.\n"));
}
}

the code enter the MemEncryptSevIsEnabled () and hang on next function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
STATIC
VOID
EFIAPI
InternalMemEncryptSevStatus (
VOID
)
{
UINT32 RegEax;
MSR_SEV_STATUS_REGISTER Msr;
CPUID_MEMORY_ENCRYPTION_INFO_EAX Eax;
BOOLEAN ReadSevMsr;
SEC_SEV_ES_WORK_AREA *SevEsWorkArea;

ReadSevMsr = FALSE;

DEBUG ((DEBUG_INFO, "=========== enter InternalMemEncryptSevStatus =========== \n"));
SevEsWorkArea = (SEC_SEV_ES_WORK_AREA *) FixedPcdGet32 (PcdSevEsWorkAreaBase);
if (SevEsWorkArea != NULL && SevEsWorkArea->EncryptionMask != 0) {
//
// The MSR has been read before, so it is safe to read it again and avoid
// having to validate the CPUID information.
//
DEBUG ((DEBUG_INFO, "=========== ReadSevMsr = true =========== \n"));
ReadSevMsr = TRUE;
} else {
//
// Check if memory encryption leaf exist
//
DEBUG ((DEBUG_INFO, "=========== AsmCpuid =========== \n"));
AsmCpuid (CPUID_EXTENDED_FUNCTION, &RegEax, NULL, NULL, NULL);
if (RegEax >= CPUID_MEMORY_ENCRYPTION_INFO) {
//
// CPUID Fn8000_001F[EAX] Bit 1 (Sev supported)
//
AsmCpuid (CPUID_MEMORY_ENCRYPTION_INFO, &Eax.Uint32, NULL, NULL, NULL);

if (Eax.Bits.SevBit) {
ReadSevMsr = TRUE;
}
}
}

if (ReadSevMsr) {
//
// Check MSR_0xC0010131 Bit 0 (Sev Enabled)
//
Msr.Uint32 = AsmReadMsr32 (MSR_SEV_STATUS);
DEBUG ((DEBUG_INFO, "=========== AsmReadMsr32 =========== \n"));
if (Msr.Bits.SevBit) {
mSevStatus = TRUE;
}

//
// Check MSR_0xC0010131 Bit 1 (Sev-Es Enabled)
//
if (Msr.Bits.SevEsBit) {
mSevEsStatus = TRUE;
}
}

DEBUG ((DEBUG_INFO, "=========== out InternalMemEncryptSevStatus =========== \n"));
mSevStatusChecked = TRUE;
}

Finally AsmReadMsr32 is the victim to blame.

note: SEC_SEV_ES_WORK_AREA is a new AREA added by amd used for their SEV feature. Normally guest without SEV feature should not modify those memory area, but it seems windows write randomly bits which caused this problem.

From edk2 groups

Same issue is found https://edk2.groups.io/g/devel/topic/87301748#84086

According to the mail:

1
2
3
4
5
Tested on Intel Platform, It is like 'SEV-ES work area' can be modified by
os(Windows etc), and will not restored on reboot, the
SevEsWorkArea->EncryptionMask may have a random value after reboot. then it
may casue fail on reboot. The msr bits already cached by mSevStatusChecked,
there is no need to try cache again in PEI phase.

it seems Windows will change SEV-ES work area which will lead QemuFwCfgLib to readMsr when guest reboot, but actually for guests, normally SEV-ES is not used, so initializing failure cause the reboot infinite loop.

Patches fix the reboot issue

check file change log

1
git log OvmfPkg/Library/BaseMemEncryptSevLib/PeiMemEncryptSevLibInternal.c

we can find related patches:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
commit d9822304ce0075b1075edf93cc6e2514685b5212
Author: Brijesh Singh <brijesh.singh@amd.com>
Date: Thu Dec 9 11:27:37 2021 +0800

OvmfPkg/MemEncryptSevLib: add MemEncryptSevSnpEnabled()

BZ: https://bugzilla.tianocore.org/show_bug.cgi?id=3275

Create a function that can be used to determine if VM is running as an
SEV-SNP guest.

Cc: Michael Roth <michael.roth@amd.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: Min Xu <min.m.xu@intel.com>
Cc: Jiewen Yao <jiewen.yao@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Jordan Justen <jordan.l.justen@intel.com>
Cc: Ard Biesheuvel <ardb+tianocore@kernel.org>
Cc: Erdem Aktas <erdemaktas@google.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Acked-by: Jiewen Yao <Jiewen.yao@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>

commit ac0a286f4d747a4c6c603a7b225917293cbe1e9f
Author: Michael Kubacki <michael.kubacki@microsoft.com>
Date: Sun Dec 5 14:54:09 2021 -0800

OvmfPkg: Apply uncrustify changes

REF: https://bugzilla.tianocore.org/show_bug.cgi?id=3737

Apply uncrustify changes to .c/.h files in the OvmfPkg package

Cc: Andrew Fish <afish@apple.com>
Cc: Leif Lindholm <leif@nuviainc.com>
Cc: Michael D Kinney <michael.d.kinney@intel.com>
Signed-off-by: Michael Kubacki <michael.kubacki@microsoft.com>
Reviewed-by: Andrew Fish <afish@apple.com>

UEFI boot procedure

Before debugging on edk2 code, check UEFI boot procedure to know more about UEFI boot.

According to https://edk2-docs.gitbook.io/edk-ii-build-specification/2_design_discussion/23_boot_sequence

PI compliant system firmware must support the six phases: security (SEC), pre-efi initialization (PEI), driver execution environment (DXE), boot device selection (BDS), run time (RT) services and After Life (transition from the OS back to the firmware) of system. Refer to Figure below

Our issue occours at PEI so check the first two steps:

Security(SEC)

The Security (SEC) phase is the first phase in the PI Architecture and is responsible for the following:

  • Handling all platform restart events
  • Creating a temporary memory store
  • Serving as the root of trust in the system
  • Passing handoff information to the PEI Foundation

The security section may contain modules with code written in assembly. Therefore, some EDK II module development environment (MDE) modules may contain assembly code. Where this occurs, both Windows and GCC versions of assembly code are provided in different files.

Pre-EFI Initialization (PEI)

The Pre-EFI Initialization (PEI) phase described in the PI Architecture specifications is invoked quite early in the boot flow. Specifically, after some preliminary processing in the Security (SEC) phase, any machine restart event will invoke the PEI phase.

The PEI phase initially operates with the platform in a nascent state, leveraging only on-processor resources, such as the processor cache as a call stack, to dispatch Pre-EFI Initialization Modules (PEIMs). These PEIMs are responsible for the following:

  • Initializing some permanent memory complement
  • Describing the memory in Hand-Off Blocks (HOBs)
  • Describing the firmware volume locations in HOBs
  • Passing control into the Driver Execution Environment (DXE) phase

Go through KVM code due to a tdp_page_fault

What happened?

Our CI/CD system ran integration test for every pull request but suddenly it met performance issue. Usually one round of integration test need 1h but this time almost all test do not finished after 1h 20min.

After check the codebase and test on lastest release stable branch, its more likely that the system met performance issue.

Before starting trip of “dig out the root cause”, check big picture of this CI/CD system architecture.

Prepare from perf

Because integration test runs on virtual machine memory, check hypervisor’s performance might gave more details. So use perf to collect run time data for analysis.

1
2
3
4
5
git clone https://github.com/brendangregg/FlameGraph  # or download it from github
cd FlameGraph
perf record -F 99 -a -g -- sleep 60
perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > perf.svg

Check the flame graph

Start from tdp_page_fault

Abviously cpu spend lots of time to handle tdp_page_fault

find definition from linux/arch/x86/kvm/mmu.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *context = &vcpu->arch.mmu;

context->base_role.word = 0;
context->base_role.guest_mode = is_guest_mode(vcpu);
context->base_role.smm = is_smm(vcpu);
context->base_role.ad_disabled = (shadow_accessed_mask == 0);
context->page_fault = tdp_page_fault;
context->sync_page = nonpaging_sync_page;
context->invlpg = nonpaging_invlpg;
context->update_pte = nonpaging_update_pte;
context->shadow_root_level = kvm_x86_ops->get_tdp_level(vcpu);
context->root_hpa = INVALID_PAGE;
context->direct_map = true;
context->set_cr3 = kvm_x86_ops->set_tdp_cr3;
context->get_cr3 = get_cr3;
context->get_pdptr = kvm_pdptr_read;
context->inject_page_fault = kvm_inject_page_fault;

kvm_vcpu page_fault point to tdp_page_fault when mmu field of kvm_vcpu is initializing.

from kvm vcpu setup arch/x86/kvm/mmu.c

1
2
3
4
5
6
7
8
9
static void init_kvm_mmu(struct kvm_vcpu *vcpu)
{
if (mmu_is_nested(vcpu))
init_kvm_nested_mmu(vcpu);
else if (tdp_enabled)
init_kvm_tdp_mmu(vcpu);
else
init_kvm_softmmu(vcpu);
}

from kvm vcpu setup arch/x86/kvm/mmu.c

1
2
3
4
5
6
void kvm_mmu_setup(struct kvm_vcpu *vcpu)
{
MMU_WARN_ON(VALID_PAGE(vcpu->arch.mmu.root_hpa));

init_kvm_mmu(vcpu);
}

from kvm vcpu setup arch/x86/kvm/x86.c

1
2
3
4
5
6
7
8
9
int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
{
kvm_vcpu_mtrr_init(vcpu);
vcpu_load(vcpu);
kvm_vcpu_reset(vcpu, false);
kvm_mmu_setup(vcpu);
vcpu_put(vcpu);
return 0;
}

vcpu is created by

kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)

check mmu in vcpu structure:

1
2
3
4
5
6
7
8
/*
* Paging state of the vcpu
*
* If the vcpu runs in guest mode with two level paging this still saves
* the paging mode of the l1 guest. This context is always used to
* handle faults.
*/
struct kvm_mmu mmu;

Find more by pf_interception

combine to flage graph, pf_interception is before tdp_page_fault,

1
2
3
4
5
6
7
8
9
10
static int pf_interception(struct vcpu_svm *svm)
{
u64 fault_address = __sme_clr(svm->vmcb->control.exit_info_2);
u64 error_code = svm->vmcb->control.exit_info_1;

return kvm_handle_page_fault(&svm->vcpu, error_code, fault_address,
static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
svm->vmcb->control.insn_bytes : NULL,
svm->vmcb->control.insn_len);
}

refer to its usage:

1
[SVM_EXIT_EXCP_BASE + PF_VECTOR] = pf_interception

SVM_EXIT_EXCP_BASE is related to AMD CPU virtualization, PF_VECTOR means page_frame vector, used by page fault.

more details about the PF_VECTOR in arch/x86/kvm/svm.c

1
2
3
4
5
6
7
8
9
10
11
if (npt_enabled) {
/* Setup VMCB for Nested Paging */
control->nested_ctl |= SVM_NESTED_CTL_NP_ENABLE;
clr_intercept(svm, INTERCEPT_INVLPG);
clr_exception_intercept(svm, PF_VECTOR);
clr_cr_intercept(svm, INTERCEPT_CR3_READ);
clr_cr_intercept(svm, INTERCEPT_CR3_WRITE);
save->g_pat = svm->vcpu.arch.pat;
save->cr3 = 0;
save->cr4 = 0;
}

if AMD CPU’s npt not enabled, PF_VECTOR will be used to intercept page fault.

So just quickly go through guest virtual address translation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/*
* Fetch a guest pte for a guest virtual address
*/
static int FNAME(walk_addr_generic)(struct guest_walker *walker,
struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
gva_t addr, u32 access)
...
...
...
error:
errcode |= write_fault | user_fault;
if (fetch_fault && (mmu->nx ||
kvm_read_cr4_bits(vcpu, X86_CR4_SMEP)))
errcode |= PFERR_FETCH_MASK;

walker->fault.vector = PF_VECTOR;
walker->fault.error_code_valid = true;
walker->fault.error_code = errcode;

error is defined to raise PF_VECTOR when failed to find any PTE(Page table entry)

Than let’s find the next method handle_exit() from svm.c

1
static int handle_exit(struct kvm_vcpu *vcpu)

following shows more details

1
2
3
struct vcpu_svm *svm = to_svm(vcpu);
struct kvm_run *kvm_run = vcpu->run;
u32 exit_code = svm->vmcb->control.exit_code;

the vcpu structure will be changed to vcpu_svm and than get the exit_code from it.

1
2
3
4
5
6
7
8
9
10
11
12
13
trace_kvm_exit(exit_code, vcpu, KVM_ISA_SVM);

if (!is_cr_intercept(svm, INTERCEPT_CR0_WRITE))
vcpu->arch.cr0 = svm->vmcb->save.cr0;
if (npt_enabled)
vcpu->arch.cr3 = svm->vmcb->save.cr3;

if (unlikely(svm->nested.exit_required)) {
nested_svm_vmexit(svm);
svm->nested.exit_required = false;

return 1;
}

than the exit_code will be traced.

CR0 has various control flags that modify the basic operation of the processor. See more: https://en.wikipedia.org/wiki/Control_register#CR0

if npt_enabled(CPU enable npt) vcpu will use vmcb saved cr3

vmcb: Intel VT-x name it as vmcs(virtual machine control structure), AMD name it as vmcb(virtual machine control block)

vmcb_control_area and vmcb_save_area combined as virtual machine control block.

note: need more research

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
if (is_guest_mode(vcpu)) {
int vmexit;

trace_kvm_nested_vmexit(svm->vmcb->save.rip, exit_code,
svm->vmcb->control.exit_info_1,
svm->vmcb->control.exit_info_2,
svm->vmcb->control.exit_int_info,
svm->vmcb->control.exit_int_info_err,
KVM_ISA_SVM);

vmexit = nested_svm_exit_special(svm);

if (vmexit == NESTED_EXIT_CONTINUE)
vmexit = nested_svm_exit_handled(svm);

if (vmexit == NESTED_EXIT_DONE)
return 1;
}

if vm is nested exit, handle nested exit next step, interrupts will be queued and if vm exit due to SVM_EXIT_ERR exit this thread.

1
2
3
4
5
6
7
8
9
10
svm_complete_interrupts(svm);

if (svm->vmcb->control.exit_code == SVM_EXIT_ERR) {
kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY;
kvm_run->fail_entry.hardware_entry_failure_reason
= svm->vmcb->control.exit_code;
pr_err("KVM: FAILED VMRUN WITH VMCB:\n");
dump_vmcb(vcpu);
return 0;
}

last, check if the error code is external interrupt and not kernel handable error

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
if (is_external_interrupt(svm->vmcb->control.exit_int_info) &&
exit_code != SVM_EXIT_EXCP_BASE + PF_VECTOR &&
exit_code != SVM_EXIT_NPF && exit_code != SVM_EXIT_TASK_SWITCH &&
exit_code != SVM_EXIT_INTR && exit_code != SVM_EXIT_NMI)
printk(KERN_ERR "%s: unexpected exit_int_info 0x%x "
"exit_code 0x%x\n",
__func__, svm->vmcb->control.exit_int_info,
exit_code);

if (exit_code >= ARRAY_SIZE(svm_exit_handlers)
|| !svm_exit_handlers[exit_code]) {
WARN_ONCE(1, "svm: unexpected exit reason 0x%x\n", exit_code);
kvm_queue_exception(vcpu, UD_VECTOR);
return 1;
}

return svm_exit_handlers[exit_code](svm);

finally invoke svm exit handler

1
return svm_exit_handlers[exit_code](svm);

Python ElementTree notes

Develop with libvrt python API, xml parse and operation is frequently required. ElementTree (stantard python library) is introduced in python-xml-parse come into used for the sake of simplify xml configuration lifecycle handling.

This blog will go throught xml.etree.ElementTree combine with typical situations which is use as learning notes.

First, start with some basic concepts

The Element type is a flexible container object, designed to store hierarchical data structures in memory. The type can be described as a cross between a list and a dictionary.

Each element has a number of properties associated with it:

  • a tag which is a string identifying what kind of data this element represents (the element type, in other words).
  • a number of attributes, stored in a Python dictionary.
  • a text string.
  • an optional tail string.
  • a number of child elements, stored in a Python sequence

use following XML as sample data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>

load xml from file:

1
2
3
4
5
6
7
8
Python 2.7.5 (default, Aug  4 2017, 00:39:18)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('test_data.xml')
>>> root = tree.getroot()
>>> root
<Element 'data' at 0x7f58bd8232d0>

or load xml from string:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
>>> test_data_str = '''<?xml version="1.0"?>
... <data>
... <country name="Liechtenstein">
... <rank>1</rank>
... <year>2008</year>
... <gdppc>141100</gdppc>
... <neighbor name="Austria" direction="E"/>
... <neighbor name="Switzerland" direction="W"/>
... </country>
... <country name="Singapore">
... <rank>4</rank>
... <year>2011</year>
... <gdppc>59900</gdppc>
... <neighbor name="Malaysia" direction="N"/>
... </country>
... <country name="Panama">
... <rank>68</rank>
... <year>2011</year>
... <gdppc>13600</gdppc>
... <neighbor name="Costa Rica" direction="W"/>
... <neighbor name="Colombia" direction="E"/>
... </country>
... </data>'''
>>> ET.fromstring(test_data_str)
<Element 'data' at 0x7f58bd823a10>
>>> root = ET.fromstring(test_data_str)
>>> root
<Element 'data' at 0x7f58bd823f10>

As an element, use dir to check whats inside element we just loaded:

1
2
>>> dir(root)
['__class__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_children', 'append', 'attrib', 'clear', 'copy', 'extend', 'find', 'findall', 'findtext', 'get', 'getchildren', 'getiterator', 'insert', 'items', 'iter', 'iterfind', 'itertext', 'keys', 'makeelement', 'remove', 'set', 'tag', 'tail', 'text']

as we see, operations is listed and think about some typical user case.

find attributes

Before finding attributes check what attributes the xml has.

For root node, only has a tag but no attribute is set. So use tag and attrib to check this before find attributes.

1
2
3
4
>>> root.tag
'data'
>>> root.attrib
{}

return value is what we expected. Get deeper, a country tag with name attribute is used.

iterate can be used to get child tag of root.

1
2
3
4
5
6
>>> for child in root:
... child
...
<Element 'country' at 0x7f58bd823f50>
<Element 'country' at 0x7f58bd825110>
<Element 'country' at 0x7f58bd825250>

or use index to find tag element directly:

1
2
>>> root[0]
<Element 'country' at 0x7f58bd823f50>

for more duplicate case, those method became hard to use, so use iter or findall:

1
2
3
4
5
6
7
8
>>> for neighbor in root.iter('neighbor'):
... print neighbor.attrib
...
{'direction': 'E', 'name': 'Austria'}
{'direction': 'W', 'name': 'Switzerland'}
{'direction': 'N', 'name': 'Malaysia'}
{'direction': 'W', 'name': 'Costa Rica'}
{'direction': 'E', 'name': 'Colombia'}

all tags match neighbor is listed.

1
2
3
4
5
6
7
8
>>> for country in root.findall('country'):
... rank = country.find('rank').text
... name = country.get('name')
... print name, rank
...
Liechtenstein 1
Singapore 4
Panama 68

use find all, all tag with name country is found and its rank text and attribute name is listed.

change the parameters for test, change findall target, test if tag not matched what will happend:

1
2
3
>>> for country in root.findall('test'):
... print country
...

when use find instead of findall

1
2
3
4
5
6
7
8
>>> for tag in root.find('country'):
... print tag
...
<Element 'rank' at 0x7f58bd823f90>
<Element 'year' at 0x7f58bd823fd0>
<Element 'gdppc' at 0x7f58bd825050>
<Element 'neighbor' at 0x7f58bd825090>
<Element 'neighbor' at 0x7f58bd8250d0>

only first matched result is returned.

if find for a unexists tag None will be returned.

1
2
>>> print root.find('test')
None

so in most cases, find and findall seems meet all the require for finding a specific tag.

use tag

get attribute of tag:

1
2
>>> root.find('country').get('name')
'Liechtenstein'

get text inside tag:

1
2
>>> root.find('country').text
'\n '
1
2
>>> root.find('country').find('year').text
'2008'

list all children

1
2
>>> root.getchildren()
[<Element 'country' at 0x7f58bd825bd0>, <Element 'country' at 0x7f58bd823f10>, <Element 'country' at 0x7f58bd8239d0>

insert tag

create new element from string:

1
2
3
4
5
6
7
8
9
10
>>> new_element_str='''    <country name="China">
... <rank>2</rank>
... <year>2022</year>
... <neighbor name="Japan" direction="E"/>
... </country>'''
(reverse-i-search)`lo': {'name': 'Colombia', 'direction': 'E'}
KeyboardInterrupt
>>> {'name': 'Colombia', 'direction': 'E'}
{'direction': 'E', 'name': 'Colombia'}
>>> new = ET.fromstring(new_element_str)

check origin element tree:

1
2
>>> ET.tostring(root)
'<data>\n <country name="Liechtenstein">\n <rank>1</rank>\n <year>2008</year>\n <gdppc>141100</gdppc>\n <neighbor direction="E" name="Austria" />\n <neighbor direction="W" name="Switzerland" />\n </country>\n <country name="Singapore">\n <rank>4</rank>\n <year>2011</year>\n <gdppc>59900</gdppc>\n <neighbor direction="N" name="Malaysia" />\n </country>\n <country name="Panama">\n <rank>68</rank>\n <year>2011</year>\n <gdppc>13600</gdppc>\n <neighbor direction="W" name="Costa Rica" />\n <neighbor direction="E" name="Colombia" />\n </country>\n</data>'

insert new element:

1
2
>>> ET.tostring(root)
'<data>\n <country name="China">\n <rank>2</rank>\n <year>2022</year>\n <neighbor direction="E" name="Japan" />\n </country><country name="Liechtenstein">\n <rank>1</rank>\n <year>2008</year>\n <gdppc>141100</gdppc>\n <neighbor direction="E" name="Austria" />\n <neighbor direction="W" name="Switzerland" />\n </country>\n <country name="Singapore">\n <rank>4</rank>\n <year>2011</year>\n <gdppc>59900</gdppc>\n <neighbor direction="N" name="Malaysia" />\n </country>\n <country name="Panama">\n <rank>68</rank>\n <year>2011</year>\n <gdppc>13600</gdppc>\n <neighbor direction="W" name="Costa Rica" />\n <neighbor direction="E" name="Colombia" />\n </country>\n</data>

confirm new element is added:

1
2
3
4
5
6
7
>>> for country in root.findall('country'):
... country.get('name')
...
'China'
'Liechtenstein'
'Singapore'
'Panama'

KVM introduction 00

See notation of virt/kvm/kvm_main.c in linux kernel

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/*
* Kernel-based Virtual Machine driver for Linux
*
* This module enables machines with Intel VT-x extensions to run virtual
* machines without emulation or binary translation.
*
* Copyright (C) 2006 Qumranet, Inc.
* Copyright 2010 Red Hat, Inc. and/or its affiliates.
*
* Authors:
* Avi Kivity <avi@qumranet.com>
* Yaniv Kamay <yaniv@qumranet.com>
*
* This work is licensed under the terms of the GNU GPL, version 2. See
* the COPYING file in the top-level directory.
*
*

I got some questions

  • what means kernel-based
  • what is VT-x
  • emulation? binary traslation?
  • who is Avi Kivity
  • is there any user-mode hypervisor?

Kernel-based

Kernel-based Virtual Machine (KVM) is a virtualization module in the Linux kernel that allows the kernel to function as a hypervisor. It was merged into the mainline Linux kernel in version 2.6.20, which was released on February 5, 2007. [1]

its available under linux/virt

VT-x

1
2
This module enables machines with Intel VT-x extensions to run virtual
machines without emulation or binary translation.

According to the code notation, Intel VT-x extensions is metioned.

Previously codenamed “Vanderpool”, VT-x represents Intel’s technology for virtualization on the x86 platform. On November 13, 2005, Intel released two models of Pentium 4 (Model 662 and 672) as the first Intel processors to support VT-x. The CPU flag for VT-x capability is “vmx”; in Linux, this can be checked via /proc/cpuinfo, or in macOS via sysctl machdep.cpu.features.[2]

for example, on centos 7.6

1
2
[root@test ~]# cat /proc/cpuinfo | grep vmx | head -1
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat

or on Intel CPU MacBook Pro (2020)

1
2
➜  ~ sysctl machdep.cpu.features | grep -i vmx
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

vmx is available

“VMX” stands for Virtual Machine Extensions, which adds 13 new instructions: VMPTRLD, VMPTRST, VMCLEAR, VMREAD, VMWRITE, VMCALL, VMLAUNCH, VMRESUME, VMXOFF, VMXON, INVEPT, INVVPID, and VMFUNC.[21] These instructions permit entering and exiting a virtual execution mode where the guest OS perceives itself as running with full privilege (ring 0), but the host OS remains protected.[2]

note: virtual execution mode is a important concept

refer to paper kvm: the Linux Virtual Machine Monitor KVM is designed to add a guest mode, joining the existing kernel mode and user mode

In guest-mode CPU instruction executed natively but when I/O requests or signal(typically, network packets received or timeout), exit guest-mode is required and kvm than redirect those I/O or signal handling to user-mode process to emulation device and execute actual I/O. After I/O handling finished, KVM will enter guest mode to execute its CPU instructions again.

For kernel-mode handling exit and enter is basic task. And user-mode process calls kernel to enter guest-mode until it interrupt.

Emulation & Binary translation

In computing, binary translation is a form of binary recompilation where sequences of instructions are translated from a source instruction set to the target instruction set. In some cases such as instruction set simulation, the target instruction set may be the same as the source instruction set, providing testing and debugging features such as instruction trace, conditional breakpoints and hot spot detection.

The two main types are static and dynamic binary translation. Translation can be done in hardware (for example, by circuits in a CPU) or in software (e.g. run-time engines, static recompiler, emulators).[3]

Emulators mostly used to run softwares or applications on current OS where those softwares or applications are not support. For example, https://github.com/OpenEmu/OpenEmu a multiple video game system. This is advantage of emulators.

Disadvantage is that binary translation sometimes require instructino scan, if its used for CPU instruction translations, it spends more time than native instruction. More details in Translator-Internals will be talked in next blogs.

Avi Kivity

Mad C++ developer, proud grandfather of KVM. Now working on @ScyllaDB, an open source drop-in replacement for Cassandra that’s 10X faster. Hiring (remotes too).

from https://twitter.com/avikivity

Avi Kivity began the development of KVM in mid-2006 at Qumranet, a technology startup company that was acquired by Red Hat in 2008. KVM surfaced in October, 2006 and was merged into the Linux kernel mainline in kernel version 2.6.20, which was released on 5 February 2007.

KVM is maintained by Paolo Bonzini. [1]

Virtualization

For hardware assisted virtualiztion, VMM running below ring0 Guest OS, user application can directly execute user requests, and sensitive OS call trap to VMM without binary translation or Paravirtualization so overhead is decreased.

But for older full virtualization design

Guest OS runs on Ring1 and VMM runs on Ring0, without hardware assist, OS requests trap to VMM and after binary translation the instruction finally executed.

KVM Details

Memory map

From perspective of Linux guest OS. Physical memory is already prepared and virtual memory is allocated depend on the physical memory. When guest OS require a virtual address(GVA), Guest OS need to translate is to guest physical address(GPA), this obey the prinsiple of Linux, and tlb, page cache will be involved. And no difference with a Linux Guest running on a real server.

From perspective of host, start a Guest need to allocate a memory space as GPA space. So every GPA has a mapped host virtual address(HVA) and also a host physical address(HPA)

So typically, if a guest need to access a virtual memory address

GVA -> GPA -> HVA -> GPA

at least three times of translation is needed.

Nowadays, CPU offer EPT(Intel) or NPT(AMD) to accelerate GPA -> HVA translation. We will refer that in after blogs.

vMMU

MMU consists of

  • A radix tree ,the page table, encoding the virtual- to-physical translation. This tree is provided by system software on physical memory, but is rooted in a hardware register (the cr3 register)
  • A mechanism to notify system software of missing translations (page faults)
  • An on-chip cache(the translation lookaside buffer, or tlb) that accelerates lookups of the page table
  • Instructions for switching the translation root inorder to provide independent address spaces
  • Instructions for managing the tlb

As referred in Memory map GPA -> HVA should be offered by KVM.

If no hardware assist, use shadow table to maintain the map between GPA and HVA, the good point of shadow table is that runtime address translation overhead is decrease but the major problem is how to synchronize guest page table with shadow page table, when guest writes page table, the shadow page table need to be changed together, so virtual MMU need offer hooks to implement this.

Another question is context switch. Shadow page tables based on the fact that guest should sync its tlb with shadow page tables so that tlb management instruction will be trapped. But the most common tlb management instruction in context-switch is invalidates the entire tlb. So the shadow page tables need to be synced again. Causes bad performance when vm runs multi processes.

vMMU is implement in order to improve guest performance which caches all page tables during context switch. This means context swtich could find its cache from vMMU directly, invdalidates tlb has no influence on context-switch.

Achieving network wirespeed in an open standard manner: introducing vDPA

之前的文章里,我们讨论了现存的virtio-networking架构,包括基于内核的(vhost-net/virtio-net)以及基于用户态DPDK的(vhost-user/vhost-pmd),现在我们需要转移我们的注意力到一个目标是让virtio-networking架构给VM提供有线连接速度的架构

本文将会涵盖构成这个架构的数据面以及控制面组件。我们将会介绍SR-IOV技术,以及这个技术如何提升网络性能。还会介绍virtio的硬件方案以及vDPA(virtual data path acceleration)带来的巨大好处。最后通过比较这些virtio-networking架构来做一个总结。

本文主要是为了那些有兴趣想了解不同virtio-networking架构的(包括vDPA),但不那么深入细节的人。当然后面也会提供一个技术细节的分享以及一个实践教程。

Data plane and control plane for direct access to NIC

在之前的vhost-net/virtio-net和vhost-user/vhost-pmd架构里,网卡都是接入在OVS kernel或者OVS-DPDK里的,而virtio的后端接口则是从OVS的另外一个port出去的

为了提升网络性能,直接把网络连到guest里,和之前的virtio架构类似,我们拆分了网卡的控制面和数据面:

  1. 控制面。提供网卡和guest之前的配置修改和特性协商功能,用来建立和销毁数据面通道
  2. 数据面。用来在guest和网卡之间传输数据包。当直接把网卡连接到guest的时候,实际上是要求网卡要支持virtio ring layout的

这个架构如下图所示:

笔记:

  • 如果需要知道KVM,libvirt以及Qemu进程的额外信息,可以看前面的文章
  • 数据面直接从网卡到guest,实际上是通过guest提供一个网卡可访问的共享内存实现的,并且并不经过host kernel。这个意味着网卡和guest都需要使用完全一致的ring layout否则就需要做地址翻译,地址翻译意味着性能损耗
  • 控制面的实现则可能设计host kernel或者qemu进程,这个取决于具体的实现

SR-IOV for isolating VM traffic

在vhost-net/virtio-net和vhost-user/virto-pmd架构里,我们是用了软件交换机(OVS)可以让一个网卡对接到物理端口上,然后分发数据包到不同的虚拟机的端口上。

把网卡挂在虚拟机上最简单的方法就是硬件透传,也就是直接把一个网卡提供给guest kernel的驱动。

问题是我们需要在服务器上有一个单独的通过PIC暴露的物理网卡,接下来的问题就是我们如何在物理网卡上创建“虚拟端口”?

SR-IOV(Single root I/O virtualization)是一种PCI设备规范,允许共享一个物理设备给多个虚拟机。换言之,这个功能允许不同的虚拟环境里的虚拟机共享一个网卡。这意味着我们能够拥有一个类似把一个物理网卡拆分为多个以太网接口的功能,帮我们解决了上面提到的“虚拟端口”的创建问题。

SR-IOV有两个主要功能

  1. Physical Functions,即PCI设备的完整功能,包括发现,管理和配置功能。每个网卡都有一个对应的PF能提供整个网卡设备的配置
  2. Virtual Functions,是单个PCI功能,可以控制设备的一部分,并且是PF的子集。同一个网卡上能有多个VF

我们需要在网卡上配置VF,PF,VF相当于是虚拟接口,PF相当于是网络接口,举个例子,我们有一个10GB网卡有一个外部接口和8个VF。那么这个外部端口的速度以及双工是取决于PF的配置而频率限制则是VF的设置

hypervisor提供了映射virtual function到虚拟机的功能,每个VF都可以被映射到一个VM(一个VM可以同时有多个VF)

然后来看看SR-IOV是如何映射到guest kernel,用户态DPDK或者是直接到host kernel的吧

  1. OVS和SR-IOV: 我们使用SR-IOV给OVS提供多个物理面端口,比如配置多个单独的mac地址,虽然我们只有一个物理网卡,但可以通过VF实现。并且给每分配一段内核内存到特定到VF(每个VF都有)
  2. OVS DPDK和SR-IOV:跳过物理机内核,通过SR-IOV直接从物理网卡到OVS-DPDK。映射host用户态内存给网卡的VF
  3. SR-IOV + guests:映射guest内存到网卡,跳过所有物理机环节。注意:使用设备透传,ring layout在物理网卡和guest之间是共享的,因此特定网卡才能被使用,因为这个逻辑一定是网卡厂商提供的。

注意:当然还有不是很常见的第四个方案,就是透传设备给guest里的用户态DPDK应用。

SR-IOV for mapping NIC to guest

重点说一下SR-IOV到guest的情况,这里存在一个问题就是在直接映射内存到网卡的场景下,如何更高效的发包收包。

我们有下面两个方法解决这个问题:

  1. 使用guest kerel驱动:这个方法就是使用网卡厂商提供的kernel驱动,即直接映射IO内存,这样的话硬件设备就能够直接访问guest kernel的内存了
  2. 在guest里使用DPDK-pmd驱动:这个方法,就是使用网卡厂商提供的DPDK pmd驱动,运行在guest的用户态,能够直接映射IO内存,因此硬件设备也能够直接访问用户态的特定进程

这一段我们重点看看DPDP pmd驱动的方案,整合起来就是下面这个图:

笔记:

  • 数据面是和厂商挂钩的直接访问VF
  • 对SRIOV,网卡厂商的驱动需要在host和guest都装
  • host内核驱动以及guest的pmd驱动并不直接互相访问,PF/VF的驱动是通过其他接口配置的(比如libvirt等)
  • 厂商提供的VF-pmd需要负责网卡VF的配置而PF驱动则负责在host内核上管理好物理网卡设备

总结一下这个方案,我们可以通过SR-IOV + DPDK PMC的方式给Guest提供一个很好的网络性能,不过这个方法还是挺麻烦的。因为这个方法是和厂商绑定的,因此需要在guest和host跑一样的驱动,并且特定网卡。这意味着网卡硬件升级,虚拟机应用驱动也需要升级。如果网卡被替换成了另外一个厂商的网卡,那么guest也需要装一个新的pmd。同时迁移虚拟机则会要求host上配置完全一致。也就是说网卡需要版本一致,物理位置需要一致,并且厂家还要提供迁移支持。

因此我们要处理的问题就是如何使用标准接口实现SR-IOV的性能提升,最好是只需要标准驱动,把这个驱动问题从整个架构中解耦出来。

下面的两个方案就是来解决这个问题的

Virtio full HW offloading

第一个方案是virtio的硬件替代方案,把virtio的控制面和数据面都转移到硬件上,也就是说网卡(当然还是通过VF来提供虚拟接口),支持virtio控制面的标准,包括发现,特性协商,以及建立/销毁数据面,等等。这个设备也支持virtio rang layout,因此一旦内存在网卡和guest之间被映射了,他们就能够直接通信了。

这个方案里,guest能够直接和网卡使用PCI通讯吗所以没有必要使用额外的驱动。然而需要网卡厂商提供支持virtio标准的网卡,包括控制面的软件实现,一般来说都是操作系统实现的,这个情况就是需要网卡自己实现。

下面是硬件架构的图:

笔记:

  • 实际上控制面需要的操作是非常负责的主要是和IOMMU以及vIOMMU,下篇文章里会说
  • 实际上在host kernel,qemu进程还有guest kernel都涉及这个流程,图里面简化了
  • 当然也可以吧virtio数据面控制面放到kernel里而不是用户态(和SRIOV的场景一致),也就是直接使用virtio-net驱动来和网卡通讯(而不是使用virtio-pmd)

vDPA - standard data plane

Virtual data path acceleration (vDPA) 是一个通过virtio ring layout和放置一个SRIOV在guest,来标准化网卡SRIOV数据面将这个网络性能改善和厂家实现解耦的方案,通过增加一个通用的控制面以及阮家架构来支持vDPA。提供一个抽象层,在SRIOV之上,并且给未来的可拓展IOV打好基础。

和virtio的硬件方案类似,数据面直接建立在网卡和guest之间,都使用virtio ring layout。然而每个网卡厂家可能就会提供各自的驱动了,然后一个通用的vDPA驱动就被添加到了kernel里面来完成常见网卡驱动或控制面之间的virtio控制面翻译工作。

vDPA是一个灵活性更高的方案,相比硬件方案来说,网卡厂家支持virtio ring layout的成本更小了,并且也能够达到性能提升的目的。

示例图如下:

笔记:

  • 实际上在host kernel,qemu进程还有guest kernel都涉及这个流程,图里面简化了
  • 类似SRIOV和virtio全硬件方案,数据面控制面都是在guest内核里而不是用户态(优劣和之前提到的一样)

vDPA有潜力成为一个权威的给虚拟机提供以太网接口的方案:

  1. 开源的标准:任何人可用,并且贡献标准,而不被特定的厂商限制
  2. 优异的性能:接近SRIOV,中间没有翻译成本
  3. 可以支持未来的硬件平台拓展技术
  4. 独立于特定厂商的标准驱动,意味着只需要配置一次驱动,而不用太关心网卡和版本
  5. 传输保护,guest直接使用单个接口。从host角度容易发现,并可以做好切换
  6. 在线迁移,提供不同网卡不同版本的在线迁移
  7. 提供一个标准的容器加速接口
  8. 裸机,提供标准的网卡驱动mvirtio网卡驱动可以被作为一个裸机驱动,当时用vDPA软件架构的时候,驱动这个驱动来适配不同网卡硬件

Comparing virtio architectures

总结一下之前的系列里我们学到的内容,包括四种提供给vm以太网络的架构,vhost-net/virito-net, vhost-user/virito-pmd, virtio full HW offloading 和 vDPA。

然后来比较一下他们的优劣:

总结

这篇文章我们包含了四种提供以太网接口的virtio-networking架构概览,包括速度慢的(virtio-net)到比较快的(vhost-user)还有最快的(virtio硬件方案和vDPA)

我们强调vDPA和SR-IOV相对其他技术来说的优势,也提供了四种技术的对比,接下来会更加深入的使用virtio硬件方案以及vDPA。

Linux memory management(1)

CPU aceess memory

CPU core -> MMU(TLBs, Table Walk Unit) -> Caches -> Memory(Translation tables)

CPU VA -> MMU find PTE(Pysical table entry) -> TLB -> L1 cache -> L2 cache -> L3 cache

note: pretend a architecture with TLB between CPU and L1 cache.

TLB is a some cache form VA-to-PA translaction and formed by PTE blocks.

if TLB miss, CPU find PA from L1 and so on until PA is find and then put the PTE into TLB.

what is TLB?

TLB definition from wiki: A translation lookaside buffer (TLB) is a memory cache that is used to reduce the time taken to access a user memory location. It is a part of the chip’s memory-management unit (MMU). The TLB stores the recent translations of virtual memory to physical memory and can be called an address-translation cache. A TLB may reside between the CPU and the CPU cache, between CPU cache and the main memory or between the different levels of the multi-level cache. The majority of desktop, laptop, and server processors include one or more TLBs in the memory-management hardware, and it is nearly always present in any processor that utilizes paged or segmented virtual memory.

note:

  1. TLB stores recent translations that means not all address translation entry is stored in TLB, take care about cache miss.

  2. TLB may reside between the CPU and the CPU cache, between the CPU cache and primary storage memory, or between levels of a multi-level cache.

  3. virtual addressing met cache miss or physical addressing, CPU always uses TLB to find and store it into cache.

  4. cache strategy LRU or FIFO

  5. The CPU has to access main memory for an instruction-cache miss, data-cache miss, or TLB miss, but compare to others the third case TLB miss is too expensive.

  6. freqently TLB misses occur degrading performance, because each newly cached page displacing one that will soon be used again. Where the TLB acting as a cache for the memory management unit (MMU) which translates virtual addresses to physical addresses is too small for the working set of pages. TLB thrashing can occur even if instruction cache or data cache thrashing are not occurring, because these are cached in different sizes. Instructions and data are cached in small blocks (cache lines), not entire pages, but address lookup is done at the page level. Thus even if the code and data working sets fit into cache, if the working sets are fragmented across many pages, the virtual address working set may not fit into TLB, causing TLB thrashing.

TLB-miss handling

Two schemes for handling TLB misses are commonly found in modern architectures:

  • With hardware TLB management, the CPU automatically walks the page tables . On x86 for example, use CR3 register to walks page tables if entry exists, bring back to TLB and TLB tries and access will hit. Or raise a page fault exception which need to be handled by operation system and load correct physical address to TLB(page swap in/out). CPU change do not cause loss of compatibility for the programs.
  • With software-managed TLBs, a TLB miss generates a TLB miss exception, and operating system code is responsible for walking the page tables and performing the translation in software. The operating system then loads the translation into the TLB and restarts the program from the instruction that caused the TLB miss. As with hardware TLB management, if the OS finds no valid translation in the page tables, a page fault has occurred, and the OS must handle it accordingly. Instruction sets of CPUs that have software-managed TLBs have instructions that allow loading entries into any slot in the TLB. The format of the TLB entry is defined as a part of the instruction set architecture (ISA).

note:

  1. hardware TLB management TLB handling the lifecycle of TLB entries.
  2. hardware TLB management throws page fault that OS must handling and OS should bring the missing table entry of physical address into TLB cache. And than the program resume.
  3. hardware TLB management maintain TLB enties is invisible to software.
  4. hardware TLB management can change from CPU to CPU, but without causing compatibility for the programs. In other words, CPU should obey the rules of TLB management so there is always any page fault exception require OS to handle
  5. software TLB management throws TLB miss exception and OS owns the responsibility to walk page tables and translation in software. Then OS loads TLB table and restart programs (attention! not resume but restart).
  6. compare hardware and software TLB management, according to 2 CPU finds TLB and throw page fault exception when hardware, but in sofware situation, the CPU’s instruction sets should have instruction to load TLB to anywhere and TLB entry can be used directly by CPU instruction

In most cases, hardware TLB management is used. But according to wiki, some of the architectures using software TLB management.

Typical TLB

These are typical performance levels of a TLB:

  • Size: 12 bits – 4,096 entries
  • Hit time: 0.5 – 1 clock cycle
  • Miss penalty: 10 – 100 clock cycles
  • Miss rate: 0.01 – 1% (20–40% for sparse/graph applications)

The average effective memory cycle rate is defined as m + (1-p)h + pm cycles, where m is the number of cycles required for a memory read, p is the miss rate, and h is the hit time in cycles. If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, a memory read takes 30 clock cycles, and the miss rate is 1%, the effective memory cycle rate is an average of 30 + 0.99 * 1 + 0.01 * 30 (31.29 clock cycles per memory access)

note: research more of TLB performance

use perf test TLB miss

1
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses -p $PID

if a high TLB missing rate exists in your OS, try to use huge page to decrease the table entries in TLB which will cut down the miss rate. But some application is not siutable for huge page and more details need to be change before use this solution.

Address-space switch

After process context switches, some TLB entries’ virtual address to physical address mapping is invalid. In order to clean thoes invalid entires, some strategies is required.

  1. flush all entries after process context change
  2. mark the entries with its process so the process context change do not matter
  3. some architecture use a sinlge address space operating system, all process use the same virtual-to-pysical mapping
  4. some CPU have a process register and hardware uses TLB entries only the current process ID matches

note:
flushing TLB is an important security mechanism for memory isolation. Memory isolation is especially critical during switches between the privileged operating system kernel process and the user processes – as was highlighted by the Meltdown security vulnerability[2]. Mitigation strategies such as kernel page-table isolation (KPTI) rely heavily on performance-impacting TLB flushes and benefit greatly from hardware-enabled selective TLB entry management such as PCID.

Virtualization and x86 TLB

With the advent of virtualization for server consolidation, a lot of effort has gone into making the x86 architecture easier to virtualize and to ensure better performance of virtual machines on x86 hardware

EPT required.

reference

  1. https://en.wikipedia.org/wiki/Translation_lookaside_buffer
  2. https://en.wikipedia.org/wiki/Meltdown_(security_vulnerability)

Hands on vhost-user: A warm welcome to DPDK

这片文章里我们将会配置一个环境,然后在虚拟机里运行一个基于DPDK应用。我们将介绍所有用来在host系统上配置一个虚拟交换机需要的步骤,并通过这个虚拟交换机连接到虚拟机的应用。正文包括描述如何创建,安装和运行虚拟机,以及安装里面的应用。你将会学习到如何创建并设置一个简单的通过guest内的应用发送并接收网络数据包到host的虚拟交换机。基于这些设置,你将会学习到如何如何调整设置来获得最优的吞吐性能。

Setting up

对于乐意使用DPDK但不希望配置和安装相关软件的,我们提供了一个ansible playbooks在github的repo里,自动化了所有步骤,我们就基于这个配置开始吧。

Requirements:

  • 一台运行了Linux发行版的电脑。本文使用Centos 7,不过不同的Linux发行版之间命令的差别也不会特别大,特别是Red Hat Enterprise Linux 7
  • 一个有sudo权限的用户
  • home目录下有大于25GB的空闲空间
  • 至少8GB的RAM

首先我们先安装我们需要的包

1
sudo yum install qemu-kvm libvirt-daemon-qemu libvirt-daemon-kvm libvirt virt-install libguestfs-tools-c kernel-tools dpdk dpdk-tools

Creating a VM

首先从下面的网站下载一个最新的Centos-Cloud-Base镜像

1
sudo wget -O /var/lib/libvirt/images/CentOS-7-x86_64-GenericCloud.qcow2 http://cloud.centos.org/centos/7/images/CentOS-7-x86_64-GenericCloud.qcow2

这个下载的是一个预安装的Centos7,用来在OpenStack环境运行的。因为我们不使用OpenStack,所以我们需要清理一下这个虚拟机。不过首先我们需要做一个镜像的copy。以此保证我们之后能重复使用这个镜像。

1
sudo qemu-img create -f qcow2 -b  /var/lib/libvirt/images/CentOS-7-x86_64-GenericCloud.qcow2  /var/lib/libvirt/images/vhuser-test1.qcow2 20G

通过下面的配置我们可以允许非特权用户使用libvirt命令(推荐):

1
export LIBVIRT_DEFAULT_URI="qemu:///system"

然后使用清理命令:

1
sudo virt-sysprep --root-password password:changeme --uninstall cloud-init --selinux-relabel -a /var/lib/libvirt/images/vhuser-test1.qcow2 --network --install "dpdk,dpdk-tools,pciutils"

这个命令回挂载文件系统并自动应用一些基础配置,然后这个镜像就可以用来启动虚拟机了

我们需要一个网络来连接我们的虚拟机,Libvirt处理网络的方式类似管理虚拟机,你可以通过XML文件定义网络并且通过命令行控制它的启动和停止。

举个例子,我们将使用一个叫做’default’的网络libvirt自带的方便网络。用下面的命令定义’default’网络,启动并检查网络运行状态

1
2
3
4
5
6
7
8
9
10
[root@10-0-117-158 ~]# virsh net-define /usr/share/libvirt/networks/default.xml
Network default defined from /usr/share/libvirt/networks/default.xml

[root@10-0-117-158 ~]# virsh net-start default
Network default started

[root@10-0-117-158 ~]# virsh net-list
Name State Autostart Persistent
--------------------------------------------
default active no yes

最后我们使用virt-install来创建虚拟机。这个命令行工具包含了一系列常用的操作系统配置定义。然后我们可以基于这个基本定义做一些改动:

1
2
3
4
5
virt-install --import  --name vhuser-test1 --ram=4096 --vcpus=3 \
--nographics --accelerate \
--network network:default,model=virtio --mac 02:ca:fe:fa:ce:aa \
--debug --wait 0 --console pty \
--disk /var/lib/libvirt/images/vhuser-test1.qcow2,bus=virtio --os-variant centos7.0

这些参数分别制定了。vCPUs的数量,RAM的大小,磁盘的路径,以及虚拟机要连接的网络。

出了通过我们指定的这些参数定义VM之外,virt-install会把虚拟机同时创建出来,所以我们应该可以看到:

1
2
3
4
[root@10-0-117-158 ~]# virsh list
Id Name State
------------------------------
7 vhuser-test1 running

很好,虚拟机已经运行了。接下来我们先把虚拟机停下来然后做一些额外的配置

1
virsh shutdown vhuser-test1

Preparing the host

DPDK对内存缓存的分配和管理做了优化。在Linux上这个需要使用hugepage的支持,所以必须要在kernel上打开。使用的page大小通常需要大于4K,以此通过使用更少的page数量,以及更少的TLB来提升性能。在翻译虚拟地址到物理地址的时候会产生这些查询。为了在启动时分配hugepage,我们需要在bootloader配置里加上kernel参数

1
sudo grubby --args="default_hugepagesz=1G hugepagesz=1G hugepages=6 iommu=pt intel_iommu=on" --update-kernel /boot/vmlinuz-3.10.0-957.27.2.el7.x86_64

当然我们来解释一下这些参数做了什么:

default_hugepagesz=1G 默认创建出来的hugepages默认是1GB

hugepagesz=1G 启动过程中创建出来的hugepage大小也是1GB

hugepages=6 最开始启动的时候创建6个大小为1GB的hugepage,这个在重启之后可以在/proc/meminfo里看到

注意,补充说明hugepages的设置增加了两个IOMMU相关的参数 iommu=pt intel_iommu=on 这个会初始化Intel VT-d以及IOMMU Pass-Through模式,在Linux用户态处理IO的时候需要用到他们。因此我们修改了kernel参数,现在刚好可以做一下重启。

等到重启完成之后,我们可以通过命令行查看对应的参数已经生效了

1
2
[root@10-0-117-158 ~]# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-957.27.2.el7.x86_64 root=/dev/mapper/zstack-root ro noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off mitigations=off crashkernel=auto rd.lvm.lv=zstack/root rd.lvm.lv=zstack/swap rhgb quiet LANG=en_US.UTF-8 default_hugepagesz=1G hugepagesz=1G hugepages=6 iommu=pt intel_iommu=on

Prepare the guest

virt-install命令通过libvirt创建并启动了一个虚拟机。为了将基于DPDK的vswitch TestPMD连接到QEMU,我们需要增加如下的定义到XML的device部分:

1
virsh edit vhuser-test1

<device>部分增加

1
2
3
4
5
6
7
8
9
10
11
12
<interface type='vhostuser'>
<mac address='56:48:4f:53:54:01'/>
<source type='unix' path='/tmp/vhost-user1' mode='client'/>
<model type='virtio'/>
<driver name='vhost' rx_queue_size='256' />
</interface>
<interface type='vhostuser'>
<mac address='56:48:4f:53:54:02'/>
<source type='unix' path='/tmp/vhost-user2' mode='client'/>
<model type='virtio'/>
<driver name='vhost' rx_queue_size='256' />
</interface>

另一个和vhost-net不同的guest配置就是hugepages。因此我们需要给guest增加如下定义:

1
2
3
4
5
6
7
8
9
<memoryBacking>
<hugepages>
<page size='1048576' unit='KiB' nodeset='0'/>
</hugepages>
<locked/>
</memoryBacking>
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>

这样就有内存了,然后再修改guest里的配置,这是非常重要的配置,没有的话就没办法收发数据包了:

1
2
3
4
5
6
 <cpu mode='host-passthrough' check='none'>
<topology sockets='1' cores='3' threads='1'/>
<numa>
<cell id='0' cpus='0-2' memory='3145728' unit='KiB' memAccess='shared'/>
</numa>
</cpu>

然后我们需要启动我们的guest。因为我们配置了让虚拟机连接到vhost-user的UNIX sockets,因此我们需要确保guest启动的时候这些sockets时可用的。这是通过启动testpmd实现的,这个操作会创建我们需要的sockets。

1
2
3
4
5
sudo testpmd -l 0,2,3,4,5 --socket-mem=1024 -n 4 \
--vdev 'net_vhost0,iface=/tmp/vhost-user1' \
--vdev 'net_vhost1,iface=/tmp/vhost-user2' -- \
--portmask=f -i --rxq=1 --txq=1 \
--nb-cores=4 --forward-mode=io

最后,这个实验需要连接到vhost-user unix sockets,因此启动QEMU的时候需要用root。所以在/etc/libvirt/qemu.conf 中设置 user=root。 这是因为我们呢的特殊验证场景需要这样配置,生产环境通常不建议这样配置。实际上读者需要在本文演示结束之后把 user=root 这个配置去掉。

现在我们可以通过命令启动虚拟机了:

1
virsh start vhuser-test1.

通过root登陆之后,我们要做的第一件事就是绑定virtio设备到vfio-pci驱动。为了能够完成这个操作,我们需要加载一些内核模块

1
2
3
4
5
[root@localhost ~]# modprobe  vfio enable_unsafe_noiommu_mode=1
[ 90.462919] VFIO - User Level meta-driver version: 0.3
[root@localhost ~]# cat /sys/module/vfio/parameters/enable_unsafe_noiommu_mode
Y
[root@localhost ~]# modprobe vfio-pci

然后找出virtio-net设备的PCI地址:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[root@localhost ~]# dpdk-devbind --status net

Network devices using kernel driver
===================================
0000:00:02.0 'Virtio network device 1000' if=eth0 drv=virtio-pci unused=virtio_pci,vfio-pci *Active*
0000:00:08.0 'Virtio network device 1000' if=eth1 drv=virtio-pci unused=virtio_pci,vfio-pci
0000:00:09.0 'Virtio network device 1000' if=eth2 drv=virtio-pci unused=virtio_pci,vfio-pci

No 'Crypto' devices detected
============================

No 'Eventdev' devices detected
==============================

No 'Mempool' devices detected
=============================

No 'Compress' devices detected
==============================

在dpdk-devbind的输出中找到virtio-devices的部分,并且没有被标记Active状态的。我们可以用这些设备来进行实验。注意:地址可能是不一样的。当我们首次启动这些设备的时候将会自动绑定到virtio-pci驱动,因为我们需要和非kernel的驱动一起使用,首先就是要将这些设备和virtio-pci设备解绑,然后再绑定到vfio-pci驱动

1
2
3
4
5
[root@localhost ~]# dpdk-devbind -b vfio-pci 0000:00:08.0 0000:00:09.0
[ 360.862724] iommu: Adding device 0000:00:08.0 to group 0
[ 360.871147] vfio-pci 0000:00:08.0: Adding kernel taint for vfio-noiommu group on device
[ 360.951240] iommu: Adding device 0000:00:09.0 to group 1
[ 360.960126] vfio-pci 0000:00:09.0: Adding kernel taint for vfio-noiommu group on device

Generating traffic

我们已经安装并配置好所有东西了,接下来就是运行网络负载了。首先在host上我们需要启动testpmd实例作为虚拟交换机。然后设置它转发所有在net_vhost0收到的数据包到net_vhost1。testpmd需要在虚拟机启动之前启动,因为它会尝试连接到由QEMU创建的属于vhost-user设备初始化出来的unix sockets。

1
2
3
4
5
testpmd -l 0,2,3,4,5 --socket-mem=1024 -n 4 \
--vdev 'net_vhost0,iface=/tmp/vhost-user1' \
--vdev 'net_vhost1,iface=/tmp/vhost-user2' -- \
--portmask=f -i --rxq=1 --txq=1 \
--nb-cores=4 --forward-mode=io

然后我们来启动之前准备好的虚拟机:

1
virsh start vhuser-test1

注意这时候我们能够在testpmd看到vhost-user收到的数据包了:

1
2
3
4
5
6
7
8
9
10
Port 1: link state change event
VHOST_CONFIG: vring base idx:0 file:0
VHOST_CONFIG: read message VHOST_USER_GET_VRING_BASE
VHOST_CONFIG: vring base idx:1 file:0
VHOST_CONFIG: read message VHOST_USER_GET_VRING_BASE

Port 0: link state change event
VHOST_CONFIG: vring base idx:0 file:0
VHOST_CONFIG: read message VHOST_USER_GET_VRING_BASE
VHOST_CONFIG: vring base idx:1 file:0

当guest启动之后我们就能够启动testpmd了。testpmd会初始化端口以及DPDK实现的virtio-net驱动。另外还有virtio特性的协商以及其他一些通用功能的协商也都在这一步发生了。

在我们启动testpmd之前,需要确认vfio内核模块已经加载并绑定了virtio-net设备到vfio-pci驱动:

1
dpdk-devbind -b vfio-pci 0000:00:08.0 0000:00:09.0

然后可以启动testpmd:

1
2
3
4
5
testpmd -l 0,1,2 --socket-mem 1024 -n 4 \
--proc-type auto --file-prefix pg -- \
--portmask=3 --forward-mode=macswap --port-topology=chained \
--disable-rss -i --rxq=1 --txq=1 \
--rxd=256 --txd=256 --nb-cores=2 --auto-start

现在我们可以检查testpmd处理了多少数据包了,我们可以通过输入命令 show port stats all 来看对应(RX/TX)方向的信息,比如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
testpmd> show port stats all

######################## NIC statistics for port 0 ########################
RX-packets: 75525952 RX-missed: 0 RX-bytes: 4833660928
RX-errors: 0
RX-nombuf: 0
TX-packets: 75525984 TX-errors: 0 TX-bytes: 4833662976

Throughput (since last show)
Rx-pps: 4684120
Tx-pps: 4684120
#########################################################################

######################## NIC statistics for port 1 ########################
RX-packets: 75525984 RX-missed: 0 RX-bytes: 4833662976
RX-errors: 0
RX-nombuf: 0
TX-packets: 75526016 TX-errors: 0 TX-bytes: 4833665024

Throughput (since last show)
Rx-pps: 4681229
Tx-pps: 4681229

#########################################################################

testpmd有不同的转发模式,这个例子里面我们用的是macswap,此模式会交换目标和源头的mac地址。另外的转发模式,比如’io’则不会处理包,所以会给出更高深职很不现实的数据。另外一个转发模式就是’noisy’,可以模拟调整包的缓存/内存的查找。

A journey to the vhost-users realm

这篇文章是基于上一篇 HOWTO 的一篇深入介绍以使用DPDK达成高性能用户态网络功能的vhost-user/virtio-pmd架构。本文主要是供对这个架构的更多实质性实现有兴趣的研发/架构师,并将会提供一个便于理解的时间博客来探索这些概念。

Introduction

之前的deep dive文章里,我们展示了将网络处理逻辑从qemu移动到kernel driver并通过vhost-net协议后带来的好处。这篇文章里,我们将会更深一步展示一下如何通过使用DPDK把host和guest的数据面从kernel移动到用户态之后获取更好的网络性能。为了实现这个我们需要看一下这个新的vhost-user库的实现。

在这篇文章的结尾,你应该对vhost-user/virtio-pmd架构有一个更深的理解,同时能够了解它系统重大性能提升背后的原因。

DPDK and its benefits

你可能已经听过DPDK了。这个用户态快速法宝处理库是很多网络功能虚拟化应用的核心(NFV,Network Function Virtualization),通过这个库他们可以实现一个完整的用户态应用,而跳过内核的网络协议栈.

DPDK是一组用户态的库,能够使一个用户创建一个优化过的高性能包处理应用。它带来了很多优势,让他在开发人员之中特别受欢迎。下面是列举一部分优势:

  • Processor affinity DPDK可以把每一个不同的线程pin到一个特定的逻辑核心上来满足并行最大化
  • Huge pages DPDK有数层内存管理(比如Mempool library或者是Mbuf libraty)。然而实际上所有的内存都mmap在hugetlbfs里面分配的。使用2MB或者是甚至1GB的页,DPDK减少了cache missi以及TLB的查询
  • Lockless ring buffers DPDK包处理是基于Ring library的,它提供了一个高效无锁的ring queueu支持爆发式的入队出队操作。
  • Poll Mode Driver 为了避免中断开销,DPDK提供了一个Poll Mode Driver(PMD)抽象
  • VFIO支持 VFIO(Virtual Function I/O)提供了一个用户态驱动开发框架,允许用户态应用直接通过I/O空间映射到应用内存的方式直接访问硬件设备。

附带在这些特性之后的还有另外两个DPDK支持的技术,提供给我们一套极大提升网络应用性能的工具:

  • Vhost-user库 用户态库,实现了vhost协议
  • Virtio-PMD 基于DPDK的PMD抽象构建,virtio-pmd驱动实现了virtio标准,同时允许通过一个标准和有效的方式使用虚拟硬件。

DPDK and OVS: A perfect combination

一个关于DPDK提升性能的好例子就是Open VSwitch。这是一个功能丰富,多层的,分布式虚拟路由器,被广泛作为虚拟化环境以及SDN应用的主要网络层应用。

经典的OVS被划分为一个高效的基于内核的数据路径(fastpath)组合上一个flow table和一个比较慢的用户态数据路径(slowpath)处理不匹配任何flow(fastpath中已有的)的包。通过集成OVS和DPDK,fastpath也移动到了用户态,减少了kernel切换到用户态的交互,提升了最大性能。结果上对比原生的OVS,OVS+DPDK会有~10x的性能提升。

所以我们如何结合OVS-DPDK的这些特性和性能到一个virtio架构中去呢?接下来将一个一个介绍相关的组件。

Vhost-user library in DPDK

vhost协议是一组消息和机制,被设计用于将virtio数据处理路径从qemu卸载出来(the primary,主要是要卸载包处理)到一个外部元素(the handler,配置virtio rings以及实际的包处理)。最相关的机制是:

  • 一组消息允许primary发送virtqueue内存的布局,并且配置到handler
  • 一堆eventft类的文件描述符,允许guest不通过primary直接发送和接受handler的消息:Available Buffer Notification (从guest发送到handler同志有buffers可以被处理了)和 the Used Buffer Notification (从handler发送到guest说明buffers的处理解释了)

之前的virtio-networking文章我们描述了一个具体的vhost协议的实现(vhost-net内核模块)以及如何允许qemu把网络处理卸载到host-kernel上。然后我们介绍的vhost-user库,这个库是基于DPDK构建的,是一个vhost协议的用户态实现,允许qemu把virtio设备的包处理卸载到任意DPDK应用(比如Open vSwitch)。

vhost-user库和vhost-net kernel模块主要的不同是通信channel。vhost-net kernel驱动实现了是通过ioctls,而vhost-user库则是通过定义消息结构然后通过unix socket发送的。

DPDK应用可以被配置到提供unix socket(server模式)并且qemu可以连接上去(client模式)。然而,反过来也是可以的,这样就可以允许在不重启vm的情况下重启DPDK了。

在这个socket上,所有请求都被primary(这里是QEMU)初始化,其中一些请求要求返回的,比如GET_FEATURES请求或者其他设置了REPLY_ACK的请求。

在这个场景下,vhost-net kernel模块和vhost-user的库允许primary通过以下重要步骤配置数据面卸载

  1. Feature negotiation(特性协商):virtio特性和vhost-user-specific特性通过类似的方式协商,首先primary会获取handler的特性bitmask然后设置一个它支持的子集
  2. Memory region configuration(内存域配置):master设置内存域的分布,然后handler就可以mmap()这些内存了
  3. Vring configuration:primary会设置virtqueue的数量,并且设置他们在memory region内的地址。注意:vhost-user支持multiqueue所以设置更多的queue能用来改善性能。
  4. Kick and Call file descriptors sending:通常irqfd和ioeventfd机制生效。详细内容可以回顾【】。更多关于virtio queue的机制可以看virtio数据面的文章

总结一下这些机制,DPDK应用能够通过和guest共享内存处理包并且直接发送和接受guest的通知而不经过qemu。

最后一个把所有东西整合到一起的就是QEMU的virtio device模型,它有两个主要任务:

  • 模拟virtio设备,在guest里面展示的i一个特定的PCI端口,并且可以被guest查询和完整配置。同时他会把ioeventfd映射到模拟设备的内存映射I/O空间,同时把irqfd映射到Global System Interrupt(GSI)。结果就是,guest对这些东西时没有感知的,通知和中断都会被转发到vhost-user库而没有qemu参与。
  • 替换实际的virtio数据路径实现,这个设备作为vhost-user协议的master来卸载处理逻辑到vhost-user库的DPDK进程里
  • 处理来自virtqueue的请求,并翻译成vhost-user的请求转发到slave。

下面的图展示了vhost-user-library作为DPDK应用的一部分的和QEMU交互以及guest使用virtio-device-model和virtio-pci设备:

对这个图有一些需要提的点

  • virtio memory region是guest初始化的
  • 能正确相应的virtio驱动通过定义在virtio规范的标准配置PCI BARs和virtio device接口交互是正常的
  • virtio-device-model(qemu内)使用vhost-user协议配置vhost-user库,同时配置irqfd和ioeventfd的文件描述符
  • virtio memory region是guest分配并映射到vhost-user库的(可以是DPDK应用)
  • 结果是DPDK应用能够直接读写包到guest内存,并通过ioeventfd和irqfd机制直接通知guest

Userland Networking in the guest

我们已经介绍了DPDK vhost-user实现,允许我们从host内核(vhost-net)卸载数据路径处理逻辑到专门的DPDK用户态应用(比如open vSwitch)因此动态改进了网络的性能。现在,我们将继续了解如何对guest做一样的改动来运行一个高性能网络应用(比如NFV服务)到guest的用户态来替换virtio-net内核模块。

为了能够直接在设备上运行用户态网络应用,我们需要三个组件:

  1. VFIO:VFIO是一个用户态驱动开发框架,允许用户态应用直接和设备交互(跳过kernel)
  2. Virtio-pmd驱动:是一个DPDK驱动,基于Poll Mode Driver抽象构建,实现了virtio协议
  3. IOMMU驱动:IOMMU驱动管理了虚拟IOMMU(I/O Memory Management Unit),一个可以对DMA-capable设备执行地址映射的模拟设备

接下来我们一个一个的描述他们的细节吧。

VFIO

VFIO是Virtual Function I/O的缩写。然而,Alex Williamson,vfio-pci kernel驱动的maintainer建议叫它 “Versatile Framework for userspace I/O”,这是一个更加准确的名字。VFIO是一个构建用户态驱动的基础框架,它提供了:

  • 映射设备配置和I/O memory regions到用户内存
  • DMA和驱动的重新映射以及基于IOMMU groups的隔离。后面我们会深入介绍IOMMU以及它是如何工作的,目前假设它允许创建映射到物理内存的虚拟I/O内存空间(类似普通MMU映射到non-IO虚拟内存),所以当一个设备想要DMA到虚拟I/O地址,IOMMU将重新映射该地址并可能应用隔离和其他安全策略
  • Eventfd和irqfd 基础信号机制支持用户态的信号和中断

引用内核文档:“如果你想在VFIO之前写一个驱动,你要么必须经历完整的开发周期才能成为合适的上游驱动,要么在代码树之外维护,要么使用没有IOMMU保护概念的UIO框架,限制中断支持并需要root权限才能访问诸如PCI配置空间之类的东西。”

VFIO暴露了用户友好的API,用于创建character设备(在 /dev/vfio/)支持ioctl通过device descriptor来描述设备、I/O regions 和 他们的读/写/mmap offsets,同时提供机制来描述和注册中断通知。

Virtio-pmd

DPDK提供了一个叫做Poll Mode Driver (PMD)的驱动抽象。这个东西作用在设备驱动和用户应用之间。提供了一系列高灵活性的东西给用户应用,同时保证了拓展性。比如给新设备实现驱动的能力。

它提供的一些更有用的特性如下:

  • 一组API允许特定的驱动实现驱动标准包括接收和发送方法。
  • 每个端口和每个队列硬件的卸载支持静态和动态配置
  • 一个用于数据的可拓展API可用,允许驱动定义自己的驱动标准的数据,同时应用可以调查回溯这些数据

virtio Poll Mode Driver(virtio-pmd)是众多使用PMD API的一个驱动之一,为使用DPDK编写的应用程序提供对virtio设备的快速无锁访问,使用virtio的virtqueues提供数据包接收和传输的基本功能。

除了这些PMD的特性之外,virtio-pmd驱动的实现还支持:

  • 接收时每个数据包可灵活合并缓冲区和发送时每个数据包的分散缓冲区
  • 组播和混杂模式
  • MAC/vlan过滤

结果就是一个高性能的用户态virtio驱动允许DPDK应用完整的使用virtio标准接口。

Introducing the IOMMU

IOMMU可以说基本上等于I/O空间(设备通过DMA直接访问内存)的MMU。它位于主存和设备之间,创建一个virtual I/O空间给每个设备,提供一个机制,动态映射虚拟内存到无力设备。因此当一个驱动配置了设备的DMA(比如一个网卡),并且配置了虚拟地址,当设备尝试访问虚拟地址的时候,这些虚拟地址就被IOMMU重新映射了。

它提供了很多优势比如:

  • 可以分配大段相邻虚拟内存,而不需要相邻物理内存
  • 一些设备不支持足够长的访问物理内存的地址,IOMMU解决了这个问题
  • 保护内存,避免DMA攻击通过恶意构造错误的设备并执行访问了并不是分配给设备的内存空间。设备只能看到虚拟地址并且运行操作系统独占的IOMMU映射。
  • 一些架构里支持中断重映射,允许中断隔离和迁移

通常情况下,所有东西都是有代价的,IOMMU的缺点是:

  • 因为要做page translation,所以性能下降
  • 如果增加page translation表会消耗物理内存

vIOMMU - IOMMU for the guest

当然,如果有一个物理IOMMU(比如intel VT-d和AMD-VI)qemu里也是会存在虚拟IOMMU的。QEMU的vIOMMU有以下特征:

  • 它翻译guest I/O虚拟地址(IOVA)到guest物理地址(GPA),并且能够通过QEMU的内存管理系统翻译成QEMU的host虚拟地址(HVA)
  • 设备隔离执行
  • 实现了I/O TLB API,因此映射可以在qemu外部查询

因此为了获取一个虚拟设备和虚拟IOMMU协作:

  1. 使用一个可用API在vIOMMU创建一个必要的IOVA映射,目前API是:
    a. 内核驱动的内核DMA API
    b. 用户态驱动的VFIO
  2. 用虚拟I/O地址配置设备的DMA

vIOMMU and DPDK integration

当一个QEMU模拟的设备尝试DMA到guest的virtio I/O空间,会使用到vIOMMU TLB来查询对应的页的映射并且执行一个安全的DMA访问。问题是如果实际的DMA被卸载到外部进程比如一个使用vhost-user库的DPDK应用?

当vhost-user库尝试直接访问这些共享内存,它需要把所有地址(I/O虚拟地址)到它自己的地址。因此这个需要让QEMU的vIOMMU来提供Device TLB API。Vhost-user库(或者说是vhost-kernel驱动)使用PCIe’s Address Translation Services标准消息集合来向QEMU请求一个页地址的翻译,通过一个次级的通信channel(另一个unix socket)这个channel会在配置IOMMU的时候创建。

总的来说,这里一共有三次地址翻译需要被处理:

  1. Qemu的vIOMMU翻译I/O虚拟地址到Guest物理地址
  2. Qemu的内存管理翻译Guest物理地址到Host虚拟地址(在qemu进程的地址空间内的Host虚拟地址)
  3. Vhost-user库翻译Qemu的Host虚拟地址到Vhost-user的Host虚拟地址。通常情况下,当vhost-user库映射qemu内存地址的时候,很简单就是把QEMU的Host虚拟地址翻译到mmap返回的地址即可

显然,这些地址翻译存在潜在的性能影响,特别是使用的是动态映射。然而,静态的大页分配(也是DPDK实际上做的)可以最小化这些性能损失。

下面的图优化了之前的vhost-user架构,来包含IOMMU的组件:

关于这个相当复杂的图表要提几点:

  • Guest物理内存空间是从Guest视角当作物理内存空间的,但显然这是QEMU进程的虚拟地址。当virtqueue memory region被分配的时候,实际上是在Guest的物理内存空间里
  • 当I/O虚拟地址分配给包含virtqueue的内存范围,和Guest物理地址相关联的一个条目就会被增加到vIOMMU的TLB table里
  • 另一方面,qemu的内存管理系统是能够知道guest的物理内存空间的和它自己的内存空间是在一起的。因此qemu的内存管理系统是可以翻译guest物理地址到QEMU(host)虚拟地址的
  • 当vhost-user库尝试去访问未被翻译过的IOVA的时候,它会通过secondary unix socket发送一个IOTLB miss的消息
  • IOTLB API收到这个请求后就会开始查找这个地址,首先是把IOVA翻译成GPA然后GPA翻译成HVA。然后发送这个翻译好的结果到master的unix socket也就是vhost-user库。
  • 最后,vhost-user库需要做最后一次翻译,因为qemu的内存被映射到了自己的内存空间里,所以需要把QEMU的HVA翻译到自己的HVA来访问共享内存

Putting everything together

这个文章涵盖了大量的组件,包括DPDK,virtio-pmd,VFIO,IOMMU等等

下面的图展示了把这些组件整合之后实现的vhost-user/virtio-pmd架构:

对这个图标要提一点:

  • 把这个图和上一个图相比,增加了通过硬件IOMMU,VFIO以及特定vendor的PMD驱动,连接OVS-DPDK应用到物理网卡的组件。现在就不会有疑惑了,因为访问硬件的方式和guest是一样的。

An example flow

Control Plane

这是设置控制面必要的步骤

  1. 当host上的DPDK应用(OVS)启动,会创建一个socket(server模式)和qemu处理virtio相关的协商逻辑
  2. 当qemu启动,它会连接到main socket,然后如果检测到VHOST_USER_PROTOCOL_F_SLAVE_REQ如果vhost-user提供了这个特性,qemu就会创建一个second socket并且把这个socket发送到vhost-user来连接并发送IOTLB同步消息
  3. 当QEMU <-> vhost-library的协商结束,两个sockets在他们之间是共享的。一个是virtio配置,另外一个是iotlb消息交换用的
  4. guest启动然后vfio驱动就和PCI设备绑定了。这个驱动能够提供访问iommu groups(iommu group主要取决于硬件拓扑)
  5. 当DPDK在guest里启动的时候有以下步骤
    a. 初始化PCI-vfio设备,同时映射PCI配置空间到用户内存
    b. 分配virtqueue
    c. 使用vfio,DMA映射virtqueue内存空间,这样通过IOMMU内核驱动dma映射到vIOMMU设备
    d. 然后,virto特性协商就开始了。本场景里,使用的virtqueue的地址是IOVA(在I/O虚拟内存空间)。映射eventfd和irqfd也完成了,因此中断和通知就被直接路由在guest和vhost-user库之间,而没有QEMU参与
    e. 最后,DPDK应用分配一个大片的连续内存作为网络buffer。这部分映射也通过VFIO和IOMMU驱动添加到vIOMMU

到此,配置完成并且数据面(virtqueues和notification机制)已经可以使用了。

Data Plane

为了发送数据包,会有以下步骤:

  1. guest里的DPDK应用命令virtio-pmd发包。首先会写buffers然后把一致性描述符增加到可用描述符ring里
  2. vhost-user PMD在host端会polling这个virtioqueue,所以他立刻会检测到新的描述符可用并开始处理
  3. 对每一个描述符,vhost-user PMD都会map他们的buffer(这一步就是地址翻译,翻译IOVA到HVA)。很少情况下会有buffer内存在一个没被映射到vhost-user IOTLB的页,如果出现miss的情况,会发送一个请求给QEMU,而实际上DPDK应用在guest里会分配静态大页,保证IOTLB请求会被最小限度的发送到QEMU。
  4. 这个vhost-user PMD会把这个buffers拷贝到mbufs(也就是DPDK应用使用的message buffers)
  5. 这些描述符会被添加到使用过的描述符ring。这些会立刻被guest里的DPDK应用发现,因为guest里的DPDK应用会不断polling virtqueue
  6. 然后这些mbufs就被host的DPDK应用消费了

Summary and conclusions

DPDK是一个很有前途的技术,因为他提供工了极大提升用户态性能的能力。不仅这个技术本身,和OVS结合,能够满足灵活高效的现代虚拟环境的要求,也在NFV部署中扮演着一个重要的角色。

为了充分利用这个技术,数据中心交换数据路径以及在guest中启用NFV应用,有必要在host和guest之间安全的创建一个有效的数据路径。这也就是virtio-net技术起到的作用。

vhost-user提供了一个可靠并且安全的机制来卸载网络处理逻辑到基于DPDK的应用里。它和vIOMMU集成,并提供隔离和内存保护,同时将libvirt从处理数据包的繁重工作下解放出来。

在guest里,符合virtio标准DPDK(virtio-pmd)利用有效的内存管理和高性能的DPDK Poll Mode driver使得在guest里创建也能创建快速数据路径。

如果你想要学习更多关于virtio技术,vhost-user以及DPDK请不要错过下一篇文章。

How vhost-user came into being Virtio-networking and DPDK

在这篇文章里我们将会通过一个宏观的视角介绍一个基于DPDK(Data Plane Development Kit)在host和guest之间的解决方案。这篇文章将会附带一篇面向架构师/研发人员的详细介绍以及一篇提供实际操作帮助的文章。

之前的文章里面包括解决方案,技术介绍以及实践的文章,引导读者了解virtio-networking的生态,包括了基本的组件,kvm,qemu,libvirt,以及vhost protocol和vhost-net/virtio-net架构。这个架构是给予host kernel的vhost-net(后端)和guest kernel的virtio-net(前端)组成的。

vhost-net/virtio-net架构提供了一个这些年来被广泛部署使用的生产解决方案。一部分是因为这个方案对用户开发应用并在虚拟机里运行因为它是用的是标准的Linux sockets来连接到网络的(通过host)。另一方面这个解决方案并不是那么完美,里面还是包含了一些性能开销的,这个问题在后面会被详细解释。

为了讲清楚性能问题,我们将会介绍vhost-user/virtio-pmd架构。为了了解细节,我们会先回顾一下DPDK,如何将OVS连接到DPDK以及virtio是如何适配到这个架构的前端和后端里去的。

在这篇文章的最后,你会对vhost-user/virtio-pmd架构以及这个架构与vhost-net/virtio-net的不同有牢固的认识。

DPDK overview

DPDK目标是提供一个简单和完善的用于数据面应用快速包处理的架构。它实现了一个数据包处理的运行时完成模型,意思是说,所有资源都需要在运行数据面应用之前分配好。这些专门的资源只会被专门的逻辑处理核心处理。

这个设计和Linux kernel通过调度器+中断在进程间上下文切换的机制不同,DPDK架构中设备是被一个定时的polling访问的。这个设计去除了上先问切换以及进程中断带来的开销,保证CPU核心100%都在做包处理。

在实践中,DPDK提供了一系列poll模式的驱动(PMDs)是的包传输能够直接在用户态和物理接口之间进行,完全跳过了kernel网络栈。这个方法提供了一个重要的通过排除中断处理以及kernel网络协议栈提升kernel转发性能的方法。

DPDK是一系列库。因此为了使用它们,你需要一个link到这些库并且调用相关api的应用。

下面的图标展示了之前的virtio构件和一个DPDK应用使用PMD驱动来访问物理网卡(跳过了kernel):

OVS-DPDK overview

在之前的文章中介绍过,open vSwtich通常在内核空间的数据路径做包转发,这意味着OVS kernel模块包含了一个简单的记录收到的转发包的flow table。然后一小部分的包我们可以称为异常包(比如第一个打开Openflow flow的包)并不匹配任何内核空间中已经存在的条目,而是发送到用户态的OVS守护进程(ovs-vswitchd)来处理。守护进程会分析这个包然后更新OVS的kernell里的flow table然后后面的发送到这个flow的包就能够直接通过OVS的内核态转发表直接发走了。

这个方法排除了大部分流量的用户态内核态的上下文切换,然而我们仍然被linux的网络协议栈限制,因为它并不适合高频率包的用户场景。

如果我们把OVS和DPDK集成在一起,把前面提到的PMD驱动当作杠杆然后移动OVS内核模块转发表到用户态。

下面的图展示了OVS-DPDK应用,所有的OVS组件都在用户态运行,并通过PMD驱动和物理网卡通信:

这里要提一下,虽然我们只看到DPDK应用运行在host的用户空间,在guest里运行带PMD驱动的DPDK应用也是可以的。下一节我们将会详细解释这个。

The vhot-user/virtio-pmd architecture

在vhost-user/virtio-pmd架构,virtio会在host的用户态guest的用户态使用DPDK:

  1. vhost-user(后端) 运行在host用户空间,作为OVS-DPDK的用户态应用。之前提到的DPDK是一个库而vhost-user模块是附带在这些库里面的API。OVS-DPDK是确切的链接到这个库并调用API的应用。任意一个创建在host上的guest VM都会有一个对应的vhost-user被创建出来用来和guest的virtio前端通信。
  2. virtio-pmd(前端)运行在guest用户态,是一个poll模式驱动,消费专门的cores并且执行不会中断的polling。一个运行在用户态的应用消费virtio-pmd也需要连接到DPDK库

这个图展示了他们是如何一起运作的:

如果把这个架构和基于内核的vhost-net/virtio-net架构做对比,vhost-net被vhost-user取代了,而virtio-net则被virtio-pmd取代。

通过启用host用户态通过共享内存跳过kernel直接访问物理网卡然后通过virtio-pmd在guest的用户态也跳过kernel,整体的性能能够提升2-4倍

然而这个方法对用户能力有更多的要求,在vhost-net/virtio-net架构中,数据面通讯是直接通过guest OS视角的:简单的安装virtio驱动到guest kernel然后guest用户态应用自动获得了一个标准的Linux网络接口。

相反vhost-user/virtio-pmd架构,guest的用户态应用为了优化数据面被要求使用virtio-pmd驱动(DPDK库提供)。这不是一个很简单的任务,并且要求专业的DPDK的配置和使用知识。

Summary

这篇文章我们介绍了vhost-user/virtio-pmd架构,通过提升了一部分使用成本改善了virtio接口的性能因为我们现在需要把应用link并使用DPDK。

这里有一系列用户场景比如虚拟网络功能(VNFs, virtial network functions)性能是一个大缺陷而virtio DPDK的架构能够帮助实现对应的性能指标。然而开发应用是需要专业知识的,并需要对DPDK API的理解以及不同的优化。

下一篇文章我们会提供一个深入的vhost-net/virtio-pmd的内部架构以及不同控制面数据面组件的介绍。

Hands on vhost-net: Do. Or do not. There is no try

Vhost-net利用标准的virtio网络接口悄悄地成为了基于qemu-kvm的虚拟化环境的默认流量卸载机制。这个机制允许通过内核模块执行网络处理,解放了qemu进程,改进了网络性能。

在之前的文章里介绍了网络架构的组成:Post not found: introduction-virtio-networking-and-vhost-net [Introduction to virtio-networking and vhost-net]  同时提供了一个更加详细的解释:[Deep dive into Virtio-networking and vhost-net] 。这篇文章里我们将会提供一个详细的步骤实践性的设置一个架构。大建好之后我们能够检查主要的组件是如何工作的。

这篇文章主要面向研发人员,hackers和其他任何有兴趣学习真实的网络卸载是如何做到的。

在读完这篇文章之后(希望能够在你的PC上重建这个环境),你将会对虚拟化使用到的工具更加熟悉(比如virsh)。你将了解如何建立一个vhost-net环境并且了解到如何检查一个运行的云主机同时测试他的网络性能。

对那些需要快速建立环境并直接逆向工程的人,这里有特别的准备!https://github.com/redhat-virtio-net/virtio-hands-on 能够自动化部署环境的ansible脚本。

Setting things up

因为懒得准备原文中相同的环境,这里我用ZStack常用环境来做替代

Requirements

一个安装了CentOS Linux release 7.6.1810 (Core)的环境
使用root用户(或者有sudo权限的用户)
home目录下有25G以上的空闲空间
至少8GB的RAM
首先安装一堆依赖,建议通过ZStack iso安装可以选择Host模式,如果是专家模式需要通过,如下命令安装

1
yum --disablerepo=* --enablerepo=zstack-mn,qemu-kvm-ev-mn install libguestfs-tools qemu-kvm libvirt kernel-tools iperf3 -y

另外根据OS版本下载一个rpm包 https://pkgs.org/download/netperf

对应的Centos 7的包是 https://centos.pkgs.org/7/lux/netperf-2.7.0-1.el7.lux.x86_64.rpm.html

安装virt-install

1
yum --disablerepo=* --enablerepo=ali* install virt-install -y

接下来确保当前用户被加入了libvirt的用户组,做一下修改

1
sudo usermod -a -G libvirt $(whoami)

然后重新登录,并重启libvirt

1
systemctl restart libvirtd

Creating VM

首先下载一个镜像,可以在内部 http://192.168.200.100/mirror/diskimages/ 找一个,这里直接用的一个c76的镜像

1
wget https://archive.fedoraproject.org/pub/archive/fedora/linux/releases/30/Cloud/x86_64/images/Fedora-Cloud-Base-30-1.2.x86_64.qcow2

这是一个封装好的镜像,我们把他作为一个copy,保证以后可以继续使用,执行如下命令,创建并查一下一下image的信息是否和预期一致

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@host ~]# qemu-img create -f qcow2 -b Fedora-Cloud-Base-30-1.2.x86_64.qcow2 virtio-test1.qcow2 20G
Formatting 'virtio-test1.qcow2', fmt=qcow2 size=21474836480 backing_file=centos76.qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[root@host ~]# qemu-img info virtio-test1.qcow2
image: virtio-test1.qcow2
file format: qcow2
virtual size: 20 GiB (21474836480 bytes)
disk size: 196 KiB
cluster_size: 65536
backing file: centos76.qcow2
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

然后清理一下这个操作系统:

1
sudo virt-sysprep --root-password password:password123 --uninstall cloud-init --selinux-relabel -a virtio-test1.qcow2

这个命令会挂载文件系统,并自动做一些基础设置,让这个镜像准备好启动

我们需要把网络连接到虚拟机网络。Libvirt对网络的配置就和管理虚拟机一样,你可以通过xml文件定义一个网络,通过命令行控制他的启动和停止。

举个例子,我们使用一个libvirt提供的叫做‘default’的网络的便利设置用如下的命令定义default网络启动并检测他已经运行

1
2
3
4
5
6
7
8
[root@host ~]# virsh net-define /usr/share/libvirt/networks/default.xml
Network default defined from /usr/share/libvirt/networks/default.xml
[root@host ~]# virsh net-start default
Network default started
[root@host ~]# virsh net-list
Name State Autostart Persistent
--------------------------------------------
default active no yes

最后我们能够使用virt-install创建虚拟机。这是一个命令行工具创建一堆操作系统需要的定义,并给出一个基础的我们可以自定义的配置:

1
virt-install --import --name virtio-test1 --ram=4096 --vcpus=2 --nographics --accelerate --network network:default,model=virtio --mac 02:ca:fe:fa:ce:01       --debug --wait 0 --console pty --disk /root/virtio-test1.qcow2,bus=virtio

这些命令用的选项特定了vCPUs的数量,还有我们虚拟机的RAM大小并且指定磁盘的路径和云主机要连接的网络。

除开虚拟机通过我们这些选项定义之外,virt-install命令也能够启动虚拟机,所以我们需要列出来:

1
2
3
4
[root@host ~]# virsh list
Id Name State
------------------------------
9 virtio-test1 running

我们的虚拟机在运行了

提醒一下,virsh是一个libvirt的命令行接口,你可以这样启动一个虚拟机:

1
virsh start virtio-test1

进入虚拟机的console:

1
virsh console virtio-test1

停止一个运行的虚拟机:

1
virsh shutdown virtio-test1

删除运行的虚拟机(不要做这个除非你想在创建一遍)

1
2
virsh undefine virtio-test1
Inspecting the guest

就像已经提到的,virt-install命令能够自动使用libvirt创建和启动云主机。每个虚拟机创建都是通过xml文件描述需要模拟的硬件设置并提交给libvirt。我们通过dump配置的内容可以看一下相关的文件。

1
2
3
4
5
6
7
8
9
10
11
12
<devices>
...
<interface type='network'>
<mac address='02:ca:fe:fa:ce:01'/>
<source network='default' bridge='virbr0'/>
<target dev='vnet0'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</interface>
...
</devices>

我们能够看到一个virtio设备被创建好了,并连接到网络,这堆配置里有virbr0。这个设备有PCI,bus和slot

然后通过console命令进入console

1
virsh console virtio-test1

进入guset安装一些测试依赖:

1
dnf install pciutils iperf3

然后在虚拟机里看一下,实际上虚拟PCI总线上挂了个网络设备

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@localhost ~]# lspci -s 0000:00:02.0 -v
00:02.0 Ethernet controller: Red Hat, Inc. Virtio network device
Subsystem: Red Hat, Inc. Device 0001
Physical Slot: 2
Flags: bus master, fast devsel, latency 0, IRQ 11
I/O ports at c040 [size=32]
Memory at febc0000 (32-bit, non-prefetchable) [size=4K]
Memory at febf4000 (64-bit, prefetchable) [size=16K]
Expansion ROM at feb80000 [disabled] [size=256K]
Capabilities: [98] MSI-X: Enable+ Count=3 Masked-
Capabilities: [84] Vendor Specific Information: VirtIO: <unknown>
Capabilities: [70] Vendor Specific Information: VirtIO: Notify
Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg
Capabilities: [50] Vendor Specific Information: VirtIO: ISR
Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg
Kernel driver in use: virtio-pci

注意:lspci后面跟的地址是根据xml里面的source,bus,slot,function拼接起来的。

除了典型的PCI设备信息之外(比如内存阈和功能之外,驱动还实现了基于PCI的通用virtio功能吗,并创建了一个由virtio_net驱动的网络设备,比如我们可以深入的看一下这个设备

1
2
[root@localhost ~]# readlink /sys/devices/pci0000\:00/0000\:00\:02.0/virtio0/driver
../../../../bus/virtio/drivers/virtio_net

通过命令行readlink,可以看到这个pci设备使用的是virtio_net驱动。

是这个virtio_net驱动控制了操作系统使用的网络接口的创建:

1
2
3
4
5
6
[root@localhost ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 02:ca:fe:fa:ce:01 brd ff:ff:ff:ff:ff:ff
Inspecting the host

我们已经看过guset了,让我们再看看host。注意我们通过‘network’类型配置网络接口的默认行为是使用vhost-net

首先我们看一下vhost-net是否加载

1
2
3
4
5
[root@host ~]# lsmod | grep vhost
vhost_net 22507 1
tun 31881 4 vhost_net
vhost 48422 1 vhost_net
macvtap 22796 1 vhost_net

我们能够检查到QEMU使用了tun,kvm和vhost-net设备,同时通过检查/proc文件系统也能发现这些文件描述符被分给qemu处理了。

1
2
3
4
5
6
7
[root@host ~]# ls -lh /proc/40888/fd | grep '/dev'
lrwx------ 1 root root 64 Dec 27 22:55 0 -> /dev/null
lrwx------ 1 root root 64 Dec 27 22:55 10 -> /dev/ptmx
lrwx------ 1 root root 64 Dec 27 22:55 13 -> /dev/kvm
lr-x------ 1 root root 64 Dec 27 22:55 3 -> /dev/urandom
lrwx------ 1 root root 64 Dec 27 22:55 35 -> /dev/net/tun
lrwx------ 1 root root 64 Dec 27 22:55 37 -> /dev/vhost-net

这意味着qemu进程,不仅打开了kvm设备执行虚拟化操作,同时也创建了一个tun/tap设备和一打开了一个vhost-net设备。当然我们也能够看到一个辅助qemu的vhost内核线程也被创建出来了

1
2
3
4
5
[root@host ~]# ps -ef | grep '\[vhost'
root 40056 21741 0 09:53 pts/0 00:00:00 grep --color=auto \[vhost
root 40894 2 0 Dec27 ? 00:00:03 [vhost-40888]
[root@host ~]# pgrep qemu
40888

这个vhost内核线程的名字就是vhost-$qemu_pid

最后我们可以看到qemu进程创建的tun接口(上面通过/proc找到的)通过bridge把host和guest连在一起了。注意,虽然tap设备被挂在了qemu进程上,实际上进行tap设备读写的是vhost内核线程。

1
2
3
4
5
[root@host ~]# ip -d tuntap
virbr0-nic: tap UNKNOWN_FLAGS:800
Attached to processes:
vnet0: tap vnet_hdr
Attached to processes: qemu-kvm(40888)

OK,所以vhost已经启动并且运行了,qemu也连接到了vhost上。现在我们可以制造一些流量来看看系统的表现。

Generating traffic

如果你已经完成之前的步骤的话,你已经可以尝试通过ip地址从host发送数据到guest或者反过来。举个例子,通过iperf3测试网络性能,注意这些测试方法不是正确的benchmarks,不同的输入比如软硬件版本,不同的网络协议栈参数,会显著影响测试结果。性能吞吐量,或者是特定的使用量基准不在本文的范围之内。

首先检查guest的ip地址,执行 iperf3 server (或者任意你打算用来测试连通性的工具)

1
2
3
4
5
6
7
8
[root@localhost ~]# ip addr
...
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 02:ca:fe:fa:ce:01 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.41/24 brd 192.168.122.255 scope global dynamic noprefixroute eth0
valid_lft 2808sec preferred_lft 2808sec
inet6 fe80::ca:feff:fefa:ce01/64 scope link
valid_lft forever preferred_lft forever

然后再host上运行iperf3的client:

1
2
3
4
5
[root@host ~]# iperf3 -c 192.168.122.41
Connecting to host 192.168.122.41, port 5201
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 34.3 GBytes 29.5 Gbits/sec 1 sender
[ 4] 0.00-10.00 sec 34.3 GBytes 29.5 Gbits/sec receiver

在iperf3的输出里我们能看到一个29.5 Gbits/sec 的传输速率(主机这个网络的带宽收到很多以来的影响,不要假定会和你的环境一致)。我们可以通过 -l 参数修改包的大小来测试更多数据平面。

如果我们在iperf3测试过程中运行top命令我们能够看到vhost-$pid内核线程使用了100%的core来做包转发,同时QEMU使用了几乎两倍的cores(可以多试几次,刚好中间观察到一个200% 一个100%,注意创建guest的时候指定的qemu的vcpu数量就是2)

1
2
3
4
5
6
7
8
9
top - 10:07:24 up 7 days, 17:30,  3 users,  load average: 1.49, 0.55, 0.28
Tasks: 612 total, 2 running, 610 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.9 us, 6.6 sy, 0.0 ni, 90.4 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 13174331+total, 10001235+free, 4672644 used, 27058324 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 12307848+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
40888 root 20 0 6158388 1.2g 11436 S 200.0 0.9 3:38.41 qemu-kvm
40894 root 20 0 0 0 0 R 100.0 0.0 0:58.46 vhost-40888

要测试延迟,我们使用netperf命令启动一个netperf服务,然后测试延迟。(注:需要在guest里先启动一个netserver,如何在guest安装netperf 2021.12.18 fedora安装netperf)

1
2
3
4
5
6
7
8
9
[root@host ~]# netperf -l 30 -H 192.168.122.41 -p 16604 -t TCP_RR
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.41 () port 0 AF_INET : first burst 0
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec

16384 87380 1 1 30.00 36481.26
16384 131072

计算打开vhost-net host → guest TCP_RR延迟为 1 / 36481.26 = 0.0000274s

就像之前说的,我们进行的不是benchmark或者是吞吐测试。我们只是熟悉一下这个方法。

Extra: Disable vhost-net

就像我们看到的vhost-net是被作为默认行为的,因为带来了性能的提升。然而因为我们是来实践学习的,金庸vhost-net来看一下性能有什么不同,通过这个我们将知道qemu做了多“重”的包处理的操作,以及对性能造成了什么影响。

首先停止vm:

1
virsh shutdown virtio-test1

编辑云主机配置:

1
virsh edit virtio-test1

修改网卡配置为,增加了

1
2
3
4
5
6
7
8
9
10
11
<devices>
...
<interface type='network'>
<mac address='02:ca:fe:fa:ce:01'/>
<source network='default'/>
<model type='virtio'/>
<driver name='qemu'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</interface>
...
</devices>

修改成功后llibvirt会显示

1
Domain virtio-test1 XML configuration not changed

然后启动云主机:

1
virsh start virtio-test1

我们可以检查一下指向/dev/vhost-net的文件描述符没了

1
2
3
4
5
6
7
[root@host ~]# ls -lh /proc/37518/fd | grep '/dev'
lrwx------ 1 root root 64 Dec 28 11:22 0 -> /dev/null
lrwx------ 1 root root 64 Dec 28 11:22 10 -> /dev/ptmx
lrwx------ 1 root root 64 Dec 28 11:22 13 -> /dev/kvm
lr-x------ 1 root root 64 Dec 28 11:22 3 -> /dev/urandom
lrwx------ 1 root root 64 Dec 28 11:22 33 -> /dev/net/tun
Analyzing the performance impact

如果我们在没有vhost-net的环境重复之前的测试,我们能看到vhost-net的线程没有在运行了

1
2
[root@host ~]# ps -ef | grep '\[vhost'
root 9076 23993 0 11:24 pts/0 00:00:00 grep --color=auto \[vhost

我们获得的性能iperf3测试结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@host ~]# iperf3 -c 192.168.122.41
Connecting to host 192.168.122.41, port 5201
[ 4] local 192.168.122.1 port 58628 connected to 192.168.122.41 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 1.89 GBytes 16.2 Gbits/sec 0 2.08 MBytes
[ 4] 1.00-2.00 sec 1.78 GBytes 15.3 Gbits/sec 0 2.19 MBytes
[ 4] 2.00-3.00 sec 1.82 GBytes 15.6 Gbits/sec 0 2.37 MBytes
[ 4] 3.00-4.00 sec 1.82 GBytes 15.7 Gbits/sec 0 2.47 MBytes
[ 4] 4.00-5.00 sec 1.73 GBytes 14.8 Gbits/sec 0 2.61 MBytes
[ 4] 5.00-6.00 sec 1.80 GBytes 15.4 Gbits/sec 0 2.64 MBytes
[ 4] 6.00-7.00 sec 1.82 GBytes 15.6 Gbits/sec 0 2.64 MBytes
[ 4] 7.00-8.00 sec 1.81 GBytes 15.6 Gbits/sec 0 2.69 MBytes
[ 4] 8.00-9.00 sec 2.33 GBytes 20.0 Gbits/sec 0 2.81 MBytes
[ 4] 9.00-10.00 sec 2.32 GBytes 19.9 Gbits/sec 0 2.93 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 19.1 GBytes 16.4 Gbits/sec 0 sender
[ 4] 0.00-10.00 sec 19.1 GBytes 16.4 Gbits/sec receiver

iperf Done.

可以看到传输速度从上面的29.5 Gbits/sec掉到了16.4 Gbits/sec

在通过top命令检查,我们可以发现qemu对CPU的使用,峰值会变得很高(这个需要多测试一下,是一个浮动的范围,关掉vhost-net之后大概是150% ~ 260%左右,之前开vhost-net的时候最高也就200%)

1
2
3
4
5
6
7
8
top - 11:27:10 up 7 days, 18:50,  3 users,  load average: 0.57, 0.36, 0.29
Tasks: 617 total, 3 running, 614 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.8 us, 4.0 sy, 0.0 ni, 91.7 id, 0.5 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 13174331+total, 10047221+free, 3931432 used, 27339668 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 12381579+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
37518 root 20 0 6605056 514392 11372 R 242.9 0.4 1:02.22 qemu-kvm

如果我们再比较一下两种不同网络架构下的TCP和UDP的延迟,我们可以看到vhost-net对两种形式的性能都有一致的提升

记录一下关闭vhoset-net之后的测试结果

关闭vhost-net的host → guest TCP_RR测试

1
2
3
4
5
6
7
8
9
[root@host ~]# netperf -l 30 -H 192.168.122.41 -p 16604 -t TCP_RR
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.41 () port 0 AF_INET : first burst 0
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec

16384 87380 1 1 30.00 27209.58
计算延迟为 1/27209.58 = 0.0000367s

关闭vhost-net的host → guest UDP_RR测试

1
2
3
4
5
6
7
8
[root@host ~]# netperf -l 30 -H 192.168.122.41 -p 16604 -t UDP_RR
MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.41 () port 0 AF_INET : first burst 0
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec

212992 212992 1 1 30.00 27516.04

计算延迟为 1/27516.04=0.0000363s

打开vhost-net的host → guest UDP_RR测试

1
2
3
4
5
6
7
8
[root@host ~]# netperf -l 30 -H 192.168.122.41 -p 16604 -t UDP_RR
MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.122.41 () port 0 AF_INET : first burst 0
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec

212992 212992 1 1 30.00 37681.20

计算延迟为 1/37681.20 = 0.0000265s

同样的测试一下guest→host方向的流量延迟,这里不列代码只记录测试结果

打开vhost-net guest→host TCP_RR 1/36030.14 = 0.0000278s

打开vhost-net guest→host UDP_RR 1/37690.97 = 0.0000265s

关闭vhost-net guest→host TCP_RR 1/26697.53 = 0.0000375s

关闭vhost-net guest→host UDP_RR 1/25850.89 = 0.0000387s

结果如下表

使用strace统计系统调用,关闭vhost-net,测试iperf3的情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@host ~]# strace -c -p 37518 # 进程的pid,统计结束后用ctrl+c结束
strace: Process 37518 attached
^Cstrace: Process 37518 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
41.63 1.137491 8 141188 2775 read
28.66 0.783060 6 135331 ioctl
27.16 0.742136 6 121757 writev
2.40 0.065491 8 8380 ppoll
0.13 0.003653 6 594 275 futex
0.01 0.000243 5 50 write
0.00 0.000020 20 1 clone
0.00 0.000012 3 4 sendmsg
0.00 0.000008 4 2 rt_sigprocmask
------ ----------- ----------- --------- --------- ----------------
100.00 2.732114 407307 3050 total

打开vhost-net之后的iperf3测试时strace qemu的结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@host ~]# strace -c -p 27346
strace: Process 27346 attached
^Cstrace: Process 27346 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.96 6.819794 131150 52 ppoll
0.02 0.001416 10 136 write
0.01 0.000862 13 66 futex
0.00 0.000341 9 39 read
0.00 0.000083 10 8 sendmsg
0.00 0.000022 22 1 clone
0.00 0.000016 8 2 rt_sigprocmask
------ ----------- ----------- --------- --------- ----------------
100.00 6.822534 304 total

另外一个好方法就是看qmue发送了多少IOCTLs到KVM。因为每次I/O事件都需要qemu处理,qemu需要发送IOCTL给KVM来切换回VMX noroot的guest模式。我们可以通过strace分析qemu在每个syscall上花费的时间。根据上面两次strace的结果可以看到没打开vhost-net的情况,存在很多ioctl的调用,而打开vhost之后基本上只有ppoll。

Conclusions

在这篇文章里面我们完整了提供了一个创建一个QEMU + vhost-net的虚拟机,检查guest和host来理解这个架构的输入输出。我们也展示了性能是如何变化的。这个系列也到此为止。从 Introduction to virtio-networking and vhost-net 的总览到技术视角深入理解的 Deep dive into Virtio-networking and vhost-net 详细的解释了这些组件,现在展示完了如果配置,希望这些内容呢能够提供足够的资源给IT专家,架构师以及研发人员理解这个技术并开始和他一起工作。

下一个话题将会涉及 Userland networking and DPDK