Linux memory management(1)

CPU aceess memory

CPU core -> MMU(TLBs, Table Walk Unit) -> Caches -> Memory(Translation tables)

CPU VA -> MMU find PTE(Pysical table entry) -> TLB -> L1 cache -> L2 cache -> L3 cache

note: pretend a architecture with TLB between CPU and L1 cache.

TLB is a some cache form VA-to-PA translaction and formed by PTE blocks.

if TLB miss, CPU find PA from L1 and so on until PA is find and then put the PTE into TLB.

what is TLB?

TLB definition from wiki: A translation lookaside buffer (TLB) is a memory cache that is used to reduce the time taken to access a user memory location. It is a part of the chip’s memory-management unit (MMU). The TLB stores the recent translations of virtual memory to physical memory and can be called an address-translation cache. A TLB may reside between the CPU and the CPU cache, between CPU cache and the main memory or between the different levels of the multi-level cache. The majority of desktop, laptop, and server processors include one or more TLBs in the memory-management hardware, and it is nearly always present in any processor that utilizes paged or segmented virtual memory.

note:

  1. TLB stores recent translations that means not all address translation entry is stored in TLB, take care about cache miss.

  2. TLB may reside between the CPU and the CPU cache, between the CPU cache and primary storage memory, or between levels of a multi-level cache.

  3. virtual addressing met cache miss or physical addressing, CPU always uses TLB to find and store it into cache.

  4. cache strategy LRU or FIFO

  5. The CPU has to access main memory for an instruction-cache miss, data-cache miss, or TLB miss, but compare to others the third case TLB miss is too expensive.

  6. freqently TLB misses occur degrading performance, because each newly cached page displacing one that will soon be used again. Where the TLB acting as a cache for the memory management unit (MMU) which translates virtual addresses to physical addresses is too small for the working set of pages. TLB thrashing can occur even if instruction cache or data cache thrashing are not occurring, because these are cached in different sizes. Instructions and data are cached in small blocks (cache lines), not entire pages, but address lookup is done at the page level. Thus even if the code and data working sets fit into cache, if the working sets are fragmented across many pages, the virtual address working set may not fit into TLB, causing TLB thrashing.

TLB-miss handling

Two schemes for handling TLB misses are commonly found in modern architectures:

  • With hardware TLB management, the CPU automatically walks the page tables . On x86 for example, use CR3 register to walks page tables if entry exists, bring back to TLB and TLB tries and access will hit. Or raise a page fault exception which need to be handled by operation system and load correct physical address to TLB(page swap in/out). CPU change do not cause loss of compatibility for the programs.
  • With software-managed TLBs, a TLB miss generates a TLB miss exception, and operating system code is responsible for walking the page tables and performing the translation in software. The operating system then loads the translation into the TLB and restarts the program from the instruction that caused the TLB miss. As with hardware TLB management, if the OS finds no valid translation in the page tables, a page fault has occurred, and the OS must handle it accordingly. Instruction sets of CPUs that have software-managed TLBs have instructions that allow loading entries into any slot in the TLB. The format of the TLB entry is defined as a part of the instruction set architecture (ISA).

note:

  1. hardware TLB management TLB handling the lifecycle of TLB entries.
  2. hardware TLB management throws page fault that OS must handling and OS should bring the missing table entry of physical address into TLB cache. And than the program resume.
  3. hardware TLB management maintain TLB enties is invisible to software.
  4. hardware TLB management can change from CPU to CPU, but without causing compatibility for the programs. In other words, CPU should obey the rules of TLB management so there is always any page fault exception require OS to handle
  5. software TLB management throws TLB miss exception and OS owns the responsibility to walk page tables and translation in software. Then OS loads TLB table and restart programs (attention! not resume but restart).
  6. compare hardware and software TLB management, according to 2 CPU finds TLB and throw page fault exception when hardware, but in sofware situation, the CPU’s instruction sets should have instruction to load TLB to anywhere and TLB entry can be used directly by CPU instruction

In most cases, hardware TLB management is used. But according to wiki, some of the architectures using software TLB management.

Typical TLB

These are typical performance levels of a TLB:

  • Size: 12 bits – 4,096 entries
  • Hit time: 0.5 – 1 clock cycle
  • Miss penalty: 10 – 100 clock cycles
  • Miss rate: 0.01 – 1% (20–40% for sparse/graph applications)

The average effective memory cycle rate is defined as m + (1-p)h + pm cycles, where m is the number of cycles required for a memory read, p is the miss rate, and h is the hit time in cycles. If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, a memory read takes 30 clock cycles, and the miss rate is 1%, the effective memory cycle rate is an average of 30 + 0.99 * 1 + 0.01 * 30 (31.29 clock cycles per memory access)

note: research more of TLB performance

use perf test TLB miss

1
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses -p $PID

if a high TLB missing rate exists in your OS, try to use huge page to decrease the table entries in TLB which will cut down the miss rate. But some application is not siutable for huge page and more details need to be change before use this solution.

Address-space switch

After process context switches, some TLB entries’ virtual address to physical address mapping is invalid. In order to clean thoes invalid entires, some strategies is required.

  1. flush all entries after process context change
  2. mark the entries with its process so the process context change do not matter
  3. some architecture use a sinlge address space operating system, all process use the same virtual-to-pysical mapping
  4. some CPU have a process register and hardware uses TLB entries only the current process ID matches

note:
flushing TLB is an important security mechanism for memory isolation. Memory isolation is especially critical during switches between the privileged operating system kernel process and the user processes – as was highlighted by the Meltdown security vulnerability[2]. Mitigation strategies such as kernel page-table isolation (KPTI) rely heavily on performance-impacting TLB flushes and benefit greatly from hardware-enabled selective TLB entry management such as PCID.

Virtualization and x86 TLB

With the advent of virtualization for server consolidation, a lot of effort has gone into making the x86 architecture easier to virtualize and to ensure better performance of virtual machines on x86 hardware

EPT required.

reference

  1. https://en.wikipedia.org/wiki/Translation_lookaside_buffer
  2. https://en.wikipedia.org/wiki/Meltdown_(security_vulnerability)