LATR: Lazy Translation Coherence
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=3173198
Year: ASPLOS 2018
Keyword: TLB Shootdown
This paper proposes LATR, a software TLB coherence scheme that overcomes the overhead of traditional TLB shootdown. Software-based TLB shootdown has been widely adopted by Operating Systems as a way of maintaining coherence among private TLBs. Just like cache coherence, when a processor modifies the page table either by changing the virtual-to-physical mapping or by altering the permissions bits in a page table entry (PTE), these modifications need to propagate to other processors that have cached a private copy of the PTE to avoid accessing the wrong physical page or accessing the page with wrong permission. Past empirical experience suggests that, compared with cache coherence, TLB coherence events happen relatively infrequently, e.g. during fork() system call, it is hence by design that hardware does not maintain the coherence of TLB entries.
The classical algorithm of enforcing TLB coherence is called TLB shootdown. It takes advantage of Inter-Processor Interrupts (IPI) to deliver asynchronous events to remote processors. We describe the shootdown algorithm as follows. First, the initiator acquires a lock on the page table entry to avoid multiple concurrent modifications and shootdown on the same entry. Then the initiator builds a list of remote processors according to the usage of the page table. The current algorithm that Linux uses is that all processors that have used the page table will be included in the list. The list will be stored in a well-known location in the memory, and is accessible to all processors in the system. The initiator processor then sends IPI to remote processors on the list. On receiving an IPI indicating TLB shootdown, the remote processor checks the list to see if it is included. If yes, then the receiving processor invalidates the TLB entry, and acknowledges the completion of the shootdown operation. In the meanwhile, the initiator spins on a set of flags which indicate whether all processors have finished the process. The flags will be set to true if the corresponding processor has completed shootdown. The last step for the initiator processor is to actually conduct the modification, and unlocks the PTE.
Recent evidences show, however, that TLB shootdown actually has an non-negligible impact on performance. This paper suggests that the slowdown caused by excessive shootdowns has two major sources. The first is frequent memory management events, such as mmap() and munmap(). An example is Apache web server, in which a HTTP request is processed by mapping the file into memory, and then um-mapping it after serving the file. The munmap() system call changes the mapping of a virtual address range, and since the change is unsafe, i.e. they cannot be processed lazily, the TLB shootdown must be carried out immediately. The paper also identifies page migration, which is part of the AutoNUMA scheduling package, as another source of frequent shootdown. AutoNUMA migration is triggered by a background daemon process to determine whether a page should be moved to a NUMA node closer to the processor that frequenly refers to it. The daemon process periodically sets the permission of pages under consideration to non-readable and non-writable. If a processor accesses the page, a page fault will be raised. During the handling of the page fault, if the access happens again for several times, then the page will be marked for migration. Since raising the permission of a page is also an unsafe change, TLB shootdown must be performed before the the daemon process modifies the page table.
LATR attempts to reduce the overhead of TLB shootdown from three different aspects. First, to start the shootdown, the initiator will send IPI to remote processors. Both the delivery and the processing of the interrupt is time consuming, and can pollute the cache. Second, during a shootdown, the initiator spins on a set of flags to wait for other processors to complete. The time taken spinning on the flags is wasted because the processor could have been doing useful work. The last problem is that a synchronous shootdown algorithm can add extra variation to the latency of normal operations. In time critical applications, the extra latency can be fatal and decreases the applicability of the shootdown algorithm.
LATR features lazy handling of TLB coherence by delaying the actual delivery of the coherence event. In the example of memory reclamation, LATR puts both the physical pages and the virtual address range into a reclamation list called the lazy virtual address list, instead of releasing the address spaces immediately. In order to determine whether a TLB shootdown is necessary, each processor has 64 LATR states, which records the information about a shootdown if it has been scheduled. The LATR states are enumerated as follows: (1) A flag to indicate whether the entry contains valid information; (2) An address range which indicates the range of virtual addresses the shootdown should invalidate; (3) A bit mask describing a subset of processors that should react to the shootdown request; (4) A reason field that encodes the purpose of the shootdown (e.g. munmap(), NUMA migration test, etc.). On context switch or scheduler tick, the handler scans LATR states located in the kernel. The scan is very efficient because LATR states are statically allocated as a consecutive range of memory, and hence the scan benefits greatly from caching and pre-fetching. If the processor finds out that it is on the list of some other processor’s shootdown request, an invalidation will be performed, and the corresponding bit on the bit mask will be cleared. The freed memory addresses can only be released after all processors on the list completes their shootdown.
On Linux platform, a scheduler tick is usually delivered every 1ms. This places an upper bound on the maximum amount of time a freed memory block can be actually released, because every processor is expected to receive a scheduler tick in every 2ms no matter what their current states are, and inside the handler of the scheduler tick, TLB shootdown will be performed. A background thread is therefore scheduled 2ms after a block is unmapped from virtual address space. The background thread tests whether the LATR states have been all reset, and then releases the block.
AutoNUMA migration test can be done in a similar way. Instead of unmapping the page before adding the shootdown request to LATR states, in AutoNUMA, the daemon thread must add the shootdown request to the LATR state first, come back after 2ms, and then modify the page table. Every memory access on the page will incur a page falut after this point.
Subtle race conditions may arise as a result of lazy handling of TLB coherence events. For example, if the application program frees a range of virtual address by calling munmap(), and later on expect accesses from other threads on the freed page to trigger a page fault, it may fail to observe the falut. This is because unmappings of virtual addresses are now handled lazily. Before the thread on another processor can have a change to scan LATR states on scheduler ticks, its locally cached PTE entries will not be updated and hence contains out-of-date information. The thread is able to access the freed page as usual, although both reads and writes are harmless and will not corrupt states since the physical page will not be released before all processors have acknowledged the shootdown request. If this feature must be implemented, LATR will fall back to using IPI for request delivery.