UNified Instruction/Translation/Data (UNITD) Coherence: One Protocol to Rule Them All

- Hide Paper Summary

Paper Title: UNified Instruction/Translation/Data (UNITD) Coherence: One Protocol to Rule Them All
Link: https://ieeexplore.ieee.org/document/5416643
Year: HPCA 2010
Keyword: TLB; Coherence; TLB Shootdown

Back

This paper presents UNITD, a unified mechanism for maintaining coherence for caches and the Translation-Lookaside Buffer (TLB) using the classical cache coherence infrastructure. Traditionally, TLB are not maintained coherent by the hardware, unlike the cache. There are three reasons. The first is that although cache coherence are important in most cases, as data sharing between processors are very common, TLB entries, on the other hand, are only invalidated when the underlying page table entries (PTEs) are modified. Maintaining coherent TLB across processors hence may not justify the extra hardware and design efforts in order to implement it. In addition, TLB modifications can be classified into “safe” and “unsafe”. “Safe” modifications are those that initializes an invalid virtual address (VA), or that the permission is lowered. In these two cases, the modifying processor does not have to notify other processors that the change has been made. This OS handles this lazily: when the exception is raised by other processors when they access the page using their out-of-date private TLB entries, the OS detects that the exception is spurious after checking the PTE, in which case no further action is taken besides simply updating the TLB entry. For “unsafe” modifications, however, the OS must always invalidate potentially cached PTE entries in remote TLBs. The last reason is that since TLB coherence is already well-handled by existing software, i.e. Operating Systems, even if hardware coherence is implemented, people would be reluctant migrating to the new method, because the old approach works very well.

The software approaches that maintains TLB coherence are called “TLB shootdown”. A TLB shootdown algorithm generally involves one processor setting up a list of sharers for a certain PTE, and trriggering Inter-Processor Interrupts (IPI) to all sharers. Sharers of the PTE will invalidate the corresponding TLB entries, and remove themselves from the list. Once all entries have been removed, the initiator processor could proceed to modify the page table entry. A lock per PTE is also acquired during the modification process to synchronize processors that wish to moidify the same entry.

The trend that software TLB coherence is sufficient, however, is likely to fade out in the coming years as more and more cores are integrated onto the chip. The study conducted by this paper using both synthetic workloads and real-world applications
suggests that, as the number of cores keep increasing, the performance impact of software TLB shootdown becomes more significant. The paper also suggests that using existing hardware infrastructure for cache coherence to enforce coherent TLBs is a feasible choice. We describe the details as follows.

One major difficulty of maintaining TLB coherence via cache coherence is to associate each entry of the TLB with a physical address on which the notion of coherence is defined. One straightforward choice, which is also the one adopted by this paper, is to use the physical address of the PTE, since TLB can be treated as a special kind of cache, whose content is derived from the PTE in the physical memory. When another processor modifies the PTE in its local cache, given that the PTE is cached elsewhere in a TLB, the state of the physical cache line would be Shared. An invalidation will be sent as usual by the cache controller to acquire exclusive ownership of the line containing the PTE. On receiving this request, the processor invalidates not only the entry in its local cache, but also the TLB entry, if it exists.

When the hardware page walker fills the TLB, the physical address of the PTE is automatically generated by the page walker, which is then associated with the newly inserted TLB entry. If the TLB is filled by software, it is recommended that a new instruction be added to the ISA to allow the OS injecting the physical address of the PTE into a TLB entry. Also note that since cache coherence is maintained on the 64 byte cache granularity, it is possible that several PTEs just reside in the same cache line. In this case, they will have the same physical address in the TLB. Modifications on any of them will result in all being invalidated. In practice, aliasing like this happens relatively rare, and the performance impact is minimum.

The storage of physical addresses in the TLB requires careful design. The hardware mapping structure must have the following two properties: First, given a physical address (of the PTE), it should find whether an entry containing the same address exists; Second, if an entry is found, the structure should identify which TLB entry is associated with it. Satisfying the second requirement is easy, because we can just add a pointer to the TLB entry. Satisfying the first, however, is non-trivial if the TLB is set-associative. A set-associative TLB uses virtual addresses to determine the set and way as in a set-associative cache. The mapping structure, on the other hand, must be fully-associative, because it is queried using physical addresses when a coherence request is received. The paper suggests that the structure be implemented as a Content Addressable Memory (CAM), which is called a PCAM. On receiving an invalidation request, the processor queries the PCAM to see if any of its addresses matches the one in the coherence request. If the search results in a match, then the corresponding TLB entry is identified and then invalidated. Of course, if the TLB is fully-associative, the PCAM then has exactly the same structure as the TLB, and there is no need to store the pointer as it is implicitly given by the index of the PCAM entry.

PCAM only stores the physical addresses of leaf level PTEs, even if the page table is organized as a radix tree with multiple levels. In some cases, the OS may only modify upper level PTEs, e.g. the permission of a large address range is changed during a fork(), in which there is no coherence traffic on the leaf PTE. If the leaf PTEs affected by the modification are cached by one or more TLBs, they will not be invalidated, and hence access violation cannot be detected. To solve this problem, it is required that the OS propagate its changes of upper levls PTEs to the leaf level, forcing coherence traffic to inavlidate all potentially affected lower level PTEs. As an optimization, the OS can choose to do this only on unsafe changes.

Another potential problem is self-invalidation. This happens if a processor holds a PTE in its local cache in Modified or Exclusive state, and in the same time has a TLB entry of the same PTE. When the processor writes to the PTE, no coherence traffic will be generated, while it is still necessary for the TLB to be aware of the self-invalidation and either updates or invalidates the entry. The solution is to let the TLB snoop on local stores. If the address of the local store instruction matches any of the entries in the PCAM, then the TLB entry will be invalidated. Such self-snooping capability is cheap to add, as it already exists on modern hardware to accomodate for self-modifying code and to maintain coherence between the instruction cache and data cache.

In order for this scheme to work with load speculation, the load queue must not only snoop store addresses for a match, but also snoop for TLB invalidations caused by store instructions. Imagine the following scenario: A load bypassed a preceding store and is executed speculatively. The store instruction changes the permission bits in the PTE to disallow read operations on the page, which happens to be the page that the load instruction accesses. In this case, if care is not taken, the load instruction would commit successfully without raising an exception as it would be for in-order execution. The process therefore must monitor the invalidation of TLB entries while load instructions are buffered in the load queue. If any of the pages whose mappings are invalidated by a preceding store operation, the load instruction must be squashed. Another solution would be just to re-execute the load before it commits in program order. If the load is not able to execute successfully, then it will be squashed.