Scalable Distributed Last-Level TLBs Using Low-Latency Interconnects
- Hide Paper Summary
Link: https://ieeexplore.ieee.org/document/8574547
Year: MICRO 2018
Keyword: TLB; NOCStar; NOC
Highlight:
- Very detailed description of the interconnection network based on MUX’ed inputs and outputs
Questions
-
I don’t quite get the calculation of arbitration signal lines
-
To my understanding, the TLB shootdown algorithm should only work if there are separate instructions for L1 and L2 TLB flush, because otherwise, all processors still have to flush their own L1 TLBs (which cannot be done by a remote core), but if the same instruction for L1 TLB flush also flushes the L2 TLB, then this optimization will not work at all. I would imagine that in the future ISA, the TLB flush instruction only flushes L1, and we need separate instructions for L2 flush. The manager cores use this instruction to flush all addresses affected by the TLB modification, after receiving all forwarded requests from other cores.
This paper proposes Nocstar, a novel design for fast access of distributed L2 TLB. Traditionally, to perform address translation, when L1 TLB misses, the page walker will be initiated to query the in-memory page table. Due to the high latency of memory accesses and the long sequence of dependent operations during the page walk, an L2 TLB is further added to reduce the misses which result in costly page walks.
The design of L2 TLB has become a problem, since the trade-off between access latency and capacity as for caches also appy to L2 TLB (which is organized similarly to a cache). This paper presents three existing designs of L2 TLB: private, monolithic shared, and distributed shared. All three have problems that can prevent them from being practical in future large systems. For private L2 TLB design, since each core can maintain a private copy of an entry in its own TLB, there is a waste of storage, because entries are duplicated across many L2 TLBs. In addition, in the private design, the capacity of each TLB is fixed at manufacturing time. In the run time, even if load imbalance occurs, which results in the scenario where some TLBs are not large enough to hold the working set, while some other TLBs are not fully utilized at all. Such static allocation of resources may harm performance by limiting the maximum size of the working set at each core. Shared monolithic L2 TLB, while solving the entry duplication problem and resource allocation problem, on ther other hand, suffers high access latency. The paper points that on Skylake architecture, the L2 TLB has 1536 entries, which takes 9 cycles to access. Given that address translation is performed on all memory instructions, this can be a disadvantage especially for applications whose working sets are larger than the L1 TLB, offseting the higher hit rate of shared L2 TLB. The last design, distributed L2 TLB, divides the monolithic TLB into small slices. Each core is responsible for one slice, and addresses are mapped to slices using a hash function (not necessarily just using the middle bits in the address). If an address maps to a remote TLB slice, the processor relies on the on-chip communication network to send the request and receive the response. This design has the obvious advantage that accesses to local TLB slice is as fast as a private TLB, while also being scalable as the number of core increases. The problem, however, is that the distributed design incurs a network communication every time a remote access is required. This can further increase both the network traffic and access latency.
Nocstar, which stands for “NOCs for Scalable TLB Architecture”, uses a dedicated, circuit-switching network infrastructure to access remote TLB slices. Nocstar has a different design philosophy from typical NOCs in which messages are transferred and relayed by intermediate nodes in the form of packets and filts. The design decision is made based on an important observation made by the paper: In general, L2 TLB requests are latency sensitive due to the reasons we mentioned in the previous paragraph, and therefore, they must obey very strict timing constraints. On the other hand, the degree of concurrency in terms of number of requests waiting to be served when another request is sent to a slice, is low. Only a few percentage of requests are processed by the same slice when there are other requests outstanding. The above observation implies a design that has both low latency for performance, and low bandwidth for economics.
Nocstar features circuit switching in which connections are set up before transmission can begin. To achieve this, each core, as a relaying node, has an arbitrater connected to it. Each arbitrater is connected with other cores via two lines, one used to sense request signals, and the other used to deliver grant or denied reply signal. Before a TLB read request can be sent to a remote slice, the local TLB controller must set up the path by arbitrating with all other cores on the path. The TLB controller first computes a route from the requestor to the destination, and then it signals all cores on this path for a transmission. Concurrent requests from another core may attempt to acquire the same node, and this is what arbitrator are designed for. The arbitrator, as the name suggests, will grant exactly one requests, and signals a granted message back to the requestor. The transmission may begin only if all grant signals from all nodes in the path is received. Otherwise, the requestor needs to retry the next cycle until arbitration succeeds. The winner of arbitration is granted the next cycle to transmit its request to the destination. Internally, the transmission nodes are implemented as a set of multiplexers. Inputs are mux’ed from several source directions, and then routed to another direction, which is then mux’ed again. Neither buffering nor forwarding happens within a node.
The next step after arbitration is transmission. Transmission with circuit switching network is very fast due to the fact that there is no intermediate buffering or forwarding. As long as the signal could propagate to the destination within one cycle, the transmission can be fulfilled in a single cycle (if not then some relay is needed). This is exactly what Nocstar is trying to achieve: low latency transmission without high bandwidth support (since requests could not share the same node, differing from packet swicthing).
On receiving the remote request, the TLB controller latches the request in its internal buffer, and begins accessing the local TLB slice. If the access is a hit, then similarly, the destination TLB controller sets up a path back to the requestor, and signals success together with entry data. If, on the other hand, the access results in a miss. There are two options: Either the destination replies a negative signal to the requestor, indicating that it should conduct a page walk by its own, or the destination TLB controller performs the page walk, and then replies. The paper pointed out that the second option may cause load imbalancing, if many cores happen to request from the same slice at roughly the same time, which will cause several page walk requests to queue up at a single site.
A slice L2 TLB also implies a design change of the OS by complicating TLB shootdown, which happens when a page table entry is modified and cached TLB entries become stale. In the case of private TLB, cores that may cache the modified entry are notified via an Inter-Processor Interruption, which forces the processor to stop and flush the TLB. In the case of sliced shared L2 TLB, if multiple processors are notified via IPI, and they all issue TLB flush instruction, all of these TLB flush will be sent to the corresponding slice, resulting in temporary congestion on the network. Recall that the network is not designed with bandwidth in mind, congestion in the network may aggravate the TLB shootdown latency which is already problematic for some workloads, causing performance degradation. The paper suggests that, instead of every core sending an invalidation to the L2 TLB using TLB invalidation instructions, only a small number of cores are designated the responsibility of L2 TLB flush. Other cores forward their requests to these cores, and they perform a collective shootdown on behalf of other cores.