Samba Overview

Samba (defined in class Samba) is the TLB simulation implementation in SST. It consists of four (sub)components. The first is the TLB unit (class TLB) that represent a single level of the TLB potentially supporting multiple page sizes, where each size class has its own separate lookup structure. The second is the TLB hierarchy (class TLBhierarchy) which serves as a container class for multi-level TLBs and the page walker, and defines the interface for communicating with the CPU and the cache hierarchy. The third is the page table walker (class PageTableWalker), which is responsible for simulating the memory accesses when a translation request misses all levels of the TLB in the hierarchy. It also emulates the OS physical page allocator, and tracks the physical addresses (if emulate_faults is turned on) of all known page table entries at all levels. All Samba components run standard discrete event simulation using the simulation infrastructure, including timers that tick periodically, and links that deliver event objects to the registered call back function with an optional delay. Components are driven by the clock, which is registered with the tick() function as a call back. Components may also communicate with each other or even with themselves by sending messages via the link objects.

At the functional level, the Samba TLB sits between the CPU and the memory hierarchy. CPU side requests for memory accesses would always be handled as translation requests, which, after translation concludes, will be forwarded to the memory hierarchy as an access request. Memory responses, on the other hand, will just pass through the Samba TLB, and be relayed directly back to the CPU. The Samba TLB itself may also issue memory requests for page table walks using a separate TLB-memory channel.

The Samba TLB simulator implements basic functionalities, such as translation lookup, page table walker, and page walk caches. It is, however, still incomplete and does not implement page fault emulation (class PageFaultHandler is only a stub), TLB entry invalidation (i.e., TLB shootdown), and there is no notion of contexts (e.g., PCID/ASID). Due to lack of information on the actual page table content and OS internal states, Samba keeps track the set of mapped virtual pages in a demand-driven manner: Initially, no page is assumed to be mapped; As translation requests are being serviced, the virtual page address is added to the set of mapped addresses, and the OS’s physical page allocation is emulated with a page fault handler (actual page fault handling is still simulated as part of the OS’s control flow. Here we only emulate the physical page allocation process). As a consequence, the memory accesses issued by the page walker are likely to use addresses that differ from the OS being simulated if it were executed on a non-simulating environment.

The Single-Level Multiple-Size TLBUnit

class TLB defines a set-associative, single-level TLB that supports multiple page sizes. Each page size is hosted by a standard implementation of a set-associative array (in data member tags), and different sizes use disjoint arrays. This class models functional features of a TLB array, such as lookup, insert, and replacement, as well as the timing traits, such as tag lookup latency, fixed service bandwidth per cycle, and deduplication of requests to the same address. A multi-level TLB hierarchy (as is the case with most architectures nowadays) can be formed by connecting several objects using direct pointers (in data member next_level, note that it is not class Link, but just an object). The TLB hierarchy is connected with the CPU and the page walk cache via link objects, which is implemented in class TLBhierarchy, and not discussed here. Note that in the entire Samba implementation, level begins with one, and lower level means closer to the CPU.

The Set-Associative Array

Each class TLB object implements a few size classes, the number of which is stored in data member sizes and is specified in the configuration using the key sizes_L%d (%d is the level in the TLB hierarchy - same for other configuration keys, and we do not repeat this below). Each size class has its own tag array, the size, associativity, and page size class is specified in the configuration using keys size%d_L%d, assoc%d_L%d, and page_size%d_L%d, respectively, where the first %d is the ID of the size class. The size class ID is specified in the configuration file, and has no special meaning other than serving as a unique identifier. Size classes are given in the unit of KB (1024-based) and stored in the page_size%d_L%d array in the unit of byte. The class constructor also derives the number of sets for each size class, and a mapping from page size class (in KB) to the ID used in this class (i.e., the reverse mapping of page_size%d_L%d array), which are stored in data member sets and SIZE_LOOKUP, respectively.

For each size class, the object maintains three arrays for storing the relevant information: tags, valid, and lru, the meaning of which is just straightforward. They are all three-dimensional arrays, where the first dimension is the page size class ID (i.e., output of SIZE_LOOKUP), the second dimension is the set number, and the last dimension is the way number.

The standard access functions for set-associative arrays are defined. insert_way() inserts a new tag into a given way of a given size class. invalidate() clears the valid bit of all entries in all size classes, and is used in TLB shootdowns (the id argument is redundant, and note that the vadd argument is page number, not full address). check_hit() performs lookup and signals hits or misses on a particular size class. find_victim_way() returns the way number of the LRU entry on a particular size class. update_lru() promotes a given way to the MRU position on a particular size class.

Note that TLBs implement precise LRU as the replacement algorithm. Each entry in a set is assigned an LRU counter representing its position in the LRU stack (stored in data member lru). The LRU entry is the one with a counter value of (number of ways - 1), and the MRU entry just has zero. An entry is moved to the MRU by incrementing all counters with a value smaller than its value by one, and assigning the entry’s counter with zero.

TLB Event Handling

class TLB has only one event handling function, namely, tick(). This function is not registered to the clock, but it is merely called by the tick() function of its containing class, class TLBhierarchy. The tick() function simulates parallel operations of all buffers and processing logic in the TLB unit, and we discuss them in the order that they appear in the source code as follows.

Response Handling

The response handling logic reads event objects that are responses of previous lookup requests to the lower level component from the queue pushed_back (implemented with std::vector), which contains event object pointers. A buddy structure, pushed_back_size, which is an std::map, maps response event objects using its event ID (a globally unique ID) as keys to the page size class that the requested address is mapped to. Responses are handled by inserting them into the tag arrays for all size classes supported at this level that have smaller page sizes than the response size (e.g., if pushed_back_size indicates that the response size is 2MB, then the entry will be inserted into both the 4KB and 2KB array).

Then, responses are matched with pending requests that missed the current level stored in pending_misses, and the pending requests are removed, because they have been serviced.

Next, the completion time is computed by adding both the tag lookup latency as well as the uplink latency (times two, since this steps models both the request and response latency) to the current cycle, and add them into the ready_by and ready_by_size queue (the latter simply copies over the response’s size). Event objects will be delivered to the upper level when the simulated cycle of the TLB unit reaches the ready_by value.

Note that the TLB timing simulation is a little bit off, since both the link latency and tag lookup latency are treated as if they were only paid when the miss request is serviced, which is not true. I guess this is an optimization to avoid the complication of adding class Link objects between TLB levels to properly simulate link latency, as well as splitting the lookup process into tag lookup and hit/miss handling. This also introduces weird artifacts. For example, the response is inserted into the array both here and below when the response is eventually sent to the upper level.

The TLB unit then checks the duplicated miss queue, SAME_MISS, which maps a virtual address to a set of events that request the same address as the key while the “master miss” was being serviced when the duplicated request came. If duplicated requests indeed exists, i.e., the address of the response occurs in SAME_MISS, then all the events will also be immediately serviced by adding them into ready_by and ready_by_size. The TLB unit also has another structure, namely PENDING_MISS, for tracking addresses with duplicated requests. This structure is logically just a set containing all addresses that are requested multiple times during the miss handling of the address.

Note that the implementation is unnecessarily complex here, because the value of SAME_MISS and the PENDING_MISS by itself are logically unordered sets of addresses. The author of the code, however, chose to implement sets (which is in the STL) as std::map objects with values being 1.

The response object is removed from the queue after the above steps, and the next object is read from the response queue. The process drains all response objects at the simulated cycle, which also seems a little bit off, since in reality there is likely a maximum bandwidth for handling responses.

Request Handling

class TLB models a fixed request processing bandwidth per cycle. If the number of requests arriving at a cycle exceeds the processing bandwidth, only a subset of them will be handled, while the rest will be buffered. The request buffer of class TLB is modeled by data member not_serviced, which stores requests that have not been processed. For each cycle before the bandwidth saturates, one request is taken from not_serviced, and is processed by calling check_hit() with the requested address. If the lookup indicates a hit, the request is fulfilled immediately, and the event object is moved from not_serviced to ready_by and ready_by_size with a latency of lookup only (Note: I do not know why the upper link latency is not added to the total latency).

If the lookup indicates a miss, however, then the request should be serviced by sending a request to the lower level component. This is achieved by calling push_request() (which simply calls push_back() on the request queue, not_serviced) on the next_level pointer, if next_level is not NULL, meaning that the next level component is a TLB. Otherwise, we have reached the last level of the TLB hierarchy, and a TLB miss happens. In the latter case, the request is pushed into the request queue of the page table walker (data member PTW). class TLB also models a maximum number of outstanding misses. If this threshold has not been reached, the request will be moved to pending_misses, if its address does not collides with the one of any another pending miss. Alternatively, the request is moved to SAME_MISS and PENDING_MISS if it is deemed a duplicated request. In other words, SAME_MISS and pending_misses are exclusive, and PENDING_MISS covers all outstanding misses.

If either the outstanding miss bandwidth or the request handling bandwidth is reached, the request will remain in not_serviced queue for the next cycle processing.

Sending Responses

Responses generated at the current level, which can either be miss responses from the lower level or lookup hit responses at the current level, is processed by scanning ready_by. If an event completes at or before the current cycle, it will be first inserted into the local tag array (which, for some reason, is also done when miss responses are handled), removing itself from pending misses, and then finally add the response event and the size into the response buffer of the upper level component. The response buffer of the upper level, which corresponds to pushed_back and pushed_back_size of the current level, are accessible in the object as service_back and service_back_size.

Configuration-Specific Behaviors

class TLB can behave differently for different configurations. First, multiple misses are fulfilled with the same response only for L1 TLB (we described the process with this feature above). In other words, SAME_MISS and PENDING_MISS are only used when level equals 1. Secondly, the configuration file can enable parallel_mode for any TLB unit, which means that lookup requests that hit the TLB do not incur any latency, and will logically be responded to the upper level in the same cycle. This is a consequence of the common practice of overlapping TLB lookup with cache tag lookup. If the TLB lookup hits, then its latency is entirely hidden by the cache lookup, and the cache simply ignores TLB latency.

TLB Hierarchy

class TLBHierarchy implements a multi-level TLB structure, with page walk handling, internal states for maintaining known virtual-to-physical translation at all levels of the page table, and sets of the current working set mapped by the page table. Besides, the class also defines link objects connecting to the CPU, the cache hierarchy, and the main memory. Each class TLBHierarchy represents a private TLB hierarchy on each core.

In the big picture of the memory hierarchy, the TLB hierarchy sits between the CPU and the cache hierarchy of each core. The CPU, when issuing a memory access, enqueues a translation request (as a class MemEvent object) into the input link of the TLB hierarchy object. The TLB hierarchy object then, at each cycle, fetches the requests from the input queue, and services the request from the TLB hierarchy, from the page walk cache, and/or from the main memory by issuing page walk memory accesses. After a translation request has been serviced, the TLB hierarchy forwards the request to the cache hierarchy via its cache link. After the cache hierarchy completes the request, the request is sent back to the TLB hierarchy, which simply forwards it back to the CPU without any additional action.

Organizing TLBUnits to Form a Hierarchy

Data member coreID is the core ID that the TLB hierarchy is attached to. Data member levels is the number of levels in the TLB hierarchy, which must be a non-negative number. levels being zero means that there is no TLB, and hence no caching of translation results. Every memory operation in this case will trigger a page walk by directly enqueuing the request into the queue of the page walker (class PageTableWalker object).

On initialization, data member TLB_CACHE, which is an array of class TLBUnit objects, is initialized by creating TLB objects, and putting them into the array. The first level TLB is on TLB_CACHE[1] rather than TLB_CACHE[0] (i.e., the first element is not used to make the notion more consistent). As mentioned earlier, class TLBUnit objects are chained together using pointers, such that requests missing one level can be forwarded to the next level. This is achieved by passing the pointer of the previous level TLB to the constructor of the current level. In addition, TLB units are also connected backwards by storing the pointers to the pushed_back and pushed_back_size array of the previous level in the current level, as data member service_back and service_back_size. When a translation request is completed, it is passed back along the hierarchy by enqueuing it into the service_back and service_back_size array, which will then be processed by the previous level. In other words, translation requests are first passed down the hierarchy by calling push_request() on the next level, if the request misses the current level. When the request is serviced, it will be passed up the hierarchy by pushing the completed request into service_back and service_back_size of the previous level.

The service_back and service_back_size of the L1 TLB (or the page walker, if levels equals zero) is set to data member mem_reqs and mem_reqs_sizes of class TLBHierarchy, such that a completed request will finally be delivered by the L1 TLB into these two arrays.

The class TLBHierarchy object also contains the page table walker, as a class PageTableWalker object, the reference of which is stored in data member PTW. The page table walker is the last level of the hierarchy that handles a translation request, and it honors the same event passing mechanism as the TLB units. To elaborate: The page table walker is registered with the last level of the hierarchy into its PTW data member (and in this case the next_level will be NULL). This is done during the initialization of the class TLBHierarchy object. Memory events for translation, if missed the last TLB level, is passed to the page table walker by calling push_request() on the walker object. When the request is serviced, it will be passed from the walker back to the last level of the hierarchy, using the service_back and service_back_size array pointers, which point to the pushed_back and pushed_back_size arrays either of the last level TLB, or of the TLB hierarchy object. The latter only happens when there is no TLB such that the page table walker directly communicates with the TLB hierarchy object. Note that no link object is involved in the hierarchy, possibly as an optimization to reduce the pressure on the event queue.

Two links are maintained in the class TLBHierarchy object, to_cache and to_cpu. These two links are not initialized in the object itself. Instead, the class Samba object configures the link, and passes them to the TLB hierarchy object. The to_cpu link is configured in the file as cpu_to_mmu%d (%d stands for the core ID; Same notions below), and the receiving call back function is registered to handleEvent_CPU(). This function is invoked when the CPU requests address translation, and it simply records the time of the event in time_tracker object, and forwards the request to the L1 TLB to initiate the translation process. The to_cache link is configured in the file as mmu_to_cache%d, and the receiving call back is handleEvent_CACHE(). This function just relays the event to the CPU by calling send() on the to_cpu link.

TLB Hierarchy Event Handling

The event handling function in class TLBHierarchy is tick(), and the logic is rather straightforward. If hold is true, which is set by the page table walker when a page fault is being serviced, the entire hierarchy is blocked from accepting new requests, and only the page table walker can make progress. In this case, only the page table walker ticks, and the function returns. Otherwise, each level of the TLB ticks, by calling tick() function on each of them, and the page table walker also ticks.

Next, the TLB hierarchy handles completed translation requests that are passed up the hierarchy and pushed into its internal arrays, namely, mem_reqs and mem_reqs_sizes. For each completed translation request, if emulate_faults option is turned on, meaning that the physical address of the translation is provided by the page fault handler, then the physical address of the memory event object is set by calling setAddr() and setBaseAddr() on it. The physical address is obtained from the address mapping structure, PTE, which is defined in class Samba, and maintained by the page table walker’s page fault handler (in fact, the physical address of all levels of the page table are maintained for page walk purposes which we discuss in the page walker section). Then, the time it takes for servicing the request is computed by taking the difference between the current cycle, and the start cycle recorded in time_tracker (recall that the start cycle is recorded in handleEvent_CPU()), followed by a statistics update. The event is eventually sent to the cache hierarchy by calling send() on the to_cache link, before it is removed from the mem_reqs and mem_reqs_sizes queue. Note that, in this process, the mem_reqs_sizes queue is ignored. The sole purpose of the queue is to keep the interface consistent.

The Page Table Walker

class PageTableWalker defines the Intel-compatible, four-level page table walker object, which is the last level of the TLB hierarchy. Translation requests that miss all levels of the hierarchy will be passed to this object, and a page walk will be performed. This class also implements a hierarchy of page walk caches, which caches the page table entries from all levels (PGD, PUD, PMD, PTE), and reduces the number of memory accesses, if the walk hits the cache.

Linking to Other Components

class PageTableWalker contains two link objects, namely, to_mem and s_EventChan. As with the case of class TLBHierarchy, both links are configured within class Samba. The to_mem link is configured with port name ptw_to_mem%d, and the receiving end is registered to recvResp(). This link serves as the channel for issuing memory operations and receiving results during a page table walk, but a simpler page walk model is also allowed, in which the table walker is not connected to the memory hierarchy, and memory accesses during page walks are just simulated with a fixed latency. In the latter case, self_connected must be set in the configuration file, and page_walk_latency must also be set to represent the fixed latency of a single memory operation (in nanoseconds). The link will then be configured as a self link, i.e., the page table walker just, effectively, calls its own handler function, recvResp(), with a delay, when an event object is sent over to_mem.

As for s_EventChan, it is merely a self link with 1 nanosecond link latency, and the receiving end is registered with handleEvent(). Every send() operation on s_EventChan will cause handleEvent() to be invoked with the event object being sent, so it is just a delayed function call. The s_EventChan is for handling page faults with an external page fault handler. If emulate_faults is turned off, this link will not be initialized, and will not be used.

Page Walk Cache

The page walker implements a page walk cache that stores intermediate results and PTE entries of previous page walks. The page walk cache only has one level, and every supported page size has its own page walk cache. Currently, the number of page sizes is hardwired to four, namely, 4KB, 2MB, 1GB, and 512GB, which corresponds to pages mapped at PTE, PMD, PUD, and PGD level (although the last is not supported on real hardware). Data member page_size stores the hardwired page sizes at all page table levels, with element zero being the smallest page (4KB), and element three being the largest (512GB).

The size and associativity of page walk caches are specified in the configuration file, using keys size%d_PTWC and assoc%d_PTWC, respectively. Data member size, assoc, and sets store these parameters (sets is derived from the previous two). The tag, valid, and lru are implemented in the same manner as the TLB, which has already been described above. Access functions for the page walk cache are also defined in exactly the same manner, i.e., insert_way() to insert an address tag, invalidate() to remove an entry, check_hit() to perform lookups, and find_victim_way()/update_lru() for LRU evict and update.

Page Fault Handling (w/ non-confined PTW)

Page Fault Handler

The page walker emulates page faults, if emulate_faults is set in the configuration file. The page walker implements page fault as separated code paths from the normal page walk request handling, which greatly simplifies the structure of the logic.

The entry of page fault related features in the link s_EventChan, which is initialized by class Samba. All page fault related events are routed to this method, which maintains a state machine tracking the current status of page fault handling. Additionally, a class PageFaultHandler object is initialized and added to the page table walker object. Although the handler is not fully implemented yet in the current distribution of SST, its duty in page fault handling is to mimic an OS memory allocator that maintains a pool of available physical pages, and allocates these pages on page faults for page table pages and data pages.

It does not seem that timing of the page fault is simulated, but this is acceptable, since the actual page fault handling runs as an OS thread, the timing of which is already modeled by just simulating the instruction stream. The major role of the page fault handler, therefore, is to simulate the OS’s memory allocator such that the page table pages and data pages in the simulator can be assigned physical addresses. These addresses are tracked by multiple page table related structures, namely, data member PGD, PUD, PMD, and PTE. These structures are just std::maps that map virtual addresses on that level (i.e., upper bits of the VA) to physical addresses on the next level (or data page, if PTE). New entries are inserted into these maps when the page fault handler returns the allocated physical page, on a page fault event, and the maps are queried during a hardware page walk to generate the physical addresses of page table entries at the corresponding level.

The page fault handler defines an interface object, class PageFaultHandlerInterface, which is just a standard functor object that, when being called, invokes a registered member function on a registered object instance. Handler registration is performed by calling registerPageFaultHandler(), which is done in class Samba’s constructor. Each page table walker object has its member function recvPageFaultResp() registered, with a class PageFaultHandlerPacket object as argument specifying the action, virtual address, physical address, and the size class of the page of the handler event.

Page Fault Handling Control Flow

We only discuss the non-confined page table walker case (controlled by data member ptw_confined, and configured by ptw_confined), which is the default. The confined page table walker seems to lack certain elements to work properly (e.g., it does not seem to stall the page table walker anywhere, but unstalls it when page fault is handled).

Page fault is triggered in the main event loop tick(). For every page walk request in not_serviced queue, the requested virtual address is checked against three structures that track the mapped pages, namely, MAPPED_PAGE_SIZE4KB, MAPPED_PAGE_SIZE2MB, and MAPPED_PAGE_SIZE1GB. These three structures are std::map objects implementing the set semantics (I have no idea why the author does not just use std::set). Addresses that already have their page fault handled will be entered into these three structures (which happens in the page fault handler when a fault is fully handled), such that the simulator just assume that no further page fault would ever happen for these addresses, and the physical page they have been mapped to will not change since then. Recall that the fault simulation is mainly to assign physical addresses to virtual addresses for page walk address generation, and it does not care about timing, so this assumption is reasonable. Moreover, when the simulated OS changes the mapping (most likely followed by a shootdown), the simulator can be instructed to erase certain mappings from these structures, which will cause a new physical address to be assigned on the next translation request for these entries.

If the requested translation is not tracked by any of the three structures, then a page fault is signaled. The fault address is tracked by data member stall_addr, and the fault address is added to the structure tracking pending page fault, namely, PENDING_PAGE_FAULTS, if not already. If there is no pending fault on the same address, the page fault is delivered via the link s_EventChan, using a class SambaEvent of type EventType::PAGE_FAULT. As the last step, data member stall and hold are both set, indicating that a page fault is being handled. During page fault handling, normal page walk requests will not be processed, and upper level TLBs will also stop processing translation requests. Note that stall and hold are always set or cleared together. The difference is that stall is a data member of class PageTableWalker, which is only read by the table walker itself, while hold is defined as a data member in class TLBHierarchy, and the hierarchy will prevent all TLBs from processing requests, if hold is set to 1. The page walker holds a reference to hold in order to notify the upper level that a fault is being handled.

Also note that the tracking structures, such as PENDING_PAGE_FAULTS, MAPPED_PAGE_SIZExxx, and the per-level page table page’s physical address tracking structures (PGD, PUD, PMD, and PTE), are shared across all instances of page walkers, and hence they always use the same virtual to physical mapping. Synchronization would not be a problem, as the simulation is already synchronous, and only processes one event at the current cycle at a time. One implication is that, although it seems impossible for PENDING_PAGE_FAULTS to have more than one pending requests, since a page walker can only handle one fault at a time, it is, in fact, possible that another page walker is already in the process of handling a fault on the same address, and hence checking PENDING_PAGE_FAULTS is actually necessary to avoid multiple allocations to the same virtual address.

The page fault object sent via s_EventChan will invoke handleEvent() at a later cycle. On receiving the event, the function will first determine the level of the fault by querying the fault address in the per-level tracking structures, namely, PGD, PUD, PMD, and PTE. Page fault will be handled from the highest level that the page table page is not mapped, and physical pages will be allocated for the page table itself as well as the data page. After determining the level, which is stored in data member fault_level, the page fault handler’s allocatePage() is called to grab an unused physical page. Although in the current distribution of SST, the allocatePage() function is not implemented, it is supposed to maintain a pool of free physical pages, and just allocate one of them to the page fault handler. The response event is supposed to be sent by calling the handler registered to its data member pageFaultHandlerInterface with the newly allocated physical address, using a class PageFaultHandlerPacket object and setting the action to PageFaultHandler::PageFaultHandlerAction::RESPONSE. The response message will be received by invoking page table walker’s function recvPageFaultResp() (which is the registered call back function). The function simply forwards the response in the form of a class Samba object, with the type being EventType::PAGE_FAULT_RESPONSE.

On receiving the response event in handleEvent(), it will update the per-level tracking structure with the allocated physical page, meaning that the page table page is allocated a physical page, and addresses could be generated for later page walks, if the walk traverses through the page. The function also decrements fault_level, if it is non-zero, and calls allocatePage() again, in order to allocate physical page for the next level of the page table. This process continues, until all nodes on the radix tree branch is allocated a physical page, in which case fault_level would become zero, and page fault handling concludes. A class SambaEvent of type EventType::PAGE_FAULT_SERVED is then sent to s_EventChan (i.e., to the same function as the sending one), on receiving of which, the page address is entered into MAPPED_PAGE_SIZE4KB, and removed from PENDING_PAGE_FAULTS. Future page walk requests to the same address will not incur the page fault, as MAPPED_PAGE_SIZE4KB will indicate that the address is already mapped. Besides, the page table walker will unstall on the next tick, since the address is no longer in PENDING_PAGE_FAULTS. This is checked at the beginning of tick(), and, if the address is truly missing from PENDING_PAGE_FAULTS, both stall and hold will be cleared, allowing page walk requests and translation requests to be served by the page table walker and upper level TLBs, respectively.

If emulate_faults is turned off, then all the physical address tracking structures will not be used. In this case, page walks have no reliable sources to generate the physical address for each step of the walk, and can thus only use random numbers as the page number to be accessed during the walk.

Page Table Walk Event Handling

The main page table walk logic is implemented in function tick() and the call back function of the link connecting to the page table walker with memory, recvResp(). The first thing that tick() does is to check whether the current page table walker is stalled in a page fault, and if true, then no page table walk is performed, and the function waits for fault handling to complete. Otherwise, page walk requests are fetched from not_serviced queue, which is enqueued by the last level TLB by calling push_request() on the page walker, and examined for processing. The page table walker also implements a maximum number of walks per cycle with local variable dispatched, which is incremented for every request serviced at the current cycle. If this value exceeds the maximum, specified by max_width_PTWC in the configuration, and stored in data member max_width. Then page fault logic checks whether the page walk will induce a fault. Note that this deviates from the theoretical way that a page table walker works, where page fault is only triggered when page walk encounters an invalid entry at some levels. On a page fault, as discussed earlier, the handling function will block the processing and all upper level TLBs, until fault handling completes. If not, the next step is to check page walk cache. All caches are assumed to be checked in parallel at the same cycle, and priority of hit is given to lower levels (i.e., PTE > PMD > PUD > PGD), since lower level hits in the page table leads to shorter page walk path. If the PTE page walk cache is hit, indicating that the translation entry is already in the cache (k == 0), then no page walk is needed, and the request can be fulfilled either at the current cycle (if parallel mode is on, which is typical for VIPT L1 caches), or latency cycles after the current cycle. In either case, the event object representing the request is moved to ready_by queue, and the corresponding ready_by_size queue is populated with the system page size. Note that only one system page size is supported for now, which is specified using os_page_size in the configuration file (unit is KB, so valid values are 4, 2048, and 1024 * 1024), despite the fact that upper level TLBs may have several size classes.

On the other hand, if all of the page walk caches signal misses, or one of them hit, but the level of hit is higher than the system page size, then page walks are still needed to retrieve the missing levels from the in-memory page table. The number of levels to fetch is computed and stored in local variable k based on the system page size. The handler then attempts to add the walk request into the pending request queue, pending_misses, if the maximum number of pending requests per cycle has not been reached (data member max_outstanding, specified with max_outstanding_PTWC in the configuration file). If the limit is reached, then the request remains in the not_serviced queue, and will be re-handled in the next cycle.

If the request is successfully added to the pending_misses queue, the handler will then proceed to model the timing of the page walk. In the simpler case, if the page walker is not connected to a memory object (self_connected is set in the configuration file, causing data member to_mem to be set as NULL), page walks are handled as a fixed latency event, and the result will be immediately added to the ready_by array with a delay of latency + 2 * upper_link_latency + page_walk_latency cycles. The page walk latency can be configured using page_walk_latency in the configuration file, and is defaulted to 50 cycles.

In the more difficult case, the page table walker is actually connected to a memory object, then the walk is simulated by issuing instructions to the memory one after another, beginning from the lowest level that hits the page walk cache, until the required number of steps is reached (stored in local variable k). Since page walks require a multi-round message exchange between the page table walker and the memory system, the page table walker, therefore, uses state machines to track the status of in-progress walk instances across calls to tick() and the memory call back function, recvResp(). Each page walker instance is assigned a globally unique ID by incrementing data member mmu_id, and is then the corresponding request event object is added to the queue, WID_EV. Four extra mapping structures track additional information of the page walk’s status: WSR_COUNT tracks the remaining number of steps to walk down the page table; WID_Add tracks the virtual address of the walk; WSR_READY tracks whether the page table walk has completed using boolean flags (note that this structure is only set, but not used anywhere). All three of them use the page walk instance ID as the key. MEM_REQ maps the memory access request (not the translation request!) event ID to the global unique ID. This map is used to recover the page walk instance ID from the event object when it is fulfilled by the memory, and the instance ID is then used to retrieve the status of the current walk.

The initial memory address to be accessed for the current step of the page walk is either generated as a random number, if page fault emulation (i.e., allocation of physical pages for page table pages) is turned off, or computed as the address of one of the PGD entries depending on bit 39 to 47 of the virtual address in the request. Note that this seems inconsistent with what an actual page walk would behave, since an actual page walk will traverse down the page table from the lowest level whose page walk cache is hit, but here the traversal always begins from the top level (i.e., PGD). As the last step, a memory access request is constructed as a class MemRequest object with both the requested virtual address, and the address to be accessed. The memory access request’s ID is entered into MEM_REQ, and the request itself is sent to the memory for fulfillment by calling send() on data member to_mem.

Once the memory request is fulfilled, the call back function, recvResp(), will be invoked. The page walk instance ID is first fetched from the mapping structure MEM_REQ, using the request ID stored in the response event object (by calling getResponseToID()), and stored in a local variable pw_id. The virtual address, which is tracked by WID_Add, is first inserted, by calling insert_way(), into the page walk cache of the current level (tracked by WSR_COUNT), after evicting an existing entry from the structure by calling find_victim_way(). Then, if the page walk has reached the lowest level, indicated by the WSR_COUNT value of the instance ID being zero, then the page walk concludes, and the translation request is added to the ready_by and ready_by_size queue with an extra latency of latency + 2 * upper_link_latency to model the page walk cache lookup and link delay. Otherwise, more memory accesses need to be made to read lower levels of the page table. The address generation process is similar to the one in tick(), with the exception that the address is read from different physical address tracking structures (PGD, PUD, PMD, and PTE) according to the value of WSR_COUNT. A new class MemEvent object is allocated to convey the next level access, and sent via the link by calling send() on to_mem link. The WSR_COUNT value of the walk instance is also decremented by one, and the MEM_REQ mapping is also updated to reflect the most recent memory access (instance ID does not change, but the event object is the newly allocated one). This process will be repeated several times, until the page walk reaches the terminating level with regard to the configured OS page size.

We now return to the event handling loop in tick(). After dealing with the memory access of page walks, the walk request is removed from the not_serviced queue. Pending page walks still remain in the pending_misses queue until they are moved from ready_by queue to the last level TLB’s response message queue. The last step of event handling is to check completed requests in the ready_by queue, and move them into the last level TLB’s response queue, if the current tick cycle is no less than the completion cycle of the request. This is done by scanning the ready_by queue, comparing the completion cycle of the request (which is the value of the map), and for every request whose completion cycle is no less than the current tick cycle, insert the requested virtual address into the PTE cache (on index zero), if not already, after which LRU is also updated. Both the request and the size (in ready_by_size queue) are inserted into the upper level’s response queue, service_back and service_back_size. The event is then removed from pending_misses as well. The code also performs a sanity check on whether the physical address of the PTE corresponding to the requested virtual address is mapped in the physical address tracing structure, PTE. If not, then something is seriously going wrong, because it is supposed to be mapped after the first emulation of the page fault, which should definitely happen before the walk request is completed.

Samba

class Samba is merely a container class that implements no concrete functionality other than initialization and event propagation. The class implements the full multi-core, multi-size TLB hierarchy, with page walkers and communication links on each core. The topology on each core must be identical, meaning that only homogeneous TLB configuration can be supported.

Samba maintains the global physical address tracking structure that map virtual page numbers and page table pages at all levels to the physical addresses. As have mentioned earlier, these structures mimic the OS physical page allocator, and is used to generate the physical addresses for cache accesses and physical addresses for page table walks. Samba assumes that all simulated TLBs operate under the same address space, and hence only one copy of the mapping structures is maintained. Each TLB instance holds pointers to the shared mapping structures (which are registered during initialization by calling setPageTablePointers() on both class TLBHierarchy and class PageTableWalker), and they access the structures synchronously in the timer call back, tick(), or link object call backs. Operation-level contention is impossible, as accesses will not overlap (tick() and link call backs are executed serially). Semantic-level contention may happen when, for example, multiple TLB page walkers fault on the same virtual address, and each of them may request a physical page without noticing each other’s fault. This is handled by adding the current ongoing page fault handling in the shared structure PENDING_PAGE_FAULTS, and checking the structure every time a fault is to be started on the page table walker. The process has already been described in the relevant sections, and we do not cover details here.

Context switches are not modeled, but can be easily added by flushing the content of TLB arrays, page walk caches, and the tracking structures. Fine-grained shootdown is also possible by only removing entries that match the requested virtual address, but it requires collaboration from other parts of the simulator, since TLB shootdown is an OS-level task, which happens on page table updates. The TLB simulator itself hence has no way to know when an entry is no longer valid.

Data member CR3 stores the physical address of the PGD page, i.e., root level of the 4-level page table. Data member PGD, PUD, and PMD map from the higher bits of the virtual address, i.e., bit 12 – 47, bit 21 – 47, bit 30 – 47, and 39 – 47, respectively. Caveat here: confined and non-confined page table walker seem to differ on these index bits, but I could not decipher the purpose of using different bit indexing schemes due to lack of documentation. Data member PTE maps virtual page numbers to physical page numbers, and is used in class TLBHierarchy to generate the physical address for memory access when translation completes. All these structures are updated in class PageTableWalker, with the aid of class PageFaultHandler, which mimics an OS physical page allocator.

Pages that are already mapped (i.e., have a physical address associated) is tracked by data members MAPPED_PAGE_SIZE4KB, MAPPED_PAGE_SIZE2MB, and MAPPED_PAGE_SIZE1GB. These structures are used to determine whether a page fault is needed, which simply emulates the allocation of physical pages, and enters the virtual to physical address mapping into one of the three structures.

On initialization, Samba creates the per-core TLB hierarchy after reading the essential configurations. Link objects that connect the per-core hierarchy with other components are also created in the initialization routine, and then passed to the corresponding Samba components. It configures the per-core TLB to CPU link, per-core TLB to cache hierarchy link, the per-table walker to memory (or to one of the caches in the hierarchy) link, if the page table walk is to be simulated (self_connected in the configuration file), and the loop back link of the page table walker for casting page fault events. At the end of initialization, Samba registers itself to the global clock of the discrete event simulation. If the frequency is not specified, it will default to 1GHz.

The tick() function is very simple, which just calls tick() of each class TLBHierarchy object.