Understanding TLB Simulation in Samba (SST)
Samba Overview
Samba (defined in class Samba
) is the TLB simulation implementation in SST. It consists of four (sub)components.
The first is the TLB unit (class TLB
) that represent a single level of the TLB potentially supporting multiple
page sizes, where each size class has its own separate lookup structure.
The second is the TLB hierarchy (class TLBhierarchy
) which serves as a container class for multi-level TLBs
and the page walker, and defines the interface for communicating with the CPU and the cache hierarchy.
The third is the page table walker (class PageTableWalker
), which is responsible for simulating the memory
accesses when a translation request misses all levels of the TLB in the hierarchy. It also emulates the OS physical
page allocator, and tracks the physical addresses (if emulate_faults
is turned on) of all known page table entries at
all levels.
All Samba components run standard discrete event simulation using the simulation infrastructure, including timers
that tick periodically, and links that deliver event objects to the registered call back function with an
optional delay.
Components are driven by the clock, which is registered with the tick()
function as a call back.
Components may also communicate with each other or even with themselves by sending messages via the link objects.
At the functional level, the Samba TLB sits between the CPU and the memory hierarchy. CPU side requests for memory accesses would always be handled as translation requests, which, after translation concludes, will be forwarded to the memory hierarchy as an access request. Memory responses, on the other hand, will just pass through the Samba TLB, and be relayed directly back to the CPU. The Samba TLB itself may also issue memory requests for page table walks using a separate TLB-memory channel.
The Samba TLB simulator implements basic functionalities, such as translation lookup, page table walker, and
page walk caches. It is, however, still incomplete and does not implement page fault emulation
(class PageFaultHandler
is only a stub), TLB entry invalidation (i.e., TLB shootdown), and there is no
notion of contexts (e.g., PCID/ASID).
Due to lack of information on the actual page table content and OS internal states, Samba keeps track the set of mapped
virtual pages in a demand-driven manner: Initially, no page is assumed to be mapped; As translation requests are
being serviced, the virtual page address is added to the set of mapped addresses, and the OS’s physical page
allocation is emulated with a page fault handler (actual page fault handling is still simulated as part of the
OS’s control flow. Here we only emulate the physical page allocation process).
As a consequence, the memory accesses issued by the page walker are likely to use addresses that differ
from the OS being simulated if it were executed on a non-simulating environment.
The Single-Level Multiple-Size TLBUnit
class TLB
defines a set-associative, single-level TLB that supports multiple page sizes. Each page size is hosted by
a standard implementation of a set-associative array (in data member tags
), and different sizes use disjoint arrays.
This class models functional features of a TLB array, such as lookup, insert, and replacement, as well as the timing
traits, such as tag lookup latency, fixed service bandwidth per cycle, and deduplication of requests to the same
address.
A multi-level TLB hierarchy (as is the case with most architectures nowadays) can be formed by connecting several
objects using direct pointers (in data member next_level
, note that it is not class Link
, but just an object).
The TLB hierarchy is connected with the CPU and the page walk cache via link objects, which is implemented in
class TLBhierarchy
, and not discussed here.
Note that in the entire Samba implementation, level
begins with one, and lower level means closer to the CPU.
The Set-Associative Array
Each class TLB
object implements a few size classes, the number of which is stored in data member sizes
and is
specified in the configuration using the key sizes_L%d
(%d
is the level in the TLB hierarchy - same for other
configuration keys, and we do not repeat this below).
Each size class has its own tag array, the size, associativity, and page size class is specified in the configuration
using keys size%d_L%d
, assoc%d_L%d
, and page_size%d_L%d
, respectively, where the first %d
is the ID of the size
class. The size class ID is specified in the configuration file, and has no special meaning other than serving as a
unique identifier. Size classes are given in the unit of KB (1024-based) and stored in the page_size%d_L%d
array
in the unit of byte.
The class constructor also derives the number of sets for each size class, and a mapping from page size class (in KB)
to the ID used in this class (i.e., the reverse mapping of page_size%d_L%d
array), which are stored in data member
sets
and SIZE_LOOKUP
, respectively.
For each size class, the object maintains three arrays for storing the relevant information: tags
, valid
,
and lru
, the meaning of which is just straightforward. They are all three-dimensional arrays, where the
first dimension is the page size class ID (i.e., output of SIZE_LOOKUP
), the second dimension is the set number,
and the last dimension is the way number.
The standard access functions for set-associative arrays are defined.
insert_way()
inserts a new tag into a given way of a given size class.
invalidate()
clears the valid
bit of all entries in all size classes, and is used in TLB shootdowns (the id
argument is redundant, and note that the vadd
argument is page number, not full address).
check_hit()
performs lookup and signals hits or misses on a particular size class.
find_victim_way()
returns the way number of the LRU entry on a particular size class.
update_lru()
promotes a given way to the MRU position on a particular size class.
Note that TLBs implement precise LRU as the replacement algorithm. Each entry in a set is assigned an LRU counter
representing its position in the LRU stack (stored in data member lru
).
The LRU entry is the one with a counter value of (number of ways - 1),
and the MRU entry just has zero. An entry is moved to the MRU by incrementing all counters with a value
smaller than its value by one, and assigning the entry’s counter with zero.
TLB Event Handling
class TLB
has only one event handling function, namely, tick()
. This function is not registered to the clock, but
it is merely called by the tick()
function of its containing class, class TLBhierarchy
.
The tick()
function simulates parallel operations of all buffers and processing logic in the TLB unit, and we discuss
them in the order that they appear in the source code as follows.
Response Handling
The response handling logic reads event objects that are responses of previous lookup requests to the lower level
component from the queue pushed_back
(implemented with std::vector
), which
contains event object pointers.
A buddy structure, pushed_back_size
, which is an std::map
, maps response event objects using its event ID (a
globally unique ID) as keys to the page size class that the requested address is mapped to.
Responses are handled by inserting them into the tag arrays for all size classes supported at this level that have
smaller page sizes than the response size (e.g., if pushed_back_size
indicates that the response size is 2MB, then
the entry will be inserted into both the 4KB and 2KB array).
Then, responses are matched with pending requests that missed the current level stored in pending_misses
, and the
pending requests are removed, because they have been serviced.
Next, the completion time is computed by adding both the tag lookup latency as well as the uplink latency (times two,
since this steps models both the request and response latency) to the current cycle, and add them into the
ready_by
and ready_by_size
queue (the latter simply copies over the response’s size).
Event objects will be delivered to the upper level when the simulated cycle of the TLB unit reaches the ready_by
value.
Note that the TLB timing simulation is a little bit off, since both the link latency and tag lookup latency are
treated as if they were only paid when the miss request is serviced, which is not true.
I guess this is an optimization to avoid the complication of adding class Link
objects between TLB levels to
properly simulate link latency, as well as splitting the lookup process into tag lookup and hit/miss handling.
This also introduces weird artifacts. For example, the response is inserted into the array both here and below when the
response is eventually sent to the upper level.
The TLB unit then checks the duplicated miss queue, SAME_MISS
, which maps a virtual address to a set of
events that request the same address as the key while the “master miss” was being serviced when the duplicated request
came. If duplicated requests indeed exists, i.e., the address of the response occurs in SAME_MISS
, then
all the events will also be immediately serviced by adding them into ready_by
and ready_by_size
.
The TLB unit also has another structure, namely PENDING_MISS
, for tracking addresses with duplicated requests.
This structure is logically just a set containing all addresses that are requested multiple times during the miss
handling of the address.
Note that the implementation is unnecessarily complex here, because the value of SAME_MISS
and the PENDING_MISS
by itself are logically unordered sets of addresses. The author of the code, however, chose to implement sets (which is
in the STL) as std::map
objects with values being 1
.
The response object is removed from the queue after the above steps, and the next object is read from the response queue. The process drains all response objects at the simulated cycle, which also seems a little bit off, since in reality there is likely a maximum bandwidth for handling responses.
Request Handling
class TLB
models a fixed request processing bandwidth per cycle. If the number of requests arriving at a cycle
exceeds the processing bandwidth, only a subset of them will be handled, while the rest will be buffered.
The request buffer of class TLB
is modeled by data member not_serviced
, which stores requests that have not been
processed.
For each cycle before the bandwidth saturates, one request is taken from not_serviced
, and is processed by
calling check_hit()
with the requested address.
If the lookup indicates a hit, the request is fulfilled immediately, and the event object is moved from not_serviced
to ready_by
and ready_by_size
with a latency of lookup only (Note: I do not know why the upper link latency
is not added to the total latency).
If the lookup indicates a miss, however, then the request should be serviced by sending a request to the lower level
component. This is achieved by calling push_request()
(which simply calls push_back()
on the request queue,
not_serviced
) on the next_level
pointer, if next_level
is not NULL, meaning that the next level component
is a TLB.
Otherwise, we have reached the last level of the TLB hierarchy, and a TLB miss happens. In the latter case, the
request is pushed into the request queue of the page table walker (data member PTW
).
class TLB
also models a maximum number of outstanding misses. If this threshold has not been reached,
the request will be moved to pending_misses
, if its address does not collides with the one of any another pending
miss. Alternatively, the request is moved to SAME_MISS
and PENDING_MISS
if it is deemed a duplicated request.
In other words, SAME_MISS
and pending_misses
are exclusive, and PENDING_MISS
covers all outstanding misses.
If either the outstanding miss bandwidth or the request handling bandwidth is reached, the request will remain in
not_serviced
queue for the next cycle processing.
Sending Responses
Responses generated at the current level, which can either be miss responses from the lower level or lookup hit
responses at the current level, is processed by scanning ready_by
.
If an event completes at or before the current cycle, it will be first inserted into the local tag array
(which, for some reason, is also done when miss responses are handled), removing itself from pending misses, and then
finally add the response event and the size into the response buffer of the upper level component.
The response buffer of the upper level, which corresponds to pushed_back
and pushed_back_size
of the current
level, are accessible in the object as service_back
and service_back_size
.
Configuration-Specific Behaviors
class TLB
can behave differently for different configurations. First, multiple misses are fulfilled with the same
response only for L1 TLB (we described the process with this feature above). In other words, SAME_MISS
and
PENDING_MISS
are only used when level
equals 1.
Secondly, the configuration file can enable parallel_mode
for any TLB unit, which means that lookup requests
that hit the TLB do not incur any latency, and will logically be responded to the upper level in the same cycle.
This is a consequence of the common practice of overlapping TLB lookup with cache tag lookup. If the TLB lookup hits,
then its latency is entirely hidden by the cache lookup, and the cache simply ignores TLB latency.
TLB Hierarchy
class TLBHierarchy
implements a multi-level TLB structure, with page walk handling, internal states for maintaining
known virtual-to-physical translation at all levels of the page table, and sets of the current working set mapped by
the page table. Besides, the class also defines link objects connecting to the CPU, the cache hierarchy, and the main
memory. Each class TLBHierarchy
represents a private TLB hierarchy on each core.
In the big picture of the memory hierarchy, the TLB hierarchy sits between the CPU and the cache hierarchy of each
core. The CPU, when issuing a memory access, enqueues a translation request (as a class MemEvent
object)
into the input link of the TLB hierarchy object. The TLB hierarchy object then, at each cycle, fetches the
requests from the input queue, and services the request from the TLB hierarchy, from the page walk cache,
and/or from the main memory by issuing page walk memory accesses.
After a translation request has been serviced, the TLB hierarchy forwards the request to the cache hierarchy via its
cache link. After the cache hierarchy completes the request, the request is sent back to the TLB hierarchy, which
simply forwards it back to the CPU without any additional action.
Organizing TLBUnits to Form a Hierarchy
Data member coreID
is the core ID that the TLB hierarchy is attached to. Data member levels
is the number of
levels in the TLB hierarchy, which must be a non-negative number. levels
being zero means that there is no TLB,
and hence no caching of translation results. Every memory operation in this case will trigger a page walk by directly
enqueuing the request into the queue of the page walker (class PageTableWalker
object).
On initialization, data member TLB_CACHE
, which is an array of class TLBUnit
objects, is initialized by creating
TLB objects, and putting them into the array. The first level TLB is on TLB_CACHE[1]
rather than TLB_CACHE[0]
(i.e., the first element is not used to make the notion more consistent).
As mentioned earlier, class TLBUnit
objects are chained together using pointers, such that requests missing one
level can be forwarded to the next level. This is achieved by passing the pointer of the previous level TLB to the
constructor of the current level.
In addition, TLB units are also connected backwards by storing the pointers to the pushed_back
and pushed_back_size
array of the previous level in the current level, as data member service_back
and service_back_size
.
When a translation request is completed, it is passed back along the hierarchy by enqueuing it into the service_back
and service_back_size
array, which will then be processed by the previous level.
In other words, translation requests are first passed down the hierarchy by calling push_request()
on the next level,
if the request misses the current level. When the request is serviced, it will be passed up the hierarchy by pushing
the completed request into service_back
and service_back_size
of the previous level.
The service_back
and service_back_size
of the L1 TLB (or the page walker, if levels
equals zero) is set to
data member mem_reqs
and mem_reqs_sizes
of class TLBHierarchy
, such that a completed request will finally be
delivered by the L1 TLB into these two arrays.
The class TLBHierarchy
object also contains the page table walker, as a class PageTableWalker
object, the reference
of which is stored in data member PTW
. The page table walker is the last level of the hierarchy that handles a
translation request, and it honors the same event passing mechanism as the TLB units.
To elaborate: The page table walker is registered with the last level of the hierarchy into its PTW
data member
(and in this case the next_level
will be NULL
). This is done during the initialization of the class TLBHierarchy
object. Memory events for translation, if missed the last TLB
level, is passed to the page table walker by calling push_request()
on the walker object.
When the request is serviced, it will be passed from the walker back to the last level of the hierarchy, using the
service_back
and service_back_size
array pointers, which point to the pushed_back
and pushed_back_size
arrays
either of the last level TLB, or of the TLB hierarchy object. The latter only happens when there is no TLB such that
the page table walker directly communicates with the TLB hierarchy object.
Note that no link object is involved in the hierarchy, possibly as an optimization to reduce the pressure on the event
queue.
Two links are maintained in the class TLBHierarchy
object, to_cache
and to_cpu
. These two links are not
initialized in the object itself. Instead, the class Samba
object configures the link, and passes them to the
TLB hierarchy object.
The to_cpu
link is configured in the file as cpu_to_mmu%d
(%d
stands for the core ID; Same notions below),
and the receiving call back function is registered to handleEvent_CPU()
. This function is invoked when the CPU
requests address translation, and it simply records the time of the event in time_tracker
object, and forwards
the request to the L1 TLB to initiate the translation process.
The to_cache
link is configured in the file as mmu_to_cache%d
, and the receiving call back is handleEvent_CACHE()
.
This function just relays the event to the CPU by calling send()
on the to_cpu
link.
TLB Hierarchy Event Handling
The event handling function in class TLBHierarchy
is tick()
, and the logic is rather straightforward.
If hold
is true
, which is set by the page table walker when a page fault is being serviced, the entire hierarchy
is blocked from accepting new requests, and only the page table walker can make progress. In this case, only
the page table walker ticks, and the function returns.
Otherwise, each level of the TLB ticks, by calling tick()
function on each of them, and the page table walker
also ticks.
Next, the TLB hierarchy handles completed translation requests that are passed up the hierarchy and pushed into its
internal arrays, namely, mem_reqs
and mem_reqs_sizes
.
For each completed translation request, if emulate_faults
option is turned on, meaning that the physical address
of the translation is provided by the page fault handler, then the physical address of the memory event object is
set by calling setAddr()
and setBaseAddr()
on it. The physical address is obtained from the address mapping
structure, PTE
, which is defined in class Samba
, and maintained by the page table walker’s page fault handler
(in fact, the physical address of all levels of the page table are
maintained for page walk purposes which we discuss in the page walker section).
Then, the time it takes for servicing the request is computed by taking the difference between the current cycle,
and the start cycle recorded in time_tracker
(recall that the start cycle is recorded in handleEvent_CPU()
),
followed by a statistics update.
The event is eventually sent to the cache hierarchy by calling send()
on the to_cache
link, before it is removed
from the mem_reqs
and mem_reqs_sizes
queue.
Note that, in this process, the mem_reqs_sizes
queue is ignored. The sole purpose of the queue is to keep the
interface consistent.
The Page Table Walker
class PageTableWalker
defines the Intel-compatible, four-level page table walker object, which is the last level
of the TLB hierarchy. Translation requests that miss all levels of the hierarchy will be passed to this object, and
a page walk will be performed.
This class also implements a hierarchy of page walk caches, which caches the page table entries from all levels
(PGD, PUD, PMD, PTE), and reduces the number of memory accesses, if the walk hits the cache.
Linking to Other Components
class PageTableWalker
contains two link objects, namely, to_mem
and s_EventChan
.
As with the case of class TLBHierarchy
, both links are configured within class Samba
.
The to_mem
link is configured with port name ptw_to_mem%d
, and the receiving end is registered to recvResp()
.
This link serves as the channel for issuing memory operations and receiving results during a page table walk, but
a simpler page walk model is also allowed, in which the table walker is not connected to the memory hierarchy, and
memory accesses during page walks are just simulated with a fixed latency.
In the latter case, self_connected
must be set in the configuration file, and page_walk_latency
must also be
set to represent the fixed latency of a single memory operation (in nanoseconds).
The link will then be configured as a self link,
i.e., the page table walker just, effectively, calls its own handler function, recvResp()
, with a delay,
when an event object is sent over to_mem
.
As for s_EventChan
, it is merely a self link with 1 nanosecond link latency, and the receiving end is registered with
handleEvent()
. Every send()
operation on s_EventChan
will cause handleEvent()
to be invoked with the event
object being sent, so it is just a delayed function call.
The s_EventChan
is for handling page faults with an external page fault handler.
If emulate_faults
is turned off, this link will not be initialized, and will not be used.
Page Walk Cache
The page walker implements a page walk cache that stores intermediate results and PTE entries of previous page walks.
The page walk cache only has one level, and every supported page size has its own page walk cache.
Currently, the number of page sizes is hardwired to four, namely, 4KB, 2MB, 1GB, and 512GB, which corresponds to
pages mapped at PTE, PMD, PUD, and PGD level (although the last is not supported on real hardware).
Data member page_size
stores the hardwired page sizes at all page table levels, with element zero being the
smallest page (4KB), and element three being the largest (512GB).
The size and associativity of page walk caches are specified in the configuration file, using keys
size%d_PTWC
and assoc%d_PTWC
, respectively.
Data member size
, assoc
, and sets
store these parameters (sets
is derived from the previous two).
The tag
, valid
, and lru
are implemented in the same manner as the TLB, which has already been described above.
Access functions for the page walk cache are also defined in exactly the same manner, i.e., insert_way()
to insert
an address tag, invalidate()
to remove an entry, check_hit()
to perform lookups, and
find_victim_way()/update_lru()
for LRU evict and update.
Page Fault Handling (w/ non-confined PTW)
Page Fault Handler
The page walker emulates page faults, if emulate_faults
is set in the configuration file. The page walker implements
page fault as separated code paths from the normal page walk request handling, which greatly simplifies the
structure of the logic.
The entry of page fault related features in the link s_EventChan
, which is initialized by class Samba
.
All page fault related events are routed to this method, which maintains a state machine tracking the current
status of page fault handling.
Additionally, a class PageFaultHandler
object is initialized and added to the page table walker object.
Although the handler is not fully implemented yet in the current distribution of SST, its duty in page fault handling
is to mimic an OS memory allocator that maintains a pool of available physical pages, and allocates these pages on
page faults for page table pages and data pages.
It does not seem that timing of the page fault is simulated, but this is acceptable, since the actual page fault
handling runs as an OS thread, the timing of which is already modeled by just simulating the instruction stream.
The major role of the page fault handler, therefore, is to simulate the OS’s memory allocator such that the page table
pages and data pages in the simulator can be assigned physical addresses. These addresses are tracked by multiple page
table related structures, namely, data member PGD
, PUD
, PMD
, and PTE
. These structures are just std::map
s
that map virtual addresses on that level (i.e., upper bits of the VA) to physical addresses on the next level (or data
page, if PTE). New entries are inserted into these maps when the page fault handler returns the allocated physical
page, on a page fault event, and the maps are queried during a hardware page walk to generate the physical addresses of
page table entries at the corresponding level.
The page fault handler defines an interface object, class PageFaultHandlerInterface
, which is just a standard
functor object that, when being called, invokes a registered member function on a registered object instance.
Handler registration is performed by calling registerPageFaultHandler()
, which is done in class Samba
’s constructor.
Each page table walker object has its member function recvPageFaultResp()
registered, with a
class PageFaultHandlerPacket
object as argument specifying the action, virtual address, physical address, and
the size class of the page of the handler event.
Page Fault Handling Control Flow
We only discuss the non-confined page table walker case (controlled by data member ptw_confined
, and configured by
ptw_confined
), which is the default. The confined page table walker seems to lack certain elements to work properly
(e.g., it does not seem to stall the page table walker anywhere, but unstalls it when page fault is handled).
Page fault is triggered in the main event loop tick()
. For every page walk request in not_serviced
queue,
the requested virtual address is checked against three structures that track the mapped pages, namely,
MAPPED_PAGE_SIZE4KB
, MAPPED_PAGE_SIZE2MB
, and MAPPED_PAGE_SIZE1GB
. These three structures are std::map
objects
implementing the set semantics (I have no idea why the author does not just use std::set
).
Addresses that already have their page fault handled will be entered into these three structures (which happens in
the page fault handler when a fault is fully handled), such
that the simulator just assume that no further page fault would ever happen for these addresses, and the physical page
they have been mapped to will not change since then. Recall that the fault simulation is mainly to assign physical
addresses to virtual addresses for page walk address generation, and it does not care about timing, so this assumption
is reasonable. Moreover, when the simulated OS changes the mapping (most likely followed by a shootdown),
the simulator can be instructed to erase certain mappings from these structures, which will cause a new
physical address to be assigned on the next translation request for these entries.
If the requested translation is not tracked by any of the three structures, then a page fault is signaled.
The fault address is tracked by data member stall_addr
, and the fault address is added to the structure tracking
pending page fault, namely, PENDING_PAGE_FAULTS
, if not already.
If there is no pending fault on the same address, the page fault is delivered via the link s_EventChan
, using a
class SambaEvent
of type EventType::PAGE_FAULT
.
As the last step, data member stall
and hold
are both set, indicating that a page fault is being handled.
During page fault handling, normal page walk requests will not be processed, and upper level TLBs will also stop
processing translation requests. Note that stall
and hold
are always set or cleared together. The difference
is that stall
is a data member of class PageTableWalker
, which is only read by the table walker itself,
while hold
is defined as a data member in class TLBHierarchy
, and the hierarchy will prevent all TLBs
from processing requests, if hold
is set to 1
. The page walker holds a reference to hold
in order to
notify the upper level that a fault is being handled.
Also note that the tracking structures, such as PENDING_PAGE_FAULTS
, MAPPED_PAGE_SIZExxx
, and the per-level
page table page’s physical address tracking structures (PGD
, PUD
, PMD
, and PTE
),
are shared across all instances of page walkers, and hence they always use the same virtual to physical mapping.
Synchronization would not be a problem, as the simulation is already synchronous, and only processes one event at
the current cycle at a time.
One implication is that, although it seems impossible for PENDING_PAGE_FAULTS
to have more than one pending requests,
since a page walker can only handle one fault at a time, it is, in fact, possible that another page walker is
already in the process of handling a fault on the same address, and hence checking PENDING_PAGE_FAULTS
is actually
necessary to avoid multiple allocations to the same virtual address.
The page fault object sent via s_EventChan
will invoke handleEvent()
at a later cycle. On receiving the event,
the function will first determine the level of the fault by querying the fault address in the per-level tracking
structures, namely, PGD
, PUD
, PMD
, and PTE
. Page fault will be handled from the highest level that the
page table page is not mapped, and physical pages will be allocated for the page table itself as well as the data page.
After determining the level, which is stored in data member fault_level
, the page fault handler’s
allocatePage()
is called to grab an unused physical page.
Although in the current distribution of SST, the allocatePage()
function is not implemented, it is supposed to
maintain a pool of free physical pages, and just allocate one of them to the page fault handler.
The response event is supposed to be sent by calling
the handler registered to its data member pageFaultHandlerInterface
with the newly allocated physical address,
using a class PageFaultHandlerPacket
object and setting the action to
PageFaultHandler::PageFaultHandlerAction::RESPONSE
.
The response message will be received by invoking page table walker’s function recvPageFaultResp()
(which is the
registered call back function). The function simply forwards the response in the form of a class Samba
object,
with the type being EventType::PAGE_FAULT_RESPONSE
.
On receiving the response event in handleEvent()
, it will update the per-level tracking structure with the allocated
physical page, meaning that the page table page is allocated a physical page, and addresses could be generated
for later page walks, if the walk traverses through the page.
The function also decrements fault_level
, if it is non-zero, and calls allocatePage()
again, in order to allocate
physical page for the next level of the page table.
This process continues, until all nodes on the radix tree branch is allocated a physical page, in which case
fault_level
would become zero, and page fault handling concludes.
A class SambaEvent
of type EventType::PAGE_FAULT_SERVED
is then sent to s_EventChan
(i.e., to the same function
as the sending one), on receiving of which, the page address is entered into MAPPED_PAGE_SIZE4KB
, and
removed from PENDING_PAGE_FAULTS
.
Future page walk requests to the same address will not incur the page fault, as MAPPED_PAGE_SIZE4KB
will indicate
that the address is already mapped.
Besides, the page table walker will unstall on the next tick, since the address is no longer in PENDING_PAGE_FAULTS
.
This is checked at the beginning of tick()
, and, if the address is truly missing from PENDING_PAGE_FAULTS
,
both stall
and hold
will be cleared, allowing page walk requests and translation requests to be served by
the page table walker and upper level TLBs, respectively.
If emulate_faults
is turned off, then all the physical address tracking structures will not be used. In this case,
page walks have no reliable sources to generate the physical address for each step of the walk, and can thus only
use random numbers as the page number to be accessed during the walk.
Page Table Walk Event Handling
The main page table walk logic is implemented in function tick()
and the call back function of the link connecting
to the page table walker with memory, recvResp()
.
The first thing that tick()
does is to check whether the current page table walker is stalled in a page fault, and if
true, then no page table walk is performed, and the function waits for fault handling to complete.
Otherwise, page walk requests are fetched from not_serviced
queue, which is enqueued by the last level TLB by
calling push_request()
on the page walker, and examined for processing.
The page table walker also implements a maximum number of walks per cycle with local variable dispatched
, which is
incremented for every request serviced at the current cycle. If this value exceeds the maximum, specified by
max_width_PTWC
in the configuration, and stored in data member max_width
.
Then page fault logic checks whether the page walk will induce a fault. Note that this deviates from the theoretical
way that a page table walker works, where page fault is only triggered when page walk encounters an invalid entry
at some levels.
On a page fault, as discussed earlier, the handling function will block the processing and all upper level TLBs, until
fault handling completes.
If not, the next step is to check page walk cache. All caches are assumed to be checked in parallel at the same cycle,
and priority of hit is given to lower levels (i.e., PTE > PMD > PUD > PGD), since lower level hits in the page table
leads to shorter page walk path.
If the PTE page walk cache is hit, indicating that the translation entry is already in the cache (k == 0
),
then no page walk is needed, and the request can be fulfilled either at the current cycle (if parallel mode is on,
which is typical for VIPT L1 caches), or latency
cycles after the current cycle.
In either case, the event object representing the request is moved to ready_by
queue, and the corresponding
ready_by_size
queue is populated with the system page size.
Note that only one system page size is supported for now, which is specified using os_page_size
in the configuration
file (unit is KB, so valid values are 4
, 2048
, and 1024 * 1024
), despite the fact that upper level TLBs may
have several size classes.
On the other hand, if all of the page walk caches signal misses, or one of them hit, but the level of hit is
higher than the system page size, then page walks are still needed to retrieve the missing levels from the
in-memory page table. The number of levels to fetch is computed and stored in local variable k
based on the system
page size.
The handler then attempts to add the walk request into the pending request queue, pending_misses
, if the
maximum number of pending requests per cycle has not been reached (data member max_outstanding
, specified with
max_outstanding_PTWC
in the configuration file). If the limit is reached, then the request remains in the
not_serviced
queue, and will be re-handled in the next cycle.
If the request is successfully added to the pending_misses
queue, the handler will then proceed to model the timing
of the page walk. In the simpler case, if the page walker is not connected to a memory object (self_connected
is set
in the configuration file, causing data member to_mem
to be set as NULL), page walks are handled as a fixed latency
event, and the result will be immediately added to the ready_by
array with a delay of
latency + 2 * upper_link_latency + page_walk_latency
cycles. The page walk latency can be configured using
page_walk_latency
in the configuration file, and is defaulted to 50 cycles.
In the more difficult case, the page table walker is actually connected to a memory object, then the walk is simulated
by issuing instructions to the memory one after another, beginning from the lowest level that hits the page walk
cache, until the required number of steps is reached (stored in local variable k
).
Since page walks require a multi-round message exchange between the page table walker and the memory system, the
page table walker, therefore, uses state machines to track the status of in-progress walk instances across
calls to tick()
and the memory call back function, recvResp()
.
Each page walker instance is assigned a globally unique ID by incrementing data member mmu_id
, and is then
the corresponding request event object is added to the queue, WID_EV
.
Four extra mapping structures track additional information of the page walk’s status:
WSR_COUNT
tracks the remaining number of steps to walk down the page table;
WID_Add
tracks the virtual address of the walk;
WSR_READY
tracks whether the page table walk has completed using boolean flags (note that this structure is
only set, but not used anywhere).
All three of them use the page walk instance ID as the key.
MEM_REQ
maps the memory access request (not the translation request!) event ID to the global unique ID.
This map is used to recover the page walk instance ID from the event object when it is fulfilled by the memory,
and the instance ID is then used to retrieve the status of the current walk.
The initial memory address to be accessed for the current step of the page walk is either generated as a random number,
if page fault emulation (i.e., allocation of physical pages for page table pages) is turned off,
or computed as the address of one of the PGD entries depending on bit 39 to 47 of the virtual address in the request.
Note that this seems inconsistent with what an actual page walk would behave, since an actual page walk will traverse
down the page table from the lowest level whose page walk cache is hit, but here the traversal always begins from the
top level (i.e., PGD).
As the last step, a memory access request is constructed as a class MemRequest
object with both the requested
virtual address, and the address to be accessed.
The memory access request’s ID is entered into MEM_REQ
, and the request itself is sent to the memory for
fulfillment by calling send()
on data member to_mem
.
Once the memory request is fulfilled, the call back function, recvResp()
, will be invoked.
The page walk instance ID is first fetched from the mapping structure MEM_REQ
, using the request ID stored in the
response event object (by calling getResponseToID()
), and stored in a local variable pw_id
.
The virtual address, which is tracked by WID_Add
, is first inserted, by calling insert_way()
, into the page
walk cache of the current level (tracked by WSR_COUNT
), after evicting an existing entry from the structure by
calling find_victim_way()
.
Then, if the page walk has reached the lowest level, indicated by the WSR_COUNT
value of the instance ID being
zero, then the page walk concludes, and the translation request is added to the ready_by
and ready_by_size
queue with an extra latency of latency + 2 * upper_link_latency
to model the page walk cache lookup and link delay.
Otherwise, more memory accesses need to be made to read lower levels of the page table.
The address generation process is similar to the one in tick()
, with the exception that the address is read from
different physical address tracking structures (PGD
, PUD
, PMD
, and PTE
) according to the value of WSR_COUNT
.
A new class MemEvent
object is allocated to convey the next level access, and sent via the link by calling
send()
on to_mem
link.
The WSR_COUNT
value of the walk instance is also decremented by one, and the MEM_REQ
mapping is also updated to
reflect the most recent memory access (instance ID does not change, but the event object is the newly allocated one).
This process will be repeated several times, until the page walk reaches the terminating level with regard to the
configured OS page size.
We now return to the event handling loop in tick()
. After dealing with the memory access of page walks, the
walk request is removed from the not_serviced
queue. Pending page walks still remain in the pending_misses
queue until they are moved from ready_by
queue to the last level TLB’s response message queue.
The last step of event handling is to check completed requests in the ready_by
queue, and move
them into the last level TLB’s response queue, if the current tick cycle is no less than the completion cycle
of the request.
This is done by scanning the ready_by
queue, comparing the completion cycle of the request (which is the
value of the map), and for every request whose completion cycle is no less than the current tick cycle,
insert the requested virtual address into the PTE cache (on index zero), if not already, after which LRU is
also updated.
Both the request and the size (in ready_by_size
queue) are inserted into the upper level’s response queue,
service_back
and service_back_size
. The event is then removed from pending_misses
as well.
The code also performs a sanity check on whether the physical address of the PTE corresponding to the
requested virtual address is mapped in the physical address tracing structure, PTE
. If not, then something
is seriously going wrong, because it is supposed to be mapped after the first emulation of the page fault,
which should definitely happen before the walk request is completed.
Samba
class Samba
is merely a container class that implements no concrete functionality other than initialization
and event propagation. The class implements the full multi-core, multi-size TLB hierarchy, with page walkers
and communication links on each core.
The topology on each core must be identical, meaning that only homogeneous TLB configuration can be supported.
Samba maintains the global physical address tracking structure that map virtual page numbers and page table pages
at all levels to the physical addresses.
As have mentioned earlier, these structures mimic the OS physical page
allocator, and is used to generate the physical addresses for cache accesses and physical addresses for page
table walks.
Samba assumes that all simulated TLBs operate under the same address space, and hence only one copy of the
mapping structures is maintained.
Each TLB instance holds pointers to the shared mapping structures
(which are registered during initialization by calling setPageTablePointers()
on both class TLBHierarchy
and class PageTableWalker
), and they access the structures synchronously in the timer call back, tick()
, or
link object call backs.
Operation-level contention is impossible, as accesses will not overlap (tick()
and link call backs are executed
serially). Semantic-level contention may happen when, for example, multiple TLB page walkers fault on the same virtual
address, and each of them may request a physical page without noticing each other’s fault.
This is handled by adding the current ongoing page fault handling in the shared structure PENDING_PAGE_FAULTS
,
and checking the structure every time a fault is to be started on the page table walker.
The process has already been described in the relevant sections, and we do not cover details here.
Context switches are not modeled, but can be easily added by flushing the content of TLB arrays, page walk caches, and the tracking structures. Fine-grained shootdown is also possible by only removing entries that match the requested virtual address, but it requires collaboration from other parts of the simulator, since TLB shootdown is an OS-level task, which happens on page table updates. The TLB simulator itself hence has no way to know when an entry is no longer valid.
Data member CR3
stores the physical address of the PGD page, i.e., root level of the 4-level page table.
Data member PGD
, PUD
, and PMD
map from the higher bits of the virtual address, i.e., bit 12 – 47,
bit 21 – 47, bit 30 – 47, and 39 – 47, respectively.
Caveat here: confined and non-confined page table walker seem to differ on these index bits, but I could not decipher
the purpose of using different bit indexing schemes due to lack of documentation.
Data member PTE
maps virtual page numbers to physical page numbers, and is used in class TLBHierarchy
to generate
the physical address for memory access when translation completes.
All these structures are updated in class PageTableWalker
, with the aid of class PageFaultHandler
, which mimics
an OS physical page allocator.
Pages that are already mapped (i.e., have a physical address associated) is tracked by data members
MAPPED_PAGE_SIZE4KB
, MAPPED_PAGE_SIZE2MB
, and MAPPED_PAGE_SIZE1GB
. These structures are used to determine whether
a page fault is needed, which simply emulates the allocation of physical pages, and enters the
virtual to physical address mapping into one of the three structures.
On initialization, Samba creates the per-core TLB hierarchy after reading the essential configurations.
Link objects that connect the per-core hierarchy with other components are also created in the initialization routine,
and then passed to the corresponding Samba components.
It configures the per-core TLB to CPU link, per-core TLB to cache hierarchy link, the per-table walker to memory
(or to one of the caches in the hierarchy) link, if the page table walk is to be simulated (self_connected
in
the configuration file), and the loop back link of the page table walker for casting page fault events.
At the end of initialization, Samba registers itself to the global clock of the discrete event simulation.
If the frequency is not specified, it will default to 1GHz
.
The tick()
function is very simple, which just calls tick()
of each class TLBHierarchy
object.