Understanding Cache System Simulation in zSim
Introduction
zSim is a fast processor simulator used for modeling the bahavior of memory subsystems. zSim is based on PIN, a binary instrumention tool that allows programmers to instrument instructions at run time and insert customized function calls. The simulated application is executed on native hardware, with zSim imposing limited control only at certain points during the execution (e.g. special instructions, basic block boundaries, etc.). In this article, we will not provide detailed description on how zSim works on the instrumentation level. Instead, we isolate the implementation of zSim cache subsystems from the other parts of the simulator, and elaborate on how the simulated cache is architected and how timing is computed. For more information about zSim, please refer to the original paper which gives a rather detailed description of the simulator’s big picture. And also, it is always best practice to read the official source code when things are unclear. In the following sections, we will use the official source code github repo given above as the reference when source code is discussed.
Source Files
Below is a table of source code files under the /src/ directory of the project that we will be talking about. For each file
we also list its important classes and declarations for reader’s convenience. One thing that worth noting is that a file is not
always named using the name of the class defined within that file. If you are unsure where a class is defined, simply doing
a grep -r "class ClsName"
or grep -r "struct ClsName"
will suffice for most of the time.
File Name (only showing headers) | Important Modules/Declarations |
---|---|
memory_hierarchy.h | Coherence messages and states declaration, BaseCache, MemObject, MemReq |
cache_arrays.h | Tag array organization and operations |
cache.h | Actual implementation of the cache class, and cache operations |
coherence_ctrls.h | MESI coherence state machine and actions |
repl_policies.h | Cache replacement policy implementations |
hash.h | Hash function defined to map block address to sets and to LLC partitions |
init.h | Cache hierarchy and parameter initialization |
Note that zSim actually provides several implementations of caches, which can be selected by editing the configuration file.
The most basic cache implementation is in cache.h, and it defines the basic timing and operation of a working cache, and
no more. A more detailed implementation, called TimingCache
, is also available, which adds a weave phase timing model
to simulate cache tag contention (zSim simulates shared resource contention in a separate phase after running the simulated
program for a short interval, assuming that path-altering interferences are rare). In this article, we focus on the functionality
and architecture of the cache subsystem, rather than detailed timing model and discrete event simulation. To this end, we
only discuss the basic cache model, and leave the discussion of timing cache to future works.
Cache Interface
In this section we discuss cache subsystem interfaces. In zSim, all memory objects, including cache and memory, must inherit
from the virtual base class, MemObject
, which features only one neat interface, access()
. The access()
call takes
one MemReq
object as argument, which contains all arguments for the memory request. The return value of the base cache
access()
call is the finish time of the operation, assuming no contention (if contention is not simulated, then it is
the actual completion time of the operation, as in our case).
Cache objects also inherit from the base class, BaseCache
, which itself inherits from MemObject
, and defines another
interface call, invalidate()
. This function call does not take MemReq
as argument, but instead, it takes the address
of the line, the invalidation type, and a boolean flag pointer to indicate to the caller whether the invalidated line is
dirty (and hence a write back to lower level cache is required). Note that in zSim, the invalidate()
call only invalidiates
the block in the current cache object, and indicates to the caller via the boolean flag whether a write back is induced
by the invalidation. It is therefore the caller’s responsibility to write back the dirty block to the lower level using
a PUTX transaction, as we will see below. The return value of invalidate()
is also the response time of the operation.
Overall, the cache object interface in zSim is rather simple: access()
implements how the cache handles reads and
writes from upper levels. invalidate()
implements how the cache handles invalidations from the processor or lower levels
(depending on the position of the cache object in the hierarchy). Both methods are blocking: A begin time (absolute cycle
number) is taken as part of the argument, and a finish time is returned as the cycle the operation completes.
MemReq object
The MemReq
object is used in two scenarios. First, an external component (e.g. the simulated processor) may issue a
memory request to the hierarchy by creating a MemReq
object before it calls the access()
method (in zSim, we have an
extra level of FilterCache
, in which case the FilterCache
object issues the request). The caller of access()
function
needs to pass the address to be accessed (lineAddr
field) and the begin cycle (cycle
field) of the cache access. The
type of the access in terms of coherence is also specified by initializing the type
field. In this scenario, no coherence
state in involved, and the access type can only be GETS
or GETX
. In the second scenario, an upper level cache may
issue request to lower level caches to fulfill a request processed on the upper level. For example, when an upper level
cache evicts a dirty block to the lower level, it must initiate a cache write transaction by creating a MemReq
object
and making it a PUTX
request. In addition, when a request misses the upper level cache, a MemReq
object must be
created to fetch the block from lower level caches or degrade coherence states in other caches. The process can be
conducted recursively until the request reaches a cache that holds this block or has full ownership, potentially reaching
beyond the LLC and reading from the DRAM.
An interesting design decision in zSim is that when upper level cache issues a request to the lower level cache, the coherence
state of the block in the upper level cache is determined by the lower level cache controller. This design decision is made
to simplify the creation of “E” state, which requires information held by the lower level cache (i.e. the shared vector).
As a result, when upper level caches issue the request, it must also pass a pointer to lower level caches such that the
latter can assign the coherence state of the block when the request is handled. This pointer is stored in the state
field
of the MemReq
object.
Another invariant is that MemReq
object is only used for upper-to-lower requests, such as cache line write back, line
fetch, or coherence invalidation. For lower-to-upper requests, such as block invalidation, we never use MemReq
objects.
Instead, lower level caches directly call into upper level cache’s invalidate
method, potentially broadcasting the
invalidation request to several upper levels recursively.
We summarize the fields and their descriptions in the following table.
MemReq Field Name |
Description |
---|---|
lineAddr | The cache line address to be requested |
type | Coherence type of the request, can be one of the GETS, GETX, PUTS, PUTX |
state | Pointer to the requestor’s coherence state. Lower level caches should set this state to reflect the result of processing the request from the upper level |
cycle | Time when the request is issued to the component |
flags | Hints to lower level caches; Most of them are unimportant to the core functionality of the cache |
Note that this is not a complete list of all MemReq
fields. Some fields are dedicated to concurrency control and race
detection, which will not be covered in this article. To make things simple, we always assume that only a single thread
will access the cache hierarchy, and hence all states are stable. In practice, multiple threads may access the same
cache object from different directions (upper-to-lower for line fetch, and lower-to-upper for invalidation). zSim has an
ad-hoc locking protocol to ensure that concurrent cache accesses can always be serialized.
Cache System Architecture
In this section we discuss the abstract model of zSim’s cache system. The cache system is organized as a tree structure, with a single root representing the last-level cache (LLC), and intermediate nodes representing intermediate level caches (e.g. L2). At the leaf level, we have the processor and the attached private L1d and L1i cache. Note that zSim does not “visualize” the cache hierarchy in the usual way, in which processors are at the top, and the LLC is placed at the bottom. This may cause some naming problems since we used to call those that are closer to the leaf level “upper level caches”, and those that are close to the root level “lower level caches”. To avoid such confusion, in the discussion that follows, we use the term “child cache” to refer to caches that are closer to the leaf level which usually have smaller capacity and operate faster. We use the term “parent cache” to refer to caches that are closer to the root level which are larger but slower.
A cache object can be partitioned into multiple banks, each having different network latency from its child caches.
Although this seems to break the tree abstraction of the cache hierarchy, the partitioned cache can still be regarded as
a logical cache object without losing generality. The partitioned parent cache models the so called Non-Uniform Cache Access
(NUCA), which is typically applied to the LLC to increase parallelism and to avoid having a non-scalable monolithic storage.
Accesses from child caches are first mapped to one of the banks using a hash function, and then dispatched to the parent
partition (see MESIBottomCC::getParentId
). In zSim, each partition is treated as a separate cache object, which can be
accessed using the regular access()
interface. Latencies from the children to the parent partitions are stored in a network
latency configuration file, which is loaded at zSim initialization time. When a child cache accesses a partition of the
parent, the corresponding child-to-parent latency value is added to the total access latency in order to model NUCA (see
parentRTTs
in class MESIBottomCC
).
Although zSim supports both inclusive and non-inclusive caches (see nonInclusiveHack
flag in MESI controllers), in
the following discussion, we assume that caches are always inclusive. The implication of inclusive caches is that
when a block is invalidated or evicted in the parent level, all children caches that hold a copy of the same block must
also invalidate or write back (in case of a dirty line) the block to maintain inclusiveness. In addition, when a block
is loaded into a child cache by a request, the same block must also be loaded into all parent level caches as the
requested is passed down recursively. The author of zSim also suggested in a code comment that the non-inclusive path is
not fully tested, which may incur unexpected behavior.
Three types of events can occur at a cache object. The first type is access, which is issued by the processor or by a
child cache to fetch a line, write back a dirty line, or perform state degradation. Accesses are always issued to lower
level caches wrapped by the MemReq
object. The second type is invalidation, which in the current implementation
is always sent from a parent cache to inform child caches that certain addresses can no longer be cached due to a
conflicting access or an eviction. Note that invalidations do not use MemReq
objects, and instead they call the child
caches’ invalidate()
method with the type of the invalidation (INV
means full invalidation; INVX
means downgrade to
shared state). The third type is eviction, which naturally happens when a new block is brought into the cache, but the
current set is full. An existing block is evicted to make space for the new block, which also incurs invalidation
message sent to child caches if the evicted block is also cached by at least one child caches.
Each simulated cache object consists a tag array for storing address tags and eviction information, a replacement policy
object that is purely logical, a coherence controller object which maintains the coherence states and shared vectors of
each block, and access latencies for reading the tag array (accLat
) and invalidating a block (invLat
) respectively.
The following table lists all data members of class Cache
and a short description. In the following sections we will
discuss these cache components individually.
Cache Field Name |
Description |
---|---|
cc | Coherence controller; Implements the state machine and shared vector for every cached block |
array | Tag array; Stores the address tags of cached blocks |
rp | Implements the replacement policy via an abstract interface |
numLines | Total number of blocks (capacity) of the cache object |
accLat | Latency for accessing tha tag array, ignoring contention |
invLat | Latency for invalidating a block in the array, ignoring contention |
Tag Array
The tag array object is defined in file cache_array.h. Tag arrays are one-dimensional array of address tags that can be
accessed using a single index. Although some cache organization may divide the array into associative sets, the 1-D array
abstraction is still used for identifying a cache block. All tag arrays must inherit from the base class, CacheArray
,
which provides three methods: lookup()
, preinsert()
, and postinsert()
. The lookup method returns the index of the
block if its address is found in the tag array, or -1 if not found. Optionally, the lookup method will also change the
replacement metadata of the cache line, which is indicated by the updateReplacement
flag in the argument list. The preinsert()
method is called before inserting a new address tag into the tag array. This method will search the tag array for an empty
slot to store inserted tag. If an empty slot cannot be found, as should be the majority of the cases, an existing slot
will be made empty first by writing back the current block data, and then returning its index. The address to be written
back is returned to the caller via the last argument, wbLineAddr
, and the index of the selected slot is the returned
value. postinsert()
will actually store the new address tag into the target slot given both the address and the index
of the slot.
Note that zSim only guarantees that preinsert()
and postinsert()
will not be nested, i.e. the pending insertion must
complete before the next one could be performed. It is, however, possible that lookup()
be called between the two methods
as a result of write backs from child caches. One example is when a middle-level cache evicts a block that has been written
drity in child caches. After preinsert()
returns, the cache controller processes the eviction of the block, which
requires sending invalidations to child caches. On receiving the invalidation, the child cache holding a dirty copy of
the block will initiate a write back which directly sends the dirty block to the current cache, before the latter
initiates a PUTX
transaction to its parent cache. lookup()
will be called to find the slot for writing back the dirty
block during the process of the PUTX
request in the parent cache.
We next take a closer look at these three method functions. For simplicity, we assume set-associative tag storage, which is
implemented by class SetAssocArray
. On initialization, the number of blocks and sets are passed as construction argument.
The set size is computed by dividing the number of lines with the number of sets. The number of sets must be a power of
two to simplify tag address mapping. The set mask is also computed to map any integer from the hash function to the range
from zero to set size minus one.
A hash function is associated with the tag array to compute the index of the set given a block address. The hash function
is relatively unimportant for set-associative caches, since we just perform an identity mapping (i.e. do not change the
address) and let the set mask map the block address into the set number. For other types of tag arrays, such as Z array,
the hash function must be an non-trivial one, and can be assigned by specifyingarray.hash
in the configurtation file.
A replacement policy object is responsible for deciding which block should be evicted, if any, when a new block is brought into the set. As discussed in the previous section, the replacement policy object stores its own metadata, which can be optionally updated on every tag array access.
The tag array is declared as a one-dimensional pointer named array
. In order to access set X, the begin index
is computed as (ways
* X), and the end index is (ways
* (X + 1)). To perform an array lookup given a block address,
we first compute the set number by AND’ing the address with the set mask. Note that all block addresses have been right
shifted to remove lower bit zeros. Then for each address tag in the set, we compare whether it equals the given address.
If true, a hit is indicated and the index of the tag array entry is returned. Otherwise we return -1.
If indicated by the caller, we also inform the replacement policy that the address has been accessed on a hit. The replacement
policy object may then promote the hit block according to its own policy (e.g. moving to the end of the LRU stack).
In preinsert()
, the tag array either finds an empty slot, or more likely, finds a slot to be evicted. This is exactly
where the replacement policy comes into play. The preinsert()
method first initializes a SetAssocCands
object, which
is just a pair of indices indicating the begin and end index of the set in which replacement happens, and then passes
this object to the replacement policy’s rankCands()
method. The rankCands()
method returns the index of the selected
block, which is then returned together with the address tag. Note that preinsert()
has no idea whether a block is
invalid or dirty when it makes decision on eviction. The coherence controller, on the other hand, knows the exact state
of the selected block, and will enforce correct behavior. As a result, the replacement policy will query the state of the
block when it evaluates the block for eviction to avoid evicting invalid blocks.
The logic of postinsert()
is simple. The replacement policy is notified that the selected block has been invalidated,
and that a new block is inserted. The metadata for the block will be updated to reflect the replacement. The new address
is also written into the array.
Replacement Policy
The replacement policy is implemented as a ranking function. The policy can be specified using the repl.type
key in the
configuration file. In this section we only discuss LRU, which is implemented by class LRUReplPolicy
. zSim implements
LRU using timestamps. A single global timestamp is incremented for every access to the tag array. Each block also has a
local timestamp which stores the value of the global timestamp when it is accessed. A larger local timestamp means the
block is closer to the MRU position.
The ranking function, rankCands()
(also rank()
), iterates over the SetAssocCands
object, and for each block in the set,
it computes a score based on the LRU counter. If the LRU policy is sharer-aware, the score will be affected by: (1) the
validity of the block; (2) the number of sharers; and (3) the local timestamp value. The higher the score is, the less
favorable it is to evict the block. The replacement policy selects the block with the smallest score as the candidate
for LRU eviction.
Simulating Cache Coherence
Coherence Overview
zSim simulates MESI coherence protocol using an implementation of directory-based MESI state machine across the memory hierarchy. zSim does not model the full set features of the protocol, since only stable states are simulated. zSim also does not model the on-chip network traffic incurred by coherence activities. Latencies of the network is assigned statically, and they will not change based on network utilization.
Each cache object has a coherence controller, which maintains the coherence states of all blocks currently residing in the cache. Since zSim caches are inclusive, the coherence directory is implemented as in-cache sharer lists, one for each cached block. The number of bits in the sharer list per block equals the number of children caches. A “1” bit in the list indicates that the corresponding child cache may have a cached copy on the same address, dirty or clean. The sharer list is queried when invalidations are sent to the children caches, and is updated when a new block is fetched by a child cache.
The coherence controller is further divided into two logical parts: A “top controller” maintains the directory and sends invalidation to children caches, and a “bottom controller” maintains the state for cached blocks and handles requests from child caches. These two logical parts are largely independent from each other, and are implemented as separate modules. The following table lists the coherence controller module name and the responsibility of the module.
Coherence Module Name | Description |
---|---|
CC | Virtual base class of the controller; Specifies interface of coherence controllers |
MESIBottomCC | Bottom coherence controller; Maintains block states and handles child requests |
MESITopCC | Top coherence controller; Maintains directory states and handles invalidations |
MESICC | Includes both MESIBottomCC and MESITopCC to implement the coherence controller for non-leaf level caches |
MESITerminalCC | Coherence controller for leaf level caches (e.g. L1d, L1i). Only includes MESIBottomCC, since leaf level caches do not have directory states to maintain |
We also list the members of class MESIBottomCC
and class MESITopCC
to aid our discussion of coherence actions in the
following sections.
MESIBottomCC Field Name |
Description |
---|---|
array | The coherence states of cached blocks; States are one of the M , E , S , I as in the standard MESI protocol |
parents | A list of parent cache partitions. A hash function maps the block address into parent ID when the parent cache is to be accessed. All parent cache partitions collectively maintains state for the entire address space |
parentRTTs | A list of network latencies to parent partitions; This models NUCA |
numLines | Number of blocks in the cache. Also number of coherence states |
MESITopCC Field Name |
Description |
---|---|
array | The sharer vector of cached blocks. Each entry in the array is a bit vector in which one bit is reserved for each child cache. A boolean flag also indicates whether the block is cached by children caches in exclusive states (used for silent upgrade). |
children | A list of children cache objects. Children caches are assumed to be not partitioned, and each child cache maintains state of its own |
childrenRTTs | A list of network latencies to children caches; This can model L1i and L1d |
numLines | Number of blocks in the cache. Also number of directory entries |
One unfortunate fact with zSim source code is that methods of class MESICC
and class MESITerminalCC
have the same name
as methods of class MESIBottomCC
and class MESITopCC
, which adds significant difficulties navigating the source files.
The rule of thumb is that class MESICC
and class MESITerminalCC
methods are all defined in coherence_ctrls.h, while
most of the class MESIBottomCC
and class MESITopCC
methods are defined in the cpp file.
We next describe coherence actions one by one.
Invalidation
Invalidation can be initiated at any level of the cache by calling the invalidate()
method. In fact, even the coherence
protocol calls this method to invalidate blocks in child caches. The semantics of invalidate()
method guarantees that
the block to be invalidated will not be cached on the level it is called as well as all children levels. In this section
we show how invalidate()
interacts with cache coherence.
The invalidate()
method first performs a cache lookup using the lookup()
method of the tag array (not updating LRU states).
If the block is found in the tag array, the address and the index of the block is passed to the coherence controller’s
method processInv
. Note that invalidate()
handles both downgrades (INVX
) and true invalidations (INV
). The type
of invalidation is specified using the type
parameter. When downgrade is requested, the current level
on which invalidate()
is called and levels below are assumed to hold a block in M
or E
state.
In a non-terminal coherence controller, processInv()
simply calls processInval()
on tcc
and then calls the method
of the same name on bcc
. The completion cycle, however, is the cycle when tcc
finishes broadcasting. This reflects an
important assumption made by zSim: broadcasting is on the critical path, while transfer of dirty data and local state
changes are not.
In tcc
’s processInval()
, sendInvalidates()
is called to broadcast the invalidation request to child caches that
have a “1” bit in the sharer list. To elaborate: This function walks the sharer list of the block, and for each potential
sharer, it calls the cache object’s invalidate()
method recursively (recall that we are now in the initial invalidate()
’s
call chain). The type of invalidation and the boolean flag indicating dirty write back are passed unmodified. zSim assumes
that all invalidations are signaled to child caches at the same time. The completion cycle of a single invalidation is
computed as the response cycle from the child cache plus the network latency. Since all requests are signaled in parallel,
the final completion cycle is the maximum of all child cache invalidations. After all children caches have responded,
the controller changes the directory entry of the current block based on the invalidation type. For full invalidation,
the directory entry is cleared, since the block no longer exists in the cache. For downgrades, the directory entry’s
exclusive flag is cleared, but we keep sharer list bit vector intact.
After tcc
’s processInval()
method returns, bcc
’s processInval()
method is invoked to handle local state changes.
For full invalidations, we always set the coherence state to I
, and set write back flag if the state is currently M
to signal a dirty write back to the caller. Note that since at most one cache in the entire hierarchy can have a dirty
copy, the dirty write back flag will be set exactly once during the invalidation process. For downgrades, we simply change
the current M
or E
states (other states are illegal) to S
, and set the write back flag if the current state is M
.
No actual write back takes place during invalidation. The caller of cache object’s invalidate()
method should handle dirty
write back by starting a PUTX
transaction on parent caches or to other memory objects (e.g. DRAM, NVM).
Eviction
Cache line eviction is triggered naturally as new blocks are loaded into the cache when the set is full. No external
interface is available for the cache object to initiate an eviction. Instead, the cache controller calls processEviction()
on the coherence controller when the tag array’s preinsert()
returns a valid block number, indicating that a block
should be evicted. Coherence controller’s processEviction()
calls the method of the same name (unfortunate coincidence)
on tcc
and bcc
respectively, in this order, and returns the bcc
completion time as the eviction completion time.
Note that by returning bcc
’s completion time, zSim assumes that tcc
’s broadcasting and bcc
’s write back are serialized,
such that the latter can only proceed after the former returns. This design decision is reasonable, as dirty write back
needs to see the dirty line first, which is transferred in a side channel from one of the child caches to the coherence
controller.
The processEviction()
method of tcc
simply wraps sendInvalidates()
method, which signals children caches of the
cache line invalidation operation. The invalidation type is set to INV
, indicating that blocks must be fully invalidated.
The dirty write back flag is passed to a local variable of coherence controller’s processEviction()
, which is then passed
to bcc
’s processEviction()
to actually perform the write back.
After sending the invalidation, bcc
’s processEviction()
method changes the local state and conducts the write back.
It first checks the dirty write back flag which is set by tcc
’s invalidation process. If the flag is set, meaning a
child cache has a dirty block on the invalidated address, the bcc
first changes the local state from E
or M
to M
(other from states are illegal). Note that local E
state implies that the child cache first reads the line using GETS,
acquiring the line in E
state, and then does a silent transition from E
to M
. Local M
state implies that the
child cache originally acquired the line using GETX
, which sets all caches holding the block to M
state along the way
the block is passed to it. Also note that zSim assumes that children caches will pass the dirty block to the current controller
using a side channel, which is not on the critical path of broadcasting, instead of using regular MemReq
. This assumption
is justified by the fact that writing data back level-by-level is unnecessary, since these written back copies between
the bottommost dirty block owner and the current cache object will be invalidated anyway.
The type of the write back depends on the state of the block after handling dirty write backs from children caches.
If the block state is E
or S
, meaning no dirty data needs to be written, the coherence controller creates a
PUTS
MemReq
object, and calls parent cache access()
method to handle the write back synchronously. If, on the
other hand, the block state is M
, a PUTX
MemReq
object is created and fed to the parent cache’s access()
method.
I
state blocks will be simply ignored, since they neither have any sharer nor require any form of write back.
In all cases, the completion cycle of the parent access()
method will be returned to the caller as the completion
cycle of the eviction operation. As discussed in early sections, the current block state will be set by the parent
cache when the request is handled by access()
method. After parent cache access()
returns, the block state
should be I
, indicating the block is no longer cached in any part of the hierarchy down below the current level.
One design decision is whether to send PUTS
requests to parent caches when the block is clean. In general, sending clean
write backs help parent cache to manage their sharer lists by removing the sharer eagerly and making the list precise.
Imprecise sharer lists do not affect correctness, but will incur extra coherence messages to caches that do not actually
hold the block. zSim decouples the creation and processing of PUTS
requests. PUTS
requests are always sent whenever
possible, but parent caches may just ignore them. Since zSim does not model network utilization, and assumes that all
invalidations are sent in parallel, not cleaning the sharer list eagerly will not affect simulation result (despite that,
zSim still clears the sharer list on clean write backs).
GETS/GETX Access
We discuss access()
method in two sections. In this section, we present how GETS
and GETX
are handled. In the next
section, we present PUTS
and PUTX
.
The cache object’s access()
method begins by performing a lookup in the tag array. If the address is not cached,
and the request is a GETS
or GETX
, then we need to first evict an existing block (preinsert()
and processEviction
),
and then load the intended block from parent level caches by calling coherence controller’s processAccess()
. If the parent
level cache does not contain the block, this process may be recursively repeated until reaching a parent level cache that
holds the block with sufficient permission, or finally hit the DRAM (or other types of main memory). If the address is
cached, we still need to call processAccess()
to update its coherence state, since a cache hit may also change the
coherence state of the block (e.g. if a GETX
request hits a E
state line, the state will transfer to M
silently).
The invariant is that no matter a block is evicted or hit, processAccess()
always sees a valid cache line number,
which is the slot for loading the new block or changing existing coherence states (recall that after a block is evicted,
its coherence state in the former holder will be I
).
In the terminal level coherence controller, processAccess()
only calls bcc
’s method of the same name. For GETS
requests,
we check the current coherence state. If it is E
, S
or M
, then the controller already has sufficient permission for
accessing the block, and no coherence action will take place. If the block is in I
state, the block needs to be brought
into the cache from a parent level cache. To this end, the controller creates a GETS
MemReq
object, and calls parent
cache access()
method recursively. The coherence state of the current block will be set by the parent level access()
method after it returns, which should be either S
(already peer caches holding the block) or E
(when it is the only
holder of the block in parent’s sharer list, and the parent itself also has E
or M
permission).
For a GETX
request, if the current block state is E
, then the block silently transfers to M
state without notifying
the parent cache. The fact that a dirty block is held by the current cache will be available to the parent when an invalidation
forces the M
state block to be written back. If, however, the current state is I
or S
, then just like the case for
GETS
, the controller creates a GETX
MemReq
object, and feeds it to parent cache’s access()
method in order to
invalidate all shared copies held by its peer caches (and peers of its parent, etc.). The final state of the block will
be set by the parent instance of access()
, which should be M
.
The handling of GETS
and GETX
are more complicated on non-terminal caches. There are two invariants in non-terminal
coherence controllers. The first invariant is that a controller can grant permission to its children caches only if
it holds the same or higher permission. For example, a controller holding an S
state line should not grant M
permission
to the child without acquiring M
permission with its parent first. The second invariant is that non-terminal cache
controllers always send data it has fetched to children caches (except for prefetching, which we do not discuss).
On receiving a request from a child cache, a non-terminal coherence controller first attempts to acquire equal or higher
permission requested by the child cache if it does not hold the block in that permission. This translates to calling
bcc
’s processAccess()
, which has exactly the same effect as in a terminal controller. After processAccess()
returns,
the current cache should have sufficient permission to grant the child cache’s request, which is performed by calling
tcc
’s processAccess()
. This function takes a boolean flag indicating whether the current state of the block is exclusive
(i.e. M
or E
) as one of the arguments (recall that tcc
has no access to the current state of the block, but only the
state of the child block). The function also takes argument on whether a dirty write back is incurred as a result
of permission downgrade in one of its children caches.
At a high level, processAccess()
will set the child block state and its own directory state based on the type of the
request and the state of the block in the current cache. We elaborate the concrete handling case-by-case. If the request
is GETX
, then we check the sharer list to see whether the block is currently shared by multiple children caches.
If true, sendInvalidates()
is called to revoke all shared cache lines in children caches. Note that when tcc
begins
processing the request, the current cache block state must already be exclusive. This does not contradict the fact
that one or more of its children caches may have a shared or exclusive copy of the block. If the invalidation results
in a dirty write back, the flag will be set, and the write back will be processed by the coherence controller. After
invalidation, we set the requestor as the exclusive owner of the cache line by clearing all bits in the sharer list except
the request’s.
If the request is GETS
, the current state of the block after bcc
’s processAccess()
can be any of the valid state.
In MESI protocol, if the current cache holds the block in an exclusive state, and there is no other sharer of the block,
tcc
could grant E
permission to the requestor, since it is certain that the current cache holds a globally unique copy
of the line, and could grant write permission to one of its children. If, on the other hand, that the current state is exclusive,
but a different child cache owns the block in exclusive state, then invalidation is still needed to downgrade the current
owner of the line to shared state (using INVX
), before S
permission could be granted to the requestor. If the current
state is shared, meaning that the block is not exclusively owned by the current cache, but also shared by other caches,
the requestor simply gets S
state without invalidation. In all three cases, the requestor is marked in the sharer list.
If the downgrade or full invalidation incurred by GETS
or GETX
results in a dirty write back, it will be passed to the
coherence controller, and handled in processWritebackOnAccess()
. Recall that zSim assumes the dirty block is transferred
via a side channel, which is not on the critical path of the invalidation process, this does not add to the latency of
the operation, but only simply changes the state of the current block from M
or E
to M
(other from states are illegal).
PUTS/PUTX Access
PUTS
and PUTX
are exclusively used for write backs, which can only occur at non-terminal levels in the hierarchy.
These two requests are also first handled by bcc
and then tcc
. bcc
ignores PUTS
, since they are clean data
and will not change the block state. For PUTX
, the block state will transit from M
or E
to M
(other from states
are illgeal). tcc
, on the other hand, does not ignore PUTS
, but clears the bit of the child cache from the sharer list.
PUTX
is handled in exact the same way as PUTS
, if PUTX_KEEPEXCL
flag is not set in the request object (this flag
demands that dirty data be written back, but keep the child block in E
state). In both cases, the child cache state
is set to I
, indicating that the block has been written back, and is no longer cached by the child level.
Beyond Last Level Cache
When a GETX
/GETS
request reaches the last level cache (LCC), but still cannot find the cached block, or when the
LLC evicts a dirty block, where does the request go? In zSim, all memory objects are instances of the base class
MemObject
, including cache and the main memory. During initialization, the LLC is connected to the main memory
by setting its parents
list pointing to a main memory object, which also features the elegant access()
interface
(note that main memory objects do not have invalidate()
, since this is added in class BaseCache
). As a result,
MemReq
going beyond the LLC will be sent to main memory objects, which is then simulated just like caches, probably
with a significantly different timing.