Understanding Ariel Core Simulation in SST
Ariel Core
Ariel Tunnel
class ArielTunnel
(in file ariel_shmem.h
) implements the shared memory layout and parent-child synchronization
between processes.
class ArielTunnel
inherits from class TunnelDef
, with struct ArielSharedData
as the shared metadata between
processes, and struct ArielCommand
as the message type.
struct ArielCommand
is the unit of message exchange between the instruction source and the Ariel core simulator.
It is a multi-purpose class consisting of several union
s. The class can either encode instructions that are
being executed in the simulated binary, or special memory operations, such as malloc()/free()
or DMA transfer,
as high-level memory layout information.
class ArielTunnel
behaves similarly to the base class, except that it defines a few more data member in
struct ArielSharedData
. Simulation time and cycle are tracked by simTime
and cycles
.
numCores
is simply the number of buffers in the shared region. child_attached
tracks the number of child processes
(running non-master tunnel objects) that have registered themselves with the master tunnel object.
If the Ariel tunnel is being initialized on the child process, the process will register itself to the master
by incrementing the child_attached
counter in struct ArielSharedData
.
The current implementation does not use atomic instruction for the increment operation, which can potentially break
due to data contention, if multiple child processes increment it at the same time. It is, however, safe in practice,
since the Ariel core is only expecting one child process.
Function waitForChild()
is called on the SST side after spawning child processes. This function contains a single
loop that repeatedly checks the child_attached
variable in the shared region. It exits the loop when this variable
is no longer zero, meaning that the (supposedly single) child has registered.
Ariel Frontend
The ariel frontend defined in class ArielFrontend
is a generic interface for the Ariel core to obtain a tunnel object
as instruction source. The class itself is abstract and does not implement any concrete function.
Any instruction source can be implemented as a derived class of class ArielFrontend
, and be incorporated seamlessly
into Ariel core.
Specifically, the SST package provides an implementation, class Pin3Frontend
, that works with Pin3, which is a binary
instrumentation tool that dynamically rewrites application image in the main memory, such that the instruction trace
can be obtained by invoking call backs after every instruction.
The instrumentation part (called a “pintool”) is implemented as a separate module using the Pin library, and
compiled into a dynamic library.
The program to be instrumented must be started as a separate process by invoking the pin binary with the pintool and
program’s binary. The pin binary will set up the execution environment for the pintool, load the application binary,
and then execute the binary with the instrumentations defined in the pintool.
class Pin3Frontend
implements this process by calling fork()
to create a child process, and then execvp()
to load the pin binary with command line arguments, including the path to the pintool, the path to the application
binary, and arguments for running the application.
class Pin3Frontend
also sets up the shared memory region, and pass the region name to the fork’ed child process
running Pin.
At the other side of the IPC channel, the pintool also contains a non-master mode Ariel tunnel object
(class ArielTunnel
in non-master mode) and a shared memory child object (class MMAPChild_Pin3
).
Instructions intercepted by the instrumentation are sent over the tunnel object as message objects
(class ArielCommand
), which can be extracted on the simulator side and then fed into the Ariel core simulator.
class Pin3Frontend
’s constructor takes the parameter object that stores various invocation parameters for the pin
binary, the number of cores, and the size of the queue objects in the tunnel.
Pintool path and application binary path are specified with arieltool
and executable
, respectively.
If the pintool path is not specified, it will, by default, be fesimple.so
, which is compiled from fesimple.cc
.
The path of the Pin binary is specified with launcher
, which defaults to macro value PINTOOL_EXECUTABLE
.
We do not cover the rest of the parameters here, and readers should be able to find the list of parameter options
and their definitions in file pin3frontend.h
.
The constructor also creates the shared memory region and initializes the layout of the region by constructing a
class MMAPParent
object as data member tunnelmgr
, and obtains the tunnel object (of type class ArielTunnel
) as
data member tunnel
.
The key to map the shared region in another process is stored in local variable shmem_region_name
, which will
be passed to the pintool in the command line.
When init()
is called during the timed initialization stage, forkPINChild()
will be called to bootstrap Pin.
This function first calls fork()
to create a new child process, and then calls
execvp()/execve()
(depending on whether environmental variables are needed) to load the
pin binary with the path to the pintool and the application binary. Other arguments,
such as the shared region’s name, are also passed via the second argument of execvp()/execve()
.
The newly spawned process will then start, finish the initialization of the non-master tunnel object in the child
process, and registers itself on the shared memory region.
The SST process will wait for the child process to register by calling waitForChild()
on the tunnel object,
and then the initialization concludes.
Ariel Pintool
The pintool that works with Ariel core is implemented in file fesimple.cc
, and will compile into a dynamic library,
fesimple.so
.
The pintool implements the instrumentations that are essential to obtain instruction traces within the potentially
multi-threaded application, as well as to capture certain operations, such as malloc()/free()
and the stack trace.
The pintool runs a class ArielTunnel
object in non-master mode, which is referred to by global variable tunnel
,
and a shared memory IPC endpoint of type class MMAPChild_Pin3
, which is referred to by global variable tunnelmgr
.
The name for the shared region is passed via command line argument from the SST process via execvp()/execve()
,
and is received by Pin’s KNOB
object SSTNamedPipe
.
Other parameters to the pintool are also passed via command line, and received by KNOB
objects, which we do not
cover here. Readers should be able to obtain a fairly detailed description of these parameters and their descriptions
in file fesimple.cc
.
The main instrumentation function for instructions is InstrumentInstruction
, which checks the instruction type,
and dispatches the instruction to different functions for processing.
Instructions are decoded into Ariel primitives (note that this is not the same as x86 uop decoding), which are then
encoded into struct ArielCommand
objects. These objects are sent to the SST process by calling writeMessage()
on
the tunnel object. Multi-threaded applications will write the message into the queue indexed by Pin’s internal
thread ID. If the queue is full, then writeMessage()
will block, which also blocks the execution of the
simulated application (since it is called in Pin’s “analysis routine”).
Ariel Core (w/o CUDA GPU)
Ariel core (class ArielCore
) implements a simple core timing model. It models a four-issue super scalar execution
unit, with a single cycle delay for non-memory instructions.
Memory operations are forwarded to the cache and memory hierarchy for detailed timing simulation, and is
asynchronous to the execution of non-memory operations. The core also models a limited capacity for the pending
transactions queue tracking outstanding memory operations. If the pending transactions queue is full, then
no more memory operations can be issued, and all four lanes will be stalled until one of the pending memory transactions
complete. Fence instructions are modelled by stalling instruction processing until all memory operations in the
pending transaction queue has been drained.
Instruction Supply
The Ariel core does not have any instruction decoding and supply frontend, and it relies on the tunnel object
(data member tunnel
) to feed instructions into the timing model. The core pulls instruction supplied from the
frontend by calling readMessageNB()
on the tunnel object, and then adds these instructions in the order that they
are received as event objects of type class ArielEvent
into the core’s private event queue, coreQ
.
The tunnel object is initialized in the containing class, class ArielCPU
, and passed to the Ariel core as
one of the constructor arguments. One of the options is to use Pin as the instruction provider, but Ariel core
also works perfectly with other forms of the frontend.
Core States
An Ariel core can be in one of the three states after initialization: executing, stalled, or halted.
The executing state is the state where instructions are allowed to be dispatched (but it is still possible that
no instruction is dispatched due to a resource hazard on the pending transaction queue), and for each cycle
being simulated, the core will attempt to dispatch instructions until the limit maxIssuePerCycle
is reached.
The stalled state is where an explicit fence instruction occurs in the instruction stream, which blocks further issuing
of all instructions until all pending memory operations have been drained.
This is to mimic the draining effect of actual implementation of certain memory fences on commercial processors.
In both execution and stalled states, the core clock currentCycles
keeps ticking, since the core is still
logically active.
The halted state is when the core has reached the maximum number of instructions to be simulated (data member
max_insts
), and it indicates the end of the entire simulation.
In the implementation, this flag is only set for the first core in a multi-core simulation for some reason.
The containing class, class ArielCPU
, will check this flag for all cores after every tick. If any one of them
is set, the simulation will terminate.
Main Loop
The main simulation loop is implemented in tick()
, which is registered as the clock call back function in its
containing class, class ArielCPU
. The frequency of the CPU clock is specified in configuration as clock
.
In the main event loop, if the simulator is neither halted nor stalled, then it will execute up to maxIssuePerCycle
instructions. For each iteration, it calls processNextEvent()
to extract an event object of type class ArielEvent
,
and simulate the event as one instruction.
The function returns a boolean flag didProcess
indicating whether the instruction is successfully processed, or
not, due to a resource hazard (e.g., pending transaction queue is full).
In the latter case, the pipeline is immediately stalled, and no instruction could be further processed beyond the
instruction. Note that in this case, the stalled flag is not set, since this is not a software induced stall.
The core, nevertheless, stops processing instructions, and will retry the same instruction that blocked the pipeline
in the future cycles, until it unblocks (didProcess
is true
).
No matter whether the core processes any instruction, or the core is stalled, as long as the core is not
halted, the currentCycles
counter will be incremented after the iteration, indicating that a logical cycle has passed.
I noted a possible bug in this mechanism, which would happen in processNextEvent()
, if the call to refillQueue()
returns false
. This would happen if readMessageNB()
, which is called in refillQueue()
, returns false
, indicating
that either the circular queue that provides instruction is empty, or the queue is currently being locked for write.
This is merely an artifact on the host created by IPC and process scheduling, and should not incur any observable
state change in the simulation. Unfortunately, in the current implementation of Ariel core, queue contention and
temporary starvation will cause the processor to stall for no architectural reason, which introduces
non-predicable simulation errors.
A better implementation would be to use blocking reads on the tunnel object, which guarantees that a frontend
generated message is read before it returns, which never stalls the processor because of the queue.
In function processNextEvent()
, the core’s instruction queue, coreQ
, (which is not timing related) is first
checked. If the queue is empty, then refillQueue()
is called to pull more frontend generated messages from the
frontend, and translate these messages into class ArielEvent objects
, before inserting them into coreQ
.
The refillQueue()
function simply calls readMessageNB()
on the tunnel object with the core’s ID (in case it is
a multithreaded simulation), and reads the message into a local class ArielCommand
object.
Messages of command type ARIEL_START_INSTRUCTION
and ARIEL_END_INSTRUCTION
delimits the memory operations of a
multi-operand instruction, and they are essentially just for statistics purposes (e.g., to keep track of the actual
number of instructions, not memory operations, that have been simulated).
The actual memory operation is encoded in ac.command
field for messages between instruction start and end
messages. If the instruction is a read (ARIEL_PERFORM_READ
) or write (ARIEL_PERFORM_WRITE
), then a corresponding
event is inserted into the coreQ
by calling createReadEvent()
or createWriteEvent()
, respectively.
These two functions will simply create and enqueue class ArielEvent
objects with the address to be read/written and
the length of the memory operation.
Other types of instructions, such as no-ops (ARIEL_NOOP
), flushes (ARIEL_FLUSHLINE_INSTRUCTION
),
and fences (ARIEL_FENCE_INSTRUCTION
), are also processed accordingly, which result in an event object being inserted
into coreQ
and eventually being executed.
High-level semantics operations that do not correspond to any instruction, such as mmap()
, malloc()/free()
,
pool switching, and simulation exit, will also be inserted into coreQ
as ArielEvent
objects.
Ariel core can use these operations to fine tune the simulation, such as generating the correct
physical address for address translation.
Note that, despite lacking an instruction class for representing ALU instructions and branches, these instructions are
still captured and simulated as no-ops. The Pin frontend treats any non-read and non-write instruction as a no-op,
and inserts them into the circular buffer by calling WriteNoOp()
in file fesimple.cc
.
These instructions still consume dispatching bandwidth in the core simulation. It is just that the detailed
architectural state transitions are ignored.
Function processNextEvent()
then pops the first event object in the coreQ
, and simulates the operation.
No-op events are not further processed, as their sole purpose is to consume dispatch bandwidth for simulating ALU
and branching instructions. Reads and writes are handled by calling handleReadRequest()
and handleWriteRequest()
respectively, if the pending transaction queue’s size pending_transaction_count
is still not at the
the maximum capacity maxPendingTransactions
. Otherwise, the memory operation is not processed due to a
resource hazard on the pending transaction queue, and the event object is not taken off the coreQ
, meaning that
they will be reattempted in the next cycle. On a resource hazard, the local variable removeEvent
is also set to
false
, meaning that the core’s main loop (described above) will be blocked until the event is able to proceed.
Cache address flushes and fences are handled by calling handleFlushEvent()
and handleFenceEvent()
, respectively.
Flushes are also entered into the pending transaction queue, and hence it is also prone to resource hazards just
like reads and writes. Fences are always instantly executed, which we discuss later below.
High-level events are processed according to their semantics. CORE_EXIT
event will immediately terminate simulation
by setting the isHalted
flag, which will be checked at the end of every loop. Memory operations such as mmap()
,
malloc()
and free()
are handled by the handler functions, which tracks the virtual to physical mapping with
best efforts.
Reads and Writes
Memory reads and writes are simulated by calling handleReadRequest()
and handleWriteRequest()
respectively.
These two functions are largely symmetric, so we only cover read handling. Write handling is almost identical except
that the command issued to the memory hierarchy is a write, rather than a read.
In handleReadRequest()
, the function first translates the virtual address to physical address using its own
embedded TLB (not the timing TLB, which is simulated separately) by calling translateAddress()
on data member
memmgr
.
It then checks whether the operation spans two blocks (multi-block reads that
require more than two are not supported), or just access a single block. In the former case, the two blocks to be
accessed are requested separately by first computing the cache block address and the read size, and then
calling commitReadEvent()
twice to simulate two cache accesses.
In the former case, only one base and length is computed, and commitReadEvent()
is called once.
In commitReadEvent()
, an event object of type class SimpleMem::Request
with the info of the access is constructed,
and sent to the memory hierarchy for timing simulation via the link cacheLink
. The link object connects the Ariel
core to the memory hierarchy, and is initialized in the containing class class ArielCPU
on a per-core basis.
Note that the memory event object is handled in parallel with core execution, which will be sent back via the
same link from the memory hierarchy at a future cycle. To track outstanding memory operations, the newly constructed
event object is also registered with the pending transaction queue, pendingTransactions
, which is a mapping structure
that maps the ID of the outstanding request to the request object.
The number of pending transactions is also updated by incrementing pending_transaction_count
.
Note that the Ariel core only implements minimum logic for handling memory operations.
More complicated tasks that are essential on a real core, such as deduplication of requests on the same address
and read-write forwarding, are not implemented.
Memory responses are sent by the memory hierarchy via the same link object when they are completed. The receiving end
of the link is registered with call back function handleEvent()
, which will be invoked with the response event object.
The function first extracts the request event ID stored in the response object, and then performs a lookup in the
mapping structure pendingTransactions
for the request object.
Then the memory operation retires by: (1) Removing the request object in the mapping structure; (2) Freeing a slot
in the pending transaction queue by decrementing pending_transaction_count
; and (3) If the number of pending
transactions drop to zero, and the core is stalled, then the core is unstalled, because the pending transactions
have been drained.
Fences and Cache Line Flushes
Ariel core models fences as a draining instruction: It stalls the core from executing until all pending transactions
have been drained, hence enforcing a strong ordering between memory operations before and after the fence instruction.
Note that Ariel core fences are not the simulator’s implementation of regular x86 memory fence family. Instead, Ariel
fence should be called explicitly as a function ariel_fence()
in the source code. The Pin instrumentation routines
will recognize the function invocation, and replace it will a fence message that will be inserted into the circular
queue when executed.
Similarly, cache line flushes are also not the implementation of x86 clflush/clflushopt/clwb
. Users must explicitly
call ariel_flushline()
in the source code with the pointer value to be flushes (the value is computed in the runtime
and passed to the function).
Fences are handled by calling fence()
, which sets isFenced
and isStalled
to true
, if the number of
pending transactions is greater than zero.
As discussed earlier, the fence will eventually commit when tha last outstanding memory operation is completed, at
which point unfence()
is called to set both isFenced
and isStalled
to false
.
Flush events are handled by handleFlushEvent()
in a way similar to memory operations, except that it always affects a
single cache block. After obtaining the physical address of the flush, commitFlushEvent()
is called to create
a memory event object of type SimpleMem::Request
, with the command being SimpleMem::Request::FlushLineInv
.
The event is sent to the memory hierarchy after being added to the pending transaction queue and incrementing the
pending transaction counter.
Memory Manager
Ariel cores also maintain per-core memory managers that mimic OS’s physical page allocator. Due to lack of information
on how physical pages are actually allocated and mapped to virtual addresses, the memory manager only works in a
best-effort manner, i.e., relying on intercepted memory allocation calls, such as mmap()
or malloc()
, to infer
virtual addresses that have been mapped by the OS, and then emulates a physical page allocator by its own.
The physical addresses generated by the memory managers may not be consistent with what the simulated system actually
uses, but it does not pose a problem, since the simulator only needs some physical addresses for memory hierarchy
simulation, but does not care what those addresses represent.
The memory manager interface is defined in class ArielMemoryManager
, which itself cannot be instanciated, but it
defines fall back functions that will print error messages, if not implemented in the derived class.
If Ariel core users desire to have a working memory manager,
then one of the derived classes should be selected using option memmgr
in the configuration file.
The core interface function of a memory manager is translateAddress()
, which, given a virtual address, returns the
physical address for memory accesses.
As mentioned earlier, the memory manager maintains its own free page pool, and emulates free page allocation as well as
virtual-to-physical mapping assignment, which should normally be performed during OS page faults.
An extra two pairs of interface allows memory allocation information on virtual addresses to be passed to the memory
manager: allocateMalloc()/freeMalloc()
for malloc-based virtual address tracking, and allocateMMAP()/freeMMAP()
for mmap-based tracking. These functions are not implemented in the base class, which, if not implemented by the
derived class but called, will report errors.