Ariel Core

Ariel Tunnel

class ArielTunnel (in file ariel_shmem.h) implements the shared memory layout and parent-child synchronization between processes. class ArielTunnel inherits from class TunnelDef, with struct ArielSharedData as the shared metadata between processes, and struct ArielCommand as the message type. struct ArielCommand is the unit of message exchange between the instruction source and the Ariel core simulator. It is a multi-purpose class consisting of several unions. The class can either encode instructions that are being executed in the simulated binary, or special memory operations, such as malloc()/free() or DMA transfer, as high-level memory layout information.

class ArielTunnel behaves similarly to the base class, except that it defines a few more data member in struct ArielSharedData. Simulation time and cycle are tracked by simTime and cycles. numCores is simply the number of buffers in the shared region. child_attached tracks the number of child processes (running non-master tunnel objects) that have registered themselves with the master tunnel object.

If the Ariel tunnel is being initialized on the child process, the process will register itself to the master by incrementing the child_attached counter in struct ArielSharedData. The current implementation does not use atomic instruction for the increment operation, which can potentially break due to data contention, if multiple child processes increment it at the same time. It is, however, safe in practice, since the Ariel core is only expecting one child process.

Function waitForChild() is called on the SST side after spawning child processes. This function contains a single loop that repeatedly checks the child_attached variable in the shared region. It exits the loop when this variable is no longer zero, meaning that the (supposedly single) child has registered.

Ariel Frontend

The ariel frontend defined in class ArielFrontend is a generic interface for the Ariel core to obtain a tunnel object as instruction source. The class itself is abstract and does not implement any concrete function. Any instruction source can be implemented as a derived class of class ArielFrontend, and be incorporated seamlessly into Ariel core.

Specifically, the SST package provides an implementation, class Pin3Frontend, that works with Pin3, which is a binary instrumentation tool that dynamically rewrites application image in the main memory, such that the instruction trace can be obtained by invoking call backs after every instruction. The instrumentation part (called a “pintool”) is implemented as a separate module using the Pin library, and compiled into a dynamic library. The program to be instrumented must be started as a separate process by invoking the pin binary with the pintool and program’s binary. The pin binary will set up the execution environment for the pintool, load the application binary, and then execute the binary with the instrumentations defined in the pintool.

class Pin3Frontend implements this process by calling fork() to create a child process, and then execvp() to load the pin binary with command line arguments, including the path to the pintool, the path to the application binary, and arguments for running the application. class Pin3Frontend also sets up the shared memory region, and pass the region name to the fork’ed child process running Pin. At the other side of the IPC channel, the pintool also contains a non-master mode Ariel tunnel object (class ArielTunnel in non-master mode) and a shared memory child object (class MMAPChild_Pin3). Instructions intercepted by the instrumentation are sent over the tunnel object as message objects (class ArielCommand), which can be extracted on the simulator side and then fed into the Ariel core simulator.

class Pin3Frontend’s constructor takes the parameter object that stores various invocation parameters for the pin binary, the number of cores, and the size of the queue objects in the tunnel. Pintool path and application binary path are specified with arieltool and executable, respectively. If the pintool path is not specified, it will, by default, be fesimple.so, which is compiled from fesimple.cc. The path of the Pin binary is specified with launcher, which defaults to macro value PINTOOL_EXECUTABLE. We do not cover the rest of the parameters here, and readers should be able to find the list of parameter options and their definitions in file pin3frontend.h. The constructor also creates the shared memory region and initializes the layout of the region by constructing a class MMAPParent object as data member tunnelmgr, and obtains the tunnel object (of type class ArielTunnel) as data member tunnel. The key to map the shared region in another process is stored in local variable shmem_region_name, which will be passed to the pintool in the command line.

When init() is called during the timed initialization stage, forkPINChild() will be called to bootstrap Pin. This function first calls fork() to create a new child process, and then calls execvp()/execve() (depending on whether environmental variables are needed) to load the pin binary with the path to the pintool and the application binary. Other arguments, such as the shared region’s name, are also passed via the second argument of execvp()/execve().

The newly spawned process will then start, finish the initialization of the non-master tunnel object in the child process, and registers itself on the shared memory region. The SST process will wait for the child process to register by calling waitForChild() on the tunnel object, and then the initialization concludes.

Ariel Pintool

The pintool that works with Ariel core is implemented in file fesimple.cc, and will compile into a dynamic library, fesimple.so. The pintool implements the instrumentations that are essential to obtain instruction traces within the potentially multi-threaded application, as well as to capture certain operations, such as malloc()/free() and the stack trace. The pintool runs a class ArielTunnel object in non-master mode, which is referred to by global variable tunnel, and a shared memory IPC endpoint of type class MMAPChild_Pin3, which is referred to by global variable tunnelmgr. The name for the shared region is passed via command line argument from the SST process via execvp()/execve(), and is received by Pin’s KNOB object SSTNamedPipe. Other parameters to the pintool are also passed via command line, and received by KNOB objects, which we do not cover here. Readers should be able to obtain a fairly detailed description of these parameters and their descriptions in file fesimple.cc.

The main instrumentation function for instructions is InstrumentInstruction, which checks the instruction type, and dispatches the instruction to different functions for processing. Instructions are decoded into Ariel primitives (note that this is not the same as x86 uop decoding), which are then encoded into struct ArielCommand objects. These objects are sent to the SST process by calling writeMessage() on the tunnel object. Multi-threaded applications will write the message into the queue indexed by Pin’s internal thread ID. If the queue is full, then writeMessage() will block, which also blocks the execution of the simulated application (since it is called in Pin’s “analysis routine”).

Ariel Core (w/o CUDA GPU)

Ariel core (class ArielCore) implements a simple core timing model. It models a four-issue super scalar execution unit, with a single cycle delay for non-memory instructions. Memory operations are forwarded to the cache and memory hierarchy for detailed timing simulation, and is asynchronous to the execution of non-memory operations. The core also models a limited capacity for the pending transactions queue tracking outstanding memory operations. If the pending transactions queue is full, then no more memory operations can be issued, and all four lanes will be stalled until one of the pending memory transactions complete. Fence instructions are modelled by stalling instruction processing until all memory operations in the pending transaction queue has been drained.

Instruction Supply

The Ariel core does not have any instruction decoding and supply frontend, and it relies on the tunnel object (data member tunnel) to feed instructions into the timing model. The core pulls instruction supplied from the frontend by calling readMessageNB() on the tunnel object, and then adds these instructions in the order that they are received as event objects of type class ArielEvent into the core’s private event queue, coreQ. The tunnel object is initialized in the containing class, class ArielCPU, and passed to the Ariel core as one of the constructor arguments. One of the options is to use Pin as the instruction provider, but Ariel core also works perfectly with other forms of the frontend.

Core States

An Ariel core can be in one of the three states after initialization: executing, stalled, or halted. The executing state is the state where instructions are allowed to be dispatched (but it is still possible that no instruction is dispatched due to a resource hazard on the pending transaction queue), and for each cycle being simulated, the core will attempt to dispatch instructions until the limit maxIssuePerCycle is reached.

The stalled state is where an explicit fence instruction occurs in the instruction stream, which blocks further issuing of all instructions until all pending memory operations have been drained. This is to mimic the draining effect of actual implementation of certain memory fences on commercial processors. In both execution and stalled states, the core clock currentCycles keeps ticking, since the core is still logically active.

The halted state is when the core has reached the maximum number of instructions to be simulated (data member max_insts), and it indicates the end of the entire simulation. In the implementation, this flag is only set for the first core in a multi-core simulation for some reason. The containing class, class ArielCPU, will check this flag for all cores after every tick. If any one of them is set, the simulation will terminate.

Main Loop

The main simulation loop is implemented in tick(), which is registered as the clock call back function in its containing class, class ArielCPU. The frequency of the CPU clock is specified in configuration as clock. In the main event loop, if the simulator is neither halted nor stalled, then it will execute up to maxIssuePerCycle instructions. For each iteration, it calls processNextEvent() to extract an event object of type class ArielEvent, and simulate the event as one instruction. The function returns a boolean flag didProcess indicating whether the instruction is successfully processed, or not, due to a resource hazard (e.g., pending transaction queue is full). In the latter case, the pipeline is immediately stalled, and no instruction could be further processed beyond the instruction. Note that in this case, the stalled flag is not set, since this is not a software induced stall. The core, nevertheless, stops processing instructions, and will retry the same instruction that blocked the pipeline in the future cycles, until it unblocks (didProcess is true). No matter whether the core processes any instruction, or the core is stalled, as long as the core is not halted, the currentCycles counter will be incremented after the iteration, indicating that a logical cycle has passed.

I noted a possible bug in this mechanism, which would happen in processNextEvent(), if the call to refillQueue() returns false. This would happen if readMessageNB(), which is called in refillQueue(), returns false, indicating that either the circular queue that provides instruction is empty, or the queue is currently being locked for write. This is merely an artifact on the host created by IPC and process scheduling, and should not incur any observable state change in the simulation. Unfortunately, in the current implementation of Ariel core, queue contention and temporary starvation will cause the processor to stall for no architectural reason, which introduces non-predicable simulation errors. A better implementation would be to use blocking reads on the tunnel object, which guarantees that a frontend generated message is read before it returns, which never stalls the processor because of the queue.

In function processNextEvent(), the core’s instruction queue, coreQ, (which is not timing related) is first checked. If the queue is empty, then refillQueue() is called to pull more frontend generated messages from the frontend, and translate these messages into class ArielEvent objects, before inserting them into coreQ. The refillQueue() function simply calls readMessageNB() on the tunnel object with the core’s ID (in case it is a multithreaded simulation), and reads the message into a local class ArielCommand object. Messages of command type ARIEL_START_INSTRUCTION and ARIEL_END_INSTRUCTION delimits the memory operations of a multi-operand instruction, and they are essentially just for statistics purposes (e.g., to keep track of the actual number of instructions, not memory operations, that have been simulated). The actual memory operation is encoded in ac.command field for messages between instruction start and end messages. If the instruction is a read (ARIEL_PERFORM_READ) or write (ARIEL_PERFORM_WRITE), then a corresponding event is inserted into the coreQ by calling createReadEvent() or createWriteEvent(), respectively. These two functions will simply create and enqueue class ArielEvent objects with the address to be read/written and the length of the memory operation. Other types of instructions, such as no-ops (ARIEL_NOOP), flushes (ARIEL_FLUSHLINE_INSTRUCTION), and fences (ARIEL_FENCE_INSTRUCTION), are also processed accordingly, which result in an event object being inserted into coreQ and eventually being executed. High-level semantics operations that do not correspond to any instruction, such as mmap(), malloc()/free(), pool switching, and simulation exit, will also be inserted into coreQ as ArielEvent objects. Ariel core can use these operations to fine tune the simulation, such as generating the correct physical address for address translation.

Note that, despite lacking an instruction class for representing ALU instructions and branches, these instructions are still captured and simulated as no-ops. The Pin frontend treats any non-read and non-write instruction as a no-op, and inserts them into the circular buffer by calling WriteNoOp() in file fesimple.cc. These instructions still consume dispatching bandwidth in the core simulation. It is just that the detailed architectural state transitions are ignored.

Function processNextEvent() then pops the first event object in the coreQ, and simulates the operation. No-op events are not further processed, as their sole purpose is to consume dispatch bandwidth for simulating ALU and branching instructions. Reads and writes are handled by calling handleReadRequest() and handleWriteRequest() respectively, if the pending transaction queue’s size pending_transaction_count is still not at the the maximum capacity maxPendingTransactions. Otherwise, the memory operation is not processed due to a resource hazard on the pending transaction queue, and the event object is not taken off the coreQ, meaning that they will be reattempted in the next cycle. On a resource hazard, the local variable removeEvent is also set to false, meaning that the core’s main loop (described above) will be blocked until the event is able to proceed.

Cache address flushes and fences are handled by calling handleFlushEvent() and handleFenceEvent(), respectively. Flushes are also entered into the pending transaction queue, and hence it is also prone to resource hazards just like reads and writes. Fences are always instantly executed, which we discuss later below.

High-level events are processed according to their semantics. CORE_EXIT event will immediately terminate simulation by setting the isHalted flag, which will be checked at the end of every loop. Memory operations such as mmap(), malloc() and free() are handled by the handler functions, which tracks the virtual to physical mapping with best efforts.

Reads and Writes

Memory reads and writes are simulated by calling handleReadRequest() and handleWriteRequest() respectively. These two functions are largely symmetric, so we only cover read handling. Write handling is almost identical except that the command issued to the memory hierarchy is a write, rather than a read.

In handleReadRequest(), the function first translates the virtual address to physical address using its own embedded TLB (not the timing TLB, which is simulated separately) by calling translateAddress() on data member memmgr. It then checks whether the operation spans two blocks (multi-block reads that require more than two are not supported), or just access a single block. In the former case, the two blocks to be accessed are requested separately by first computing the cache block address and the read size, and then calling commitReadEvent() twice to simulate two cache accesses. In the former case, only one base and length is computed, and commitReadEvent() is called once.

In commitReadEvent(), an event object of type class SimpleMem::Request with the info of the access is constructed, and sent to the memory hierarchy for timing simulation via the link cacheLink. The link object connects the Ariel core to the memory hierarchy, and is initialized in the containing class class ArielCPU on a per-core basis. Note that the memory event object is handled in parallel with core execution, which will be sent back via the same link from the memory hierarchy at a future cycle. To track outstanding memory operations, the newly constructed event object is also registered with the pending transaction queue, pendingTransactions, which is a mapping structure that maps the ID of the outstanding request to the request object. The number of pending transactions is also updated by incrementing pending_transaction_count. Note that the Ariel core only implements minimum logic for handling memory operations. More complicated tasks that are essential on a real core, such as deduplication of requests on the same address and read-write forwarding, are not implemented.

Memory responses are sent by the memory hierarchy via the same link object when they are completed. The receiving end of the link is registered with call back function handleEvent(), which will be invoked with the response event object. The function first extracts the request event ID stored in the response object, and then performs a lookup in the mapping structure pendingTransactions for the request object. Then the memory operation retires by: (1) Removing the request object in the mapping structure; (2) Freeing a slot in the pending transaction queue by decrementing pending_transaction_count; and (3) If the number of pending transactions drop to zero, and the core is stalled, then the core is unstalled, because the pending transactions have been drained.

Fences and Cache Line Flushes

Ariel core models fences as a draining instruction: It stalls the core from executing until all pending transactions have been drained, hence enforcing a strong ordering between memory operations before and after the fence instruction. Note that Ariel core fences are not the simulator’s implementation of regular x86 memory fence family. Instead, Ariel fence should be called explicitly as a function ariel_fence() in the source code. The Pin instrumentation routines will recognize the function invocation, and replace it will a fence message that will be inserted into the circular queue when executed.

Similarly, cache line flushes are also not the implementation of x86 clflush/clflushopt/clwb. Users must explicitly call ariel_flushline() in the source code with the pointer value to be flushes (the value is computed in the runtime and passed to the function).

Fences are handled by calling fence(), which sets isFenced and isStalled to true, if the number of pending transactions is greater than zero. As discussed earlier, the fence will eventually commit when tha last outstanding memory operation is completed, at which point unfence() is called to set both isFenced and isStalled to false.

Flush events are handled by handleFlushEvent() in a way similar to memory operations, except that it always affects a single cache block. After obtaining the physical address of the flush, commitFlushEvent() is called to create a memory event object of type SimpleMem::Request, with the command being SimpleMem::Request::FlushLineInv. The event is sent to the memory hierarchy after being added to the pending transaction queue and incrementing the pending transaction counter.

Memory Manager

Ariel cores also maintain per-core memory managers that mimic OS’s physical page allocator. Due to lack of information on how physical pages are actually allocated and mapped to virtual addresses, the memory manager only works in a best-effort manner, i.e., relying on intercepted memory allocation calls, such as mmap() or malloc(), to infer virtual addresses that have been mapped by the OS, and then emulates a physical page allocator by its own. The physical addresses generated by the memory managers may not be consistent with what the simulated system actually uses, but it does not pose a problem, since the simulator only needs some physical addresses for memory hierarchy simulation, but does not care what those addresses represent.

The memory manager interface is defined in class ArielMemoryManager, which itself cannot be instanciated, but it defines fall back functions that will print error messages, if not implemented in the derived class. If Ariel core users desire to have a working memory manager, then one of the derived classes should be selected using option memmgr in the configuration file.

The core interface function of a memory manager is translateAddress(), which, given a virtual address, returns the physical address for memory accesses. As mentioned earlier, the memory manager maintains its own free page pool, and emulates free page allocation as well as virtual-to-physical mapping assignment, which should normally be performed during OS page faults. An extra two pairs of interface allows memory allocation information on virtual addresses to be passed to the memory manager: allocateMalloc()/freeMalloc() for malloc-based virtual address tracking, and allocateMMAP()/freeMMAP() for mmap-based tracking. These functions are not implemented in the base class, which, if not implemented by the derived class but called, will report errors.