The Simulation Infrastructure

Simulation Object

The simulator’s global data structure is defined in simulation.h as the base class class Simulation. The derived class that is actually instantiated is defined in simulation_impl.h as class Simulation_impl, and the implementation is in simulation.cpp. The simulator maintains a per-thread singleton simulation object, which can be accessed by calling Simulation_impl::getSimulation().

The simulation body is defined in Simulation::run(), which consists of a simple loop that fetches the next activity from the event queue, and then execute the activity by calling execute() on the activity object. This is just the standard implementation of Discrete Event Simulation (DES).

The simulation is driven forward by two mechanisms. First, events handled at time T can generate new events at time T + t. Second, components can also register for periodic clocks that fires at a regular interval, calling the component handler function. The handler function may also generate events that need to be handled in the future.

Activity and Event

class Activity is defined in activity.h, and it is the fundamental building block of Discrete Event Simulation. class Activity is an abstract class that cannot be directly instantiated, and it stores event timing related information, such as delivery time, queue order, and priority, relying on derived classes to implement the actual actions on calling the virtual abstract interface function execute(). The class also overrides operator< for computing the partial ordering between objects in the global event queue. Derivative objects from this class can be inserted into the global event queue at a specified time, on which point the object will be executed.

class Event is a non-abstract class derived from class Activity. It is used for passing events across links that connect two ports (which are defined in the configuration file). This event object keeps track of the receiving end of the link when it is sent, and will invoke the call back function on the receiving side on execution. Sending an event over a non-polling link will result in the event being inserted into the global event queue.

Event objects and its derivatives are identified by unique IDs, which is of type id_type defined as std::pair<uint64_t, int> in the class body, where the first component is the per-process unique ID, and the second component is the rank number. The per-process ID is dispensed by incrementing an atomic, static data member of the class, id_counter, which is then combined with the rank (i.e., partition ID for multi-process distributed simulation) of the simulation process to form the event object’s ID.

Note that although the ID generation function generateUniqueId() is defined in class Event, the event object itself has no ID field. In further derived classes, such as class MemEventBase, IDs are stored for the object and initialized using this function every time a new object is created. Events can use this ID as a unique identifier (i.e., key to map objects) during memory hierarchy simulation by calling getID() method of class MemEventBase.

class Event also defines the handler type, class Event::Handler, which is just a standard functor (function object) that stores the object instance and the call back function on the receiving end of the event. The handler object is registered to the receiving link at initialization time using configureLink() of class BaseComponent, and will be called when the execute() method of the event is invoked.

Global Event Queue

The global event queue is a data member named timeVortex of type class TimeVortex contained in class Simulation_impl. The class itself is an abstract class that cannot be instantiated, but its derived class class TimeVortexPQ under impl/timevortex is what is instantiated by the simulation. The implementation uses std::priority_queue for ordering, and in Simulation::run, the closest events are popped using the pop() method of the time vortex, and then executed. If the activity is an event object, calling execute() will invoke the registered handler function, which is registered with the receiving end of the link object. All events sent over all links will be sent to and serialized by the global event queue. Link objects (of non-polling type) store references to the global event queue, and for every event object sent, the insert() method of the global queue is called to put the event into the global queue at the specified cycle.

Simulation and Component Time

The global simulation time, also known as the “core time”, is the absolute unit of time in the simulation object, which is of type SimTime_t (defined as uint64_t) stored in data member currentSimCycle, and accessed via getCurrentSimCycle(). This is the highest frequency cycle counter in the system, and hence the name “core time”, since the core typically runs at the highest frequency in a real system. Components may have their own slower cycle counters, but these counters must only count at a integer division of the core time, e.g., 1/2, 1/3 or 1/7 of the core frequency. Component’s local clocks are of type Cycle_t, which is also defined as uint64_t, and components are implemented with their local clocks rather than the global clock.

To convert from core time to component time, components can optionally create a time converter object, which is defined as class TimeConverter. The object is very simple: It only contains a clock divisor, namely the data member factor, which indicates that the component time runs at 1/factor of the speed of the core time. Conversion from core to component time (or from component to core time) only involves dividing the core time by factor (or multiplying component time by factor). The conversion object defines a component’s local time, and is hence used in all cases where a component interacts with the simulation object (e.g., when registering a clock).

New time converter objects can be created with getTimeConverter() in class TimeLord. This function implements several conversions between different time units, and is merely a helper for convenience. The time lord object is a static member object of class Simulation, and can be accessed by calling Simulation_impl::getTimeLord() anywhere in the source code. The time base, which is used also as the core time, is initialized by calling init() on the time lord object. This is performed early in the initialization phase in function main() using the time base read from the configuration file. Component frequencies must be an integral division of the base frequency derived from the core time, which is consistent with the fact that time converter objects can only use integer factors.

Clocks

class Clock implements the analogy of a clock that fires periodically and invokes a clock handler function. This corresponds to an actual clock in a real system which fires regularly and drives state transition of storage components. The clock object is a direct derivation of class Action, which itself is a derived class of class Activity acting as a thin abstraction layer of no critical functionality. Clock objects can also be inserted into the global priority queue, and be invoked by calling execute() just like an event object.

The clock object contains the current component time, currentCycle (of type Cycle_t), the next global time for invocation, next (of type SimTime_t), and a time converter object that defines the component’s local time. The clock class also defines a handler class, Clock::Handler, which is just a standard functor class that stores the instance of the component and the call back member function.

The functor object overrides the function call operator, which if invoked, will call the call back member function with the instance pointer, the component’s local time, and an optional argument. A clock object can be registered with multiple handlers potentially driving different components by calling registerHandler(). Registered handlers are stored in a vector, staticHandlerMap, and they will be called exactly once on every clock event. Handler functions return a boolean value indicating whether the function wishes to be triggered on the next clock event. If the return value is true, then the handler will be removed from all future clock events.

Clocks are triggered by calling the execute() method just like an event object. Each clock invocation will advance currentCycle by one, then invoke all handler functions and optionally remove them from the registered handlers list, if the return value indicated so, and computes the next global cycle that the clock should be invoked before inserting the same clock object into the global priority queue by calling insertActivity(). Note that period->getFactor() is just the number of global cycles between two consecutive local cycles.

An optimization allows the clock to be de-scheduled from the global queue, if there is currently no registered handler, in order to avoid empty clock invocations, and rescheduled when handlers are inserted. This is controlled by the boolean data member scheduled, which, if set to false, indicates that the clock is not scheduled in the global queue. Meanwhile, schedule() implements rescheduling by synchronizing the local clock with the global clock first, and then inserting the clock object into the global queue.

Components that require a clock will typically implement a function tick(), and register it with the clock by calling registerClock() with the desired component clock frequency, which is typically read from the configuration. The function is implemented in class BaseComponent, and will forward the call to the function with the same name in class Simulation_impl. The simulator object maintains a global pool of clocks in clockMap, each with a different frequency. A new clock will be created if the requested frequency does not exist. The handler is then registered with one of the clocks, and the clock will be scheduled.

Links and Ports

class Link defines the communication end points between components or within the same component. It enables components to insert events into the global event queue, which, if executed, can invoke a call back function registered to the link object. This is roughly the equivalence of links in real systems where messages can be sent, buffered, and processed by a functional unit. In the source code, link objects are also referred to as “ports” due to the fact that it simulates a message endpoint.

Links are always attached to components and subcomponents, and they must be specified in the configuration file. To connect two components, a Python Link object must be created, and connect() is called on the Python link object with the Python component object, the name of the link (i.e., port name), and link latencies. Components connected this way can send event objects using the corresponding C++ link object to each other by calling the send() method of class Link.

Note that the naming here is a little bit confusing, because in the simulator implementation, class Link is more similar to end points of a link (i.e., the conception of ports in SST), while in Python configuration, the sst.Link object represents a connection, and users need to explicitly call connect() on the link object, with component objects and port names as arguments. The simulator then, during initialization stage, maps Python sst.Link objects into class LinkPair objects, and for each port name that is explicitly connected in the configuration, creates a class Link object. The simulator also does not implement any connection object. Instead, it just let link objects keep a reference to each other, and directly use the reference for event delivery.

Link Initialization and Configuration

Link configuration is processed during initialization, in function prepareLinks() of class Simulation_impl. This function creates link objects for each component as instructed by the configuration file (which is parsed into a configuration graph), inserts them into the per-component map linkMap (contained in class ComponentInfo), and connects link objects using a class LinkPair object. The link pair object will set pair_link, which is the data member of class Link, to the other link object constituting the connection, essentially connecting the two link objects. After link initialization, link objects can be retrieved by calling getLink() on the per-component class LinkMap object with the port name appearing in the Python configuration file.

Note that only connected ports in the Python configuration file are initialized as link objects. Components and subcomponents can retrieve the initialized link objects by calling the member function configureLink() (which has several different flavors), passing the port name that occurs in the configuration file as the first argument. Link objects can thus be further configured (e.g., binding to a call back function) after they are initialized by component’s member functions.

After prepareLinks(), connected link objects have been created and hold a reference to each other, but configuration has not finished yet, because the call back functions have not been registered. This is indicated by the fact that the boolean data member configured of class Link is still set to false. This cannot be done, however, in the Python configuration file, and must be performed by the component itself. During component initialization, which is performed in performWireUp() of class Simulation_impl, the constructor needs to call configureLink(), which is defined in class BaseComponent, with the name of the port (which should match the name specified in the configuration file), a time converter object (or any equivalence that can be converted to the time converter), and a functor object of type class Event::Handler wrapping the call back function and the object instance as the argument. The call back function is registered with the specified link object by calling setFunctor(), and configuration concludes by calling setAsConfigured() to set configured to true.

To summarize: In SST, link objects are initialized before their components are. Link initialization creates link objects, assigns them a port name, stores them in a per-component map structure, and connects those links as indicated by the Python configuration file. Later at component initialization time, link objects are retrieved given the component ID and the port name, and is eventually bounded to a call back function. This process is somehow counter-intuitive, as most would think that the components would initialize first, which creates the links.

Event Delivery

During simulation, components can send an event object via a link object to another component (possibly itself, if the link is of type class SelfLink) by calling send() on the link object on its own side. The send() function will simply compute the time of delivery using component latency value (which is in the unit of local cycles) and the time converter, stores delivery time in the event object by calling setDeliveryTime(), sets the delivery_link of the event object to a reference to the other link (i.e., stored in pair_link of the link object) by calling setDeliveryLink(), and inserts the event object into the global event queue. The event will be delivered at the exact cycle it is scheduled by calling execute() on the event object, which further calls deliverEvent() of the target link object on delivery_link, and eventually invokes the call back function registered at the other end of the connection.

Note that the link object actually holds a reference to the global event queue in configuredQueue and recvQueue by calling getTimeVortex() to obtain the reference and assigning it to the data members during initialization. In send() function, the event object is inserted into the global event queue simply by calling insert() on the recvQueue of the pair_link.

Polling Links

What we have discussed above covers one of the two types of the link, the non-polling link, which has fixed latency, and delivers the event using the Discrete Event Simulation mechanism at the exact specified cycle, eliminating the need of explicit receiving calls. There is a second type of link, the polling link, which does not involve a call back function, and requires the receiver to explicitly call recv() to acquire the event object. This type of link is easier to configure, since they do not need a call back function. If setPolling() is called during initialization, then the link is a polling link. The recvQueue and configuredQueue of a polling link is not the global event queue, but a class PollingLinkQueue object local to the link object (the object is no more than a thin wrapper around std::multiset). On send() calls, events are inserted into the polling queue of the other end of the connection. On explicit recv() calls, which must be made on the receiving end, event objects in the polling queue is checked against the current global cycle, and returned to the caller if the deliver cycle of the event is not in the future. In other words, the event appears to be received only at the cycle in which recv() is called, which can differ from the canonical delivery cycle.

Both types of links can also be configured as self links, defined in class SelfLink as a derived class of class Link. A self link is no more than a simple modification of a normal link, where the pair_link data member points to the object itself, such that the component can receive the event object it sends via the self link. Self links do not need to be declared in the configuration file, and they can be just added by calling configureSelfLink() in class BaseComponent, and use the returned link object as usual.

Initialization and Teardown

Before simulation starts, components and links built from the configuration file needs to go through an initialize phase. Similarly, before the simulation shuts down, they need to terminate properly and perform certain tasks such as printing the statistics. Four interface functions are defined in class BaseComponent, namely, init(), setup(), complete(), and finish(), as dummy functions. Derived components can override these interface functions to change their behavior and implement actual initialization and teardown.

The high-level initialization sequence is defined in main.c, function start_simulation(), which calls into the simulation object (of type class Simulation_impl). The sequence is as follows: initialize(), setup(), run(), complete(), and finish(), which also loosely correspond to the four interface functions of class BaseComponent except run().

During the first stage initialize(), components can send essential information to other connected components using the link objects, and can perform multi-step initialization in a tick-by-tick manner (e.g., warming up a state machine). This stage is similar to the simulation body in a way that it also has a loop in which a clock (tracked by the variable untimed_phase) is maintained and incremented for every loop iteration. For each iteration, the init() method is called on all components, with the value of the clock as argument, and components can send event objects across links as in the simulation.

Links behave differently during the init() stage. First, link latency is ignored, and event objects are always delivered in the next tick (i.e., untimed_phase + 1) or the same tick, if the event is sent synchronously. Second, link objects must be polled like polling links in the normal simulation, and there is no automatic delivery of event objects by calling the registered call back function handler. Correspondingly, component init() method must call the untimed version of send of receive, namely, sendUntimedData() (or sendUntimedData_sync() for same-tick delivery) and recvUntimedData(), to communicate with each other during initialization. These two functions are also aliased to sendInitData() and recvInitData(), which just forward the call to the untimed versions.

To implement this special initialization behavior, link objects use a initialization-specific queue, the untimedQueue, to send and receive event objects for untimed operations. This data member is of type class InitQueue, which is just a thin wrapper layer of std::deque. Events are enqueued and dequeued into and from the queue (plus a time check for receive) on send and receive, respectively. The recvQueue, which is used during normal simulation, is disabled at this stage by setting it to an object of type class UninitializedQueue. An error will be reported if component initialization routines attempt to use the regular link functions.

During the first initialization stage, the simulation object also tracks the number of messages being sent via all links between all components using variable untimed_msg_count, which is incremented by sendUntimedData() of link objects. The first stage concludes when the number of events is zero on a tick, indicating that all components have finished message exchange. Links also change to their normal behavior after finalizeLinkConfiguration() of class ComponentInfo is called. This function just iteratively calls finalizeConfiguration() on all links in the component and recursively calls the same function in child subcomponents, enabling the recvQueue of all links in the simulation. The untimed queue is disabled by assigning afterInitQueue to it, which will report error if used.

The second stage initialization is simple: Just iterate over all components in the system, and call setup() on these components. This is the last notification before simulation begins.

complete() and finish() are identical to initialize() and setup(), except that complete() and finish() calls complete() and finish() on components, respectively. complete() also runs tick-by-tick, in which components communicate post-simulation states using the untimed channel of the links. During complete(), links behavior will be altered as in initialize() by calling prepareForComplete() on each ComponentInfo, which forwards the call to the function with the same name in class Link. This function will disable the recvQueue and reenable untimedQueue for untimed communication. The complete() phase concludes when no message is sent during a tick.

finish() is the last notification before objects are destroyed. Statistics information should be printed in this stage according to the source code.

Shared Memory IPC

Inter-process communication can be performed using Linux shared memory support in SST. IPC provides a message-based communication channel between two independent processes, most likely providing external information to support simulation, such as the flow of instruction (with potentially multiple threads) to the core pipeline simulator. In the abstract model, the IPC channel consists of one or more buffers allocated from the shared memory region, with each buffer being a circular queue that serves messages in a FIFO order. The writing end inserts messages into the queue, and may block on this operation if the queue is full to avoid queue overflow. The reading end retrieves the oldest message from the queue, and can be either blocking or non-blocking. A blocking read will wait for new messages to be inserted, if the queue is empty. A non-blocking read will return a boolean value indicating whether the operation has succeeded or not. An empty or locked queue will cause a non-blocking read to fail.

Circular Queue

The circular queue (class CircularBuffer) implements a single message buffer as a FIFO queue. To avoid reader writer contention, the queue is protected by a spin lock, bufferMutex (class SSTMutex), which only grants one exclusive reader or writer. The queue is parameterized to serve message objects of type T, given as the template argument. The actual storage of the queue is at the end of the object, defined as T buffer[0], which must be allocated as extra storage after the object itself. The queue maintains two pointers, the read pointer readIndex for retrieving messages, which points to the next message to read, and the write pointer writeIndex for inserting new messages, which points to the next free slot to insert. The queue is considered as empty when the two pointers have the same value, and full when the write pointer is one slot before the read pointer (i.e., the actual capacity is one less than the number of physically allocated slots).

Calling read() on the queue will perform a blocking read, which will not return until the queue is unlocked, or the queue becomes non-empty. readNB() performs non-blocking read, which simply returns false if the read fails due to a locked queue, or because of the queue is empty. Write operations will always block on a full queue, or on the spin lock, which is performed by calling write().

Tunnel

The tunnel object defines the layout of the shared memory, given a pointer to the shared region. The region consists of three parts. The first part is the metadata for the shared region, including the region size and the layout of the rest of the region. The layout of this part is described by struct InternalSharedData. Note that this struct is variable-sized, i.e., there is an array of offset values at the end of the struct (defined as size_t offsets[0], and must be allocated as extra space after the struct). The first element of the offsets array is the offset to the second part, while the rest of the array points to individual queue objects. The second part is a user-defined shared data structure that is transparent to the tunnel object, the type of which is given in the template parameter. Storage is allocated for this data structure right after the tunnel metadata. The layout of this part is unknown, and users should initialize this part properly. The last part is an array of buffers, the message type of which is given as template parameters. There can be multiple buffers of the same size in the shared memory region, and these buffers are allocated in the remaining storage of the region after the first two parts. Buffers are referred to by the offsets array in the first part, as we have mentioned earlier.

Note that although the tunnel object also contains pointers to data structures in the shared region, in, for example, data members isd that points to the first part, and circBuffs that contains pointers to all the buffer objects, the tunnel object is not shared across processes.

Data member shmPtr tracks the beginning of the shared memory region, which is allocated outside of this class, and passed as an constructor argument. Data member shmSize tracks the size of the region in bytes. Data member nextAllocPtr is the allocation pointer that points to the next unallocated byte within the region, and is advanced by the memory allocation function reserveSpace() to reserve memory for the previously mentioned shared data structures. Data member numBuffs stores the number of queues in the third part, and buffSize stores the number of elements in each queue. Boolean data member master tracks whether the tunnel object is on the SST side (set to true), or on some third-party process side (set to false). The master tunnel object is responsible for initializing the shared data structure, while the non-master object just reads them from the shared region and uses them for local configuration.

Member function reserveSpace() allocates a given number of bytes from the shared region by advancing the nextAllocPtr pointer. It takes a type as template argument, and a function argument extraSpace. The number of bytes allocated is the size of the template type, plus extraSpace. The extraSpace argument is to adjust the size of the actual allocation for variable sized data structures, such as the payload part of the circular queue (defined as T buffer[0]) and the offsets array of struct InternalSharedData. The function returns a pair consisting of the allocated size, and a pointer to the allocated memory address.

Messages in the queue can be accessed by calling, with the index of the buffer as the argument, writeMessage(), readMessage(), and readMessageNB(), the semantics of which is straightforward.

Data structures in the shared region is initialized by calling initialize() with the pointer to the beginning of the region. Initialization only takes place when master equals true. This function first calls reserveSpace() to allocate storage for struct InternalSharedData as well as the trailing offsets array (which needs (1 + numBuffs) * sizeof(size_t) bytes) at the beginning of the region. Important system arguments, such as the number of queue objects and the size of the shared region, are stored in the shared region such that the other end of the IPC channel (where master equals false) can access these information as well. It then allocates storage for template argument type ShareDataType right after the previous part, and stores the pointer value to this part to offsets[0]. Eventually, the queues are initialized in a loop, which allocates storage for the queue itself, and the message storage at the end of the queue, which takes an extra sizeof(MsgType) * buffSize bytes. Pointers to queue objects are inserted into both the shared part (isd->offsets) and data member circBuffs. On the other hand, if master equals false, initialization is merely pulling information from the shared region (which was written by the master tunnel object), and stores them in the local, non-master tunnel object.

Data member shmSize stores the total number of bytes (aligned to page boundaries) required for the given shared region layout. This number can be obtained by external classes by calling getTunnelSize().

SHMParent

class SHMParent encapsulates the OS interface for creating a shared memory region, and contains a master tunnel object for managing the layout of the shared region. The object is supposed to be instanciated at the master side of the IPC, which allocates a shared memory region based on the message class, number of buffers, and the buffer size. The region name (a string object) is then passed to the non-master side of the IPC (via command line argument of the child process started by the master, for example), such that the other side can also map the same shared region, instanciate the non-master tunnel object, and pull the configuration information to complete the construction of the IPC channel.

On construction, class SHMParent constructor generates a key of the format "/sst_shmem_%u-%u-%d", with the three integer components being the PID, the component ID of the owner (given in construction argument), and a random number. This key is then used as the path argument for shm_open(), which creates a shared region with the given path as the key. Other processes could map the same region into their own address space using the same key. The tunnel object tunnel is also constructed after this with arguments defining the shared region’s layout. shm_open() returns a file descriptor that represents the shared region, which is then passed into ftruncate() with the desired size of shared region obtained from the tunnel object by calling getTunnelSize(). Finally, the shared region is mapped into the virtual address space of the master process by calling mmap() with the file descriptor and region size. The layout is initialized by calling initialize() on the tunnel object using the mapped address of the shared region as the argument, which concludes the shared memory IPC initialization process on the master side.

SHMChild

class SHMChild implements shared memory IPC endpoint on the non-master side. The class is constructed with the name of the shared region. It calls shm_open() to obtain the file descriptor that corresponds to the name, and then maps a smaller region of size sizeof(InternalSharedData). This step is necessary, because the child has not yet known the size of the shared region. It then initializes the non-master tunnel object, which pulls essential information from the mapped region, including the shared region size, shmSize (note that the constructor being called is not the same one as in master tunnel object). Then the small region is unmapped, and the larger region is mapped into the address space of the process by calling mmap() using the file descriptor and shmSize. The tunnel object is initialized again using the full shared region by calling initialize() under non-master mode, which pulls the layout information from the struct InternalSharedData part. Note that the object offsets recorded in struct InternalSharedData are relative offsets to the beginning of the shared region. The non-master tunnel needs to rebuild the virtual address pointer by adding these offsets to the beginning address of the shared region, since the OS does not guarantee to map the region on the same virtual address in different processes.

Each initialize() call from the non-master tunnel will decrement the isd->expectedChildren counter stored in the shared region. The counter’s initial value is provided as a construction argument to the master tunnel to upper bound the number of non-master tunnels that are allowed to connect to this shared region. Once the counter value reaches zero after a non-master tunnel completes initialization, the shared memory file descriptor will be removed by calling shm_unlink() on the non-master side, which eliminates the region name from the shared memory name space, and prevents it from being mapped by other non-master IPC endpoints.

Variants

There are also variants of the shared memory endpoint implementation to fit into different non-master environment. The most typical of them is the shared memory child endpoint specially designed for Pin3, due to the fact that common operating system calls would not work properly in Pin3, which must be replaced with the PinCRT version. class MMAPChild_Pin3 implements this non-master endpoint that can be used in Pin environment. The high-level logic is identical to those in the standard implementation of class SHMChild.

Using mmap()

Instead of relying on shm_open() to return a file descriptor representing shared region, another way of sharing memory between processes is to directly mmap() a disk file backed by external storage. The high-level logic is similar to the one using shared memory, except that the file descriptor is obtained by creating a file under the /tmp directory, which is then passed to mmap(). This is implemented in class MMAPParent.

IPC Tunnel

The combination of tunnel objects and shared memory objects is unnecessarily complicated, since shared memory allocation and layout management are divided into two classes. class IPCTunnel addresses this problem by implementing both allocation and management of the shared region in one single class. Like tunnel objects, the class operates on master and non-master mode, when running on SST and external processes, respectively. The constructor that takes the component ID and buffer parameters will initialize the object in master mode, which, just a like a class SHMParent object, will first generate a key for the shared region and allocate shared memory using shm_open() system call, and then initialize the layout into three parts like a tunnel object. The constructor that takes the key will initialize the object in non-master mode, which simply opens the shared region, maps the region into its own address space, and then pulls information from the shared information area to the local. Processes communicate with each other with the same message-based model by calling writeMessage(), readMessage(), and readMessageNB(), just as in a tunnel object.