A Scalable Architecture for Ordered Parallelism

- Hide Paper Summary

Paper Title: A Scalable Architecture for Ordered Parallelism
Link: A Scalable Architecture for Ordered Parallelism
Year: MICRO 2015
Keyword: Swarm; TLS; HTM; Speculation

Back

This paper proposes Swarm, a highly parallelized architecture supporting Thread-Level Speculation (TLS). This paper pointed out that some algorithms intrinsically have abundant parallelism, but most existing speculation mechanisms could not exploit them very well for several reasons. First, existing HTM designs are targeted at unordered execution of transactional regions. Optimizations such as dirty read are therefore disallowed to maintain isolation, which hinders concurrency, because if the order of transactions can be known and tracked by hardware in advance, forwarding dirty data from lower numbered tasks to higher numbers ones are totally legal. Second, speculative tasks are created dynamically by earlier tasks, usually driven by the structure of the problem itself (e.g. in parallel version of Dijkstra’s algorithm, the creation of tasks are driven by the adjacency structure of nodes). This typically forces the program to be written in a sequential manner (because there are control dependencies between tasks), which is hard to parallelize without knowing the specific problem. Third, unnecessary data dependencies are introduced if we maintain tasks using software structures. This happens not only when an explicit task scheduler is implemented, but also in data structures used by the algorithm. For example, in Dijkstra’s algorithm, a priority queue is used to track the distance of candicate nodes from the source. Any insertion of new nodes into the queue is likely a conflict with some speculative tasks that dequeued a node from the queue, but this dependency can be eliminated if hardware can track the structure of task creation. The last reason is that existing software solutions can only provide limited performance improvement, while introducing a non-negligible overhead, which can offset any improvement gain from parallelization.

Swarm solves the above challenges using a task-based speculation model. Fine-grained pieces of code that can potentially run in parallel if there is no conflict are abstracted as “tasks”. The exection of tasks is expected to have only a few data dependencies, which will cause the violating task and all its decendants, and related tasks to abort. Tasks are assigned programmer-defined sequence numbers called timestamps, which specify the logical ordering tasks are executed. Swarm guarantees that it appears that tasks are executed sequentially following the partial ordering defined by timestmaps. If two or more tasks have the same timestamp, their relative ordering is undefined, and they can be in any logical total order (in practice this total order is the order they are selected for execution).

Swarm assumes a chip-multiprocessor organized as tiles. Each tile has a few number of cores, and its own private L1 and shared L2. LLC cache is shared among all tiles, and partitioned such that each tile has a slice of LLC array. Swarm assumes that the L1 cache is write-through. This may increase the latency of store instructions when the store buffer is at high load, but makes L1 flush instaneous since no write back is to be performed (we will see later how this benefits conflict detection). Each tile has also a task unit, which consists of a task queue and a commit queue. The task queue holds pending tasks that are to be executed once resource is available, while the commit queue holds tasks that are completed and waiting for lower numbered tasks to commit. These two structures are not necessarily of the same size, given that Swarm can commit many tasks in the same cycle. The paper suggests that the commit queue should be implemented as a TCAM, while the task queue can be implemented with two TCAMs that provide indexing capability to the structure. In the runtime, the task queue always selects the task with the minimum timestamp to execute.

The task queue consists of task structures, which contains context information that is necessary to start and maintain the task. The context information includes the starting address of the task (i.e. a function pointer), arguments to the task, and the task timestamp. In addition, each task structure has a finite number of entries that point to the child tasks. A task spawns child task when it creates them. These child tasks are stored in the current task’s structure, which is used when the current task aborts (commit does not follow the child pointer). Child task pointers are a composite field of the tile ID and task queue index on that tile. Since a task never changes its location in the queue after it is inserted, this uniquely identifies a task before it finishes and releases the entry.

Swarm adds two new instructions to the ISA: enqueue_task and dequeue_task. The enqueue_task instruction takes a function pointer, a timestamp, and function arguments as operands. A new task will be created by dispatching the context information and the parent task information to a randomly selected tile, at which site a new task structured will be allocated, and these information will be filled in (the child pointer of the current task will also be updated). When dequeue_task is executed, the processor performs a priority-based lookup, and selects the task with the minimum timestamp in the current task queue. If the task queue is empty, this instruction will stall until a new task is added (this paper does not mention how a stalled processor can resume at the end of the computation; Queuing a task that indicates the end of the computation may do the job).

Since tasks run in the specified logical order, Swarm must track memory accesses to enforce that reads and writes observe the order. Due to the fact that in some cases, reading uncommitted data will not violate the ordering, Swarm allows data forwarding from lower numbered to higher numbered tasks. To achieve this, Swarm adopts eager version management from LogTM, which means that updates are always performed in-place, and the undo image (before the update) is stored in a per-task log. Each task is allocated a task-local logging area, which is part of the task context. With eager version management, uncommitted data can be forwarded to higher numbered tasks without the latter doing anything special.

Conflicts occur when data is forwarded in the wrong direction, i.e. from higher numbered tasks to lower numbered tasks. This happens not only on RAW and WAW, but also on anti-dependencies (WAR). Luckily, these three cases are detectable via coherence, because they all incur coherence actions when the lower numbered task attempts to perform the load or store. Conceptually speaking, the conflict detection works by attaching the current task timestamp to the coherence request. When the processor executes a load or store, it inserts the address into a read or write set (implemented as bloom filters, stored in the task queue entry). These two sets are checked every time a coherence request is received from another task using the timestamp attached to the request. If the coherence request has a lower timestamp than the current task, and the request’s address is in the read and/or write set (depending on the type of conflict), the current task will abort, since an ordering violation just happened. The coherence request can only be fulfilled after the abort is handled.

Swarm chooses to decouple data and read/write set maintenance in order to simplify the handling of overflowed states. If a speculatively read or written cache line is written back to the memory, instead of aborting the task until it becomes non-sepculative, Swarm simply follows the write back protocol (if needed), and indicates to the directory that the bit should be preserved (“sticky bits” borrowed from LogTM). The next time a coherence request on the same address reaches the directory, it will be forwarded to the current processor. The current processor checks the timestamp and the address, and act as if the block were still in the local cache.

Note that the paper seems to assume there is no coherence bit in the L2, maybe because the L1 is assumed to be write-through (so only one valid copy exists in the L2 for every tile even if several cores have written to a cache line). In this case, detecting a violation with coherence is impossible. One straightforward way is to broadcast the current memory access to all cores in the same tile (sharing the same L2) when a core executes load or store. The paper offers another optimization to reduce broadcast traffic. The optimization tags L2 lines with the timestamp of the transaction that inserts the line into the cache. If an access hits L2, but the timestamp of the task that initiates the access is lower than the L2 timestamp, a broadcast should be sent to all cores in the tile, the address of which is then checked against the R/W set. In addition, the untagged L1 cache also introduces a problem: If a task is switched out of a core (e.g. it completes), no conflict detection is made if another task is switched in, since the L1 cache still has valid data. In the case where the newly switched task has a smaller timestamp, this constitutes a violation, but no check is performed. To counter this, the paper suggests that the L1 should be flushed whenever a task completes or is switched out. Since the L1 cache is write-through, this does not require any costly write back, and can be accomplished with a flash clear of valid bits.

After a task completes, the task descriptor is inserted into the commit queue. Note that tasks in the commit queue will also be validated against incoming coherence requests. Tasks could not commit locally without communicating with remote tasks, because the commit ordering defined by timestamps must be obeyed. A task can only commit, if there is no lower numbered task that is still running. To this end, every tile will send its minimum unfinished task (either running or idle) to a centralized global arbiter. The global arbiter maintains a “global virtual timestamp” (GVT), which is the timestamp of the global minimum of unfinished task. The arbiter updates GVT after receiving all messages from the tiles by computing the minimum of them, and then overwrite the current GVT. After computing the GVT, the arbiter further broadcasts the updated GVT to all tiles. After receiving the broadcast, tiles commit all tasks in the commit queue whose timestamp is less than GVT. Due to eager version management, the commit process is instant, and the tile could commit all tasks in the same cycle.

On abort, the changes made by the task is undone by reading the undo log and copying the before-image to the original address. Child tasks are also notified to abort by following the child pointers in the task descriptor. Note that although tasks that have data dependencies and anti-dependencies should also be aborted, this is done naturally by writing back the undo log entry to the original address. Since these runtime dependencies only point from lower numbered tasks to higher numbered ones, when the lower numbered task aborts, by writing the same address again, we deliberately introduced dependencies from the same set of higher numbered tasks to the aborted task, which will abort those higher numbered tasks, resolving data dependency.