SpZip: Architectural Support for Effective Data Compression in Irregular Applications



- Hide Paper Summary
Paper Title: SpZip: Architectural Support for Effective Data Compression in Irregular Applications
Link: https://conferences.computer.org/iscapub/pdfs/ISCA2021-4ghucdBnCWYB7ES2Pe4YdT/333300b069/333300b069.pdf
Year: ISCA 2021
Keyword: SpZip; Compression; Data Flow Execution



Back

Highlight:

  1. Decouples memory accesses from data dependencies. Modern core pipelines can only support limited number of outstanding memory requests, and the pipeline often gets stalled when a data dependency cannot be fulfilled due to cache misses. This paper proposes a simple data flow model that runs ahead of the core pipeline’s instruction stream, and is programmed in a way that memory requests can be issued in parallel from different operators and/or for range scan (bulk load) operations. This is essentially a super flexible (programmable) gather-scatter that does not consume processor cycles.

  2. The paper uses two examples (PageRank, BFS) to demonstrate that the underlying memory access pattern of many
    algorithms can be described with two logical operators: ranged scans, and indirections. These two types of operators can further be connected together via queue buffers to form a practical data flow model. These two logical operators are simple to implement, because they are essentially address generation units plus MSHRs.

  3. Data compression can still be performed with irregular access patterns and with data flow machines. The essence here is that, with proper transformation of the graph algorithms (e.g., buffering, clustering), we can still turn scattered writes into sequential writes. These sequential writes can be easily detected by the data flow machine, who will then apply proper compression. Decompression is performed similarly when previously compressed data is being read using a data flow operator.

  4. Logical data flow operators can be physically implemented as multiplexed operators that only one single instance. Similarly, queues between operators can be implemented as scratchpad memory and head/tail pointers. To achieve multiplexing, the physical operators should be able to load/unload context objects which contain full information and internal states (e.g., MSHRs) in order for the physical operator to work as the logical operator indicated by the context. A central scheduler then controls the load and unload of context objects to multiplex the physical operator between different logical operators. In one word: Programmability via multiplexing.

Comments:

  1. I can be wrong, but it is not obvious how the hardware accelerated data flow architecture can support other commonly used data structures in addition to sparse matrix encoded in Compressed Sparse Row (CSR) format. While I agree that CSR is general enough to encode sparse graphs, and is hence very useful when graph algorithms are being implemented, when it comes to something like a linked list or a hash table, could SpZip handle them as well? My best guess is no, because there is no way to express conditions in the data flow architecture. Imagine the following scenario: I have an array of keys, and I want to fetch the values stored in the hash table for each key. On the CPU side, I can compute the hash values of several keys using SIMD instructions. It would be great, if these hash values can be entered into the queues of the data flow machine, and let it issue memory requests to traverse the linear probing hash table and grab the value.

  2. Figure 9 does not show L1 cache, which can be misleading because it may seem that SpZip entirely gets rid of L1 (which is not a good design choice as it cripples other applications not using it). It turns out that L1 is still being used in the simulation, just not drawn in Figure 9.

  3. It seems that compression only works with certain algorithmic transformation (e.g., UB)? What if I do not use them?

  4. Who provides the stream for compression to the compression operator? They cannot be the output of some earlier stage operators, because operators cannot compute (e.g., the scores in PageRank or dist in BFS). The paper also did not mention whether it is the application code that should give the input to the compression operator. If so, then how does it specify the location of the input? Using a memory address? If compression is done after data has been written into memory, then what is the purpose of that? You already wrote the data and paid for the bandwidth cost.

This paper proposes SpZip, a data-flow memory access architecture with compression for better cache and memory efficiency on irregular data structures and algorithms. SpZip is motivated by two important observations. First, many real world applications use sparse data structures in which most of the elements are of value zero, and the data structure itself is stored in a compact format. As a result, these applications demonstrate irregular access patterns over the address space. Pure software solutions that use manually crafted, ad-hoc routines to traverse these structure usually suffer from high cache miss rates due to software’s inability to communicate the access pattern to hardware. The core pipeline is hence frequently stalled by data dependencies, which prevents it from issuing memory accesses on the critical path, causing under-utilization of memory bandwidth. Second, conventional compressed cache architecture is agnostic of the layouts of underlying data structures, and most of them just perform compression in a granularity that is consistent with cache hierarchy’s block interface. This does not work well for irregular accesses, since these accesses are typically scattered across the address space due to indirections, making it difficult to group accesses that are nearby in the address space for compression or decompression.

The design of SpZip seems to primarily focus on a general encoding for sparse matrix, the Compressed Sparse Row (CSR) format, which can be used to represent the adjacency matrix of a graph. CSR encodes a sparse matrix by only storing non-zero entries for each row using a (column id, value) tuple. The matrix is still stored in row-major format, meaning that non-zero values on the same row are stored adjacent to each other, and the column ids are sorted (at least in all the examples given in the paper, while in practice, as the paper also explicitly points out, sorting columns on the same row based on some other keys other than the column id will not change the graph it encodes, but it may bring some locality benefit and result in performance improvement). Random access to individual rows of a CSR matrix is supported by adding another “offset” array which serves as the index into the body of the matrix. The offset array contains the starting offset of each row in the body, and there is exactly one element for each row. Random accesses on columns are not supported. In order to read the value of a column given a row, the entire row must be traversed, and if an entry with the column id exists, the value of the entry is read. Otherwise, the column contains value zero and is simply not stored.

The paper then presents two of the most common access patterns that graph algorithms are likely to perform. The first pattern, most notably seen in PageRank (push model), enumerates all nodes in the graph as the source node, and for each source node, enumerates its adjacent nodes as destination. The destination score is then updated by computing a “contrib” function on the source node. Both the destination score and contrib of the source are stored in separate arrays for fast access.

The data flow model of the above pattern is described as follows. First, a ranged scan operator reads out values in the offset array. This can be done in parallel since there is no data dependency. Then, for each value read from the offset array, an indirection operator is performed that uses the value as an index into the body of the CSR matrix. The value in the CSR matrix is also read out using another ranged scan, the starting address of which is the address computed by the indirection operator, and the size of the scan is given by taking the diff between the current offset value and the next value (or the size of the offset array, if the value is the last one in offset array). The output of the second scan is then sent to the CPU core for value update. In the meantime, values in score array is also read out by a scan operator in a similar manner. Values in contrib arrays, however, cannot be read by scans, since it is indexed by destination node ids, which itself are produced by the second range scan, and are not guaranteed to be a range. The paper shows that it can be accessed using a single indirection operator, the input of which is the column ids output from the second range scan operator that reads the matrix body. Note that the data flow model itself does not know how the contrib array is updated, since the function required to compute the value is not programmed into the model. SpZip proposes that the model only prefetches these “terminal values” (i.e., values that are not used as sources to data flow operators) into the cache, and let the processor compute the functions and write the updated values.

The second pattern differs from the first pattern in a way that not all nodes are traversed as the source. Instead, the algorithm maintains a list of “frontier nodes” which represents the current working set. On each iteration, the frontier node list is updated by enumerating the adjacent nodes for each node from the current frontier list, and adding these nodes into the new frontier list. In the meantime, some properties of the adjacent nodes (e.g., distance, or “dist”) are also updated as by computing a function on the node.

The data flow of the second pattern is then described as follows. First, we need a range scan operator on the current frontier list to enumerate all nodes. Then, the offset array is accessed by indirection using the node ids from the frontier list as indices with an indirection operator. The next step is similar to the first pattern: We use another range scan operator on the body of the matrix, the base and size of which is given by results from the offset array. Here note that since the offset array is not scanned, we actually need to read two values (the current index and the next index) from the offset array to determine the base and size of the scan on the matrix body. This was not required in the first pattern, since the offset array is scanned, and the next index will guaranteed to be available. Meanwhile, the dist array is also accessed using an indirection operator whose source are the adjacent nodes from the second range scan operator that produces adjacent nodes given a frontier node. Similar to the first pattern, the operator on the dist array will not compute the function. Instead, it only issues memory requests to prefetch these blocks into the cache hierarchy, and let the processor determine how updates are performed.

In both access patterns, scan operators will insert a special marker into the output stream. This mark indicates the end of a scan operation. The mark will be preserved by the next stage operators without any modification. The CPU may need the mark in the final output in order to determine the progress of computation (e.g., use it as loop conditions).

SpZip implements the above data flow model with two operators: range scan and indirection. Operators in SpZip take input values from an input queue, and produces output values to an output queue. Operators that are connected together share the same queue for data communication. Both queues have configurable word size to handle different data types. Operators themselves do not transform or modify data. Instead, they are just programmed to follow certain access patterns, and fetch data from the hierarchy by generating addresses and then issuing memory access requests.

The range scan operator requires a base array address and element size, which are pre-configured into the operator’s context. For every iteration, it takes a tuple (begin offset, size) from the input queue, and then issues memory requests to fetch all elements in this range. The fetched elements are then put into the output queue for consumption. The indirection operator similarly requires base address and element size, but it only takes a single index from the queue, and fetches the value on the address to the output queue.

Since cache misses are common especially for irregular accesses, each operators have a few MSHRs that are also part of the operator’s context. When a memory request is fulfilled by the lower level hierarchy, these MSHRs are checked to direct the response to the correct operators that issued the request.

Some operators produce values that are not consumed by other operators, but rather, are consumed by the processor. These operators’ output queue can be accessed by the processor pipeline using dequeue instructions. Similarly, some operators do not consume inputs generated by other operators. Instead, their inputs are pushed into the queue by the processor using enqueue instructions. As discussed earlier, there are also special operators that do not produce any value. They simply issue memory requests, and prefetches memory blocks into the cache hierarchy for faster access.

At physical level, only one instance of operator of each type is provisioned, and the single physical operator is multiplexed to emulate the existence of multiple logical operators. To achieve this, SpZip designs physical operators such that they can load or unload context objects. Each context object just contains all operational arguments (e.g., base array address) and internal states (e.g., internal counter values, MSHRs) that defines a logical operator. The physical operator can thus emulate a logical operator by loading its context object, and temporarily swaps it out by unloading the context object. A central coordinator controls the load and unload of contexts similar to how an OS scheduler controls thread scheduling. A logical operator is loaded, if its input queue is not empty, and is swapped out if all its MSHRs are occupied, meaning that no more memory requests can be made, or when the output queue is full. The coordinator maintains context objects of logical operators, which can be initialized by the application program using memory-mapped I/O.

Similarly, queues in SpZip are not physical entities. Instead, they are just allocated from a monolithic scratchpad space, and emulated as circular buffers by a pair of pointers representing the head and tail. Queue sizes and element sizes can also be configured from the application with memory-mapped I/O.

To further optimize memory efficiency, SpZip supports compression at stream level. A stream is informally defined as a consecutive range of memory written by the application. In graph algorithms, however, it is not always possible to generate writes to consecutive addresses, despite the fact that the data structure it writes to is allocated as a consecutive array. Two examples are the scores array of PageRank, and the dist array of BFS. The paper proposes that special transformation can be applied to the algorithm such that scattered writes to these arrays are first clustered and buffered as smaller groups, such that writes that are close to each other on the destination array are clustered together, and then for each group, the updates are applied to the destination array. Both clustering and the final updates can be made mostly sequential by this transformation, which greatly improves locality, making compression possible.

SpZip implements compression using a compression operator. The application code needs to direct data that is stored sequentially in the memory to the compression operator, and the operator runs compression algorithm over the given data, and then stores them into another pre-configured location. In PageRank and BFS, this can be done as the last step of updating the scores and dist array, after the clustering step. SpZip supports two types of compression algorithm: Delta encoding and Bit-Plane Compression. More complicated algorithms are possible, but they may not always work well.

Compressed data cannot be handled directly by other operators. To solve this, when compressed data is being read from memory (e.g., as the input of the next iteration after the previous iteration has completed), a decompression operator needs to be added before the first operator that handles uncompressed data.

In the overall architecture, SpZip resides between the core pipeline and the L2/LLC as a data flow accelerator. Application programs initialize logical operators and buffers, and kickstart the data flow machine by providing the initial inputs using enqueue instructions to add elements to the input queue of the first operator (e.g., an initial set of nodes for PageRank, or the root of the traversal in BFS). The data flow machine then runs independently from the core pipeline, which issues memory requests following the data flow pattern programmed into it, and eventually produces data on its output queue. The core pipeline then uses dequeue instructions to extract data from the queue, and use them for the final step computation of the current iteration (e.g., compute scores in PageRank, or compute dist in BFS). Memory requests of data flow operators are sent to the L2 cache to avoid polluting L1, and compressor operators write compressed data to the LLC, since these data are unlikely to be used in the near future.