Unbounded Hardware Transactional Memory for a Hybrid DRAM/NVM Memory System

- Hide Paper Summary

Paper Title: Unbounded Hardware Transactional Memory for a Hybrid DRAM/NVM Memory System
Link: https://www.microarch.org/micro53/papers/738300a525.pdf
Year: MICRO 2020
Keyword: NVM; HTM; UHTM

Back

Author’s Feedback About This Review (email copy-paste):

From: Jungi Jeong

Date: 2020-12-29 14:16

To: ziqiw@cs.cmu.edu

Subject: Thanks for reading UHTM.

Hi, Ziqi

I am Jungi, the author of UHTM presented MICRO 2020. I ran into your blog post accidentally and found out you did not enjoy reading it. Sorry about that. It always is difficult to write a good paper :).

I really appreciate you reading my paper and suspect you spent some of your time on it since your concerns are actually better than some reviewers of prior submissions. I hope you find satisfying answers below.

If you have other concerns, please let me know without hesitation.

Best wishes,

Jungi

===============

The paper lacks many details that are necessary to judge its feasibility. For example: (1) Where is the per-core bloom filter placed? LLC or L1? LLC would be a better place since a request only knows that it misses the hierarchy at LLC. Also, inserting into the filter is easier on LLC evictions. (2) When a block is evicted from the LLC, how are bloom filters updated? Do we insert into read set for those that are in the sharers list, and insert into write set for the one that are in the owner list? (3) In the above scenario, what if one of the sharers or the owner is currently switched out (so its bloom filter is not loaded)? Do we invoke OS and let OS find its bloom filter?

==> (1) Bloom filters are located in the memory controller. But they can locate in LLC as well. It is a matter of implementation.

(2) Correct. If the evicted block has multiple sharers, all corresponding read-set BFs (could be multiple) must be updated. Similarly, the write-set BF (must be one) of the owner also must be updated.

(3) Context switch does not have anything to do with BF updates. No OS intervention required.

This paper assumes inclusive LLC, such that cache lines not in the LLC will definitely not in upper level caches. This is may not be true for large multicores. And even if this assumption is valid, the author should explicitly put it in the text.

==> Yes, my implementation assumes inclusive LLC. But, I do not see any fundamental difficulty to extend it to exclusive LLC as well. If you know limitations, please let me know. There could be something that I missed :)

What if an address is first evicted into the LLC / DRAM, and then read back again? Do we delete it from the overflow list, or just let it stay in the list?

==> The addresses are deleted from the overflow list if re-read.

Are transactions required to be pinned on the same core until finish? It seems to me that this is not required since on page 9, “Context Switch” section, it is suggested that transaction abort should broadcast the aborted transaction ID to all cores. But in this case, how do you make speculative lines non-speculative, when a transaction commits at a different core?

==> If a transaction is pinned on the same core, things become much simpler as you know. But, to make real “unbounded” transactions, they must survive across context switches as well. I personally agree with your opinion (regarding excluding context switches in the design).

To answer your question, UHTM flushes and invalidates the write-set in the private cache before scheduling out so that the directory can maintain speculative lines correctly. When the transaction schedules in and commits in a different core, it only contacts the directory.

Speculative cache lines are not tagged with transaction ID to indicate the ownership. On commit or abort, how do you know which line belongs to which transaction? One way is to query the directory and overflow list, but this makes commit very slow since you still have to perform LLC search on L1 controller.

==> cache lines in the private caches are tagged with write-bit, indicating whether they are in the write-set or not. The directory maintains the ownership of cache lines in the shared cache. The commit can be potentially very slow as you pointed out. It must contact the directory and the overflow list if overflowed. But, this must be done to guarantee correctness in “unbounded” transactions.

If the LLC directory maintains the list of speculative sharers and owner, the first access to a non-speculative cache line should always be sent to the LLC such that the access is registered. Otherwise, if the line already exists in the private cache of the accessing core, then LLC will not be notified. This design makes it very similar to the coherence protocol design of OverlayTM, which is written by me, but the authors did not cite it.

==> My apologies. I was not aware of your work when I started working on this project (late 2018). Let me read your work once I get a chance.

How do you guarantee the atomicity of transaction commit and abort? What if another transaction wants to access a cache line that is currently being rolled back during an abort? Do you just lock the bus and serialize all commits and aborts?

==> When another transaction accesses a line that once modified and currently being aborted, the transaction also aborts. Since transaction abort must be atomic, such false abortion may happen. This applies to the commit scenario as well.

Context switch support is a total mess, which introduces even more corner cases and race conditions. For example, if a transaction is aborted while swiched out, but the conflicting transaction needs to access one of the cache lines it has written back to the DRAM, how should that transaction obtain the pre-image? I personally will not recommend adding this support to a TM system. It is better just let the OS know the HTM semantics, and not make scheduling decisions AT ALL.

==> Suppose transaction A is scheduled out and transaction B conflicts with the cache line C that modified by A and written back to DRAM. In this scenario, UHTM selects B to a victim and aborts B instead of A (See Table II in the paper). Since A is an overflowed transaction (cache line C is written back to DRAM), UHTM aborts B. If B is not overflowed, B has a higher priority to be aborted. If B is also an overflowed Tx, UHTM aborts the requester, which is B in this case. If a transaction is scheduled out and picked for abort, it must be a non-overflowed transaction.

As noted in question 4, I really did not want to include context switch supports in UHTM. But, it is “research”. I had to include this for completeness (not for practicality). If someone attempts to realize UHTM, I will suggest them to reconsider this feature :)

My Response, and Updates to the Questions Section:

On Tue, Dec 29, 2020 at 2:49 PM ziqiw@andrew.cmu.edu ziqiw@andrew.cmu.edu wrote:

Hello Jungi,

I am really surprised that the blog post is actually read by the paper author! Please do not misinterpret my confusion to some aspects of the design as not enjoying reading the paper. All papers I chose to write a review on are really nice papers, and represent at least some significant contribution (as I indicate in the “highlight” part). In fact, I discussed your paper with my advisor in a team meeting, because I found similarities between this one and one of my past HTM papers, mainly the LLC ownership directory and the associated coherence protocol, and we were thinking whether we could do something similar to extend our design.

The reason I stated in the beginning of the review that HTM designs are generally not good is due to usability concerns, which is common to all HTMs, not particularly yours or mine or other people’s. From my experimence of running simulation as well as building the software runtime interface for these HTM designs, one of my major concern is that all existing HTMs cannot handle complicated system level tasks well, such as I/O, system call, process migration, context switch, etc. The lack of support for system level tasks makes HTMs less useful, since programmers still have to implement the software fall-back handler which is often non-trivial.

On the other hand, I do believe HTM on NVM or other non-volatile technologies are more promising than pure DRAM HTM, because existing ISAs (most notably x86’s clwb/sfence) are not optimized for efficiency when NVM is deployed, while HTM can pose as a low-cost alternate solution, because most NVM critical sections (e.g., metadata update in a B+Tree) are small and non-complicated anyway.

Your answers to these detailed design-level concerns are also well-written and very informative. I will merge them into the post in case someone else read this post in the future, and hope this can help them better appreciate the paper.

It is also my pleasure to be answered directly by the paper’s author. I really appreciate your time and effort spent on writing the paper. Please do not aplogize, as every research paper represent years of hard work, and as a PhD student myself I also understand how difficuit it is to design hardware. I wish you more high quality publications as this one in your academia career.

Thanks,

Ziqi

Highlight:

Divide on-chip speculative states and off-chip states into two parts. The on-chip part is maintained with LLC
directory and the off-chip part is maintained with bloom filters and overflow lists. This makes it easier to maintain on-chip states with regular coherence protocols, and off-chip states with more advanced methods such as bloom filters and logs. UHTM further allows both sets be approximate, i.e., the off-chip states need not be accurate, so the maintenance cost is further lowered.
Use different logging methods for DRAM and NVM. For DRAM, the commit latency is the major concern, so we perform undo logging such that commit is simply a log truncation (although there are other log operations). For NVM, redo logging with DRAM buffer solves both the read indirection and log replay problem. Commit latency is also lower since the log need not be played immediately on commit point.
Although not proposed by this paper, the combination of DRAM cache and redo logging solves all problems of redo logging: (1) Read redirection is no longer needed as long as the access hits the DRAM cache; (2) Commit does not require replay of the log; (3) The working set is not restricted (canonical redo logging, if implemented in the cache, must lock dirty lines in the hierarchy to avoid polluting the consistent image).

Questions

In general, the design is really complicated, and there are lots of corner cases and complicated race conditions left unexplained. Although I do appreciate some aspects of the design, one should expect such a design to be only of theoretical interests, but very little practical value.

The paper lacks many details that are necessary to judge its feasibility. For example: (1) Where is the per-core bloom filter placed? LLC or L1? LLC would be a better place since a request only knows that it misses the hierarchy at LLC. Also, inserting into the filter is easier on LLC evictions. (2) When a block is evicted from the LLC, how are bloom filters updated? Do we insert into read set for those that are in the sharers list, and insert into write set for the one that are in the owner list? (3) In the above scenario, what if one of the sharers or the owner is currently switched out (so its bloom filter is not loaded)? Do we invoke OS and let OS find its bloom filter?
This paper assumes inclusive LLC, such that cache lines not in the LLC will definitely not in upper level caches. This is may not be true for large multicores. And even if this assumption is valid, the author should explicitly put it in the text.
What if an address is first evicted into the LLC / DRAM, and then read back again? Do we delete it from the overflow list, or just let it stay in the list?
Are transactions required to be pinned on the same core until finish? It seems to me that this is not required since on page 9, “Context Switch” section, it is suggested that transaction abort should broadcast the aborted transaction ID to all cores. But in this case, how do you make speculative lines non-speculative, when a transaction commits at a different core?
Speculative cache lines are not tagged with transaction ID to indicate the ownership. On commit or abort, how do you know which line belongs to which transaction? One way is to query the directory and overflow list, but this makes commit very slow since you still have to perform LLC search on L1 controller.
If the LLC directory maintains the list of speculative sharers and owner, the first access to a non-speculative cache line should always be sent to the LLC such that the access is registered. Otherwise, if the line already exists in the private cache of the accessing core, then LLC will not be notified. This design makes it very similar to the coherence protocol design of OverlayTM, which is written by me, but the authors did not cite it.
How do you guarantee the atomicity of transaction commit and abort? What if another transaction wants to access a cache line that is currently being rolled back during an abort? Do you just lock the bus and serialize all commits and aborts?
Context switch support is a total mess, which introduces even more corner cases and race conditions. For example, if a transaction is aborted while swiched out, but the conflicting transaction needs to access one of the cache lines it has written back to the DRAM, how should that transaction obtain the pre-image? I personally will not recommend adding this support to a TM system. It is better just let the OS know the HTM semantics, and not make scheduling decisions AT ALL.

This paper proposes Unboundes HTM (UHTM), a HTM design for hybrid DRAM/NVM architecture. The paper points out that prior HTM prosals are unable to handle modern workloads on NVM-based architecture for several reasons. First, due to the fact that NVM has higher storage density than DRAM, NVM devices can store much more data in the same area, and therefore, applications running NVM workloads tend to have significantly larger memory footprint than DRAM transactions. This puts a heavy burden on version management, which distinguishes speculative copies of data from the committed image. Prior researches either assume a bounded HTM model, in which transactions will be aborted when speculative states overflow the last-level cache (LLC), or assumes unbounded model, and employ logging or other heavy-weight techniques to maintain speculative versions of data. Second, conflict detection between parallel transactions can severely hamper the usability of an HTM design. Most proposals rely on cache coherence to provide read and write information. Some of them assume a fully-mapped directory, in which one entry is reserved for each cache line sized block in physical memory. This is not applicable, since NVM is typically much larger than DRAM, and the storage overhead of directory entries (whcih, for minimum latency, must be maintained in DRAM) would be huge compared with DRAM size. On the other hand, some other designs use bloom filters or other address signatures to track the approximate read and write set, and rely on simple bit operations between these signatures to detect conflict. These approaches suffer from very high false positive rates, which can also render the design unusable when the size of the working set is large.

UHTM solves the above issues with a combination of different techniques. To deal with version management for both NVM and DRAM, the paper proposes that logging be used when speculative lines overflow from the LLC. For addresses mapped to the DRAM, the DRAM controller performs undo logging before applying the update in-place, such that when a power loss occurs, the speculative content of the DRAM can be rolled back automatically. In addition, commit operations on the DRAM are fast, since the controller simply truncates the log. For addresses mapped to the NVM, the paper proposes that speculative states are written in the form of redo logs, which are persisted to the NVM when dirty lines are evicted from the LLC. Using redo logs, however, suffers from long log search latency when the most up-to-date values are to be accessed later. To address this issue, the paper further proposes that an L4 DRAM cache be added between the LLC and the NVM. Speculative states evicted from the LLC are not only written to the NVM as redo logs, but also update the DRAM cache, such that future accesses are likely to hit the cache, saving both log traversal latency and NVM access bandwidth.

To deal with conflict detection overheads and false positives, the paper proposes a two-stage conflict detection protocol, where both the directory and address signatures are used. When the address to be accessed is in the cache hierarchy, an access request is checked against the LLC’s directory for conflicting accesses. This ensures that conflicts of cache lines in the hierarchy can be detected with normal access latency, which covers the majority of cases. The processor also maintains two per-core address signatures, one for read set and another for write set. Cache lines evicted from the LLC are inserted into the signatures for all accessing transactions. If the address to be accessed is not in the hierarchy, the requested address is then checked with all signatures, and conflicts are signaled if one of the signature indicates that the address is present. Since most conflicts are already resolved in the hierarchy with the directory, and that only overflowing addresses are inserted into the signature, UHTM can achieve a lower false positive rates on the signature than prior proposals.

UHTM extends the cache hierarchy as follows. For each cache line, one speculative bit tracks whether the line has been read or written by a transaction. This bit serves as an indicator to the LLC controller that the line belongs to a transaction, has not yet been logged, and that special actions must be taken when it is evicted. Second, for each core, two bloom filters are added to represent the working set currently not in the cache hierarchy, one for read set, and another for write set. The LLC controller inserts evicted lines into the corresponding bloom filters of cores that have read or written the line during a transaction. Note that the bloom filters are not updated when a line is fetched from the DRAM or NVM, at which time these lines are treated as non-speculative lines. This does not affect correctness, since lines are logged for off-core commit or abort the moment they were evicted. The bloom filters are swapped in and out as part of the thread’s context.

The third hardware change is that each LLC directory entry is extended with extra fields for tracking speculative readers and writers. Each transaction is allocated a unique transaction ID at the beginning (and stored per-transaction as part of the context), and LLC’s directory stores transaction IDs for transactions that have speculatively accessed the line. For each cache line, at most one speculative writers, or multiple speculative readers are allowed, but not both. The directory is updated when a speculative GETS or GETX is received by the LLC from an upper level cache, in which case the transaction ID of the issuing transaction is added to the list, and when a transaction commits or aborts, in which case the transaction’s ID is removed from the corresponding list. In addition, the LLC directory also uses the directory information to update bloom filters when a line is evicted, since a line may be in the read set of multiple transactions.

The last hardware change is an overflow list, which is merely a pair of pointers (base, current) into the L4 DRAM cache. Part of the L4 cache is allocated to store a per-transaction list of addresses that are evicted from the L1, such that they can be located quickly without having to scan lower level caches. The overflow list is also part of the thread context. When a speculative block is evicted from the L1, its address is inserted into the overflow list. On the other hand, addresses in the overflow list are never removed, hence making it possible to have false positives, i.e., an address may still be in the L1, while it is also present in the list, due to the fact that an evicted address can be re-fetched.

We next describe the operation of UHTM. In the cache hierarchy, all accessed lines are marked as speculative by setting the special bit. The L1 controller should notify the LLC controller that an access is about to occur before it marks a line that has not yet been marked. The LLC, on receiving the notifications, adds the transaction ID of the requesting transaction into the directory’s corresponding list, in addition to performing the necessary coherence actions.

When a speculative cache line is evicted from the L1 cache, the speculative bit is also carried to lower level caches. In addition, the L1 controller adds the address into the per-transaction overflow list, by issuing write requests to the L4 DRAM cache and incrementing the overflow list pointer.

In-cache conflict detection is performed by the LLC controller. Since all speculative accesses must inform the LLC, conflict detection is performed eagerly when an upper level request is received. The LLC checks whether a conflicting access is already present in the list. The paper suggests that RAW, WAR and WAW should all incur conflicts, in which case the requestor transaction wins, and the current holder of the line should abort. Detection between on-chip lines and off-chip lines is also performed by checking the request with all bloom filters. If any of the filters indicate a positive, the requestor should abort. The difference between conflict resolution policies of on-chip and off-chip lines is motivated by the fact that: (1) If a transaction has already overflowed, i.e., its bloom filter is non-empty, then the cycles wasted by this transaction is higher; and (2) Aborting an overflowed transaction requires more work.

When a line is evicted from the LLC, if the address is mapped to DRAM, then the DRAM controller performs undo logging by first copying the before image to the logging area, and then applying evicted data in-place. When the line is evicted to the NVM, it is first inserted into the L4 DRAM cache, and meanwhile, a redo log entry is generated and written into the NVM. As a result, fetch requests from the LLC that hit the DRAM cache can be fulfilled by the cache itself without accessing the NVM, but those that misses the DRAM cache still needs to first read the log in case there is a more up-to-date version in the log, after which the home address on the NVM is accessed. In addition, data evicted from the DRAM cache is always discarded, regardless of whether it is dirty, since the NVM already has a copy.

On transaction commit, the L1 controller first flash-clears all lines accessed by the committing transaction that have the speculative bit set. This process consists of two stages. In the first stage, the L1 controller performs a tag walk in the local L1 cache, and for each speculative line, it queries the LLC directory, and clears the speculative bit if the line is accessed by the committing transaction. In the second stage, it iterates over the overflow list, and for each line in the overflow list, it clears the speculative bit in a similar way in L2 and LLC. The reason that UHTM does not perform tag walks on L2 and LLC is that this may introduce tag array access contention, resulting in lower performance. The DRAM controller truncates the log by writing a commit mark with the transaction ID that just committed. The NVM controller commits the transaction by atomically writing a commit mark. The redo log will be reeplayed to the home address, in the background, to free log buffer space and avoid excessive log redirection.

On transaction abort, speculative lines are invalidated using the same two-stage process, except that the valid of the line is cleared. The DRAM controller should replay the undo log entries by copying the pre-image back to their home addresses. The NVM controller simply truncates the redo log by writing an abort mark atomically. The logical point of abort is the persistence of the abort mark. Bloom filters and overflow pointers are also cleared.

On recovery from a power loss, the NVM controller performs redo log replay based on commit and abort marks. It first scans the log for commit and abort records, and generates a table of committed and aborted transactions. It then scans all log entries, and only replays these entries if the transaction status is committed. DRAM data need not be restored, since a power loss will reset the DRAM content.

The paper also discusses UHTM support of context switch and virtualized transactions. The architecture also maintains a transaction table, which tracks currently active transaction’s ID, status, and other metadata. The challenges of allowing context switch are: (1) One transaction may be executed on more than one cores, leaving its speculative states scattered in the cache hierarchy. This requires broadcasts of commits or invalidations to other cores during commit and abort operations; (2) When a conflict happens, the transaction to be aborted may be switched out, which makes it impossible to properly abort it. In this case, the transaction to be aborted is marked as aborted in the table. When the transaction is switched in, the hardware should first check the status in the table. If the transaction is in aborted state, then it immediately begins the abort process.