Friday, December 23, 2016


How will interstellar database systems look like? This problem has fascinated me ever since I read Krugman's treatise on interstellar trade theory [1]. In this article, I will attempt to share my rudimentary thoughts on how the design of interstellar database systems might differ from our terrestrial database systems. The crux of this problem is that the ordering of two spatially separated events that occur at the same time is no longer absolute, as we often take for granted, and is instead relative to the reference frame of the observer [2, 3, 4].

Consider an interstellar database system that is managing a database spread across two machines A and B -- with the former machine located on earth, and the latter placed on an interstellar vehicle that travels nearly as fast as the speed of light (c). In this scenario, as illustrated in Figure 1, the laws of motion vary depending on the acceleration of B's non-inertial reference frame. More importantly, given the cost of communication across an interstellar network, the ordering of two events might not only defined by the messages that are actually sent between the machines A and B, but also by those that could have been sent between them.

Figure 1: Interstellar Database System - The locations of the two machines that contain a distributed database managed by an interstellar database system.

We must, therefore, first reexamine the notion of ordering of events in the context of an interstellar database system. I will later make the case for a multi-world protocol that is inspired by the many-worlds interpretation of quantum mechanics [5].


In a relativistic non-Euclidean setting, events are the fundamental elements of spacetime. An event corresponds to a unique position at a unique time in any given spacetime. Every event is, thus, represented by a spacetime point.

The separation between two points in a Euclidean space is purely spatial. However, in spacetime, the distance between two events is a function of both the spatial displacement vector ΔS and the temporal displacement vector ΔT. This separation between two events is referred to as the invariant spacetime interval, and is given by the displacement four-vector ΔD:

ΔD2 = ΔS2 - c2 ΔT2

Spacetime intervals can be classified into three types depending on whether the interval is positive, negative, or zero. In order to establish a causal relationship between two events, enough time must pass between them so that there is no reference frame in which the two events can occur at the same time. This is feasible only when the events are separated by time-like and light-like intervals, wherein ΔD2 < 0 and ΔD2 = 0 respectively. When ΔD2 > 0, not enough time passes between these events for there to exist a causal relationship that can span the spatial distance between them at a speed less than or equal to the speed of light. Such events are considered as being separated by space-like intervals.

By computing the spacetime interval between two events, we can determine whether they can be ordered or not. With this mechanism for ordering events in a relativistic setting, the transaction manager of our interstellar DBMS can determine the ordering of transactions whose operations correspond to a set of events. We can generalize the timestamp-ordering concurrency control protocol to a spacetime-ordering protocol that ensures the isolation of concurrent interstellar transactions. For every tuple in the interstellar database, the DBMS must keep track of the spacetime points at which it was read and updated respectively with respect to a well-defined reference frame.

The spacetime-ordering protocol must handle the relativity of simultaneity [2]. Consider a scenario wherein two transactions attempt to concurrently mutate the same tuple, and these events are separated by a space-like interval. In this scenario, the transaction manager should abort one of these unorderable transactions.


In our vanilla spacetime-based protocol, there can exist atmost one valid version of a tuple at a given spacetime point. To ensure this invariant, the DBMS needs to employ a distributed atomic commitment protocol that might incur a high communication cost. A more pragmatic approach would be to relax this invariant, thereby allowing multiple versions of a tuple to co-exist at a given spacetime point.

Let's say the agent aboard the interstellar vehicle wants to create a new version T' of an existing tuple T. The DBMS should allow the agent to do so even when it is unable to communicate with the machine on earth. Let us assume that there exists a time-like interval between this update operation and the spacetime point at which T was created. Now, depending on the reference frame with respect to which an agent chooses to observe, the DBMS can either return T or T'.

Figure 2: Multi-World Spacetime-Ordering Protocol -  The metadata that the interstellar DBMS stores to track tuple versions. We depict four-coordinate spacetime points using integers for ease of reading.

In this sense, there are can exist many versions of the same tuple in different worlds. We refer to this generalization of the multi-versioned spacetime-ordering protocol as a multi-world protocol. The additional metadata that the DBMS stores to track the tuple versions is shown in Figure 2. When the interstellar vehicle eventually lands on earth, the DBMS must allow the agent to create a new world view wherein it converges its perspectives.


This article highlighted the utility of spacetime intervals in ordering events within an interstellar database system. I think that a multi-world spacetime-ordering concurrency control protocol might be a good fit for an interstellar database system.


[1] The Theory of Interstellar Trade, Paul Krugman, 1978
[2] Relativity - The Special and General Theory, Albert Einstein, 1916
[3] The Measure of Time, Henri Poincaré, 1898
[4] Time, Clocks, and the Ordering of Events in a Distributed System, Leslie Lamport, 1978
[5] The Road to Reality, Roger Penrose, 2004

Tuesday, December 20, 2016


The design of the logging and recovery components of database systems has always been influenced by the difference in the performance characteristics of volatile (DRAM) and non-volatile storage devices (SSD). The key assumption has been that non-volatile storage is much slower than DRAM and only supports block-oriented read/writes. But the arrival of new non-volatile memory (NVM) storage that is almost as fast as DRAM with fine-grained read/writes invalidates these previous design choices.

We make the case for a new logging and recovery protocol, called write-behind logging (WBL), that enables the DBMS to recover nearly instantaneously from system failures. The key idea is that the DBMS logs what parts of the database have changed rather than how it was changed. In contrast to the ubiquitous write-ahead logging (WAL) protocol, the DBMS directly flushes the changes to the database before recording them in the log when it employs the WBL protocol.


The performance of DBMSs that use SSD for durable storage is constrained by the speed with which they persist changes to the log stored on these devices. This is because there is a large gap in the read/write latencies of DRAM and SSD, as well as a mismatch in their data access granularities (i.e., coarse-grained block writes vs. fine-grained byte-level writes).

NVM technologies, such as phase change memory, STT-MRAM, and memristors, provide low-latency, byte-addressable loads and stores. In contrast to the other durable storage devices that use the PCIe or SATA interfaces, NVM can be plugged into DIMM slots to deliver higher bandwidths and lower latencies to CPUs. Consequently, it can help reduce the performance overhead associated with persisting the changes on durable storage. Although the performance advantages of NVM are obvious, it is still not clear how to make full use of it in a DBMS running on a hybrid storage hierarchy with both DRAM and NVM.

Previous work has focused on using NVM only for storing the log and managing the database still on disk. This is a more cost-effective solution, as the cost of NVM devices are expected to be higher than that of disk. But this approach only leverages the low-latency sequential writes of NVM, and does not exploit its ability to efficiently support random writes and fine-grained data access. Given this, we contend that it is better to employ logging and recovery algorithms that are designed for NVM. To appreciate why WBL is better than WAL when using NVM, we now discuss how WAL is implemented in DBMSs.


The most well-known recovery method based on WAL is the ARIES protocol developed by IBM in the 1990s. ARIES is a physiological logging protocol where the DBMS combines a physical redo process with a logical undo process. During normal operations, the DBMS records transactions' modifications in a durable log that it uses to restore the database after a crash.

Figure 1: WAL Recovery Protocol - The phases of the recovery protocol

The traditional WAL recovery algorithm (see Figure 1) comprises of three phases: (1) analysis, (2) redo, and (3) undo. In the analysis phase, the DBMS processes the log starting from the latest checkpoint to identify the transactions that were active at the time of failure and the modifications associated with those transactions. In the subsequent redo phase, the DBMS processes the log forward from the earliest log record that needs to be redone. Some of these log records could be from transactions that were active at the time of failure as identified by the analysis phase. During the final undo phase, the DBMS rolls back uncommitted transactions (i.e., transactions that were active at the time of failure) using the information recorded in the log.

Although WAL supports efficient transaction processing when memory is volatile and durable storage cannot support fast random writes, it is inefficient for NVM storage. Consider a transaction that inserts a tuple into a table. The DBMS first records the tuple's contents in the log, and it later propagates the change to the database. With NVM, the logging algorithm can avoid this unnecessary data duplication and thereby better support data intensive applications. We now describe the design of such an algorithm geared towards a DBMS running on a hybrid storage hierarchy comprising of DRAM and NVM.


Write-behind logging leverages fast, byte-addressable NVM to reduce the amount of data that the DBMS records in the log when a transaction modifies the database. The reason why NVM enables a better logging protocol than WAL is three-fold. Foremost, the write throughput of NVM is more than an order of magnitude higher than that of an SSD or HDD. Second, the gap between sequential and random write throughput of NVM is smaller than that of older storage technologies. Finally, individual bytes in NVM can be accessed by the processor, and hence there is no need to organize tuples into pages or go through the I/O subsystem.

WBL reduces data duplication by flushing changes to the database in NVM during regular transaction processing. For example, when a transaction inserts a tuple into a table, the DBMS records the tuple's contents in the database before it writes any associated meta-data in the log. Thus, the log is always (slightly) behind the contents of the database, but the DBMS can still restore it to the correct and consistent state after a restart.

WBL differs from WAL in many ways. Foremost is that the DBMS does not construct log records that contain tuple modifications at runtime. This is because the changes made by transactions are guaranteed to be already present on durable storage before they commit. Relaxing the ordering of writes to durable storage complicates WBL's commit and recovery protocols. When the DBMS restarts after a failure, it needs to locate the modifications made by transactions that were active at the time of failure so that it can undo them. But these changes can reach durable storage even before the DBMS records the associated meta-data in the log. This is because the DBMS is unable to prevent the CPU from evicting data from its volatile caches to NVM. Consequently, the recovery algorithm must scan the entire database to identify the dirty modifications, which is prohibitively expensive and increases the recovery time.

The DBMS avoids this problem by recording meta-data about the clean and dirty modifications that have been made to the database by tracking two commit timestamps in the log. First, it records the timestamp of the latest committed transaction all of whose changes and updates of prior transactions are safely persisted on durable storage (cp). Second, it records the commit timestamp (cd, where cp < cd) that the DBMS promises to not assign to any transaction before the subsequent group commit finishes. This ensures that any dirty modifications that were flushed to durable storage will have only been made by transactions whose commit timestamp is earlier than cd. When the DBMS restarts after a failure, it considers all the transactions with commit timestamps earlier than cp as committed, and ignores the changes of the transactions whose commit timestamp is later than cp and earlier than cd. In other words, if a tuple's begin timestamp falls within the (cp , cd) pair, then the DBMS's transaction manager ensures that it is not visible to any transaction that is executed after recovery.

Before describing WBL's recovery algorithm, we first introduce the notion of a commit timestamp gap. A commit timestamp gap refers to the range of timestamps defined by the pair (cp, cd). The DBMS must ignore the effects of transactions that fall within such a gap while determining the tuple visibility. This is equivalent to undoing the effects of any transaction that was active at the time of failure. The set of commit timestamp gaps that the DBMS needs to track increases on every system failure. To limit the amount of work performed while determining the visibility of tuples, the DBMS's garbage collector thread periodically scans the database to undo the dirty modifications associated with the currently present gaps. Once all the modifications in a gap have been removed by the garbage collector, the DBMS stops checking for the gap in tuple visibility checks and no longer records it in the log.

Figure 2: WBL Commit Timestamp Gaps - An illustration of successive system failures resulting in multiple commit timestamp gaps. The effects of transactions in those gaps are eventually undone by the garbage collector.

The example in Figure 2 depicts a scenario where successive failures result in multiple commit timestamp gaps. At the end of the first group commit operation, there are no such gaps and the current commit timestamp is 101. The DBMS promises to not issue a commit timestamp higher than 199 in the time interval before the second commit. When the DBMS restarts after a system failure, it adds (101, 199) to its set of gaps. The garbage collector then starts cleaning up the effects of transactions that fall within this gap. Before it completes the scan, there is another system failure. The system then also adds (301, 399) to its gap set. Finally, when the garbage collector finishes cleaning up the effects of transactions that fall within these two gaps, it empties the set of gaps that the DBMS must check while determining the visibility of tuples.

With WBL, the DBMS does not need to periodically construct WAL-style physical checkpoints to speed up recovery. This is because each WBL log record contains all the information needed for recovery: the list of commit timestamp gaps and the commit timestamps of long running transactions that span across a group commit operation. The DBMS only needs to retrieve this information during the analysis phase of the recovery process. It can safely remove all the log records located before the most recent log record. This ensures that the log's size is always bounded.

Figure 3: WBL Recovery Protocol - The phases of the recovery protocol.

As shown in Figure 3, the WBL recovery protocol only contains an analysis phase. During this phase, the DBMS scans the log backward until the most recent log record to determine the currently present commit timestamp gaps and timestamps of long running transactions. There is no need for a redo phase because all the modifications of committed transactions are already present in the database. WBL also does not require an WAL-style undo phase. Instead, the DBMS uses the information in the log to ignore the effects of uncommitted transactions. After the brief analysis phase, the DBMS can immediately start handling transactions again.


Existing NVM devices cannot store large databases due to their limited capacities and prohibitive costs. We instead use Intel Labs' persistent memory evaluation platform (PMEP). PMEP models the latency and bandwidth characteristics of Intel's upcoming NVM technologies. It also emulates the newly proposed persistence primitives. PMEP emulates NVM's higher read/write latencies for the NVM partition by using custom CPU microcode. This microcode estimates the number of cycles that the CPU would have to wait if DRAM is replaced by slower NVM and then stalls the CPU for that amount of time.

PMEP contains a NVM-aware memory allocator that exports the POSIX malloc interface. This allocator provides a durability mechanism that the DBMS uses to ensure that database modifications are persisted on NVM. This is required because stores to NVM share the same volatile micro-architectural buffers in the processor and can therefore be lost on a power failure. The CPU must provide instructions that the allocator uses to expose a special NVM sync primitive.

Internally, the allocator implements the sync primitive by writing back the modifications to NVM using the cache-line write back (CLWB) instruction. This instruction writes back the modified data in the cache-lines to NVM. Unlike the cache-line flush (CLFLUSH) instruction that is generally used for flushing operations, CLWB does not invalidate the line from the cache and instead only transitions it to a non-modified state. This reduces the possibility of a compulsory cache miss when the same data is accessed momentarily after the line has been flushed. In our experiments, we have found that an efficient cache flushing primitive is critical for a high-performance DBMS.


We now present our analysis of the logging protocols. We implemented both WAL and WBL in Peloton, an in-memory HTAP DBMS that supports NVM. We compare the DBMS's runtime performance, recovery times, and storage footprint for the YCSB benchmark. This is a widely-used key-value store workload from Yahoo!. It is representative of the transactions handled by web-based companies. The workload consists of two transaction types: (1) a read transaction that retrieves a single tuple using its primary key, and (2) an update transaction that modifies a single tuple based on its primary key. The distribution of the transactions' access patterns is based on a Zipfian skew. We present the results for the write-heavy workload mixture, that consists of 10% reads and 90% updates.

We performed these experiments using Intel Lab's PMEP hardware emulator. It contains two Intel Xeon E5-4620 CPUs (2.6 GHz), each with eight cores and a 20 MB L3 cache. The PMEP contains 256 GB of DRAM. It dedicates 128 GB of DRAM for the emulated NVM. We configured the NVM latency to be 4x that of DRAM and validated these settings using Intel's memory latency checker. The PMEP also includes two additional storage devices: (1) HDD: Seagate Barracuda (3 TB, 7200 RPM, SATA 3.0), and (2) SSD: Intel DC S3700 (400 GB, SATA 2.6).

We modified Peloton to use the PMEP's allocator and filesystem interfaces to store its logs, checkpoints, and table heap on NVM. When employing WAL, the DBMS maintains the log and the checkpoints on the filesystem, and uses fsync to ensure durability. When it adopts WBL, the DBMS uses the allocator for managing the durable table heap and indexes. Internally, it stores indexes in persistent B+trees. It relies on the allocator's sync primitive to ensure database durability. All the transactions execute with the same snapshot isolation level and durability guarantees.


We begin with an analysis of the recovery protocols' impact on the DBMS's runtime performance. To obtain insights that are applicable for different storage technologies, we run the YCSB benchmark in Peloton while using either the WAL or WBL. For each configuration, we scale up the number of worker threads that the DBMS uses to process transactions. The clients issue requests in a closed loop. We execute the workload three times under each setting and report the average throughput.

Figure 4: YCSB Throughput - The throughput of the DBMS for the YCSB benchmark with different logging protocols and durable storage devices.

Figure 4 shows the throughput of the DBMS while executing YCSB with varying number of worker threads. The most notable observation from this experiment is that while the DBMS's throughput with the SSD-WAL configuration is higher than that with the SSD-WBL configuration, its performance with the NVM-WBL configuration is comparable to that obtained with the NVM-WAL configuration. This is because NVM supports fast random writes unlike SSD.

We observe that the NVM-WBL configuration delivers 1.3x higher throughput than the NVM-WAL configuration because of its lower logging overhead. That is, under WBL the DBMS does not construct as many log records as it does with WAL and therefore it writes less data to durable storage. The performance gap between the NVM-based and SSD-based configurations is prominent on this write-intensive workload. The NVM-WBL configuration delivers 12.1x higher throughput than the SSD-WBL configuration.


We next evaluate the recovery time of the DBMS using the different logging protocols and storage devices. We first execute a fixed number of transactions and then force a hard shutdown of the DBMS (SIGKILL). We then measure the amount of time for the system to restore the database to a consistent state. That is, a state where the effects of all committed transactions are durable and the effects of uncommitted transactions are removed. We note that the number of transactions that the DBMS processes after restart in WAL depends on the frequency of checkpointing. With WBL, the DBMS performs garbage collection to clean up the dirty effects of uncommitted transactions at the time of failure. This garbage collection step is done asynchronously and does not have a significant impact on the throughput of the DBMS.

Figure 5: Recovery Time - The time taken by the DBMS to restore the database to a consistent state after a restart with different logging protocols.

The results in Figure 5 present the recovery measurements for the YCSB benchmark. The recovery times of the WAL-based configurations grow linearly in proportion to the number of transactions that the DBMS recovers. This is because the DBMS needs to replay the log to restore the effects of committed transactions. In contrast, with WBL, we observe that the recovery time is independent of the number of transactions executed. The system only reverses the effects of transactions that were active at the time of failure as the changes made by all the transactions committed after the last checkpoint are already persisted. The WBL-based configurations, therefore, have a short recovery.


Lastly, we compare the storage utilization of the DBMS using either the WAL and WBL protocols while running on NVM. This metric is important because we expect that the first NVM products will initially be more expensive than current technologies, and thus using less storage means a lower procurement cost.

We measure Peloton's storage footprint as the amount of space that it uses in either DRAM or NVM to store tables, logs, indexes, and checkpoints. We periodically collect statistics from the DBMS's storage manager and the filesystem meta-data during the workload execution. We perform these measurements after loading the initial database and report the peak storage footprint of the DBMS for each trial. For all of the configurations, we allow the DBMS's background processes (e.g., group commit, checkpointing, garbage collection) to execute while we collect these measurements. 

Figure 6: Storage Footprint - The storage space occupied by the internal
components of the DBMS while using different recovery protocols.
The initial database size is 2 GB. The results in Figure 6 show that the WAL-based configuration has a larger storage footprint than WBL. This is because WAL constructs log records that contain the physical changes associated with the modified tuples. In contrast, WBL's log records do not contain this information. Another important difference is that while the WAL-based DBMS periodically constructs transactionally-consistent checkpoints of the database, WBL only requires the DBMS to write log records that contain the list of currently present commit identifier gaps. As such, its logical checkpoints have a smaller storage footprint than WAL's physical checkpoints. Unlike WAL, WBL persists the indexes on durable storage to avoid rebuilding it during recovery. The WBL-based DBMS consumes 26% less storage space on NVM than its WAL counterpart.


We presented the write-behind logging protocol for emerging non-volatile storage technologies. We examined the impact of this redesign on the transactional throughput, availability, and storage footprint of the DBMS. Our evaluation of this recovery algorithm in Peloton showed that across different OLTP workloads it reduces the system's recovery time by 100x and shrinks the storage footprint by 1.5x in comparison to the write-ahead logging protocol.

[An abridged version of this article was posted on the ISTC Big Data Blog.]