36th International Conference
on Massive Storage Systems
and Technology (MSST 2020)
October 29th and 30th, 2020

Sponsored by Santa Clara University,
School of Engineering

Since the conference was founded, in 1974, by the leading national laboratories, MSST has been a venue for massive-scale storage system designers and implementers, storage architects, researchers, and vendors to share best practices and discuss building and securing the world's largest storage systems for high-performance computing, web-scale systems, and enterprises.

Hosted at
Santa Clara University
Santa Clara, CA

2020 Conference

Thanks to our authors, attendees, volunteers, and Santa Clara University for an outstanding, informative virtual Research Track. Special thanks to our authors who were available for live questions from very different time zones.

We are hoping to get back to normal, with an in-person MSST 2021 at Santa Clara University. Watch this space for details.

Santa Clara University
The beautiful Santa Clara University campus that we didn't see this year!

Subscribe to our email list for (infrequent) information along the way.

Virtual Research Track

Thursday, October 29th
8:00 AM — 10:00 AM            Opening Remarks / Performance
Geomancy: Automated Performance Enhancement (Paper, Slides)
Oceane Bel, Kenneth Chang, Nathan R. Tallent, Dirk Duellmann, Ethan L. Miller, Faisal Nawab, and Darrell D. E. Long.
Large distributed storage systems such as high-performance computing (HPC) systems used by national or international laboratories require sufficient performance and scale for demanding scientific workloads and must handle shifting workloads with ease. Ideally, data is placed in locations to optimize performance, but the size and complexity of large storage systems inhibit rapid effective restructuring of data layouts to maintain performance as workloads shift.

To address these issues, we have developed Geomancy, a tool that models the placement of data within a distributed storage system and reacts to drops in performance. Using a combination of machine learning techniques suitable for temporal modeling, Geomancy determines when and where a bottleneck may happen due to changing workloads and suggests changes in the layout that mitigate or prevent them. Our approach to optimizing throughput offers benefits for storage systems such as avoiding potential bottlenecks and increasing overall I/O throughput from 11% to 30%.
Analytical models for performance and energy consumption evaluation of storage devices (Paper, Slides)
Eric Borba, Eduardo Tavares, Paulo Maciel and Carlos Araujo.
Improvements in data storage may be constrained by the lower performance of hard disk drives (HDD) and the higher cost per gigabyte of solid-state drives (SSD). To mitigate these issues, hybrid storage architectures have been conceived. Some works evaluate the performance of storage architectures, but energy consumption is usually neglected and not simultaneously evaluated with performance. This paper presents an approach based on generalized stochastic Petri nets (GSPN) for performance and energy consumption evaluation of individual and hybrid storage systems. The proposed models can represent distinct workloads and also estimate throughput, response time and energy consumption of storage systems. Some case studies based on industry-standard benchmarks are adopted to demonstrate the feasibility of the proposed approach.
Census: Counting Interleaved Workloads on Shared Storage (Paper, Slides)
Si Chen, Jianqiao Liu and Avani Wildani.
Understanding the different workload- dependent factors that impact the latency or reliability of a storage system is essential for SLA satisfaction and fair resource provisioning. However, due to the volatility of system behavior under multiple workloads, determining even the number of concurrent types of workload functions, a necessary precursor to workload separation, is an unsolved problem in the general case. We introduce C ENSUS , a novel classification framework that combines time-series analysis with gradient boosting to identify the number of functional workloads in a shared storage system by projecting workload traces into a high-dimensional feature representation space. We show that C ENSUS can distinguish the number of interleaved workloads in a real-world trace segment with up to 95% accuracy, leading to a decrement of the mean square error to as little as 5% compared to the fairest guess according to the daily average.
PreMatch: An Adaptive Cost-Effective Energy Scheduling System for Data Centers (Paper, Slides)
Daping Li, Jiguang Wan, Duo Wen, Chao Zhang, Nannan Zhao, Fei Wu and Changsheng Xie.
As data centers expand, increasingly growing traditional grid energy consumption and carbon dioxide emissions have caused considerable challenges. Therefore, many data centers have focused on renewable energy. However, such data centers fail to maintain high performance while trying to fully utilize renewable energy, as they cannot make a balance between the uncontrollable storage-based workload and variable renewable energy. This paper proposes PreMatch, a tiered caching storage system that considers both high-performance demands and renewable energy utilization. PreMatch deploys a Solid State Drive (SSD) cache and an Hard Disk Drive (HDD) cache for the disk-based massive storage system, which can provide a data transfer station while maintaining the reliability. We also design an adaptive energy scheduling scheme to make the active devices proportional to the dominant one of the green energy and workload. To make decisions in advance, we introduce Long Short-Term Memory (LSTM) neural network to forecast the information on workload and green energy. Experimental results show that the storage system using PreMatch can achieve the same performance as Workload-Driven Scheme (WDS), but consumes only half grid energy of WDS and has higher green energy utilization.
10:00 AM — 10:30 AM          Break
10:30 AM — 12:30 PM          Flash / Flash Translation Layer
BitFlip: A Bit-Flipping Scheme for Reducing Read Latency and Improving Reliability of Flash Memory
(Paper, Slides)
Suzhen Wu, Sijie Lan, Jindong Zhou, Hong Jiang and Zhirong Shen.
LDPC codes provide stronger error correction capability for flash memory, but at the expense of high decoding latency that leads to poor read performance. In this paper, we demonstrate via preliminary analysis that the four states of a MLC cell in flash memory differ substantially in error proneness and proportion, which opens up new opportunities for reducing the read latency. We therefore design BitFlip, a lightweight yet effective bit-flipping scheme for flash memory. BitFlip carefully examines the bits at the proper granularity and looks for opportunities to flip the error-prone data to make them more stable against retention errors, thereby reducing the decoding time. In-depth analysis and extensive experiments are conducted to show that BitFlip can reduce 25.9%-34.2% of the read latency and prolong 2.9%-33.3% of the lifespan for flash memory, while adding negligible impact on the write latency.
Space-Oblivious Compression and Wear Leveling for Non-Volatile Memories (Paper, Slides)
Haikun Liu, Yuanyuan Ye, Xiaofei Liao, Hai Jin, Yu Zhang, Wenbin Jiang and Bingsheng He.
Emerging Non-Volatile Main Memory (NVMM) technologies generally feature high density, low energy consumption and cost per bit compared to DRAM. However, the limited write endurance of NVMM poses a significant challenge of using NVMM as an substitute for DRAM. This paper proposes a space-oblivious compression and wear-leveling based memory architecture to improve the write endurance of NVMM. As memory blocks of many applications usually contain a large amount of zero bytes and frequent values, we propose a non-uniform compression encoding scheme that integrates Zero Deduplication with Frequent Value Compression (called ZD-FVC) to reduce bit-writes on NVMM. Moreover, we leverage the memory space available through compression to achieve an intra-block wear leveling policy. It rotates the writing positions of a compressed data block within the data's initial memory space, and thus enhances write endurance by balancing the bit-writes per NVMM cell. ZD-FVC can be integrated into the NVMM module and implemented entirely by hardware, without any intervention of Operating Systems. We implement ZD-FVC in Gem5 and NVMain simulators, and evaluate it with several programs from SPEC CPU2006. Experimental results shows that ZD-FVC is much better than several state-of-the-art approaches. Particularly, ZD-FVC can improve data compression ratio by 55% compared to Frequent Value Compression. Compared with the Data Comparison Write, ZD-FVC is able to significantly improve the lifetime of NVMM by 3.3X on average, and also reduces NVMM write latency by 31% and energy consumption by 19% on average.
Maximizing Bandwidth Management FTL Based on Read and Write Asymmetry of Flash Memory (Paper, Slides)
Bihan Li, Wei Tong, Jingning Liu, Dan Feng, Yazhi Feng, Junqing Qin, Peihao Li and Bo Liu.
With the high-density advantage, fewer 3D NAND chips are needed to build higher capacity embedded storage devices. However, this decrease in the number of chips means fewer parallel units, resulting in reduced channel bandwidth utilization and poor performance. By analyzing requests execution timing, we find that write and read operations need to focus on different problems to improve system performance because of read and write asymmetry. Promoting plane-level parallelism is more important for write. Reducing response time and providing a more balanced distribution of data is more crucial for read. Motivated by this observation, we propose a Maximize Bandwidth Management FTL called MBM. MBM includes a parallelism- enhanced Write Strategy (WS) and a parallelism-relaxed Read Strategy (RS). WS extends an active block for GC in each plane to enhance intra-chip parallelism. Additionally, it limits the channel's maximum executable number for superior request distribution. To guarantee response time, RS executes parallel read conditionally and improves read efficiency by rearranging data to a suitable location. Moreover, to reduce long tail latency, we also propose a Minimizing Chip consumption Strategy (MCS) and exploit a program/erase suspension. MCS helps provide enough idle chips for subsequent requests. Experiment results show the proposed MBM reduces the average response time by up to 66.8% and promotes I/O bandwidth to 3x compared to the baseline scheme. Specifically, between 99–99.999th percentiles, MBM significantly reduces the tail latency.
PAPA: Partial Page-aware Page Allocation in TLC Flash SSD for Performance Enhancement (Paper, Slides)
Imran Fareed, Mincheol Kang, Wonyoung Lee and Soontae Kim.
The three bit types, namely, least significant bit (LSB), central significant bit (CSB), and most significant bit (MSB), in the TLC flash memory exhibit variable read/write latencies. Reading/writing an MSB takes more time than reading/writing a CSB, and an LSB incurs the minimum latency. In addition, the increased size of flash pages results in the formation of partial page writes. The partial page writes are significantly costly if they update the existing data, as partial updates perform read-modify-write (RMW) operations for ensuring data integrity. The performance further worsens if the to-be-updated data by partial writes are stored in high-latency MSB pages. Conventional TLC programming designs do not consider the size of the write requests and follow a type-blind page allocation, thereby missing a key opportunity to boost the performance of the TLC flash memory. In this study, we propose a partial page-aware page allocation (PAPA) scheme for TLC flash memory. PAPA simultaneously considers both the write request size and flash page types for performing page allocation. Our study reveals that most of the to-be-updated data updated via partial updates are partial pages. Therefore, the central mechanism of PAPA scheme is to prioritize low-latency LSB pages for partial page writes, as partial page writes incur extra latency to read the existing data, during update operations; however, high-latency CSB/MSB pages are assigned to full page writes. Our analysis using various write-intensive workloads report that the PAPA scheme improves the write response time, RMW latency, and IOPS by 55%, 34%, and 14% on average, respectively.
12:30 PM — 1:00 PM            Break
  1:00 PM — 3:00 PM            Non-Volatile Memory
A Performance Study of Optane Persistent Memory: From Indexing Data Structures' Perspective (Paper, Slides)
Abdullah Al Raqibul Islam, Dong Dai, Anirudh Narayanan and Christopher York.
In this paper, we study the performance of new Intel Optane DC Persistent Memory (Optane DC) from indexing data structures' perspective. Different from existing Optane DC benchmark studies, which focus on either low-level memory accesses or high-level holistic system evaluations, we work on the data structures level, benchmarking commonly seen indexing data structures such as Linkedlist, Hashtable, Skiplist, and Trees under various running modes and settings. We believe that indexing data structures are basic and necessary building blocks of various real-world applications. Hence, the accurate knowledge about their performance characteristics on Optane DC will directly help developers design and implement their persistent applications. To conduct these performance evaluations, we implemented pmemids_bench, a benchmark suite that includes seven commonly used indexing data structures implemented in four persistent modes and four parallel modes. Through extensive evaluations on real Optane DC-based platform under different workloads, we identify seven observations that cover various aspects of Optane DC programming. These observations contain some unique results on how different data structures will be affected by Optane DC, providing useful reference for developers to design their persistent applications.
HMEH: write-optimal extendible hashing for hybrid DRAM-NVM memory (Paper, Slides)
Xiaomin Zou, Fang Wang, Dan Feng, Chaojie Liu, Fan Li and Jianxi Chen.
Emerging non-volatile memory (NVM) is expected to coexist with DRAM as a hybrid memory to fully exploit the complementary strengths of DRAM's low read-write latency and NVM's high density, persistence, and low standby power. However, existing hashing schemes cannot efficiently reap the benefits of such a hybrid memory. In this paper, we present a hybrid DRAM-NVM write-optimal and high-performance dynamic hashing scheme, named HMEH (Hybrid Memory Extendible Hashing). In our design, key-value items are persisted in NVM while the directory is placed in DRAM for faster access. To rebuild the directory upon recovery, HMEH also keeps a radix-tree-structured directory in NVM with negligible overhead. Furthermore, HMEH proposes a cross-KV strategy to write back items through natural eviction, which can ensure data consistency with no performance degradation from persist barriers. Experimental results show that HMEH outperforms the state-of-the-art NVM-based hashing structures by up to 2.47×. And concurrent HMEH also delivers superior performance and high scalability under YCSB workloads with different search/insertion ratios.
NUMA-Aware Thread Migration for High Performance NVMM File Systems (Paper, Slides)
Ying Wang, Dejun Jiang and Jin Xiong.
Emerging Non-Volatile Main Memories (NVMMs) provide persistent storage and can be directly attached to the memory bus, which allows building file systems on non-volatile main memory (NVMM file systems). Since file systems are built on memory, NUMA architecture has a large impact on their performance due to the presence of remote memory access and imbalanced resource usage. Existing works migrate thread and thread data on DRAM to solve these problems. Unlike DRAM, NVMM introduces extra latency and lifetime limitations. This results in expensive data migration for NVMM file systems on NUMA architecture. In this paper, we argue that NUMA-aware thread migration without migrating data is desirable for NVMM file systems. We propose NThread, a NUMA-aware thread migration module for NVMM file system. NThread applies what-if analysis to get the node that each thread performs local access and evaluate what resource contention will be if all threads access data locally. Then NThread adopts migration based on priority to reduce NVMM and CPU contention. In addition, NThread also considers CPU cache sharing between threads for NVMM file systems when migrating threads. We implement NThread in state-of-the-art NVMM file system and compare it against existing NUMA-unaware NVMM file system ext4-dax, PMFS and NOVA. NThread improves throughput by 166.5%, 872.0% and 78.2% on average respectively for filebench. For running RocksDB, NThread achieves performance improvement by 111.3%, 57.9%, 32.8% on average.
ExtraCC: Improving Performance of Secure NVM with Extra Counters and ECC (Paper, Slides)
Zhengguo Chen, Youtao Zhang, and Nong Xiao.
Emerging non-volatile memories (NVMs), while exhibiting great potential to be DRAM alternatives, are vulnerable to security attacks. Adopting counter mode AES based encryption and authentication schemes help to protect memory security but tend to incur non-negligible performance overhead in order to keep data consistency between counters and user data. In particular, counters are associated with logical addresses such that counters of hot data may overflow frequently, incurring lifetime and performance overhead in secure NVM system. The recently proposed ACME scheme mitigates the issue by associating counters with physical addresses and leveraging underlying wear-leveling schemes. While it stores and updates data and counters together, it destroys counter locality and introduces large read overhead during integrity check.

In this paper, we propose ExtraCC to address the performance and lifetime losses in secure NVMs. We keep an extra counter and enhance the ExtraCC with logical-addressed-physical-associated (LAPA) counter scheme and two-tiered ECC to not only keep the counter locality but also effectively reduce write overhead. Our experimental results show that it achieves 15.2% performance improvement and 20.5% write traffic reduction over the state-of-the-art, with about 8.4% storage overhead.
3:00 PM — 3:30 PM              Break
3:30 PM — 5:30 PM              Performance 2
Comparing Tape and Cloud Storage for Long-Term Data Preservation (Paper, Slides)
Matt Starr, Matt Ninesling, Eric Polet and Mariana Menge.
With the retention life of data increasing, many organizations are trying to determine the best long-term storage strategy for the future. To do this, organizations must analyze the reliability, security and speed of data storage solutions, including both cloud and tape storage. This paper considers how a hybrid archive solution, consisting of both cloud and tape, may be deployed in the event of a cloud mandate, and how data growth and future costs can impact a storage solution selection.
Measuring the Cost of Reliability in Archival Systems (Paper, Slides)
James Byron, Ethan L. Miller and Darrell D. E. Long.
Archival systems provide reliable and cost-effective data storage over a long period of time. Existing technologies offer familiar and well-defined features, but uncertainties about future developments complicate decisions about selecting the best storage technology that will continue to scale in the future. Furthermore, inaccurate assumptions about the long-term reliability of each storage technology can result in the use of suboptimal storage technologies for an archival system or the unwanted loss of data. Prospective storage technologies like archival glass and synthetic DNA may deliver much greater capacity and reliability than do existing technologies, yet their availability and exact features remain uncertain. As each storage technology develops and changes over time, its reliability may also change and give rise to further uncertainties about the long-term cost of highly reliable archival systems. We present results of simulations that explore the effects of various technology developments upon the cost of constructing archival systems that meet various levels of reliability against data loss. We show that storage density more than device reliability dominates the cost of constructing and maintaining reliable archival storage systems, and innovations to increase storage density—even at the expense of individual device reliability—can reduce total archival system cost. We also explore the advantages of prospective over existing archival storage technologies, and we present estimates of the extent to which their availability will affect the cost of long-term data storage.
Speeding up Analysis of Archived Sensor Data with Columnar Layout for Tape (Paper, Slides)
Ken Iizawa, Seiji Motoe, Yuji Yazawa and Masahisa Tamura.
Analyzing archived stream data such as sensor data, packet data, and log data provides valuable insights into past events. Tape technology has been improving in both capacity and performance and thus is suitable for archiving such a large amount of stream data at low cost. However, due to tape's performance characteristics, read performance is poor when data is stored on tape in the same format as on SSD or HDD. In this paper, we propose a method to improve read performance by placing data on tape in a columnar layout aware of physical structure of tape called wrap. Our preliminary evaluation using a realistic workload shows our method to be 53% faster than traditional wrap-unaware columnar layout.
Dsync: a Lightweight Delta Synchronization Approach for Cloud Storage Services (Paper, Slides)
Yuan He, Lingfeng Xiang, Wen Xia, Hong Jiang, Zhenhua Li, Xuan Wang and Xiangyu Zou.
Delta synchronization (sync) is a key bandwidth-saving technique for cloud storage services. The representative delta sync utility, rsync, matches data chunks by sliding a search window byte by byte, to maximize the redundancy detection for bandwidth efficiency. This process, however, is difficult to cater to the demand of forthcoming high-bandwidth cloud storage ser- vices, which require lightweight delta sync that can well support large files. Inspired by the Content-Defined Chunking (CDC) technique used in data deduplication, we propose Dsync, a CDC- based lightweight delta sync approach that has essentially less computation and protocol (metadata) overheads than the state-of-the-art delta sync approaches. The key idea of Dsync is to simplify the process of chunk matching by (1) proposing a novel and fast weak hash called FastFp that is piggybacked on the rolling hashes from CDC; and (2) redesigning the delta sync protocol by exploiting deduplication locality and weak/strong hash properties. Our evaluation results driven by both benchmark and real-world datasets suggest Dsync performs 2×-8× faster and supports 30%-50% more clients than the state-of-the-art rsync-based WebR2sync+ and deduplication-based approach.

Friday, October 30th
8:00 AM — 10:00 AM            File Systems / Key-Value Stores
Revisiting Virtual File System for Metadata Optimized Non-Volatile Main Memory File System (Paper, Slides)
Ying Wang, Dejun Jiang and Jin Xiong.
Emerging non-volatile main memories (NVMMs) provide persistency and low access latency than disk and SSD. This motivates a number of works to build file systems based on NVMM by reducing I/O stack overhead. Metadata plays an important role in file system. In this paper, we revisit virtual file system to find two main sources that limit metadata performance and scalability. We thus explore to build a metadata optimized file system for NVMM-DirectFS. In DirectFS, VFS cachelet is first co-designed with VFS and NVMM file system to reduce conventional VFS cache management overhead meanwhile retaining file lookup performance. DirectFS then adopts a global hash based metadata index to manage both VFS cachelet and metadata in NVMM file system. This helps to avoid duplicated index management in conventional VFS and physical file system. In order to increase metadata operation scalability, DirectFS adopts both fine-grained flags and atomic write to reduce limitations during concurrency control and crash consistency guaranteeing. We implement DirectFS in Linux kernel 4.18.8 and evaluate it against state-of-the-art NVMM file systems. The evaluation results show that DirectFS improves performance by up to 59.2% for system calls. For real-world application varmail, DirectFS improves performance by up to 66.0%. Besides, DirectFS scales well for common metadata operations.
Artifice: Data in Disguise (Paper, Slides)
Austen Barker, Yash Gupta, Sabrina Au, Eugene Chou, Ethan L. Miller and Darrell D. E. Long.
With the widespread adoption of disk encryption technologies, it has become common for adversaries to employ coercive tactics to force users to surrender encryption keys and similar credentials. For some users this creates a need for hidden volumes that provide plausible deniability or the ability to deny the existence of sensitive information. Plausible deniability directly impacts groups such as democracy advocates relaying information in repressive regimes, journalists covering human rights stories in a war zone, or NGO workers hiding food shipment schedules from violent militias. All of these users would benefit from a plausibly deniable data storage system Previous deniable storage solutions only offer pieces of an implementable solution. We introduce Artifice, the first tunable, operationally secure, self repairing, and fully deniable storage system.

With Artifice, hidden data blocks are split with Shamir Secret Sharing to produce a set of obfuscated carrier blocks that are indistinguishable from other pseudo-random blocks on the disk. The blocks are then stored in unallocated space and possess a self-repairing capability and rely on combinatorial security. Unlike preceding systems, Artifice addresses problems regarding flash storage devices and multiple snapshot attacks through comparatively simple block allocation schemes and operational security. To hide the user's ability to run a deniable system and prevent information leakage, Artifice stores its driver software separately from the hidden data.
LightKV: A Cross Media Key Value Store with Persistent Memory to Cut Long Tail Latency (Paper, Slides)
Shukai Han, Dejun Jiang and Jin Xiong.
Conventional persistent key-value stores widely adopt LSM-Tree to manage data across memory and disk. However, expensive write ahead logging, inefficient cross-media indexing and write amplification are three limitations faced by LSM-Tree based key-value store. Thanks to the development of non-volatile memory (NVM), persistent key-value stores can exploit NVM-based persistent memory (PM) to directly persist data avoiding costly logging. With this design choice, we propose LightKV a cross media key-value store with persistent memory. To support efficient cross-media indexing, we design a global index Radix-Hash Tree (RH-Tree) consisting of the upper-layer radix tree and hash table based leaf nodes. We explore the specific features of real PM product to balance the persistency and performance of RH-Tree. Meanwhile, replying on the range partition of RH-Tree, LightKV organizes key-value pairs with the same key prefix into SSTables within the same partition. LightKV then conducts partition-based data compaction with carefully-controlled compacted data volumes. By doing so, LightKV greatly reduces write amplification. We evaluate LightKV against state-of-the-art PM-based key-value stores NoveLSM and SLM-DB as well as LevelDB and RocksDB. The experiment results show that LightKV reduces write amplification by up to 8.1x and improves read performance by up to 9.2x. Due to the reduced write amplification, LightKV also reduces read tail latency by up to 18.8x under read-write mixed workload.
NovKV: Efficient Garbage Collection for Key-Value Separated LSM-Stores (Paper, Slides)
Chen Shen, Youyou Lu, Fei Li, Weidong Liu and Jiwu Shu.
LSM-based key-value stores (LSM-stores) play an important role in many storage systems. However, LSM-stores suffer from high write amplification of their compaction operations. Recently proposed key-value separated LSM-stores reduce the impact, but the garbage collection overheads of the value parts remain high. In this paper, we find that existing key-value separation approaches have to check validity of key-value items by querying the LSM-tree, and update value handles by inserting them back into the LSM-tree during garbage collection. Validity checking and value handle updating introduce heavy overheads to the LSM-tree. To this end, we propose an efficient approach to reduce expensive overheads of garbage collection, by eliminating queries and insertions of the LSM-tree. The approach consists of three key techniques: collaborative compaction, efficient garbage collection, and selective handle updating. We implement this approach atop LevelDB and name it as NovKV. Evaluations show that NovKV outperforms WiscKey by up to 1.98x on random write and 1.85x on random read.
10:00 AM — 10:30 AM          Break
10:30 AM — 12:30 PM          General
OSwrite: Improving the lifetime of MLC STT-RAM with One-Step write (Paper, Slides)
Wei Zhao, Wei Tong, Dan Feng, Jingning Liu, Jie Xu, Xueliang Wei, Bing Wu, Chengning Wang, Weilin Zhu and Bo Liu.
Spin-Transfer Torque Random Access Memory (STT-RAM) is a promising cache memory candidate due to high density, low leakage power, and non-volatility. Multi-Level Cell (MLC) STT-RAM can further increase density by storing two bits in the hard and soft domain of a cell respectively. However, MLC STT-RAM suffers from severe lifetime issues because of its two- step write operation. As two-step write could incur extra writes to a cell's soft domain, which drastically degrades the overall lifetime of MLC STT-RAM. Thus, it is necessary to reduce the wear to soft domain so that extend the lifetime of MLC STT-RAM.

We observe that the most wears to the soft domain are produced by the hard domain bit flips (i.e. Two-step Transition and Hard Transition). Based on the observation, we propose One-Step write (OSwrite) to avoid Two-step Transition (TT) and Hard Transition (HT). Half-Sized Compression (HSC) removes HTs and TTs by writing data only to the soft domain through compression techniques. The compressed data is encoded to further reduce the writes to soft domain. Besides, Hard Transition Removal Encoding (HTRE) scheme is used while data cannot be compressed to less than half-size. HTRE scheme uses a hard flag to record the state of hard domain flipping to avoid changing its value. Then, HTRE compresses the hard flag and encodes the soft data to further reduce the writes to soft domain with encoding tags stored in the saved space of hard flag. Evaluation results show that OSwrite can improve the lifetime of MLC STT-RAM to 2.6×. Our scheme can largely decrease HTs and TTs thus achieve one-step write. The results show OSwrite reduces write energy and improves system performance of MLC STT-RAM by 56.2% and 6.4% respectively. Besides, OSwrite reduces hard bit flips and soft bit flips by 82.8% and 5.3% respectively.
ChronoLog: A Distributed Shared Tiered Log Store with Time-based Data Ordering (Paper, Slides)
Anthony Kougkas, Hariharan Devarajan, Keith Bateman, Jaime Cernuda, Neeraj Rajesh and Xian-He Sun.
Modern applications produce and process massive amounts of activity (or log) data. Traditional storage systems were not designed with an append-only data model and a new storage abstraction aims to fill this gap: the distributed shared log store. However, existing solutions struggle to provide a scalable, parallel, and high-performance solution that can support a diverse set of conflicting log workload requirements. Finding the tail of a distributed log is a centralized point of contention. In this paper, we show how using physical time can help alleviate the need of centralized synchronization points. We present ChronoLog, a new, distributed, shared, and multi-tiered log store that can handle more than a million tail operations per second. Evaluation results show ChronoLog's potential, outperforming existing solution by an order of magnitude.
Towards Application-level I/O Proportionality with a Weight-aware Page Cache Management (Paper, Slides)
Jonggyu Park, Kwonje Oh and Young Ik Eom.
Cloud systems often use blkio subsystem of Cgroups for controlling I/O resources to guarantee the service-level objective (SLO) of the systems. However, the blkio subsystem of Cgroups is originally designed to achieve block-level I/O proportionality without consideration on the upper layers of the system software stack, such as the page cache layer. Therefore, when an application utilizes buffered I/O, the performance of the application can exhibit unexpected results in its I/O proportionality even though the block-level I/O proportionality is still being guaranteed. To address this problem, we suggest a weight-aware page cache management scheme, called Justitia, which realizes application-level I/O proportionality in the systems that use OS virtualization technologies. Justitia prioritizes higher-weighted applications in the lock acquisition process of the page allocation by re-ordering the lock waiting queue based on their I/O weights. Additionally, it keeps the number of allocated pages for each application proportional to its I/O weight, with a weight-aware page reclamation scheme. Our experiments show that Justitia effectively improves I/O proportionality with negligible overhead in various cases.
WATSON: A Workflow-based Data Storage Optimizer for Analytics (Paper, Slides)
Jia Zou, Ming Zhao, Juwei Shi and Chen Wang.
This paper studies the automatic optimization of data placement parameters for the inter-job write once read many (WORM) scenario where data is first materialized to storage by a producer job, and then accessed for many times by one or more consumer jobs. Such scenario is ubiquitous in Big Data analytics applications but existing Big Data auto-tuning techniques are often focused on single job performance.

To address the shortcomings in existing works, this paper investigates data placement parameters regarding blocking, partitioning and replication and models the trade-offs caused by different configurations of these parameters through a producer-consumer model. We then present a novel cross-layer solution, WATSON, which can automatically predict future workloads' data access patterns and tune data placement parameters accordingly to optimize the performance for an inter-job WORM scenario. WATSON can achieve up to eight times performance speedup on various analytics workloads.
12:30 PM — 1:00 PM            A brief history of MSST / Closing Remarks

2020 Organizers
Conference Co-Chairs     John Bent (Seagate), Gary Grider (LANL)
Tutorial Chair     Sean Roberts (Open Mobility Foundation)
Invited Track Program Co-Chairs     Garrett Ransom (LANL), Jason Feist (Seagate), John Bent (Seagate)
Research Track Chair     James Hughes (UCSC)
Research Track Program Co-Chairs     Prof. Darrell Long (UCSC), Michal Simon (CERN)
Santa Clara University Representative     Dr. Ahmed Amer (SCU)
Local Arrangements Chair     Prof. Yuhong Liu (SCU)
Registration Chair     Prof. Behnam Dezfouli (SCU)

Page Updated November 16, 2020