Project Details
Fast crash recovery strategies for many small data objects in a distributed memory storage Akronym: FastRecovery
Applicant
Professor Dr. Michael Schöttner
Subject Area
Security and Dependability, Operating-, Communication- and Distributed Systems
Term
from 2015 to 2018
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 269648469
More and more programs need to manage billions of small data objects like for example social network applications. Data access times of disks and solid state drives are too slow for these interactive applications and providers are forced to keep many data in caches. For large applications data cannot be loaded into memory of a single node and thus the memory of potentially many nodes need to be aggregated. A prominent example is Facebook running more than 1,000 memcached servers to keep around 75% of all data always in memory because background databases are too slow. Obviously, data is lost in case of node failures and power outages and it can take hours to load large data volumes from secondary storage like databases or file systems. The proposed project addresses these challenges by developing and evaluating fast recovery strategies for distributed memory systems. This project focuses on a key-value data-model for up to one trillion small data objects (sizes around 16-64 byte, stored in 1,000 nodes). Recovery will use an asynchronous logging strategy optimized for SSD drives, based on research from log-structured file systems. The state of one node needs to be distributed on many backup nodes in order to allow a fast and parallel recovery. All the log parts belonging to one node state need also to be replicated in order to be able to mask permanent node failures. It is important to point out that random replica placement has a high probability of data loss for large clusters, if several nodes fail simultaneously. We plan to address this challenge based upon the recently proposed Copyset replica placement scheme and we plan to develop efficient and adaptive strategies which minimize data loss probability while at the same time allow fast recovery. The backup management will be implemented using a super-peer overlay network taking into account different metrics including load and ongoing recoveries as well as re-replication.
DFG Programme
Research Grants