In Search of an Understandable Consensus Algorithm (Raft)

Paxos Made Simple

ZooKeeper: Wait-free coordination for Internet-scale systems

Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore

Impossibility of Distributed Consensus With One Faulty Process

Consensus in the presence of partial synchrony

Viewstamped Replication Revisited

Replication

Don’t be lazy, be consistent: Postgres-R, a new way to implement Database Replication

PacificA: Replication in Log-Based Distributed Storage Systems

Chain Replication for Supporting High Throughput and Availability

Byzantine Chain Replication

A Comprehensive Study of Convergent and Commutative Replicated Data Types

Optimistic Replication

Causality/Transactions

Stronger Semantics for Low-Latency Geo-Replicated Storage (Eiger)

Calvin: Fast Distributed Transactions for Partitioned Database Systems

Sinfonia: a new paradigm for building scalable distributed systems

Understanding the Limitations of Causally and Totally Ordered Communication

A Response to Cheriton and Skeen’s Criticism of Causal and Totally Ordered Communication

MDCC: Multi-Datacenter Consistency

Spanner: Google’s globally distributed database

Concurrency

Transactional Memory: Architectural Support for Lock-Free Data Structures

Software Transactional Memory

Sharing Memory Robustly in Message-Passing Systems

Wait-free Synchronization

ZooKeeper’s atomic broadcast protocol: Theory and practice

Kafka (LinkedIn)

Omega: flexible, scalable schedulers for large compute clusters

Thialfi: A Client Notification Service for Internet-Scale Applications

Large-scale Incremental Processing Using Distributed Transactions and Notifications

Note: We haven’t included anything already covered in 6.824 , but you should read those papers too.

Paxos Made Live: An Engineering Perspective

Viewstamped Replication: A new primary copy method to support highly-available distributed systems

Time, Clocks, and the Ordering of Events in a Distributed System

The Part-Time Parliament

Paxos Made Practical

The papers from SOSP 2013

Distributed Systems and Parallel Computing

No matter how powerful individual computers become, there are still reasons to harness the power of multiple computational units, often spread across large geographic areas. Sometimes this is motivated by the need to collect data from widely dispersed locations (e.g., web pages from servers, or sensors for weather or traffic). Other times it is motivated by the need to perform enormous computations that simply cannot be done by a single CPU.

From our company’s beginning, Google has had to deal with both issues in our pursuit of organizing the world’s information and making it universally accessible and useful. We continue to face many exciting distributed systems and parallel computing challenges in areas such as concurrency control, fault tolerance, algorithmic efficiency, and communication. Some of our research involves answering fundamental theoretical questions, while other researchers and engineers are engaged in the construction of systems to operate at the largest possible scale, thanks to our hybrid research model .

Recent Publications

Some of our teams.

Algorithms & optimization

Graph mining

Network infrastructure

System performance

We're always looking for more talented, passionate people.

Careers

distributed database systems Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

ER-Store: A Hybrid Storage Mechanism with Erasure Coding and Replication in Distributed Database Systems

In distributed database systems, as cluster scales grow, efficiency and availability become critical considerations. In a cluster, a common approach to high availability is using replication, but this is inefficient due to its low storage utilization. Erasure coding can provide data reliability while ensuring high storage utilization. However, due to the large number of coding and decoding operations required by the CPU, it is not suitable for some frequently updated data. In order to optimize the storage efficiency of the data in the distributed system without affecting the availability of the data, this paper proposes a data temperature recognition algorithm that can distinguish data tablets and divides data tablets into three types, cold, warm, and hot, according to the frequency of access. Combining three replicas and erasure coding technology, ER-store is proposed, a hybrid storage mechanism for different data types. At the same time, we combined the read-write separation architecture of the distributed database system to design the data temperature conversion cycle, which reduces the computational overhead caused by frequent updates of erasure coding technology. We have implemented this design on the CBase database system based on the read-write separation architecture, and the experimental results show that it can save 14.6%–18.3% of the storage space while meeting the efficient access performance of the system.

Efficiently Supporting Adaptive Multi-Level Serializability Models in Distributed Database Systems

A review on fault tolerance in distributed database.

In this paper, we study about the different types of fault tolerance techniques which are used in various distributed database systems. The main focus of this research is about how the data are storedin the servers, fault detection techniques and the recovery techniques used. A fault can occur for many reasons. For example, system failure, resource failure, network between the server’s failure and any other reasons. These faults must be emphasis in order to make sure the system can work smoothly without any problem. A proper failure detector and a reliable fault tolerance technique can avoid loss and at once save the system from fail.

A Survey On Fragmentation In Distributed Database Systems

Abstract One of the most critical aspects of distributed database design and management is fragmentation. If the fragmentation is done properly, we can expect to achieve better throughput from such systems. The primary concern of DBMS design is the fragmentation and allocation of the underlying database. The distribution of data across various sites of computer networks involves making proper fragmentation and placement decisions. The first phase in the process of distributing a database is fragmentation which clusters information into fragments. This process is followed by the allocation phase which distributes, and if necessary, replicates the generated fragments among the nodes of a computer network. The use of data fragmentation to improve performance is not new and commonly appears in file design and optimization literature. An efficient functionality of any distributed database system is highly dependent on its proper design in terms of adopted fragmentation and allocation methods. Fragmentations of large, global databases are performed by dividing the database horizontally, vertically or combination of both. In order to enable the distributed database systems to work efficiently, the fragments have to be allocated across the available sites in such a way that reduces communication cost of data.In this article, we have tried to describe the existing methods of database fragmentation and have an overview of the existing methods. Finally, we conclude with suggestions for using machine learning to solve the overlap problem in fragments.

Survey on Deadlocks in Distributed Database Systems

A new multi-resource deadlock detection algorithm using directed graph requests in distributed database systems, a communication-induced checkpointing algorithm for consistent-transaction in distributed database systems, rdma based performance optimization on distributed database systems: a case study with goldenx, formal development of fault tolerance by replication of distributed database systems, distributed database systems: the case for newsql, export citation format, share document.

A Distributed Systems Reading List

Introduction.

I often argue that the toughest thing about distributed systems is changing the way you think. The below is a collection of material I've found useful for motivating these changes.

Thought Provokers

Ramblings that make you think about the way you design. Not everything can be solved with big servers, databases and transactions.

  • Harvest, Yield and Scalable Tolerant Systems - Real world applications of CAP from Brewer et al
  • On Designing and Deploying Internet Scale Services - James Hamilton
  • The Perils of Good Abstractions - Building the perfect API/interface is difficult
  • Chaotic Perspectives - Large scale systems are everything developers dislike - unpredictable, unordered and parallel
  • Data on the Outside versus Data on the Inside - Pat Helland
  • Memories, Guesses and Apologies - Pat Helland
  • SOA and Newton's Universe - Pat Helland
  • Building on Quicksand - Pat Helland
  • Why Distributed Computing? - Jim Waldo
  • A Note on Distributed Computing - Waldo, Wollrath et al
  • Stevey's Google Platforms Rant - Yegge's SOA platform experience
  • Latency Exists, Cope! - Commentary on coping with latency and it's architectural impacts
  • Latency - the new web performance bottleneck - not at all new (see Patterson ), but noteworthy
  • The Tail At Scale - the latencychallenges inherent of dealing with latency in large scale systems

Somewhat about the technology but more interesting is the culture and organization they've created to work with it.

  • A Conversation with Werner Vogels - Coverage of Amazon's transition to a service-based architecture
  • Discipline and Focus - Additional coverage of Amazon's transition to a service-based architecture
  • Vogels on Scalability
  • SOA creates order out of chaos @ Amazon

Current "rocket science" in distributed systems.

  • Chubby Lock Manager
  • Google File System
  • Data Management for Internet-Scale Single-Sign-On
  • Dremel: Interactive Analysis of Web-Scale Datasets
  • Large-scale Incremental Processing Using Distributed Transactions and Notifications
  • Megastore: Providing Scalable, Highly Available Storage for Interactive Services - Smart design for low latency Paxos implementation across datacentres.
  • Spanner - Google's scalable, multi-version, globally-distributed, and synchronously-replicated database.
  • Photon - Fault-tolerant and Scalable Joining of Continuous Data Streams. Joins are tough especially with time-skew, high availability and distribution.
  • Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing - Data warehousing system that stores critical measurement data related to Google's Internet advertising business.

Consistency Models

Key to building systems that suit their environments is finding the right tradeoff between consistency and availability.

  • CAP Conjecture - Consistency, Availability, Parition Tolerance cannot all be satisfied at once
  • Consistency, Availability, and Convergence - Proves the upper bound for consistency possible in a typical system
  • CAP Twelve Years Later: How the "Rules" Have Changed - Eric Brewer expands on the original tradeoff description
  • Consistency and Availability - Vogels
  • Eventual Consistency - Vogels
  • Avoiding Two-Phase Commit - Two phase commit avoidance approaches
  • 2PC or not 2PC, Wherefore Art Thou XA? - Two phase commit isn't a silver bullet
  • Life Beyond Distributed Transactions - Helland
  • If you have too much data, then 'good enough' is good enough - NoSQL, Future of data theory - Pat Helland
  • Starbucks doesn't do two phase commit - Asynchronous mechanisms at work
  • You Can't Sacrifice Partition Tolerance - Additional CAP commentary
  • Optimistic Replication - Relaxed consistency approaches for data replication

Papers that describe various important elements of distributed systems design.

  • Distributed Computing Economics - Jim Gray
  • Rules of Thumb in Data Engineering - Jim Gray and Prashant Shenoy
  • Fallacies of Distributed Computing - Peter Deutsch
  • Impossibility of distributed consensus with one faulty process - also known as FLP [access requires account and/or payment, a free version can be found here ]
  • Unreliable Failure Detectors for Reliable Distributed Systems. A method for handling the challenges of FLP
  • Lamport Clocks - How do you establish a global view of time when each computer's clock is independent
  • The Byzantine Generals Problem
  • Lazy Replication: Exploiting the Semantics of Distributed Services
  • Scalable Agreement - Towards Ordering as a Service
  • Scalable Eventually Consistent Counters over Unreliable Networks - Scalable counting is tough in an unreliable world

Languages and Tools

Issues of distributed systems construction with specific technologies.

  • Programming Distributed Erlang Applications: Pitfalls and Recipes - Building reliable distributed applications isn't as simple as merely choosing Erlang and OTP.

Infrastructure

  • Principles of Robust Timing over the Internet - Managing clocks is essential for even basics such as debugging
  • Consistent Hashing and Random Trees
  • Amazon's Dynamo Storage Service

Paxos Consensus

Understanding this algorithm is the challenge. I would suggest reading "Paxos Made Simple" before the other papers and again afterward.

  • The Part-Time Parliament - Leslie Lamport
  • Paxos Made Simple - Leslie Lamport
  • Paxos Made Live - An Engineering Perspective - Chandra et al
  • Revisiting the Paxos Algorithm - Lynch et al
  • How to build a highly available system with consensus - Butler Lampson
  • Reconfiguring a State Machine - Lamport et al - changing cluster membership
  • Implementing Fault-Tolerant Services Using the State Machine Approach: a Tutorial - Fred Schneider

Other Consensus Papers

  • Mencius: Building Efficient Replicated State Machines for WANs - consensus algorithm for wide-area network
  • In Search of an Understandable Consensus Algorithm - The extended version of the RAFT paper, an alternative to PAXOS.

Gossip Protocols (Epidemic Behaviours)

  • How robust are gossip-based communication protocols?
  • Astrolabe: A Robust and Scalable Technology For Distributed Systems Monitoring, Management, and Data Mining
  • Epidemic Computing at Cornell
  • Fighting Fire With Fire: Using Randomized Gossip To Combat Stochastic Scalability Limits
  • Bi-Modal Multicast
  • ACM SIGOPS Operating Systems Review - Gossip-based computer networking
  • SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol
  • Chord : A Scalable Peer-to-peer Lookup Protocol for Internet Applications
  • Kademlia : A Peer-to-peer Information System Based on the XOR Metric
  • Pastry : Scalable, decentralized object location and routing for large-scale peer-to-peer systems
  • PAST : A large-scale, persistent peer-to-peer storage utility - storage system atop Pastry
  • SCRIBE : A large-scale and decentralised application-level multicast infrastructure - wide area messaging atop Pastry

Advanced Distributed Systems

Research Seminar at Columbia University

  • --> Blog -->