SlideShare a Scribd company logo
2
Most read
3
Most read
14
Most read
Cassandra   Structured Storage System over a P2P Network Avinash Lakshman, Prashant Malik, Karthik Ranganathan
Why Cassandra? Lots of data Copies of messages, reverse indices of messages, per user data. Many incoming requests resulting in a lot of random reads and random writes. No existing production ready solutions in the market meet these requirements.
Design Goals High availability Eventual consistency trade-off strong consistency in favor of high availability Incremental scalability Optimistic Replication “ Knobs” to tune tradeoffs between consistency, durability and latency Low total cost of ownership Minimal administration
Cassandra Architecture Messaging Layer Cluster Membership Failure Detector Storage Layer Partitioner Replicator Cassandra API Tools
Data Model KEY ColumnFamily1  Name  : MailList   Type  : Simple   Sort  : Name   Name : tid1 Value : <Binary> TimeStamp : t1 Name : tid2 Value : <Binary> TimeStamp : t2 Name : tid3 Value : <Binary> TimeStamp : t3 Name : tid4 Value : <Binary> TimeStamp : t4 ColumnFamily2  Name  : WordList   Type  : Super   Sort  : Time   Name : aloha ColumnFamily3  Name  : System   Type  : Super   Sort  : Name   Name : hint1 <Column List> Name : hint2 <Column List> Name : hint3 <Column List> Name : hint4 <Column List> Name : dude C2  V2 T2 C6 V6 T6 Column Families are declared upfront Columns are added and modified dynamically SuperColumns are added and modified dynamically Columns are added and modified dynamically C1  V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4
Write Operations A client issues a write request to a random node in the Cassandra cluster. The “Partitioner” determines the nodes responsible for the data. Locally, write operations are logged and then applied to an in-memory version. Commit log is stored on a dedicated disk local to the machine.
Write cont’d Key (CF1 , CF2 , CF3) Commit Log Binary serialized  Key ( CF1 , CF2 , CF3 ) Memtable ( CF1) Memtable ( CF2) Memtable ( CF2) FLUSH Data size Number of Objects Lifetime Dedicated Disk <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family>  --- --- --- --- <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> BLOCK Index  <Key Name> Offset, <Key Name> Offset K 128   Offset K 256   Offset K 384   Offset Bloom Filter (Index in memory) Data file on disk
Compactions K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- -- -- Sorted K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > -- -- -- Sorted K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- -- -- Sorted MERGE  SORT K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Sorted K1  Offset K5  Offset K30  Offset Bloom Filter Loaded in memory Index File Data File D E L E T E D
Write Properties No locks in the critical path Sequential disk access Behaves like a write through Cache Append support without read ahead Atomicity guarantee for a key “ Always Writable” accept writes during failure scenarios
Read Query Closest replica Cassandra Cluster Replica A Result Replica B Replica C Digest Query Digest Response Digest Response Result Client Read repair if digests differ
Partitioning N=3 h(key2) And Replication 0 1 1/2 F E D C B A h(key1)
Cluster Membership and Failure Detection Gossip protocol is used for cluster membership. Super lightweight with mathematically provable properties. State disseminated in O(logN) rounds where N is the number of nodes in the cluster. Every T seconds each member increments its heartbeat counter and selects one other member to send its list to. A member merges the list with its own list .
Accrual Failure Detector Valuable for system management, replication, load balancing etc. Defined as a failure detector that outputs a value, PHI, associated with each process.  Also known as Adaptive Failure detectors - designed to adapt to changing network conditions. The value output, PHI, represents a suspicion level. Applications set an appropriate threshold, trigger suspicions and perform appropriate actions. In Cassandra the average time taken to detect a failure is 10-15 seconds with the PHI threshold set at 5.
Properties of the Failure Detector If a process p is faulty, the suspicion level  Φ (t)     ∞ as t     ∞. If a process p is faulty, there is a time after which  Φ (t) is monotonic increasing. A process p is correct     Φ (t) has an ub over an infinite execution. If process p is correct, then for any time T,  Φ (t) = 0 for t >= T.
Implementation  PHI estimation is done in three phases Inter arrival times for each member are stored in a sampling window. Estimate the distribution of the above inter arrival times.  Gossip follows an exponential distribution. The value of PHI is now computed as follows: Φ (t) = -log 10 ( P(t now  – t last ) )  where P(t) is the CDF of an exponential distribution. P(t) denotes the probability that a heartbeat will arrive more than t units after the previous one. P(t) = ( 1 – e -t λ  ) The overall mechanism is described in the figure below.
Information Flow in the Implementation
Performance Benchmark Random and sequential writes - limited by network bandwidth. Read performance for Inbox Search in production: Search Interactions Term Search Min 7.69 ms 7.78 ms Median 15.69 ms 18.27 ms Average 26.13 ms 44.41 ms
Lessons Learnt Add fancy features only when absolutely required. Many types of failures are possible. Big systems need proper systems-level monitoring. Value simple designs
Future work Atomicity guarantees across multiple keys Distributed transactions Compression support  Granular security via ACL’s
Questions?

More Related Content

What's hot (9)

PPTX
Apache Spark RDD 101
sparkInstructor
 
PPTX
Seastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
Takuya ASADA
 
PDF
第11回ACRiウェビナー_インテル/竹村様ご講演資料
直久 住川
 
PDF
カラムストアインデックス 最初の一歩
Masayuki Ozawa
 
PPTX
Apache Kafka
emreakis
 
PDF
Kamonを理解する
Shuya Tsukamoto
 
PPTX
Online analytical processing
nurmeen1
 
PPTX
Prometheus入門から運用まで徹底解説
貴仁 大和屋
 
PDF
18CS2005 Cryptography and Network Security
Kathirvel Ayyaswamy
 
Apache Spark RDD 101
sparkInstructor
 
Seastar:高スループットなサーバアプリケーションの為の新しいフレームワーク
Takuya ASADA
 
第11回ACRiウェビナー_インテル/竹村様ご講演資料
直久 住川
 
カラムストアインデックス 最初の一歩
Masayuki Ozawa
 
Apache Kafka
emreakis
 
Kamonを理解する
Shuya Tsukamoto
 
Online analytical processing
nurmeen1
 
Prometheus入門から運用まで徹底解説
貴仁 大和屋
 
18CS2005 Cryptography and Network Security
Kathirvel Ayyaswamy
 

Viewers also liked (16)

PPTX
3D_Reservoir_Characterization_Term_Project
Tyler Howe
 
PPTX
Cassandra - Research Paper Overview
sameiralk
 
PDF
Yahoo! JAPANにおけるApache Cassandraへの取り組み
Yahoo!デベロッパーネットワーク
 
PPTX
An Overview of Apache Cassandra
DataStax
 
PDF
Apache cassandra an introduction
Shehaaz Saif
 
PPT
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
egpeters
 
PDF
Cassandra Prophecy
Igor Khotin
 
PPT
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
egpeters
 
PDF
Cassandra - A Decentralized Structured Storage System
Varad Meru
 
PDF
Storm@Twitter, SIGMOD 2014
Karthik Ramasamy
 
PPTX
Cassandra
Upaang Saxena
 
PPTX
Cassandra - A decentralized storage system
Arunit Gupta
 
PDF
Guide to Cassandra for Production Deployments
smdkk
 
PPTX
Oconee county crime data 2010
Trey Downs
 
PPS
Book Of Prayers
ArChNa KaMrA
 
PPTX
How to install Civicrm in Drupal 7
Zabisco Digital
 
3D_Reservoir_Characterization_Term_Project
Tyler Howe
 
Cassandra - Research Paper Overview
sameiralk
 
Yahoo! JAPANにおけるApache Cassandraへの取り組み
Yahoo!デベロッパーネットワーク
 
An Overview of Apache Cassandra
DataStax
 
Apache cassandra an introduction
Shehaaz Saif
 
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
egpeters
 
Cassandra Prophecy
Igor Khotin
 
NoSQL Cassandra Talk for Seattle Tech Startups 3-10-10
egpeters
 
Cassandra - A Decentralized Structured Storage System
Varad Meru
 
Storm@Twitter, SIGMOD 2014
Karthik Ramasamy
 
Cassandra
Upaang Saxena
 
Cassandra - A decentralized storage system
Arunit Gupta
 
Guide to Cassandra for Production Deployments
smdkk
 
Oconee county crime data 2010
Trey Downs
 
Book Of Prayers
ArChNa KaMrA
 
How to install Civicrm in Drupal 7
Zabisco Digital
 
Ad

Similar to Data Presentations Cassandra Sigmod (20)

PPT
in this ppt the basic details of cassandra database
SetuPrajapati1
 
PPT
6.1-Cassandra.ppt
DanBarcan2
 
PPT
6.1-Cassandra.ppt
yashsharma863914
 
PPT
Cassandra
ssuserbad56d
 
PPTX
L6.sp17.pptx
SudheerKumar499932
 
PDF
Cassandra Fundamentals - C* 2.0
Russell Spitzer
 
PDF
Cassandra 101
Nader Ganayem
 
PPT
NOSQL and Cassandra
rantav
 
PPT
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Pavlo Baron
 
PPTX
Cassandra
exsuns
 
PDF
Introduction to Cassandra Concepts and its usage
bharatkumarbhojwani
 
PPTX
Getting started with Cassandra 2.1
Viswanath J
 
PPT
5266732.ppt
hothyfa
 
PDF
Cassandra for Sysadmins
Nathan Milford
 
PPTX
Cassandra under the hood
Andriy Rymar
 
PPTX
Dynamo cassandra
Wu Liang
 
PDF
Storing time series data with Apache Cassandra
Patrick McFadin
 
PDF
Cassandra data structures and algorithms
Duyhai Doan
 
PPTX
Cassandra Architecture
Prasad Wali
 
PPTX
Cassandra via-docker
Chris Ballance
 
in this ppt the basic details of cassandra database
SetuPrajapati1
 
6.1-Cassandra.ppt
DanBarcan2
 
6.1-Cassandra.ppt
yashsharma863914
 
Cassandra
ssuserbad56d
 
L6.sp17.pptx
SudheerKumar499932
 
Cassandra Fundamentals - C* 2.0
Russell Spitzer
 
Cassandra 101
Nader Ganayem
 
NOSQL and Cassandra
rantav
 
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Pavlo Baron
 
Cassandra
exsuns
 
Introduction to Cassandra Concepts and its usage
bharatkumarbhojwani
 
Getting started with Cassandra 2.1
Viswanath J
 
5266732.ppt
hothyfa
 
Cassandra for Sysadmins
Nathan Milford
 
Cassandra under the hood
Andriy Rymar
 
Dynamo cassandra
Wu Liang
 
Storing time series data with Apache Cassandra
Patrick McFadin
 
Cassandra data structures and algorithms
Duyhai Doan
 
Cassandra Architecture
Prasad Wali
 
Cassandra via-docker
Chris Ballance
 
Ad

More from Jeff Hammerbacher (20)

PDF
20120223keystone
Jeff Hammerbacher
 
PDF
20100714accel
Jeff Hammerbacher
 
PDF
20100608sigmod
Jeff Hammerbacher
 
PDF
20100513brown
Jeff Hammerbacher
 
PDF
20100423sage
Jeff Hammerbacher
 
PDF
20100418sos
Jeff Hammerbacher
 
PDF
20100301icde
Jeff Hammerbacher
 
PDF
20100201hplabs
Jeff Hammerbacher
 
PDF
20100128ebay
Jeff Hammerbacher
 
PDF
20091203gemini
Jeff Hammerbacher
 
PDF
20091203gemini
Jeff Hammerbacher
 
PDF
20091110startup2startup
Jeff Hammerbacher
 
PDF
20091030nasajpl
Jeff Hammerbacher
 
PDF
20091027genentech
Jeff Hammerbacher
 
PDF
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Jeff Hammerbacher
 
PDF
20090622 Velocity
Jeff Hammerbacher
 
PDF
20090422 Www
Jeff Hammerbacher
 
PDF
20090309berkeley
Jeff Hammerbacher
 
PDF
20081030linkedin
Jeff Hammerbacher
 
PDF
20081022cca
Jeff Hammerbacher
 
20120223keystone
Jeff Hammerbacher
 
20100714accel
Jeff Hammerbacher
 
20100608sigmod
Jeff Hammerbacher
 
20100513brown
Jeff Hammerbacher
 
20100423sage
Jeff Hammerbacher
 
20100418sos
Jeff Hammerbacher
 
20100301icde
Jeff Hammerbacher
 
20100201hplabs
Jeff Hammerbacher
 
20100128ebay
Jeff Hammerbacher
 
20091203gemini
Jeff Hammerbacher
 
20091203gemini
Jeff Hammerbacher
 
20091110startup2startup
Jeff Hammerbacher
 
20091030nasajpl
Jeff Hammerbacher
 
20091027genentech
Jeff Hammerbacher
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Jeff Hammerbacher
 
20090622 Velocity
Jeff Hammerbacher
 
20090422 Www
Jeff Hammerbacher
 
20090309berkeley
Jeff Hammerbacher
 
20081030linkedin
Jeff Hammerbacher
 
20081022cca
Jeff Hammerbacher
 

Recently uploaded (20)

PDF
FastnersFastnersFastnersFastnersFastners
mizhanw168
 
PPTX
Bovine Pericardial Tissue Patch for Pediatric Surgery
TisgenxInc
 
PDF
BeMetals_Presentation_July_2025 .pdf
DerekIwanaka2
 
PPTX
Technical Analysis of 1st Generation Biofuel Feedstocks - 25th June 2025
TOFPIK
 
PDF
HOW TO RECOVER LOST CRYPTOCURRENCY - VISIT iBOLT CYBER HACKER COMPANY
diegovalentin771
 
PPTX
Sustainability Strategy ESG Goals and Green Transformation Insights.pptx
presentifyai
 
PPTX
Hackathon - Technology - Idea Submission Template -HackerEarth.pptx
nanster236
 
PPTX
SYMCA LGP - Social Enterprise Exchange.pptx
Social Enterprise Exchange
 
PPTX
25 Future Mega Trends Reshaping the World in 2025 and Beyond
presentifyai
 
PDF
The Complete Guide to SME IPO in 2025.pdf
India IPO
 
PDF
Top Supply Chain Management Tools Transforming Global Logistics.pdf
Enterprise Wired
 
PDF
Flexible Metal Hose & Custom Hose Assemblies
McGill Hose & Coupling Inc
 
PPTX
Revolutionizing Retail: The Impact of Artificial Intelligence
RUPAL AGARWAL
 
PDF
Top 10 Emerging Tech Trends to Watch in 2025.pdf
marketingyourtechdig
 
PDF
Native Sons Of The Golden West - Boasts A Legacy Of Impactful Leadership
Native Sons of the Golden West
 
PDF
20250703_A. Stotz All Weather Strategy - Performance review July
FINNOMENAMarketing
 
PDF
Gabino Barbosa - A Master Of Efficiency
Gabino Barbosa
 
DOCX
TCP Communication Flag Txzczczxcxzzxypes.docx
esso24
 
PDF
Reflect, Refine & Implement In-Person Business Growth Workshop.pdf
TheoRuby
 
PDF
Cloud Budgeting for Startups: Principles, Strategies, and Tools That Scale
Amnic
 
FastnersFastnersFastnersFastnersFastners
mizhanw168
 
Bovine Pericardial Tissue Patch for Pediatric Surgery
TisgenxInc
 
BeMetals_Presentation_July_2025 .pdf
DerekIwanaka2
 
Technical Analysis of 1st Generation Biofuel Feedstocks - 25th June 2025
TOFPIK
 
HOW TO RECOVER LOST CRYPTOCURRENCY - VISIT iBOLT CYBER HACKER COMPANY
diegovalentin771
 
Sustainability Strategy ESG Goals and Green Transformation Insights.pptx
presentifyai
 
Hackathon - Technology - Idea Submission Template -HackerEarth.pptx
nanster236
 
SYMCA LGP - Social Enterprise Exchange.pptx
Social Enterprise Exchange
 
25 Future Mega Trends Reshaping the World in 2025 and Beyond
presentifyai
 
The Complete Guide to SME IPO in 2025.pdf
India IPO
 
Top Supply Chain Management Tools Transforming Global Logistics.pdf
Enterprise Wired
 
Flexible Metal Hose & Custom Hose Assemblies
McGill Hose & Coupling Inc
 
Revolutionizing Retail: The Impact of Artificial Intelligence
RUPAL AGARWAL
 
Top 10 Emerging Tech Trends to Watch in 2025.pdf
marketingyourtechdig
 
Native Sons Of The Golden West - Boasts A Legacy Of Impactful Leadership
Native Sons of the Golden West
 
20250703_A. Stotz All Weather Strategy - Performance review July
FINNOMENAMarketing
 
Gabino Barbosa - A Master Of Efficiency
Gabino Barbosa
 
TCP Communication Flag Txzczczxcxzzxypes.docx
esso24
 
Reflect, Refine & Implement In-Person Business Growth Workshop.pdf
TheoRuby
 
Cloud Budgeting for Startups: Principles, Strategies, and Tools That Scale
Amnic
 

Data Presentations Cassandra Sigmod

  • 1. Cassandra Structured Storage System over a P2P Network Avinash Lakshman, Prashant Malik, Karthik Ranganathan
  • 2. Why Cassandra? Lots of data Copies of messages, reverse indices of messages, per user data. Many incoming requests resulting in a lot of random reads and random writes. No existing production ready solutions in the market meet these requirements.
  • 3. Design Goals High availability Eventual consistency trade-off strong consistency in favor of high availability Incremental scalability Optimistic Replication “ Knobs” to tune tradeoffs between consistency, durability and latency Low total cost of ownership Minimal administration
  • 4. Cassandra Architecture Messaging Layer Cluster Membership Failure Detector Storage Layer Partitioner Replicator Cassandra API Tools
  • 5. Data Model KEY ColumnFamily1 Name : MailList Type : Simple Sort : Name Name : tid1 Value : <Binary> TimeStamp : t1 Name : tid2 Value : <Binary> TimeStamp : t2 Name : tid3 Value : <Binary> TimeStamp : t3 Name : tid4 Value : <Binary> TimeStamp : t4 ColumnFamily2 Name : WordList Type : Super Sort : Time Name : aloha ColumnFamily3 Name : System Type : Super Sort : Name Name : hint1 <Column List> Name : hint2 <Column List> Name : hint3 <Column List> Name : hint4 <Column List> Name : dude C2 V2 T2 C6 V6 T6 Column Families are declared upfront Columns are added and modified dynamically SuperColumns are added and modified dynamically Columns are added and modified dynamically C1 V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4
  • 6. Write Operations A client issues a write request to a random node in the Cassandra cluster. The “Partitioner” determines the nodes responsible for the data. Locally, write operations are logged and then applied to an in-memory version. Commit log is stored on a dedicated disk local to the machine.
  • 7. Write cont’d Key (CF1 , CF2 , CF3) Commit Log Binary serialized Key ( CF1 , CF2 , CF3 ) Memtable ( CF1) Memtable ( CF2) Memtable ( CF2) FLUSH Data size Number of Objects Lifetime Dedicated Disk <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> --- --- --- --- <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> BLOCK Index <Key Name> Offset, <Key Name> Offset K 128 Offset K 256 Offset K 384 Offset Bloom Filter (Index in memory) Data file on disk
  • 8. Compactions K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- -- -- Sorted K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > -- -- -- Sorted K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- -- -- Sorted MERGE SORT K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Sorted K1 Offset K5 Offset K30 Offset Bloom Filter Loaded in memory Index File Data File D E L E T E D
  • 9. Write Properties No locks in the critical path Sequential disk access Behaves like a write through Cache Append support without read ahead Atomicity guarantee for a key “ Always Writable” accept writes during failure scenarios
  • 10. Read Query Closest replica Cassandra Cluster Replica A Result Replica B Replica C Digest Query Digest Response Digest Response Result Client Read repair if digests differ
  • 11. Partitioning N=3 h(key2) And Replication 0 1 1/2 F E D C B A h(key1)
  • 12. Cluster Membership and Failure Detection Gossip protocol is used for cluster membership. Super lightweight with mathematically provable properties. State disseminated in O(logN) rounds where N is the number of nodes in the cluster. Every T seconds each member increments its heartbeat counter and selects one other member to send its list to. A member merges the list with its own list .
  • 13. Accrual Failure Detector Valuable for system management, replication, load balancing etc. Defined as a failure detector that outputs a value, PHI, associated with each process. Also known as Adaptive Failure detectors - designed to adapt to changing network conditions. The value output, PHI, represents a suspicion level. Applications set an appropriate threshold, trigger suspicions and perform appropriate actions. In Cassandra the average time taken to detect a failure is 10-15 seconds with the PHI threshold set at 5.
  • 14. Properties of the Failure Detector If a process p is faulty, the suspicion level Φ (t)  ∞ as t  ∞. If a process p is faulty, there is a time after which Φ (t) is monotonic increasing. A process p is correct  Φ (t) has an ub over an infinite execution. If process p is correct, then for any time T, Φ (t) = 0 for t >= T.
  • 15. Implementation PHI estimation is done in three phases Inter arrival times for each member are stored in a sampling window. Estimate the distribution of the above inter arrival times. Gossip follows an exponential distribution. The value of PHI is now computed as follows: Φ (t) = -log 10 ( P(t now – t last ) ) where P(t) is the CDF of an exponential distribution. P(t) denotes the probability that a heartbeat will arrive more than t units after the previous one. P(t) = ( 1 – e -t λ ) The overall mechanism is described in the figure below.
  • 16. Information Flow in the Implementation
  • 17. Performance Benchmark Random and sequential writes - limited by network bandwidth. Read performance for Inbox Search in production: Search Interactions Term Search Min 7.69 ms 7.78 ms Median 15.69 ms 18.27 ms Average 26.13 ms 44.41 ms
  • 18. Lessons Learnt Add fancy features only when absolutely required. Many types of failures are possible. Big systems need proper systems-level monitoring. Value simple designs
  • 19. Future work Atomicity guarantees across multiple keys Distributed transactions Compression support Granular security via ACL’s