Skip to main content

Cassandra

Agenda

This blog will tell you that what is Cassandra. Here in this blog you can get a basic tutorial about Cassandra and to understand this you should have a basic knowledge of Java. It will help if you have some exposure to database concepts.

What is NoSQL

A NoSQL (Not Only SQL) database provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These are schema-free and can handle huge amounts of data.

The primary features of a NoSQL is 
  • simplicity of design,
  • horizontal scaling, and
  • finer control over availability.

CAP Theorem

CAP theorem states that you can only choose 2 out of the 3
  • Consistency: each read will get you the most recent write
  • Availability: each node (not failed) always executes queries
  • Partition-tolerance: if the connections between nodes are down, the other two (A and C) promises, are kept. 
It is explained in a equilateral triangle in below diagram:

What is Cassandra

Cassandra is a distributed storage system for managing very large amounts of structured data spread out across the world. It provides highly available service with no single point of failure. It it an opensource provided by Apache.

Below are some points of Cassandra:
  • It is scalable, consistent and fault-tolerant.
  • It provides column-oriented storage system.
  • Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable.
  • Cassandra is being used by some of the biggest companies such as Facebook, Twitter, Cisco, ebay, Netflix.
Below are some of the features of Cassandra:
  • Always on architecture - Cassandra has no single point of failure and it is continuously available for business-critical applications that cannot afford a failure.
  • Elastic scalability - Cassandra is highly scalable; it allows to add more hardware to accommodate more customers and more data as per requirement.
  • Fast linear-scale performance - Cassandra is linearly scalable, i.e., it increases your throughput as you increase the number of nodes in the cluster. Therefore it maintains a quick response time.
  • Flexible data storage - Cassandra accommodates all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structures according to your need.
  • Transaction support - Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).
  • Easy data distribution - Cassandra provides the flexibility to distribute data where you need by replicating data across multiple data centers.
  • Fast writes - Cassandra was designed to run on cheap commodity hardware. It performs fast writes and can store hundreds of terabytes of data without sacrificing the read efficiency.

Architecture of Cassandra

Cassandra is responsible to handle big data workloads across multiple nodes without any failure. This has peer-to-peer distributed system across its nodes and data is distributed among all the nodes in a cluster.
  • Each and every nodes in a cluster play the same role and responsibility. They are independent and at the same time interconnected to each other.
  • All nodes in a cluster can accept read and write requests regardless of where the data is actually located in the cluster.
  • On any failure node, read/write requests can be served from other nodes in the network.
Data Replication in Cassandra: Each node in a cluster act as replicas for a given data. Cassandra will return the most recent value to the client in case of some of the nodes responded with an out-of-date value. Cassandra performs a read repair in the background to update the stale values after returning the most recent value. The following diagram shows how Cassandra uses data replication among the nodes in a cluster to ensure no failure.



Following the Components of Cassandra:
  • Node: Where data is stored.
  • Data center: Collection of related nodes.
  • Cluster: Contains one or more data centers.
  • Commit Log: This is a crash-recovery mechanism. Every write operation is logged in commit log.
  • Mem-table: This is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
  • SSTable: It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.
  • Bloom Filter: These are quick, non-deterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.
Cassandra Query Language (CQL): In  Cassandra the nodes can be accessed by using CQL. It treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers.

Write Operations: Write activity of nodes is captured by the commit logs written in the nodes. After that the data will be captured and stored in the mem-table. Whenever the mem-table is full, data will be written into the SStable data file. All writes are automatically partitioned and replicated throughout the cluster. Cassandra periodically consolidates the SSTables, discarding unnecessary 
data.



Read Operations: Cassandra gets values from the mem-table and checks the bloom filter to find the appropriate SSTable that holds the required data.

Data Model

What we see data model in a relation databases, Cassandra provides difference data model.
Cluster: Cassandra database is distributed over some machines that operate together. The outermost container is known as the Cluster. Cluster contains different nodes. Each node contains a replica, so in case of a failure replica takes charge. Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to them.
Keyspace: Keyspace is the outermost container for data in Cassandra. Below are the attributes of Keyspace in Cassandra-
  • Replication factor: This is the number of machine in the cluster that will receive copies of the same data. 
  • Replica placement strategy: This is the strategy to place replicas in the ring. The strategies such as simple strategy (rack-aware strategy), old network topology strategy (rack-aware strategy), and network topology strategy (datacenter-shared strategy).
  • Column families: Column families are placed under keyspace. A keyspace is a container for a list of one or more column families while a column family is a container of a collection of rows. Each row contains ordered columns. Column families represent the structure of your data. Each keyspace has at least one and often many column families.
Cassandra data Models Rules: Cassandra doesn't support JOINS, GROUP BY, OR clause, aggregation etc. So you have to store data in a way that it should be retrieved whenever you want.
Cassandra is optimized for high write performances so you should maximize your writes for better read performance and data availability. There is a tradeoff between data write and data read. So, optimize you data read performance by maximizing the number of data writes.
Maximize data duplication because Cassandra is a distributed database and data duplication provides instant availability without a single point of failure.

References

https://www.tutorialspoint.com/cassandra/
https://www.javatpoint.com/cassandra-tutorial
https://dzone.com/articles/better-explaining-cap-theorem

Comments