Cassandra is a NoSQL database, which provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are schema-free, support easy replication, have simple API and can handle huge amounts of data. Simplicity of design, horizontal scaling and finer control of availability are the major objectives of a NoSQL database. They use different data structures compared to relational database and it makes operations faster in NoSQL.
Some of the biggest companies such as Facebook, Twitter, Cisco, Rackspace, Ebay, Twitter, Netflix etc., are using Cassandra. Thanks to its features like elastic scalability, fast linear scale performance and flexible data storage mechanism. It is always based on architecture and it hardly possesses any single point of failure.
The major attribute, which everyone cherishes about Cassandra is that it is an open source endeavour by Facebook (2008) and later adopted by Apache Foundation (2009) with distributed and decentralized/distributed storage system (database), for managing large amounts of structured data. It is distributed through Amazon’s Dynamo and the data model relies on Google’s Bigtable.
The data storage mechanism provided by Cassandra is rather flexible, which can accommodate all possible data formats including: structured, semi-structured, and unstructured. It can dynamically accommodate changes to your data structure according to your need. By replicating data across multiple data centers, it provides the flexibility to distribute data. Alongside, it supports properties like Atomicity, Consistency, Isolation, and Durability (ACID). It can run on cheap hardware and still it performs blazingly fast. It can write and can store hundreds of terabytes of data, without sacrificing the read efficiency.
Cassandra can handle big data workloads across multiple nodes without any single point of failure. It has peer-to-peer distributed system across its nodes, and the data is distributed among all the nodes in a cluster. In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. If it is detected, then some of the nodes can be responded with an out-of-date value.
Cassandra will return the most recent value to the client and after returning the most recent value, Cassandra performs a read repair in the background to update the stale values. The key components of Cassandra are Node, Data Center, Cluster, Commit log, Mem-Table, SSTable and Bloom Filter. Basically, a node is a place for storage of data; a datacenter is a collection of such nodes and a cluster is a component, which contains one or more datacenters. Commit log is a crash recovery mechanism in which every write operation is written. Once write operation is performed data will be written to mem-table where mem-table is a memory resident data structure. SSTable is a disk file to which data is flushed from mem-table when its contents reach a threshold value; last , but not the least, Bloom filter , which acts a special type of cache, performs quick, non-deterministic algorithms for testing whether an element is a member of a set or not.
Cassandra can be accessed using a special Query language (CQL), where CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers. Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy between the client and the nodes holding the data.
Cassandra database is distributed over several machines that operate together. The outermost container is known as the Cluster. For handling failures, every node contains a replica. Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to them. The work strategy is handy in case of data storage and management.
The hosting world, which cherishes NoSQL boom nowadays, will surely benefit from the versatility of Cassandra, even, in years to come.