cSiTra
A Simple Transformer to migrate from RDBMS to Cassandra
A not so great while ago, when the world was not so full of wonders and needed more, an idea to produce a software with an aim made up of three simple words "SQL to NoSQL" emerged as a thought in the beautiful minds of intelligent beings in the faraway land. But the only way to achieve this was to go to the the tunnel of odds and take the treacherous path which lead to light at the end of the tunnel. With great determination and "SQL to NoSQL" in their hearts the beings decided to take the treacherous path and began a eight week journey to reach the light at the end of the tunnel of odds and produce the software. Tested by time and fate, having battled the odds, with scars and bruises, hoping beyond hope, and with great efforts and courage and collaboration, the beings reached the end of the tunnel and cSiTra was produced and in the end like in every great tale everyone lived happily ever after. cSiTra was one of a kind automated piece of software to migrate data and schema from the traditional relational database to Cassandra (a prominent NoSQL database). The software used the simple yet the powerful transformation Java library SiTra.
History
Ever since the evolution of computer began a resource to store the data in memory has been one of the vital components that has undergone mutation in the field of computer science and it has took many forms ever since. A new scheme has always come up whenever the amount of data to be stored increased, and/or the existing scheme was no longer the correct logical medium. The computer file is such a resource for storing data, an analogy to the computer file can be a paper document which traditionally are kept in office and libraries. Files (though sometimes still used as a primary source for data storage) however was haunted by many problems such as redundancy, inconsistency, isolation, and difficulty in accessing data when a lot was to be processed and stored in them and these backlogs gave way to the creation of Databases.
Databases became a new way for storing data and it came over all the issues a file suffered, they gave persistent data, managed concurrency, system integration (repositories), reporting features, atomicity and security to the data it hosts and an excellent support. However this was not really satisfactory, a new urge to store data in a relational manner gave way to the relational databases. The main theme for the RDBMS is Normalization (where a whole chunk of data is split into meaningful tables and stored in form of rows and columns and reduce redundancy), Relational databases offered better security and integrity to the data, it could also handle millions and millions of records of data and still do pretty well. They reigned the industry for a little more than two to three decades. Though, storing data relationally was not everybody’s piece of cake.
RDBMS was not friends with everyone, some people had problems with the whole process of normalization as it led to something called Impedance mismatch. The object-relational impedance mismatch is a set of conceptual and technical difficulties that are often encountered when an RDBMS is being used by a program written in an object-oriented programming language or style; particularly when objects or class definitions are mapped in a straightforward way to database tables or relational schema. These problems got sky high when the Internet Explosion occurred and big names like Google, Amazon started having problems handling such big data in their relational databases. They tried clustering and building big machines (very large databases which were costly in terms of complexity, cost and maintenance plus lots of other problems), but only to fail. When nothing was working out they gathered up to discuss the big data problem and it was found that using RDBMSs on their clusters was an “unnatural selection” and the result will be an abomination and hence the urge to store such big data efficiently emerged again.
Hence, the big NoSQL movement started in the industry, the Google came up with big table, and the Amazon with dynamo, some research was done and some papers were published. Many like minded people wanted to take this to a whole new level and desired an international conference to discuss ideas(and this conference took place in San Francisco). In 2000s the biggest and the powerful and something very important to have for an advertisement was a twitter hashtag(which was short and unique) and some one came up with #NoSQL, and that's all it was meant to be, a single hashtag for a meeting which happened somewhere in some point of time. The conference was crowded by many folks from big names such Cassandra, CouchDB, MongoDB, Bigtable, HBase and finally as a result of all of this the NoSQL databases came into picture. Though there is no De facto definition for a NoSQL database, only its characteristics are notable, like its schema-less, web and cluster friendly, non-relational and mainly data modeled. Many NoSQL databases are designed based on what kind of data they handle,
- Cassandra, HBase are good at column and families ( hashmap, 3D tables, RDBMS like structures).
- MongoDB, CouchDB, RavenDB are good at handling documents (like web DOMs, XML or JSON).
- Project Voldemort, Redis good at storing Key-Value (hashmap or dictionary kind of stuff).
Who can use cSiTra ? And why choose Cassandra ?
cSiTra is the tool for anyone who wants to migrate their database from any of the traditional RDBMS to Cassandra.
As to why Cassandra this you have to ask yourself, as both RDBMSs and Cassandra are meant for different things. Here below we try to give a general checklist which are also the main reasons why we chose Cassandra and if you think most of them apply to you then the migration is recommended.
- A need for a very high scalability, Cassandra can handle lots and lots of data and has support for dynamic column families, where each row can have a different set of columns. Overall Cassandra still gives you a semi-SQL structure.
- A need for very fast look up on data, Cassandra uses indexes, a data structure that allows for fast, efficient lookup of data matching a given condition.
- A need for very high availability, Cassandra provides clustering (a collection of nodes) and replication (the process of storing copies of data on multiple nodes to ensure reliability and fault tolerance). Cassandra stores copies, called replicas, of each row based on the row key. You set the number of replicas when you create a keyspace using the replica placement strategy. In addition to setting the number of replicas (nodes).
- A need for better manageability of Data without ACID-transactions.
- A need to come out of master/slave architecture. In a Cassandra cluster all nodes are peers, meaning there is no master node or centralized management process. A node joins a Cassandra cluster based on its configuration.
- You are in need of a high performance distributed database designed to handle large amounts of data across many commodity servers.