Saturday, 4 January 2014

Making Sense of NoSQL

Making sense about the post

You have heard of NoSQL database and you just can't make sense of what it is, how to use it and the advantage of using it. Then you are in the right place to make sense of it all, have fun reading.

What is NoSQL?

Defining NoSQL is somewhat of a challenge, the term NoSQL doesn't really describe the ideals behind the NoSQL movement, descriptive or not it seems to be everywhere.  The word "nosql" originated from a hashtag(#nosql) about a meeting were people can talk about ideas and the new types of emerging databases. That's all NoSQL was ever meant to be a twitter hashtag(#nosql) to advertise a single meeting, it been the name of a single movement was an accident nobody ever thought. NoSQL means "Not Only SQL" by that NoSQL is not meant to knockout SQL or supplant it.

Defining NoSQL:

  • NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
  • NoSQL is a set of concept that allows the rapid and efficient processing of data set with a focus on performance, reliability, and agility.

Making sense of the definition

  • It's more than rolls in a table - NoSQL systems store and retrieve data from many format: Key-value stores, graph databases, column-family (Google BigTable) and document stores.
  • It's free of joins - NoSQL systems allow you to extract your data using simple interfaces without joins.
  • It works on many processors - NoSQL systems allow you to store your database on multiple processors and maintain high speed performance.
  • It supports linear scalability - When more processors are added you get a consistent increase in performance.
NoSQL is focused on providing scalability(scaling out on multiple box), performance and high availability(by making efficient use of the RAM and Solid State Disk).

Types of NoSQL data store 


Type
Usage
Examples
Key value store – A simple data storage that uses a key to access a value
   §   Image Stores
   §   Key  - Based file systems
   §   Object cache
   §   System designed to scale
   §   Redis
   §   Memcache
   §   DynamoDB
   §   Riak
   §   Berkeley DB
Column family store – A sparse matrix system that uses row and a column as keys
   §   Web crawlers results
   §   Big data problems that can relax consistency rules
   §   Apache HBase
   §   Apache Cassandra
   §   Hypertable
   §   BigTable
Graph Store – For relationship intensive problems
   §   Social networks
   §   Relationship-heavy data
   §   Neo4j
   §   Bigdata
   §   InifiniteGraph
Document store – Storing hierarchical data structures directly in the database
   §   High – variability data
   §   Document search
   §   Integration hubs
   §   Web content management
   §   Publishing
   §   MongoDB
   §   CouchDB
   §   Couchbase
   §   MarkLogic
for list of NoSQL databases click here

Drivers/reasons behind NoSQL movement

Web and Big data associated with it are the driver of NoSQL's rise, but not the only reason to use NoSQL. Many NoSQL databases are design to run well on larger clusters, which make them more attractive for large data volumes. NoSQL databases are designed to scale horizontally on commodity hardware; typically with relational databases like SQL server or Oracle, you scale by purchasing bigger and faster box(i.e scaling up) this approach is more expensive, by design RDBMS can't scale out(i.e buying of small boxes to scale) because they are meant to run on a single node.

In time, the ability to increase processing speed was no longer an option. As chip density increased, heat could no longer dissipate fast enough without chip overheating. This phenomenon, known as the power wall, forced systems designers to shift their focus from increasing speed on a single chip to using more processor working together. The need to scale out(also know as horizontal scaling), rather than scale up(faster processors), moved organizations from serial to parallel processing where data processing are split into separate paths and sent to separate processors to divide and conquer the work. Horizontal scaling is used in many NoSQL databases, data is partitioned and balanced across multiple nodes in a cluster and aggregate queries are distributed by default. This makes it easy to scale out cheaply and quickly. Other reason for NoSQL movement include Schemaless data presentation, development time(you don't have to deal with complex SQL queries), speed(NoSQL database are much faster than RDBMS) and more.

NoSQL case studies

We look at how companies have tried to innovate and stay competitive when faced with a challenge.

  1. LiveJournal's Memcache: Engineers at LiveJournal started to look at how there system was using there most precious resource: RAM in each web server. Due to the popularity of the site, visitors using the site continued to increase on the daily basis, they kept up with the demand by adding more web servers each with it separate RAM. There need to increase performance of database queries lead them to found ways to keep the result of the most frequently used database queries in RAM, avoiding the expensive cost of running the same queries on there database. But there was a challenge there was no way for any web server to know that another web server already had a copy of the query in RAM. So they found a way to create distinct signature of each query, this signature or hash was a short string for representing a SQL SELECT statement. Web server could ask other web servers if they have a copy of the SQL result by sending a small message to them, if one did it will return the result of the query and avoid an expensive trip to the already overwhelmed SQL database. By using hashing and caching, data in RAM can be shared. This cut down the number of read request sent to the database, increasing performance. They called this new system Memcache because it manage RAM memory cache.
  2. Google's MapReduce: One of the most influential case study in the NoSQL movement is the Google MapReduce system. There need to index billions of web pages for search lead them to develop a process for transforming large volumes of web data content into index using low-cost hardware. MapReduce is a programming model for processing large dataset with a parallel, distributed algorithm on a cluster(a large number of computers working together). The initial stage of the transformation are called the Map() function which performs filtering and sorting(such as sorting student by first name into queues, one queue for each name). The results of the map operation are then sent to the second layer: the Reduce() function, where the results are sorted, combined, and summarized to produce the final result. The Core concept behind map and reduce is nothing new, it date back to the Influential LISP functional programming language. LISP programming was different from other language because it emphasized function first and this has been the basis of modern functional programming language.Google's use of MapReduce fostered a growing awareness of the limitation of procedural programming language and encouraged others to take another look at the power of functional programming and the ability of functional programming systems to scale over thousands of low-cost CPUs.
  3. Google's BigTable: A table with a billion rows and a million columns. The motivation behind Bigtable was the need to store results from the web crawlers, html pages and media content from the internet. The resulting dataset was so large that it couldn't fit into a single relational database, so Google developed their own storage systems. The solution was neither a full relational database nor a filesystem, but what they called a "distributed storage system" that worked with structured data.
  4. Amazon Dynamo: In 2007 Amazon published a significant NoSQL paper Amazon 2007 Dynamo: A high Available Key-Value Store.  The business motivation behind Dynamo was Amazon's need to create a high reliable web storefront that supported transactions from around the world 24/7 without interruption. What they developed was key-value store with a simple interface that can be replicated even when there are large volumes of data to be processed.

RDBMS Pros and Cons

RDBMs Pros:
  • ACID transactions at database level makes development easier
  • Fine-grained security on columns and rows using views prevents and changes by unauthorized users
  • Most SQL codes is portable to other SQL databases..
  • RDBMs is matured
  • Readily available skilled personal.
RDBMs Cons:
  • Object Relation-mapping layer can be complex.
  • RDBMs don't scale out when joins are required.
  • It can be difficult to store high-variablity data in tables.

NoSQL Pros and Cons

NoSQL Pros:
  • Mostly Open Source.
  • Horizontal scaling takes place by adding new processors to the cluster.
  • Lower operational cost are obtained by autosharding.
  • There is no need for object-relational mapping layer.
  • Ability to store complex datatype.
  • It's easy to store high-variability data.
NoSQL Cons:
  • ACID transaction can't be done(some solution have atomicity support on a single object level).
  • Immature.
  • Staffs are new to NoSQL databases.
  • Indexing is not supported(Some systems like mongoDb has indexing but it not has powerful has SQL systems).
  • Bad Reporting performance.

Comparing ACID and BASE two method of reliable database transaction

Transaction control is important in distributed computing environment with respect to performance and consistency. Traditional RDBMs systems have focused on ACID transactions while BASE is found in most NoSQL systems.

ACID Transactions:

  • Atomicity: Everything in a transaction either succeed or is rolled back i.e each transaction is all or nothing.
  • Consistency: All transaction must be bring the database from one valid state to another. A transaction cannot leave the database in an inconsistent state.
  • Isolation: Transaction occur without the knowledge of another transaction.
  • Durability: Once transaction has occurred it is permanent. A completed transaction persist, even after application restarts.

BASE Transactions:

  • Basic Availability: System guarantees response - Successful or failed execution.
  • Soft State:  Indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
  • Eventual Consistency: The database will be consistent over time.

Conclusion

NoSQL databases are becoming an increasingly important part of the database landscape, and when used appropriately, can offer real benefits, but NoSQL database is not a solution to each and every application. Tradition RDBMs will always be around and in use for a very long time, you just have to choose the one that best fit your needs.

What Next?

Want to start developing apps with NoSQL check out this blog post.

4 comments:

  1. Great post.
    Thanks for sharing.

    ReplyDelete
  2. :facepalm: https://s3.amazonaws.com/theoatmeal-img/comics/misspelling/their.png

    ReplyDelete