Big Data, NoSQL and MapReduce

Consider a hypothetical scenario. Your company has got the project to design a new website for the channel airing IPL (if you haven’t heard of IPL, just pick up any sport you love). The channel wants to create this new website when users can register and create their own discussion rooms for discussing a specific aspect of match or a specific player or anything else. You have been assigned as a lead architect on this project. Among other challenges, you are having nightmares thinking about non-functional requirements (NFRs) that are to be met for this project (your competition was fired, as their traditional 3 tier architecture wasn’t holding up). You know you got to do something different, but not sure exactly what and how. If this resonates with you, keep reading.

Big Data – As name suggests Big Data is about huge and fast growing data, though how huge and how fast is left to one’s discretion. Big Data initially attributed to search engines and social networks is now making its way into enterprises. Primary challenges while working with Big Data are – how to store it and how to process it. There are other challenges too like visualization and data capture itself, but for this post I will omit them. Let’s start with storage first, by understanding NoSQL.

NoSQL is an umbrella term for non-relational databases which don’t use SQL (Structured Query Language). NoSQL databases unlike relational databases are designed to scale horizontally and can be hosted on a cluster. Most of these databases are key value stores (Riak) where each row is a key value pair. The important thing to note here is the value doesn’t have a fix schema; it can be anything – a user or a user profile or an entire discussion. There are two major variants of key value databases – document database (MongoDB) and column-family database (Cassandra). Both of them extend the basic premise of key-value store to allow easy search on data contained inside value object. Document store database imposes a structure on the value stored allowing query on internal fields. On the other hand column-family database stores value across multiple column value pairs (you can also think of it as second level key value pair) and then group them into a coherent unit called column-family.

Before you think you have found the storage panacea and ready to go, you need to take care of few important aspects related to distributed databases – scalability, availability, and consistency.

First is scaling via sharding to meet your data volume. Good part is most of NoSQL database support auto sharding which means shards are automatically balanced across the nodes on a cluster. You can also add additional nodes as necessary to your cluster, to align with data volume. But what if a node goes down? How can we still make the shards available? We need to mitigate these failures by making our system highly available.

Availability can be achieved via replication. You can setup a master slave replication or peer-to-peer replication. With master slave replication you should typically setup three nodes including master and all the writes go to the master node. Data reads though can happen from any node, either a master or a slave. If a master node goes down, the slave gets promoted to master, and continues to replicate to the third node. When failed master node resurrects it joins the cluster as a slave. In contrast, peer-to-peer replication is slightly complex. Here, unlike Master / Slave all the nodes receive read / write requests. The shards are now replicated bidirectional. While this looks good just remember when we use replication we will run into consistency issues due to latency.

There are two major types of inconsistencies – read and write. Read inconsistencies will arise in master / slave replication when you try to read of a slave before changes propagate from master. While in peer-to-peer replication you will run into both read and write inconsistencies, as write (update) is allowed on multiple nodes (think of two people trying to book movie tickets at the same time). As you would have observed availability and consistency are in contrast to each other (check out CAP theorem for more details). What’s the right balance is purely contextual. For instance you can prohibit reads and writes inconsistencies – just have slaves as hot standby; don’t read of them.

Let’s now see how you can process Bigdata – the compute aspect. Processing massive amount of data needs a shift from the client server model of data processing wherein client pulls the data from server. Instead the emphasis is to run processing on the cluster nodes where data is present by pushing the code. In addition, this processing can be carried out independently in parallel as the underlying data is already partitioned across nodes. This way of processing is referred to as MapReduce pattern and it also interestingly uses key-value pairs.

Extending our IPL example, consider you want to list the top players being discussed across all the forums. This would mean you need to iterate through each discussion in our NoSQL store and then identify each player occurrences. Applying MapReduce here, we start with the map function. A single discussion (key-value pair) would be an input to the map function, which would result into a key value pairs output, with key being player name and value indicating number of occurrences. All the occurrences (values) for a given player (key) across nodes are then passed to a reduce function for aggregation.

Most MapReduce frameworks allow you to control the number of mappers and reducers instances, using configurations. While reduce functions normally operate on a single key, there is also a concept of partition function which allows you to send multiple keys to a single reducer, helping you evenly distribute the load across reducers. Finally, as you would have guessed mappers and reducers could be running on different nodes, and this would need map output being moved across to reducers. To minimize these data movements, you can introduce combiners, which perform a local reducing job – in our case all the player occurrences can be aggregated at the node level before passing it on to the reducer. Most of NoSQL databases have their own way of abstracting / implementing MapReduce via queries and others. You can also use Hadoop and related technologies like HDFS for your MapReduce workload without using NoSQL databases.


Hope this overview has helped you understand the big picture of how these technologies fit together.