Big Data, NoSQL and MapReduce

Consider a hypothetical scenario. Your company has got the project to design a new website for the channel airing IPL (if you haven’t heard of IPL, just pick up any sport you love). The channel wants to create this new website when users can register and create their own discussion rooms for discussing a specific aspect of match or a specific player or anything else. You have been assigned as a lead architect on this project. Among other challenges, you are having nightmares thinking about non-functional requirements (NFRs) that are to be met for this project (your competition was fired, as their traditional 3 tier architecture wasn’t holding up). You know you got to do something different, but not sure exactly what and how. If this resonates with you, keep reading.

Big Data – As name suggests Big Data is about huge and fast growing data, though how huge and how fast is left to one’s discretion. Big Data initially attributed to search engines and social networks is now making its way into enterprises. Primary challenges while working with Big Data are – how to store it and how to process it. There are other challenges too like visualization and data capture itself, but for this post I will omit them. Let’s start with storage first, by understanding NoSQL.

NoSQL is an umbrella term for non-relational databases which don’t use SQL (Structured Query Language). NoSQL databases unlike relational databases are designed to scale horizontally and can be hosted on a cluster. Most of these databases are key value stores (Riak) where each row is a key value pair. The important thing to note here is the value doesn’t have a fix schema; it can be anything – a user or a user profile or an entire discussion. There are two major variants of key value databases – document database (MongoDB) and column-family database (Cassandra). Both of them extend the basic premise of key-value store to allow easy search on data contained inside value object. Document store database imposes a structure on the value stored allowing query on internal fields. On the other hand column-family database stores value across multiple column value pairs (you can also think of it as second level key value pair) and then group them into a coherent unit called column-family.

Before you think you have found the storage panacea and ready to go, you need to take care of few important aspects related to distributed databases – scalability, availability, and consistency.

Image
First is scaling via sharding to meet your data volume. Good part is most of NoSQL database support auto sharding which means shards are automatically balanced across the nodes on a cluster. You can also add additional nodes as necessary to your cluster, to align with data volume. But what if a node goes down? How can we still make the shards available? We need to mitigate these failures by making our system highly available.

Availability can be achieved via replication. You can setup a master slave replication or peer-to-peer replication. With master slave replication you should typically setup three nodes including master and all the writes go to the master node. Data reads though can happen from any node, either a master or a slave. If a master node goes down, the slave gets promoted to master, and continues to replicate to the third node. When failed master node resurrects it joins the cluster as a slave. In contrast, peer-to-peer replication is slightly complex. Here, unlike Master / Slave all the nodes receive read / write requests. The shards are now replicated bidirectional. While this looks good just remember when we use replication we will run into consistency issues due to latency.

There are two major types of inconsistencies – read and write. Read inconsistencies will arise in master / slave replication when you try to read of a slave before changes propagate from master. While in peer-to-peer replication you will run into both read and write inconsistencies, as write (update) is allowed on multiple nodes (think of two people trying to book movie tickets at the same time). As you would have observed availability and consistency are in contrast to each other (check out CAP theorem for more details). What’s the right balance is purely contextual. For instance you can prohibit reads and writes inconsistencies – just have slaves as hot standby; don’t read of them.

Let’s now see how you can process Bigdata – the compute aspect. Processing massive amount of data needs a shift from the client server model of data processing wherein client pulls the data from server. Instead the emphasis is to run processing on the cluster nodes where data is present by pushing the code. In addition, this processing can be carried out independently in parallel as the underlying data is already partitioned across nodes. This way of processing is referred to as MapReduce pattern and it also interestingly uses key-value pairs.

Extending our IPL example, consider you want to list the top players being discussed across all the forums. This would mean you need to iterate through each discussion in our NoSQL store and then identify each player occurrences. Applying MapReduce here, we start with the map function. A single discussion (key-value pair) would be an input to the map function, which would result into a key value pairs output, with key being player name and value indicating number of occurrences. All the occurrences (values) for a given player (key) across nodes are then passed to a reduce function for aggregation.

Most MapReduce frameworks allow you to control the number of mappers and reducers instances, using configurations. While reduce functions normally operate on a single key, there is also a concept of partition function which allows you to send multiple keys to a single reducer, helping you evenly distribute the load across reducers. Finally, as you would have guessed mappers and reducers could be running on different nodes, and this would need map output being moved across to reducers. To minimize these data movements, you can introduce combiners, which perform a local reducing job – in our case all the player occurrences can be aggregated at the node level before passing it on to the reducer. Most of NoSQL databases have their own way of abstracting / implementing MapReduce via queries and others. You can also use Hadoop and related technologies like HDFS for your MapReduce workload without using NoSQL databases.

Image

Hope this overview has helped you understand the big picture of how these technologies fit together.

Advertisements

Dependency Inversion, Dependency Injection, DI Containers and Service Locator

Here is a post I have been planning to do for years 🙂 . Yes, I am talking about Dependency Injection, which by this time, has most likely made its way into your code base. I got to recollect few thoughts around it in a recent discussion and hence writing them down.

Dependencies are the common norm of object oriented programming, helping us to adhere to software principles like Single Responsibility, Encapsulation, etc. Instead of establishing dependencies using direct references between classes or components, they are better managed through an abstraction. Most of the GOF patterns are based around this principle which is commonly referred as ‘Dependency Inversion’. Using Dependency inversion principle we can create flexible Object Oriented designs, making our code base reusable and maintainable.

To further enhance value proposition of the dependency inversion, you can pass the dependencies via a constructor or a property from the root of your application, instead of instantiating them within your class. This would allow you to mock / stub your dependency and make your code easily testable (read unit testing).

E.g.

IA ob = new A(); //Direct instantiation
public D (IA ob) { … } //constructor injection
public IA ob { get; set } //property injection – create object and set the property

Entire software industry sees value in above approach and most people refer to this entire structure (pattern) as Dependency Injection. But beyond this things get little murkier.

Confusion starts for programmers who see above as a standard way of programming. To them Dependency Injection (DI) is the automated way of injecting dependencies using a DI container (DI container at times is also referred as IoC (Inversion of Control) container. As Martin Fowler points out in his classic article, IoC is a generic term used in quite a few cases – e.g. the callback approach of Win32 programming model. In order to avoid confusion and keep this approach discernible, the term DI was coined. In this case, IoC refers to inversion of dependencies creation, which are created outside of the dependent class and then injected.) If you are surprised by this statement, let me tell you 90% of people who have walked to me to ask if I was using DI, wanted to know the about the DI container and my experience with it.

So what the heck is this ‘DI container’? A DI container is a framework which can identify constructor arguments or properties on the objects being created and automatically inject them as part of object creation. Coming from the .NET world, the containers I use include StructureMap, Autofac and Unity. DI containers can be wired up using few lines of code at the starting of your program or you can even specify configuration in a XML file. Beyond that, containers are transparent to the rest of your code base. Most containers also provide AOP (Aspect Oriented Programming) functionality and its variants. This allows you to bundle cross cutting concerns like database transactions, logging, caching, etc. as aspects and avoid boilerplate code throughout the system (I have written a CodeProject article on those lines). Before you feel I am over simplifying things, let me state if you haven’t worked with DI containers in past you are likely to be faced with a learning curve. As is the case with most of the other frameworks, a pilot is strongly recommended. As a side note, the preferred injection rule is – unless your constructor requires too many parameters (ensure you haven’t violated SRP), you should resort to constructor injection and avoid property injection (see fowler’s article for a detailed comparison between the two injection types).

Finally, let’s talk about Service Locator an alternative to DI. A Service Locator holds all the services (dependencies) required by your system (in code or using a configuration file) and returns a specific service instance on request. Service Locator can come in handy for scenarios where DI container is not compatible with a given framework (e.g. WCF, ASP.NET Web APIs) or you want more control over the object creation (e.g. create the object late in the cycle). While you can mock service locator, mocking it would be little cumbersome when compared to DI. Service Locator is generally seen as an anti-pattern in DI world. Interestingly, most DI containers offer APIs which can allow us to use them as Service Locator (lookup for Resolve / GetInstance methods on the container).

Below sample shows you StructureMap – a DI container, in action (use NuGet to add StructureMap dependency).

class Program
{
static void Main(string[] args)
{
var container = new Container(registry =>
{
registry.Scan(x =>
{
x.TheCallingAssembly();
x.WithDefaultConventions();
});
});
var ob = container.GetInstance<DependentClass>();
ob.CallDummy();
}
}

public interface IDependency
{
void Dummy();
}

public class Dependency : IDependency
{
public void Dummy()
{
Console.WriteLine("Hi There!");
}
}

public class DependentClass
{
private readonly IDependency _dep;
public DependentClass(IDependency ob)
{
_dep = ob;
}
public void CallDummy()
{
_dep.Dummy();
}
}

I will try to post some subtle issues around DI containers in future. I hope above helps anyone looking for a quick start on DI and associated terms 🙂 .

What is DMZ?

A very brief introduction. DMZ is an element which most of architects miss out in their deployment architectures (except few who run their designs through IT pros). The word stands for “Demilitarized Zone”, an area often found on perimeter (outside) of a country’s border. These areas are typically not guarded (under treaties between two or more countries).

In IT domain we refer to DMZ as a separate network. So, why to create separate networks or DMZs? Simplest answer is to enhance security. For example, consider hosting the public facing websites of your business. You might want to host these sites inside a DMZ, separate from your corporate network. In case security of your site is compromised, your corporate network is still safe. Many architects also prefer hosting web and database servers under different DMZs to ensure there is no comprise on data, in case a hacker breaks into their web servers. Like elsewhere, you can use routers to transfer data to DMZ networks. While DMZ is separate network, you must have enough defense packed into it. An enterprise firewall (with redundancy, of course) is a minimum recommendation. Enterprises are also known to have multiple firewalls for each of their DMZs / networks. Below is a simplistic diagram of DMZ deployment.

Hope this helps 🙂 !

It’s Tech Ed Time!!!

Tech Ed 2010 is here. I will presenting @ Architect Track on – “Integration and Identity challenges for Enterprise Grade Cloud Applications”. I also have an session lined up for Community Track with an interesting title – “Imagine a world free of login screens …”. I am really geared up for these sessions and look forward to provide max ROI to audience for their time and money. If you are attending Tech Ed and do drop in to say hello.

I am bit late on this post and missed uploading presentation decks for my last VTD. Hope to put bundle all of them together and upload post TechEd.

SQL Server Reporting Services (SSRS) Architecture Overview

This is level 100 for people trying to figure out how SQL Server Reporting Services (SSRS) works and what makes it work. Honestly, I haven’t found SSRS architecture explained that clearly. I have given up in past, as the search engines weren’t leading to any easy to understand sources. Few days back I had to give a management presentation on migrating to SSRS. So with my back on wall I had little option but to dive in. During the journey I came across few distilled facts that I am sharing in this blog post. Hope you find them useful 🙂 .

SSRS is an optional package which you can select to install while installing SQL Server. SSRS in turn is made up of number of components. The simplest diagram I could find that describes these components and their deployment was from TechNet

As you can see in above diagram when you install SSRS it creates Report Server Databases in your SQL Server Instance. These databases are ReportServerDB and ReportServerTempDB which are used to store report configurations and other things including Caching, Session, etc. that improvise the overall performance. You have an option of installing other components like Report Manager and Report Server on the same machine where SQL Server instance is running or you can install them on a different server (typical enterprise setup). An important thing to note here is if you opt for latter you would end up paying for 2 or more SQL Server licenses.

As it turns out there are 3 distinguish components of SQL Server Reporting Services:

1) Report Server: It’s an overloaded term. Largely used to indicate a set of components that allow interaction with Report Server database. SSRS provides Web Services (.asmx) which allows LOB applications to interact with Report Server database directly (http://computername/ReportServer/reportservicexxxx.asmx where xxxx is version of SSRS). SSRS 2005 created virtual directories for Report Server & Report Manager (discussed next), but SSRS 2008 leverages the OS level HTTP listener making SSRS independent of IIS. This allows bundling of Report Server & Report Manager within a windows Service ReportingServicesService.exe. The name of this Windows Service is ReportServer. ReportingServicesService.exe functionality also includes report processing, scheduling (auto generated reports), subscriptions (mailers), etc.

2) Report Manager: An ASP.NET web based application (http://computername/Reports) that in turn interacts with Report Server Web Services. As the name indicates Report Manager allows you to manage reports in terms configuring security access, organizing them into folders (non of these folders map to physical directories but are stored as details in Report Server Database), subscribing to them, etc. One can also create reports (see next point as to how) and deploy them to Report Server Database using Report Manager. This is handy for some restricted user / production scenarios, though most developers prefer to do the deploy reports from BI studio. As discussed earlier with SSRS 2008 this component is bundled with ReportServer windows service.

3) Report Designer: There would be few guys in your team whom you may want to designate as Report Designers. Report Designers can design reports using VS.NET Business Intelligence projects (Report Server Project). Report Designers create data sources (normally a shared data source (.rds) that’s used across a set of reports), create the dataset (using queries / stored procedures on top of data source), define relevant report parameters (mapped to datasets for value retrieval via Report data window and can be passed on from .NET applications too), field formats (using properties window with pervasive VB expressions – e.g. formatting a textbox to display currency decimals) and create layouts (e.g. Grouping). Once they are done with designing their reports (.rdl files – described later) they can test (preview) them and publish them via Report Server (this is done by providing the Report Server URL in project properties and SSRS there creates a specific folder for your project). Once published these reports are available for end user consumption. Advance scenarios like interacting with Excel may require a third party product like OfficeWriter.

There are few other important aspects of SSRS which one should be familiar with.

Report Builder is a another tool which is targeted at business users who want to generate custom reports on fly. Report Builder is a ClickOnce application, intuitive and easy to use but doesn’t support all the options available with VS.NET. It’s also possible to install ReportBuilder as a standalone application.

Report Model is the base for report creation with Report Builder. It’s a simplified view of relational database targeted at business users for ad hoc report creation. Report models are created using BI Development Studio (Report Model Project – .smdl files). A report model is built on top of a Data Source View (.dsv) that defines a logical model based on one or more data sources. Models generated mainly consist of entities (relational tables), fields (attributes of a relational table) and roles (entity relationships – 1-1, 1-*, *-1). Models also contain other attributes like aggregate values that would help ease the reporting for end users. Post creation report model has to be deployed in similar way as reports. You can also use Report Model as a data source for generating reports via Report Designer. While it’s easy to deploy Report model from BI development studio, to deploy report model manually e.g. in production requires you to merge the .smdl and .dsv files.

RDL – this is another term you would run into while talking about SSRS. RDL stands for Report Definition language. This is an XML file which stores query information, data source information, etc. which are required to generate report. There is another type of report definition – RDLC (Report Definition Language Client-side) which don’t store any of above configurations. RDLC is a client side component (VS.NET Application Projects) to which you can pass data (e.g. via DataSet) coming from any of data sources. RDLC can be useful for scenarios like implementing custom pagination (SSRS 2005 pagination by default is client side pagination).

SSRS Security is primarily windows based. When a user accesses the Report Manager Application or ASMX Web Services he has to authenticate with a valid domain username / password. On successful login SSRS determines the role of the user (custom or built-in ones like Browser / Content Manager, etc.) and displays only those reports / folders to which user has access. There are few variations in the security implementation I have come across that don’t rely on Windows Authentication. Some projects tend to control role like content manager for pushing reports (.rdl files) to production with help of rs.exe. All users have an implicit role of a Browser and application layer security determines which reports the user should have access to. In case you want go ahead and roll out your custom authentication that flows security all the way down, SSRS allows you that too. If you are generating reports by connecting to remote data sources for accessing images, etc. you might have to configure Unattended Execution Account.

Deploying Reports to Production – This is mainly done in three ways. In restricted production environments rs.exe can be used. In others we can deploy the reports directly from Visual Studio or use Report Manager (discussed earlier). This normally requires to change your Data Sources and Report Server URL in Project properties. There is overwrite property for Data Sources which is normally set to false. This property helps in ensuring that you accidentally don’t overwrite production data sources during your deployment. We can deploy individual reports too in case we have a specific modification.

Integrating SSRS into your applications – I will focus mainly on ASP.NET here. We need a ReportViewer Control found in the toolbox. Drag it and drop it to your ASPX page and in the background it would add reference to – Microsoft.ReportViewer.WebForms DLL. You may need to bundle this DLL with your application package, as mostly in production, Web Server and Report Server would be on different machines. Below is typical markup found in .ASPX page (I have hardcoded report server url for simplicity)

<rsweb:ReportViewerID=”ReportViewer1″ …>
<ServerReportReportPath=”/TargetReportFolder/OpenTicketReport”ReportServerUrl=”http://ReportServerURL/ReportServer”/&gt;
</rsweb:ReportViewer>

One can also pass any necessary parameters as below in the codebehind file

this.ReportViewer1.ServerReport.SetParameters(Param);

In case your Web Server and Report Server are located on different machines, you need to ensure that the Worker Process running the application on web server has access to reports (you can configure the same using Report Manager Security)

Finally here’s a good link to tune SSRS reports.

Inputs/Thoughts/Suggestions/Corrections?

Implementing Audit Trail for your application

I recently came across Davy’s post where he talks about implementing auditing. Davy talks about 3 requirements which I guess are quite common for audit trail:

1) We should be able to easily retrieve what user X changed in the database on day Y or during a certain time frame.
2) We need to know what record X looked like at point Y in time, and be able to show a historic list of changes for a specific record.
3) We need to configure the duration for which auditing data must be kept in the database on a per-table (or per-schema) level

In this post I would like to talk about the approach we took for one of our recent projects. We prepared a generic sort of table which would hold the audit data. This table mainly consists of AuditID, EntityID, EntityTypeID, UserID, AuditTime, AuditLevel & AuditData. AuditData is a TEXT column (we support SQL 2000) holding XML structure, for those changes made to an entity by a given user in form of old and new values, at a level specified. Level here is just an entry (capturing type of modification I/D/U) or a detail of old / new values. This would cater to the first requirement as it’s quite easy to search for activities of a given user.

Second requirement is little tough, as with this model we don’t allow direct queries against record attributes. But this can be achieved by 2 variants. First simple solution could be you have to search by entity type & a specific entity in turn for a date range & browse through records to see changes done on them. Second option is take this AuditData out and restructure its layout with help of OLAP cubes to produce meaningful quick search.

Third requirement is very much there for us in order to control size of this massive table but we for simplicity purge entire database at a given time interval & do not make it adjustable for each entity. Though I feel providing that flexibility shouldn’t be difficult considering the design we have laid out.

A final thing I wanted to highlight is about generating the XML data itself. It’s always better to do such processing in .NET code & if you have been using DataSets you get old values extraction feature out of box. If not a dataset many O/R mappers too allow to access old and new values. In case you are rolling your own repositories with domain model not capturing the old values, you would need to design a custom approach like storing original object in session and comparing it with the postback. We are yet to see a significant performance degradation due to this design but as a contingency plan I was thinking of making this XML conversion & insertion into AuditDB an asynchronous activity with help of Queue like data structure.

Look forward to read your thoughts on above.

Whoever saves one life, saves the world entire

Title of this post happens to be from one of my all time favourite movies – Schindler’s List, but post is about my newly created cloud application using Azure Platform. My cloud application – EmergencyBloodBank is finally up. Here’s the direct link to application. This application was created was Cloud App Contest.

There were 2 major driving factors for this application – one to create an application covering all offerings of Azure platform that can help community members to boot start their Azure efforts and second was coming up with scenario that effectively leverages cloud services displaying no elements of over engineering (which can make solution look artificial). My initial thought was to create a shared calendar service or streaming live video but there were few existing compelling solutions already there. So, I decided to focus on healthcare segment and immediately could see the impact on blood bank field. An initial search on internet landed me here and that was it, I was all set to create a communication platform for sharing blood related requirements that leverages cloud infrastructure and services.

First step was to create a site with 99.999% uptime which is a necessary requirement when talking about blood bank sort of service. Site is built on ASP.NET 3.5 with C# as programming language, and this is where one can post their blood related requests. Next was using Live Services for authentication. How many times you have turned away from an excellent site due to its lengthy login form? I have and I never wanted the end users of this site to do that. So using Live Services was a natural fit with a large user base – apprx. 500 million users. So challenge was how we make these requests reach potential donors? Pull model is not effective in this case (imagine the load it would put on site or a web service) and .NET Service Bus provides the push model I was looking for. So I went ahead with it and created a WPF client which gets the relayed messages from Azure worker role via .NET Services Bus. Few things that I opted out of due to time constraints were providing a SMS service, porting WPF client to Windows mobile, & integration with web 2.0 applications like FaceBook, Twitter, etc. Finally task was to store the user details so they don’t have to key them everytime they visit site & also audit all blood requests they place. SQL Azure with its SaaS offering was a natural fit because the data here is nothing more than a KB per request, so one can easily forgo buying the product and use the available SaaS model (add to that almost zero percent of installation & maintenance effort). This would help maintain the site cost within 100$ per month providing a unique high available solution. Below is the architecture diagram for the application

EBBArchitecture

You can access the acutal solution document here.

Let me know your thoughts if I could have done things differently. I plan to cover the entire architecture, design & development process in one of upcoming talks.

Till then, Happy Clouding 🙂 .

(P.S. My Application has been selected as winner of the Azure Cloud App Contest. Thanks to everybody who voted and provided their valuable feedback).