Niraj Bhatt – Architect's Blog

Ruminations on .NET, Architecture & Design

Category Archives: Architecture Design

WS-Fed vs. SAML vs. OAuth vs. OpenID Connect

Identity protocols are more pervasive than ever. Almost every enterprise you would come across will have a identity product incubated, tied with a specific identity protocol. While the initial idea behind these protocols was to help enterprise employees use a single set of credentials across applications, but new use cases have shown up since then. In this post, I am going to provide a quick overview of major protocols and the use cases they are trying to solve. Hope you will find it useful.

WS-Fed & SAML are the old boys in the market. Appearing in early 2000s they are widespread today. Almost every major SSO COTS product supports one of these protocol. WS-Fed (WS-Federation) is a protocol from WS-* family primarily supported by IBM & Microsoft, while SAML (Security Assertion Markup Language) adopted by Computer Associates, Ping Identity and others for their SSO products. The premise with both WS-Fed and SAML is similar – decouple the applications (relying party / service provider) from identity provider. This decoupling allows multiple applications to use a single identity provider with a predefined protocol, and not care about the implementation details of identity provider per se.

For web applications, this works via a set of browser redirects and message exchanges. User tries to access web application, the application redirects user to identity provider. User authenticates himself, identity provider issues a claims token and redirects user back to the application. Application then validates the token (trust needs to established out of band between application and IdP), authorizes user access by asserting claims, and allows user to access protected resources. The token is then stored in the session cookie of user browser, ensuring the process doesn’t have be repeated for every access request.

At a high level there isn’t much separating the flow of these two protocols, but they are different specifications with each having its own lingo. WS-Fed is perceived to be less complex and light weight (certainly an exception for WS-* family), but SAML being more complex is also perceived to be more secure. At the end you have to look at your ecosystem including existing investments, partners, in house expertise, etc. and determine which one will provide higher value. The diagram below taken from wiki, depicts the SAML flow.

640px-Saml2-browser-sso-redirect-post

OAuth (Open Standard for Authorization) has different intent (the current version is OAuth 2.0). It’s driving force isn’t SSO but access delegation (type of authorization). In simplest terms, it means giving your access to someone you trust, so that they can perform the job on your behalf. E.g. updating status across Facebook, Twitter, Instagram, etc. with a single click. Option you have is either to go to these sites manually, or delegate your access to an app which can implicitly connect to these platforms to update status on your behalf. Flow is pretty simple, you ask application to update your status on Facebook, app redirects you to Facebook, you authenticate yourself to Facebook, Facebook throws up a consent page stating you are about give this app rights to update status on your behalf, you agree, the app gets an opaque access token from Facebook, app caches that access token, send the status update with access token to facebook, facebook validates the access token (easy in this case as the token was issued by Facebook itself), and updates your status.

OAuth refers to the parties involved as Client, Resource Owner (end-user), Resource Server, and Authorization Server. Mapping these to our Facebook example, Client is the application trying to do work on your behalf. Resource owner is you (you owe the Facebook account), Resource Server is the Facebook (holding your account), and Authorization Server is also Facebook (in our case Facebook issues the access token using which client can update status on Facebook account). It perfectly ok for Resource Server and Authorization Server to be managed by separate entities, it just means more work to establish common ground for protocols and token formats. Below screenshot depicts the OAuth2 protocol flow

OAuth2

Web community liked the lightweight approach of OAuth. And hence, the question came – can OAuth do authentication as well, providing an alternative to heavy lifting protoo WS-Fed and SAML? Enter OpenID Connect is about adding Authentication to OAuth. It aims at making Authorization Server do more – i.e. not only issuing access token, but also an ID token. ID token is a JWT (JSON Web Token) containing information about authentication event, like when it did it occur, etc. and also about subject / user (specification talks of a UserInfo Endpoint to obtain user details). Going back to the Facebook example, here the client not only relies on Facebook to provide us an opaque access token for status updates, but also an ID token which client can consume and validate that the user actually authenticated with Facebook. It can also fetch additional user details it needs via Facebook’s UserInfo Endpoint. Below diagram from OpenID connect spec indicates the protocol flow.

OpenIDConnect

OP in above case is OpenID Provider. All OpenID Providers have the discovery details published via JSON document found by concatenating provider URL with /.well-known/openid-configuration. This document has all the provider details including Authorization, Token and UserInfo Endpoints. Let’s see a quick example with a Microsoft offering called Azure Active Directory (Azure AD). Azure AD being a OpenID Provider, will have the openid configuration for it’s tenant demoad2.onmicrosoft.com available at https://login.microsoftonline.com/demoad2.onmicrosoft.com/.well-known/openid-configuration.

Fairly digestible, isn’t it 🙂 ?

Advertisements

Evolution of software architecture

Software architecture has been an evolutionary discipline, starting with monolithic mainframes to recent microservices. It’s easy to understand these software architectures from an evolution standpoint, rather than trying to grasp them independently. This post is about that evolution. Let’s start with mainframes.

The mainframe era was of expensive hardware, having powerful server capable of processing large amount instructions and client connecting to it via dumb terminals. This evolved as hardware became cheaper and dumb terminals paved way to smart terminals. These smart terminals had reasonable processing power leading to a client server models. There are many variations of client server models around how much processing a client should do versus the server. For instance, client could all the processing having server just act as centralized data repository. The primary challenge with that approach was maintenance and pushing client side updates to all the users. This led to using browser clients where the UI is essentially rendered from server in response to a HTTP request from browser.

Mainframe Client Server

As server started having multiple responsibilities in this new world like serving UI, processing transactions, storing data and others, architects broke the complexity by grouping these responsibilities into logical layers – UI Layer, Business Layer, Data Layer, etc. Specific products emerged to support this layers like Web Servers, Database servers, etc. Depending on the complexity these layers were physically separated into tier. The word tier indicates a physical separation where Web Server, Database server and business processing components run on their own machines.

3 Tier Architecture

With layers and tiers around, the next big thing was how do we structure them what are the ideal dependencies to have across these layers, so that we can manage change better? Many architecture styles showed up as recommended practice most notably Hexagonal (ports and adapters) architecture and Onion architecture. These styles were aimed to support development approaches like Domain Driven Design (DDD), Test Driven Development (TDD), and Behavior Driven Development (BDD). Themes behind these styles and approaches were to isolate the business logic, the core of your system from everything else. Not having your business dependent on UI, database, Web Services, etc. allows for more participation of business teams, simplifies change management, minimizes dependencies, and make the software easily testable.

HexagonalOnion

Next challenge was scale. As compute became cheaper, technology became a way of life causing disruption challenging the status quo of established players across industries. The problems are different, we are no longer talking of apps that are internal to an organization or mainframes where users are ok with longer wait times. We are talking of global user base with sub second response time. Simpler approach to scale was better hardware (scale up) or more hardware (scale out).  Better hardware is simple but expensive and more hardware is affordable but complex. More hardware meant your app would run on multiple machines, and the user data would be distributed across these machines. This leads us to famous CAP (Consistency, Availability and Partition tolerance) theorem. While there are many articles on CAP, essentially it boils down to – network partitions are unavoidable and we have to accept them. This is going to require is to choose between availability and consistency. You can choose to be available and return stale data to be eventually consistent or you can choose to be strongly consistent and give up on availability (i.e. returning error for the missing data – e.g. you are trying to read from a different node than where you wrote your data). Traditional Database servers are consistent and available (CA) with no tolerance for partition (active DB server catering to all requests). Then there are NoSQL databases with master slave relationship, configurable to support strong consistency or eventual consistency.

CAP Theorem

Apart from scale challenges, today’s technology systems have to often deal with contention. E.g. selecting an airline seat, or a high discount product that everyone wants to buy on a Black Friday. As multitude of users are trying to get access to the same piece of data, it leads to contention. Scaling can’t solve this contention, it can only make it more worse (imagine having multiple records of the same product inventory within your system). This led to specific architecture styles like CQRS (Command Query Responsibility Segregation) & Event Sourcing. CQRS in its simple terms is about separating writes (command) from reads (query). With writes and reads having separate stores and model both can be optimally designed. Write stores in such scenarios typically use Event Sourcing to capture entity state for each transaction. Those transactions are then played back to the read store to make write and read eventually consistent. This model of being eventually consistent would have some implications and needs to be worked with business to keep the customer experience intact. E.g. Banana Republic recently allowed me to order an item. They took my money and later during fulfillment they realized they were out of stock (that is when things became eventually consistent). Now they refunded my money, sent me a sorry email and allowed me 10% discount on my next purchase to value me as a customer. As you would see CQRS and Event Sourcing come with their own set of tradeoffs. They should be used wisely for specific scenarios rather than an overarching style.

CQRS

Armed with above knowledge, you are now probably thinking can we use these different architecture styles within a single system system? For instance, have parts of your system use 2-tier other use CQRS and other use Hexagonal architecture. While this might sound counterproductive, it actually isn’t. I remember building a system for healthcare providers, where every use case was so different – appointments, patient registration, health monitoring, etc. Using a single architecture style across the system was definitely not helping us. Enter Microservices. The microservices architecture style recommends to break your system into a set of services. Each service can then be architected independently, scaled independently, and deployed independently. In a way, you are now dealing with vertical slices of your layers. Having these slices evolve independently, will allow you to adopt a style that’s more befitting to the slice in context. You might ask, while this makes sense for the architecture, but won’t you just have more infrastructure to provision, more units to deploy and more stuff to manage? You are right, and what really makes Microservices feasible is the agility ecosystem comprising of cloud, DevOps, and continuous delivery. They bring automation and sophistication to your development processes.

Microservices

So does this evolution make sense? Are there any gaps in here which I could fill? As always, will look forward for your comments.

Image Credits: Hexagonal Architecture, Microservices Architecture, 3-Tier architecture, client server architecture, CQRS architecture, Onion architecture, CAP Theorem

Windows Azure Portals and Access Levels

When you sign up for Windows Azure you get a subscription and you are made the Service administrator of that subscription.

Image

While this creates a simple access model, things do get little complicated in an Enterprise where users need various levels of access. This blog post would help you understand these access levels. 

Enterprise Administrator
Enterprise Administrator has the ability to add or associate Accounts to the Enrollment and can view usage data across all Accounts. There is no limit to the number of Enterprise Administrators on an Enrollment.
Typical Audience: CIO, CTO, IT Director
URL to GO: https://ea.windowsazure.com

Account Owner
Account Owner can add Subscriptions for their Account, update the Service Administrator and Co-Administrator for an individual Subscription, and can view usage data for their Account. By default all subscriptions are named as ‘Enterprise’ on creation. You can edit the name post creation in the account portal. Under EA usage, only Account Administrators can sign up for Preview features. Recommendation for accounts to be created is either on functional, business or geographic divisions, though creating a hierarchy of accounts would help larger organizations.
Typical Audience: Business Heads, IT Divisional Heads
URL to GO: https://account.windowsazure.com

Service Administrator
Service Administrator and up to nine Co-Administrators per Subscription have the ability to access and manage Subscriptions and development projects within the Azure Management Portal. The Service Administrator does not have access to the Enterprise Portal unless they also have one of the other two roles. It’s recommended to create separate subscriptions for Development and Production, with production having strict restricted access.
Typical Audience: Project Manager, IT Operations
URL to GO: https://manage.windowsazure.com

Co-Administrators
Subscription co-administrators can perform all tasks that the service administrator for the subscription can perform. A co-administrator cannot remove the service administrator from a subscription. The service administrator and co-administrators for a subscription can add or remove co-administrators from the subscription.
Typical Audience: Test Manager, Technical Architect, Build Manager
URL to GO: https://manage.windowsazure.com

That’s it! With above know-how you can create an EA Setup like below

Image

Hope this helps 🙂

Windows Azure vs. Force.com vs. Cloud Foundry

Below is a brief write up of some personal views. Let me know your thoughts.

Windows Azure is the premier cloud offering from Microsoft. It has a comprehensive set of platform services ranging from IaaS to Paas to SaaS. This is a great value proposition for many enterprises looking to migrate to cloud in a phased manner; first move as-is with IaaS and then evolve to PaaS. In addition, Azure has deep integration across Microsoft products –including SharePoint, SQL Server, Dynamics CRM, TFS, etc. This translates to aligned cloud roadmap, committed product support and license portability. Though .NET is the primary development environment for Azure platform, most of the Azure services are exposed as REST APIs. There are JAVA, Ruby and other SDKs available which allows variety of developers to easily leverage Azure platform. Azure also allows customers to spawn Linux VMs, though that’s limited to IaaS offerings.

Force.com allows enterprises to extend Salesforce.com – the CRM from SalesForce. Instead of just providing SDKs and APIs, Salesforce has created force.com as a PaaS platform – so that you focus only on building extensions; rest is managed by Salesforce. Salesforce also provides a marketplace ‘AppExchange’ where companies can sell these extensions to potential customers. Though force.com offers an accelerated development platform (abstracting many programming aspects), programmers still need to learn APEX programming language and related constructs. Some enterprises are considering force.com as their de-facto programming platform – taking it beyond the world of CRM. It’s important to understand the applicability of force.com for such scenarios would typically be limited to transactional business applications. So, where should enterprises go when they need to develop custom applications with different programming stacks and custom frameworks? Salesforce answer is Heroku. Heroku supports all the major programming platforms including Ruby, Node.js, JAVA, etc. with exception of .NET. Heroku uses Debian and Ubuntu as the base operating system.

Many enterprises today are contemplating their move to PaaS cloud citing vendor lock-in. For instance, if they move to Azure PaaS platform their applications would run only on Azure, and they would have to remediate them to port to AWS. It would definitely be great to have a PaaS platform agnostic of a vendor. This is the idea behind open source PaaS platform Cloud Foundry. It’s an effort co-funded by VMware and EMC. VMware offers a Cloud Foundry hosted solution, with the underlying infrastructure being vCloud. Cloud Foundry supports various programming languages like Java, Ruby, Node.js, etc. and frameworks like MySQL, MongoDB, RabbitMQ among others. VMware also offers vFabric, a PaaS platform focused on JAVA spring framework. vFabric is an integrated product with VMWare infrastructure, providing a suite of offerings around Runtime, Data Management and Operations. I feel future of vFabric is likely to depend on the industry adoption of Cloud Foundry (there is also another open source PaaS effort being carried out by Red Hat called OpenShift).

Overview of VMware Cloud Platform

Continuing my discussion on major Cloud Platforms, in this post I will talk about VMware (subsidiary of EMC) – one of the companies that pioneered the era of virtualization. Flagship product of VMware is ESX (VSphere being product, which bundles ESX with vCenter) a hypervisor that runs directly on the hardware (bare metal). As you would expect, VMware is major player in private cloud and data center space. It also has a public IaaS (Infrastructure as a Service) cloud offering and also supports an open source PaaS platform (understandably no SaaS offerings). Below is a quick overview of VMware offerings.

Private CloudvCloud Suite is an end-to-end solution from VMware for creating and managing your own private cloud. The solution has two major components – Cloud Infrastructure and Cloud Management. Cloud Infrastructure components include VMware products like vSphere (cloud OS controlling the underlying infrastructure) and vCloud Director (multitenant self-service portal for provisioning VM instances based on vApp Templates), while Cloud Management consists of operational products like vCenter (centralized extensible platform for managing infrastructure) among others. There are also vCloud SDKs available which you can use to customize the platform to specific business requirements. Also, with last year acquisition of DynamicOps (now called vCloud Automation Center) VMware is extending its product support to other hypervisors in the market. Other vendors too like Microsoft are evolving with similar offerings with Hyper-V, System Center, SPF and Windows Azure Services. It’s important to note though, quite a few enterprises operate a private cloud like setup using VSphere alone and build custom periphery around it as necessary.

Public Cloud – In case you don’t have budget to setup your own datacenter or are looking to build a hybrid approach which helps you do a cloud burst for specific use cases, you can leverage VMware’s vCloud Hybrid Service (AKA vCHS). The benefit here is migration and operation remains seamless, as you would use the same tools (and seamlessly extend your processes) that were being used for in-house Private Clouds.

PaaS Cloud – VMware has a PaaS offering for private clouds called vFabric. vFabric application platform contains various products focused on JAVA Spring Framework stack. Architects can create a deployment topology using drag and drop for their multi-tier applications. Not only they can automate the provisioning, but also scale their applications in accordance with business demand. In addition, VMware is also funding an open source PaaS platform called Cloud Foundry (CF). The value proposition here is you can move this platform to any IaaS vendor (vCloud, OpenStack, etc.), so when you switch between cloud vendors you don’t have to modify your applications. This is contrary to other PaaS offerings which are tied to the underlying infrastructure – e.g. application ready for Azure PaaS would have to undergo remediation to be hosted on Google PaaS. Also, being open source you can customize the CF platform to suite your needs (there is similar effort being carried out by Red Hat called OpenShift).

Finally, you might hear the term vBlock (or vBlock Systems) in context of VMware. VCE (Virtual Computing Environment) – the company which manufactures vBlock Systems was formed by collaboration of Cisco, EMC and VMWare. These vBlock systems racks contain Cisco’s servers & switches, EMC’s storage and VMware virtualization. There are quite a few service providers using vBlock, to create their own set of cloud offerings and services.

Hope this helps!

Overview of Google Cloud Platform

In next few posts, I will try to give a brief overview of major Cloud Computing platforms. As I started writing this post, it reminded me of an incident. Few years back I was chatting with a Microsoft Architect. He proudly told me that if Google were to shut tomorrow, none of the enterprises would care about it. Well, since then things have changed. From a provider of search engine, email and mobile platform (Android), Google has made it ways into enterprises. To add another experience, recently I was visiting a fortune customer and saw one of the account managers using Gmail. While my first reaction was he shouldn’t be checking his personal emails at work (we were discussing something important), he, in fact, was replying to an official email. I learned from him that they were among the early adopters of Google Apps. With those interesting anecdotes, below is quick overview of Google cloud platform.

Google Apps – You can think of Google Apps as a SaaS offering more on the lines of Microsoft Office 365. It includes Gmail, Google Calendar, Docs, Sites, Videos, etc. Value proposition is – you can customize these services under a domain name (i.e. white label). Google charges per user monthly fee for these services (this fee is applicable to Google Apps for Business; Google also offers a free version for educational institutions under brand Google Apps for Education). In addition, Google has created a market place (Google Apps Marketplace), where organizations can buy third party software (partner ecosystem) which further extends Google Apps. As you would expect, Google also provides infrastructure and APIs for third party software developers.

Google Compute Engine – GCE is the IaaS offering of Google. Interestingly, it offers sub hour billing calculated at minute level with minimum of 10 minutes. For now only Linux images / VMs are supported. Here’s a Hello World to get started with GCE. Note that you need to setup your billing profile to get started with GCE.

Google App Engine – GAE is an ideal platform to create applications for Google Apps Marketplace. A PaaS offering from Google – easy to scale as your traffic and data grows. Like Microsoft’s Windows Azure Web Sites, you can serve your app from a custom domain or use a free name on appspot.com domain. You can write your applications using JAVA, Python, PHP or Go. You can download respective SDKs from here along with a plugin for Eclipse (SDKs come with an emulator to simplify development experience). With App Engine you are allowed to register up to 10 applications per account – and all applications can use up to 1 GB of storage and enough CPU and bandwidth to support an application serving around 5 million page views a month at no cost. Developers can also use NoSQL (App Engine Datastore) and relational (Google Cloud SQL) stores for storing their applications data. Google Cloud Storage a similar offering to Windows Azure Blob Storage, allows you to store files and objects up to terabytes in size. App Engine also provides additional services such as URL Fetch, Mail, Memcache, Image Manipulation, etc. to help perform common application tasks.

Google BigQuery – BigQuery is an analytic tool for querying massive datasets. All you need to do is move your dataset to Google’s infrastructure. After that, you can query data using SQL-like queries. These queries can be executed using a browser or command line or even from your application by making calls to BigQuery REST API (client libraries are available for Java, PHP and Python).

So, in a nutshell these are the major offerings of Google Cloud platform encompassing SaaS, PaaS and IaaS. Google Apps appears to be the most widely used of all offerings, with Google claiming more than 5 million businesses running on it.

Hope you found this overview useful.

RTO vs. RPO

While talking of IT Service Continuity planning, an IT aspect of Business continuity planning, terms RTO and RPO have become a common place. While both terms can have different meaning depending on the context, for IT they largely represent acceptable downtime or time to recover IT operations to normal. Below is a brief overview.

RTO – Recovery Time Objective is permissible system downtime after a breakdown event. If downtime exceeds this limit, it’s bound to cause impact to the business (most likely financially). RPO – Recovery Point Objective is the permissible time of data loss during a failover. Though RPO is an involved term, for simplistic example consider a RPO limit of 2 hours set by company X – this could translate that during a disaster event when the secondary site is activated the data loss (sync window) between primary and secondary shouldn’t be more than 2 hours.

Normally, there isn’t one RTO and RPO for a given organization, rather is different and attributed to the service / system in context. Systems with aggressive RTO / RPO are costlier to run compared to the ones with relaxed guidelines. Most enterprises mandate SLAs around RTO / RPO from their service providers. Also, If your primary focus is just around databases you can pick up one of these approaches. Please leave your comments below with additional thoughts on this topic.

Big Data, NoSQL and MapReduce

Consider a hypothetical scenario. Your company has got the project to design a new website for the channel airing IPL (if you haven’t heard of IPL, just pick up any sport you love). The channel wants to create this new website when users can register and create their own discussion rooms for discussing a specific aspect of match or a specific player or anything else. You have been assigned as a lead architect on this project. Among other challenges, you are having nightmares thinking about non-functional requirements (NFRs) that are to be met for this project (your competition was fired, as their traditional 3 tier architecture wasn’t holding up). You know you got to do something different, but not sure exactly what and how. If this resonates with you, keep reading.

Big Data – As name suggests Big Data is about huge and fast growing data, though how huge and how fast is left to one’s discretion. Big Data initially attributed to search engines and social networks is now making its way into enterprises. Primary challenges while working with Big Data are – how to store it and how to process it. There are other challenges too like visualization and data capture itself, but for this post I will omit them. Let’s start with storage first, by understanding NoSQL.

NoSQL is an umbrella term for non-relational databases which don’t use SQL (Structured Query Language). NoSQL databases unlike relational databases are designed to scale horizontally and can be hosted on a cluster. Most of these databases are key value stores (Riak) where each row is a key value pair. The important thing to note here is the value doesn’t have a fix schema; it can be anything – a user or a user profile or an entire discussion. There are two major variants of key value databases – document database (MongoDB) and column-family database (Cassandra). Both of them extend the basic premise of key-value store to allow easy search on data contained inside value object. Document store database imposes a structure on the value stored allowing query on internal fields. On the other hand column-family database stores value across multiple column value pairs (you can also think of it as second level key value pair) and then group them into a coherent unit called column-family.

Before you think you have found the storage panacea and ready to go, you need to take care of few important aspects related to distributed databases – scalability, availability, and consistency.

Image
First is scaling via sharding to meet your data volume. Good part is most of NoSQL database support auto sharding which means shards are automatically balanced across the nodes on a cluster. You can also add additional nodes as necessary to your cluster, to align with data volume. But what if a node goes down? How can we still make the shards available? We need to mitigate these failures by making our system highly available.

Availability can be achieved via replication. You can setup a master slave replication or peer-to-peer replication. With master slave replication you should typically setup three nodes including master and all the writes go to the master node. Data reads though can happen from any node, either a master or a slave. If a master node goes down, the slave gets promoted to master, and continues to replicate to the third node. When failed master node resurrects it joins the cluster as a slave. In contrast, peer-to-peer replication is slightly complex. Here, unlike Master / Slave all the nodes receive read / write requests. The shards are now replicated bidirectional. While this looks good just remember when we use replication we will run into consistency issues due to latency.

There are two major types of inconsistencies – read and write. Read inconsistencies will arise in master / slave replication when you try to read of a slave before changes propagate from master. While in peer-to-peer replication you will run into both read and write inconsistencies, as write (update) is allowed on multiple nodes (think of two people trying to book movie tickets at the same time). As you would have observed availability and consistency are in contrast to each other (check out CAP theorem for more details). What’s the right balance is purely contextual. For instance you can prohibit reads and writes inconsistencies – just have slaves as hot standby; don’t read of them.

Let’s now see how you can process Bigdata – the compute aspect. Processing massive amount of data needs a shift from the client server model of data processing wherein client pulls the data from server. Instead the emphasis is to run processing on the cluster nodes where data is present by pushing the code. In addition, this processing can be carried out independently in parallel as the underlying data is already partitioned across nodes. This way of processing is referred to as MapReduce pattern and it also interestingly uses key-value pairs.

Extending our IPL example, consider you want to list the top players being discussed across all the forums. This would mean you need to iterate through each discussion in our NoSQL store and then identify each player occurrences. Applying MapReduce here, we start with the map function. A single discussion (key-value pair) would be an input to the map function, which would result into a key value pairs output, with key being player name and value indicating number of occurrences. All the occurrences (values) for a given player (key) across nodes are then passed to a reduce function for aggregation.

Most MapReduce frameworks allow you to control the number of mappers and reducers instances, using configurations. While reduce functions normally operate on a single key, there is also a concept of partition function which allows you to send multiple keys to a single reducer, helping you evenly distribute the load across reducers. Finally, as you would have guessed mappers and reducers could be running on different nodes, and this would need map output being moved across to reducers. To minimize these data movements, you can introduce combiners, which perform a local reducing job – in our case all the player occurrences can be aggregated at the node level before passing it on to the reducer. Most of NoSQL databases have their own way of abstracting / implementing MapReduce via queries and others. You can also use Hadoop and related technologies like HDFS for your MapReduce workload without using NoSQL databases.

Image

Hope this overview has helped you understand the big picture of how these technologies fit together.

Dependency Inversion, Dependency Injection, DI Containers and Service Locator

Here is a post I have been planning to do for years 🙂 . Yes, I am talking about Dependency Injection, which by this time, has most likely made its way into your code base. I got to recollect few thoughts around it in a recent discussion and hence writing them down.

Dependencies are the common norm of object oriented programming, helping us to adhere to software principles like Single Responsibility, Encapsulation, etc. Instead of establishing dependencies using direct references between classes or components, they are better managed through an abstraction. Most of the GOF patterns are based around this principle which is commonly referred as ‘Dependency Inversion’. Using Dependency inversion principle we can create flexible Object Oriented designs, making our code base reusable and maintainable.

To further enhance value proposition of the dependency inversion, you can pass the dependencies via a constructor or a property from the root of your application, instead of instantiating them within your class. This would allow you to mock / stub your dependency and make your code easily testable (read unit testing).

E.g.

IA ob = new A(); //Direct instantiation
public D (IA ob) { … } //constructor injection
public IA ob { get; set } //property injection – create object and set the property

Entire software industry sees value in above approach and most people refer to this entire structure (pattern) as Dependency Injection. But beyond this things get little murkier.

Confusion starts for programmers who see above as a standard way of programming. To them Dependency Injection (DI) is the automated way of injecting dependencies using a DI container (DI container at times is also referred as IoC (Inversion of Control) container. As Martin Fowler points out in his classic article, IoC is a generic term used in quite a few cases – e.g. the callback approach of Win32 programming model. In order to avoid confusion and keep this approach discernible, the term DI was coined. In this case, IoC refers to inversion of dependencies creation, which are created outside of the dependent class and then injected.) If you are surprised by this statement, let me tell you 90% of people who have walked to me to ask if I was using DI, wanted to know the about the DI container and my experience with it.

So what the heck is this ‘DI container’? A DI container is a framework which can identify constructor arguments or properties on the objects being created and automatically inject them as part of object creation. Coming from the .NET world, the containers I use include StructureMap, Autofac and Unity. DI containers can be wired up using few lines of code at the starting of your program or you can even specify configuration in a XML file. Beyond that, containers are transparent to the rest of your code base. Most containers also provide AOP (Aspect Oriented Programming) functionality and its variants. This allows you to bundle cross cutting concerns like database transactions, logging, caching, etc. as aspects and avoid boilerplate code throughout the system (I have written a CodeProject article on those lines). Before you feel I am over simplifying things, let me state if you haven’t worked with DI containers in past you are likely to be faced with a learning curve. As is the case with most of the other frameworks, a pilot is strongly recommended. As a side note, the preferred injection rule is – unless your constructor requires too many parameters (ensure you haven’t violated SRP), you should resort to constructor injection and avoid property injection (see fowler’s article for a detailed comparison between the two injection types).

Finally, let’s talk about Service Locator an alternative to DI. A Service Locator holds all the services (dependencies) required by your system (in code or using a configuration file) and returns a specific service instance on request. Service Locator can come in handy for scenarios where DI container is not compatible with a given framework (e.g. WCF, ASP.NET Web APIs) or you want more control over the object creation (e.g. create the object late in the cycle). While you can mock service locator, mocking it would be little cumbersome when compared to DI. Service Locator is generally seen as an anti-pattern in DI world. Interestingly, most DI containers offer APIs which can allow us to use them as Service Locator (lookup for Resolve / GetInstance methods on the container).

Below sample shows you StructureMap – a DI container, in action (use NuGet to add StructureMap dependency).

class Program
{
static void Main(string[] args)
{
var container = new Container(registry =>
{
registry.Scan(x =>
{
x.TheCallingAssembly();
x.WithDefaultConventions();
});
});
var ob = container.GetInstance<DependentClass>();
ob.CallDummy();
}
}

public interface IDependency
{
void Dummy();
}

public class Dependency : IDependency
{
public void Dummy()
{
Console.WriteLine("Hi There!");
}
}

public class DependentClass
{
private readonly IDependency _dep;
public DependentClass(IDependency ob)
{
_dep = ob;
}
public void CallDummy()
{
_dep.Dummy();
}
}

I will try to post some subtle issues around DI containers in future. I hope above helps anyone looking for a quick start on DI and associated terms 🙂 .

What is DMZ?

A very brief introduction. DMZ is an element which most of architects miss out in their deployment architectures (except few who run their designs through IT pros). The word stands for “Demilitarized Zone”, an area often found on perimeter (outside) of a country’s border. These areas are typically not guarded (under treaties between two or more countries).

In IT domain we refer to DMZ as a separate network. So, why to create separate networks or DMZs? Simplest answer is to enhance security. For example, consider hosting the public facing websites of your business. You might want to host these sites inside a DMZ, separate from your corporate network. In case security of your site is compromised, your corporate network is still safe. Many architects also prefer hosting web and database servers under different DMZs to ensure there is no comprise on data, in case a hacker breaks into their web servers. Like elsewhere, you can use routers to transfer data to DMZ networks. While DMZ is separate network, you must have enough defense packed into it. An enterprise firewall (with redundancy, of course) is a minimum recommendation. Enterprises are also known to have multiple firewalls for each of their DMZs / networks. Below is a simplistic diagram of DMZ deployment.

Hope this helps 🙂 !