Simplified Docker Networking for Developers

When I joined Docker Inc., my hiring manager had mentioned to me that everything they do is fairly simple except Networking. While most will agree to that statement, it’s even more true from a developer standpoint. If you search for any Docker Networking article, it invariably gets into the details of network namespaces, netns commands, ip links, iptables, IPVS, veths, mac addresses, and so on. While the underpinnings are important, what most of them fail to do is to paint a simple picture for Devs who are looking to understand the overall traffic flow and how communication between containers work. And that’s a common pain point I have heard from most of my customers. Hence, I thought to simplify this convoluted piece. Hope you find my attempt useful.

Let’s start by talking about networking in general which is to make computers (VMs and devices) talk to each other. Networking primarily relies on Bridge (or switch) and Router. Interestingly most of us are already using these components in our home – to connect our devices at home with internet. If you need a refresher check out this link on home networking. Docker containers are no different than those devices and to connect containers we need a similar setup. There are 2 parts to establishing connectivity – connecting containers running on single VM and connecting containers spread across VMs (nodes).

Connecting containers on a single VM is simple and I guess well understood. Docker uses a virtual bridge to connect all the containers running on a single VM. This bridge is called ‘Docker0’ and all the containers running on a standalone docker instance are connected to this bridge. The one end of the bridge is connected to host ethernet. This allows for inbound traffic to containers and outbound traffic from containers for accessing external services. The docker0 also has a default IP CIDR range of 172.17.0.0/16 assigned, out of which individual container get their own IP address. To access the container externally you can map a host port to container port something similar to “docker container run -d -p 8585:80 nginx”. With that you should be able to browse to NGINX container using <host-ipaddress>:8585.

Docker containers communicating on a single VM

Now what about the containers running across nodes which is typical with Swarm – running Docker containers at scale. How do we connect all these containers? While the VMs themselves are typically connected via network, they don’t have any clue about containers running on them and the IPs assigned to those containers. It would be very tedious to do any physical network changes to enable this communication. That’s where Docker leverages software defined networking, using VXLAN based overlay networks (note VXLAN overlay is one of the network drivers supported by Docker; for more comprehensive description on Docker Networking refer to this article). When you initialize Swarm on a docker node (“docker swarm init”) it creates two networks by default – docker_gwBridge & ingress (will get to ingress at the end) to enable cross node container communication.

Let’s take an example – consider you have two UI and DB containers let’s say C1 & C2, and you want C1 container to communicate to C2. In this case both C1 & C2 containers are running on different nodes. To make this communication possible you will have to create a user defined overlay network and create C1 and C2 containers attached to it.

docker network create --driver overlay appnet
docker service create --network appnet nginx # Attach UI or DB to Overlay

Yes it’s that simple. Actually when you create the overlay network ‘appnet’ Docker assigns it a CIDR range and attached containers are assigned IPs from that range. You can also customize this CIDR pool for all overlay networks at time of swarm initialization with “docker swarm init –default-addr-pool <ip-range>” or –subnet flag for individual networks.

Here’s a diagrammatic view of the above setup – Overlay N/W with CIDR range of 10.0.1.0/24 connecting individual containers of IP 10.0.1.3 and 10.0.1.4.

Docker containers communicating across VMs using Overlay network

So is that all? Mostly yes, but we still need to figure out the inbound and outbound access for these containers. Let’s start with external access – how do the containers running on these overlay networks reach out to external ecosystem? What if the database we just mentioned is not running in the Swarm cluster but outside of it? For standalone Docker nodes this was carried out by connecting all containers to Docker0 bridge. Following a similar pattern in Swarm mode, we use a Gateway bridge – which is created on every node at the time of node joining the Swarm cluster. Each container in an overlay network has its gateway endpoint attached to Docker gateway bridge for external access. To verify this you can inspect docker_gwbridge network to find the container ID of the attached container as show below.

Docker containers communicating across nodes using overlay and docker_gwbridge networks

You might think why the gateway bridge is connected with individual container and not to VLAN as a whole? That’s because containers can be running on any node and having a single gateway bridge for the entire VLAN would be unreliable and slow. That brings us to the last component inbound access to containers. You might be tempted to think it should be straight forward. After all, every container has outbound access through docker_gwbridge and just like docker standalone approach we can use ‘-p’ flag to allow access via host port. But things aren’t that simple. While port based host access should work, you are pinning your access to specific node and then if the container were to move to different node you will have to change your request URL to point to that node. Take a pause and think over it. To solve this puzzle Docker uses ‘Ingress’ network. Ingress (also called routing mesh) is just another overlay network to which every node in docker cluster is connected via docker_gwbridge. The idea is you still can publish your app on a specific port just like standalone Docker, but now the traffic would be routed to the underlying app regardless of the node the request landed on. This is the routing magic – your request lands on node #3 it still gets routed to container running on node #1. This magic is done by attaching all the app containers which expose published port to ingress overlay in existing to their overlay networks (see the diagram below where container ‘C1’ is attached to both App Overlay & Ingress Overlay). This is also the reason why you will containers like C1 having dual IPs. This attachment of containers to both ingress and app overlay networks is necessary, without which there is no way for ingress to route request. Remember overlay networks are isolated from any external traffic and hence container attaches to both networks to allow for seamless routing. You can see in below commands that when I create a new Swarm service with published ports and attach it to app overlay network, the underlying container is attached to both appnet and ingress overlay networks and has dual IPs.

External access to Docker containers via Ingress routing mesh

So there you go. I have taken some liberty to abstract things out but I would be very happy if all my customers had this level of understanding when I start working with them. You can easily extend this setup by adding L7 load balancer and reverse proxy like Traefik, Interlock, etc. That would allow you to do all app routing through reverse proxy and reverse proxy will be the only component exposed through ingress overlay. Docker engine has DNS / LB capability too, where name resolution works for services in same overlay network (try to ping DB container from UI without IP address using serivce-name of DB swarm service) and there is built-in support for load balancing across service replicas via virtual IPs (you can lookup VIP using docker service inspect <service-name>). May be I will cover these aspects in a future post.

That’s all for this post. Did I demystify the docker networking magic for you? Is there anything else I could simplify further. Please let me know your comments.

Overview of Calico CNI for Kubernetes

One of the key reasons why Kubernetes (K8s) is so popular is due to its single responsibility design. It largely confines itself to one specific job of scheduling and running container workloads. For rest, it relies on container ecosystem vendors and open specifications to fill the gaps. Take for instance ‘CRI’ (Container Runtime interface) – a plugin interface which enables K8s to work with wide variety of container runtimes, including Docker. Another area is Networking. After you schedule the workloads to run on specific nodes, these workloads would invariably need IP address, traffic routing between them and at times even block traffic for security reasons. K8s could have baked in a networking solution into the scheduling engine but it didn’t. Enter CNI (Container Network Interface) – an open specification for container networking plugins that has been adopted by many projects including K8s. There are over dozen vendors out there who have implemented CNI specification and Calico, an open source project from Tigera, is one of them.

Once you install K8s, the first thing typically you would be required to do is to install a CNI plugin. For instance, if you are using kubeadm the official documentation states –

The network must be deployed before any applications. Also, CoreDNS will not start up before a network is installed. kubeadm only supports Container Network Interface (CNI) based networks…

To install Calico you can use yaml file which creates calico-node daemonset (pod that runs on all the nodes of the K8s cluster). Though if you really want to know the Calico installation underpinnings, it might benefit to do it the hard way.

kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml

On other side, if you are using a managed K8s platform like Docker EE, Calico is setup for you by default. Regardless of how you install Calico you should also install Calicoctl – the command line option to interact with Calico resources. Three key components that make Calico work are Felix, BIRD, and etcd. Felix is the primary Calico agent that runs on each machine that hosts endpoints, programming routes and ensure only valid traffic is sent across endpoints. BIRD is a BGP client that distributes routes created by Felix across the datacenter. Calico uses etcd as a distributed data store to ensure network consistency. Once configured, Calico provides you a seamless networking experience. Customizations are typically around IPAM (IP Address Management) and Network Security requirements. Let’s take a quick look at them starting with IPAM.

By default Calico uses 192.168.0.0/16 IP pool to assign IP addresses to pods (sudo ./calicoctl get ippools -o wide). Calico allows to create multiple IP pools and request for an IP from a given IP pool as part of pod deployment. To request IP from a given IP pool you would add a K8s annotation to your pod yaml file as shown below. You can also refer to this link for detailed instruction.

annotations: "cni.projectcalico.org/ipv4pools": "[\"custompool-ipv4-ippool\"]"

Network policy is a specification of how groups of pods are allowed to communicate with each other and other network endpoints. NetworkPolicy is a standard K8s resource that uses labels to select pods and define rules which specify what traffic is allowed to the selected pods. Calico extends this further allowing for policy ordering/priority, deny rules, and more flexible match rules. Here’s a simple Calico tutorial on how to apply NetworkPolicy.

Couple of things before we wrap up. Most of public cloud providers have a native CNI plugin but using Calico provides you the desired portability (same reasons for using Docker EE platform). Also most cloud provides don’t allow for BGP (excluding the VPN Gateways & VNET peering) so it’s quite possible that you may use cloud native plugins for routing & IPAM and leverage Calico plugin for extended network policy features (also note Calico does supports IPinIP and VXLAN to overcome BGP shortcomings). You can examine the CNI plugin configuration of your cluster by looking at /etc/cni/net.d/10-calico.conflist.

Hope that gave you a fair idea of where CNI & Calico fit into the K8s ecosystem. Please let me know your thoughts in comments section below.

Deploying GPU Workloads with Docker EE & Kubernetes

In my previous post I had demonstrated how easily we can setup Docker EE (Enterprise Edition) and all of it’s components including UCP (Universal Control Plane) and DTR (Docker Trusted Registry) on a single node. I had also outlined steps to deploy a sample application using Swarm orchestrator. Taking it further in this post I am going to provide you a walkthrough of how you can deploy GPU (Graphical Processing Unit) workloads on Docker EE using Kubernetes (K8s) as the orchestrator. K8s support is in experimental mode while GPU support for Swarm is still being worked upon. But before we get into details let me start by providing a quick perspective on GPUs.

If you are a geek most likely you have heard of GPUs. They have been around for a while. GPU, as you might know, was designed to perform computations needed for 3D graphics (for instance, interactive video games) an area where CPUs fell short. That’s how a computer’s motherboard started to have two separate chips – one CPU & other for GPU. Technically, GPU is a large array of small CPU processors performing highly parallelized computation.

Cool! But why are modern cloud computing platforms rushing to augment their compute services with GPUs (AWS, Azure, GCE)? Cloud is typically used for backend processing and has nothing to do with traditional displays. So what’s the rush about? The rush is to allow running computational workloads that have found sweet spot with GPUs – these include AI (Artificial Intelligence), Machine Learning, Deep Learning, and HPC (High Performance Computing) among others. These applications of GPU for non-display use cases is popularly referred to as GPGPU – General Purpose GPUs. Do note though, while it’s still possible to run these workloads with traditional CPUs but with GPUs their processing time is reduced from hours to minutes. Nvidia is the leading manufacturer of GPUs followed by AMD radeon. At the same time, cloud provides are coming up with their own native chips to cater to AI workloads.

So how can developers write programs to leverage GPUs? Well a lot of that depends on manufacturer and the tools they have built around it. For this post let us focus on Nvidia and CUDA (Compute Unified Device Architecture). CUDA is a parallel computing platform developed by Nvidia to allow developers use a CUDA enabled GPU (commonly referred as CUDA core). CUDA platform was designed to work with programming languages such as C & C++. As demand grew for deep learning workloads, Nvidia extended the platform with CUDA Deep Neural Network library (cuDNN) which provides a GPU-accelerated library of primitives for deep neural networks. cuDNN is used by popular deep learning frameworks including TensorFlow, MXNet, etc. to achieve GPU acceleration.

By this time you should have understood – to run GPU workloads you will need the manufacturer’s GPU driver / runtime, libraries and frameworks along with your project binaries. You can deploy all of these components on your machines or better leverage containerization and schedule / scale them using Docker and K8s. Coming to K8s the support for GPUs is in experimental mode. K8s implements Device Plugins to let Pods (Scheduling Unit of K8s) access specialized hardware features such as GPUs. Device Plugins are available for AMD and Nvidia but for this post I will continue to stick to Nvidia. Further for K8s you have two Nvidia device plugin implementations to choose from – one is official from Nvidia and other customized for Google Cloud. I will go with former and use a AWS Ubuntu 16.04 GPU based EC2 instance to deploy Nvidia Driver, Nvidia Docker runtime, K8s and then deploy a CUDA sample.

Let’s start by spinning up a p2.xlarge Ubuntu based Deep Learning AMI on AWS. This AMI has Nvidia CUDA and cuDNN preinstalled.

After logging you can execute ‘nvidia-smi’ command to get details of driver, CUDA version, GPU type, running processes and others. The output should be similar to below.

The first thing you need to with this AMI image is to install K8s. There are various options to install K8s but for this post we will be using Docker EE. You can use the same commands to install Docker EE on this node that I had outlined in my earlier post. Docker EE is bundled with upstream K8s (read here to get more details about Docker Kubernetes Service) hence that one node setup should suffice for our exercise. After installing Docker EE, to install Nvidia Docker container runtime you can follow these quick start instructions. As part those instructions don’t forget to enable K8s Device Plugin daemonset for Nvidia by executing below command (if you don’t have Kubectl installed you can follow the steps outlined here) .

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta/nvidia-device-plugin.yml

That’s it! You are all set to deploy your first GPU workload to K8s. To create the Pod you can use the UCP UI or kubectl CLI. Here’s a sample yaml file requesting for 1 GPU to perform vector addition of 50000 elements:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU

The pod should be scheduled on our single node cluster and run to completion. You can retrieve the logs via UCP UI or CLI as shown below. The logs should show you the job completion with success status.

I will extend our basic setup in future blog posts to demonstrate workloads with popular deep learning libraries. Meanwhile, do let me know if you have any questions or comments on this deployment workflow.

Until then, happy learning 🙂

Working with Docker APIs

Docker’s popularity is due to its simplicity. Most developers can cover quite a bit of ground just with build, push & run commands. But there could be times when you need to start looking beyond CLI commands. For instance, I was working with a customer who was trying to hook up their monitoring tool to Docker Engine to enhance productivity of their ops team. This hook is typically to pull status and stats of docker components and report / alert on them. The way monitoring tool would achieve this is by tapping into APIs provided by Docker Engine. Yes, Docker Engine does provide REST HTTP APIs which can be invoked by any language / runtime with HTTP support (that’s how Docker client works with Docker Engine). Docker even has SDKs for Go and Python. For this post I will be focus on how you can access these APIs using HTTP for both Linux and Windows hosts. Finally we will look at how to access APIs for Docker Enterprise deployments.

Accessing APIs with Linux Hosts

For Linux, Docker daemon listens by default on a socket – /var/run/docker.sock. So using our good old friend CURL here’s the command that you can execute locally to get the list of all containers (equivalent of docker ps)

curl --unix-socket /var/run/docker.sock -H "Content-Type: application/json" -X GET http:/containers/json

But how about accessing the API remotely? Well simple you need to expose a docker endpoint beyond socket using the -H flag for docker service. Link shows you how to configure the same. With that configuration in place you can fire CURL again but this time access the IP rather than the socket.

curl -k  -H "Content-Type: application/json" -X GET http://172.31.6.154:2376/containers/json

So can just anyone access these APIs remotely? Is there an option to restrict or secure access to APIs? The answer is yes. Docker supports certificate based authentication wherein only authorized clients (possessing cert obtained from authorized CA) can communicate to Docker engine. This article covers required steps in detail to configure certs. Once configuration is done you can use below curl command to access Docker APIs.

curl -k --cert cert.pem --cacert ca.pem --key key.pem https://3.14.248.232/containers/json

Accessing APIs with Windows Hosts

Unlike Linux which uses sockets, Windows uses pipes. Once you install Docker and run the pipelist command you should see ‘docker_engine’ in the pipelist. You can access this pipe using custom C# code or use Docker.DotNet NuGet Package. Below is a sample code with Docker.DotNet library to retrieve list of containers using ‘docker_engine’ pipe.

DockerClient client = new DockerClientConfiguration(new Uri("npipe://./pipe/docker_engine")).CreateClient();

var containerList = client.Containers.ListContainersAsync(new ContainersListParameters());

foreach (var container in containerList.Result)
{...}

In addition, similar to Linux you can use the -H option to allow for TCP connections and also configure certs to secure connection. You can do this via config file, environment variables or just command line as shown below. You can refer to this article for securing Docker Engine with TLS on Windows. Once configured you can use Docker.DotNet to invoke Docker APIs.

dockerd -H npipe:// -H 0.0.0.0:2376 --tlsverify --tlscacert=C:\ProgramData\docker\daemoncerts\ca.pem --tlscert=C:\ProgramData\docker\daemoncerts\cert.pem --tlskey=C:\ProgramData\docker\daemoncerts\key.pem --register-service

Accessing APIs with Docker Enterprise Edition

So far we have invoked APIs for a standalone community docker engine but Docker has other enterprise offerings including Docker Enterprise Edition (EE). If you are not familiar with it check out my post to create a single node test environment. Universal Control Plane (UCP) part of Docker EE provides a set of APIs which you can interact with. Also unlike standalone engines, Docker UCP is secure by default. So to connect to UCP you would need to pass username password credentials or use certs from the client bundle both of which are shown below. With credentials you retrieve an AUTHTOKEN and pass that token for subsequent requests to gain access.

#Accessing UCP APIs via Creds
AUTHTOKEN=$(curl --insecure -s -X POST -d "{ \"username\":\"admin\",\"password\":\"Password123\" }" "https://3.14.248.232/auth/login" | awk -F ':' '{print    $2}' | tr -d '"{}')

curl -k -H  "content-type: application/json" -H "Authorization: Bearer $AUTHTOKEN" -X GET https://3.14.248.232/containers/json 

#Accessing UCP APIs via Client Bundle
curl -k --cert cert.pem --cacert ca.pem --key key.pem https://3.14.248.232/images/json 

Hope this helps in extend your automation and monitoring tools to Docker!

Deploying Docker Enterprise on a single node

Most developers find installing Docker fairly straight forward. You download the EXE or DMG for your environment and go through an intuitive installation process. But what you have there is typically a Docker Desktop Community Edition (CE). Desktop editions are great for individual developers but Docker has lot more to offer to enterprises who want to run containers at scale. Docker Enterprise provides many capabilities to schedule and manage your container workloads and also offers a private docker image repository. The challenge though is while it’s fairly easy to install Docker Desktop it isn’t quite the same to install Docker Enterprise. Docker Enterprise invariably requires installing Docker Enterprise Engine, installing Universal Control Plane (UCP) with multiple manager and worker nodes, Docker Trusted Registry (DTR) – a private image repo (similar to public Docker Hub), and other components.

In this post, I am going to simplify the setup for you by creating a single node sandbox environment for Docker Enterprise. Hope you will find it useful for your POCs and test use cases. For setup all you will need is a single node VM. You can use any of the public cloud platforms to spin up that single node. For this post, I have setup a Red Hat Enterprise Linux (RHEL) m4.xlarge (4 vCPU & 16GB Mem) instance on AWS. Please note the key components of Docker EE including UCP & DTR only run on Linux platforms. You can install additional Windows nodes later as workers to run Windows workloads.

STEP 1: Install Docker Enterprise Engine
To begin you would have to sign up for free 1 month trial for Docker EE. Once you create docker hub account and fill out the contact details required for trial you should get the below screen.

Screen Shot 2019-07-08 at 3.57.35 PM

Copy the storebits URL and navigate to it in your browser. Here you will find packages for different OS versions. Now RDP into your EC2 instance. Add the relevant docker repo to your yum config manager. I will use RHEL since that’s the OS of my EC2 instance.

$ sudo -E yum-config-manager --add-repo "storebitsurl/rhel/docker-ee.repo"

Then install the docker EE engine, CLI & containerd packages.

$ sudo yum -y install docker-ee docker-ee-cli containerd.io

In case you are aren’t using RHEL as your OS, here are the links to install Docker Enterprise Engine on Ubuntu, CentOS, SUSE Linux, and Oracle Linux.

STEP 2: Install UCP (Universal Control Plane)
UCP is the heart of Docker Enterprise. It allows for running containers at scale, across multiple nodes, let’s you choose an orchestrator, provides RBAC enabled web UI for operations, and much more. The UCP installer runs through ‘ucp’ container image installing ucp agent which in turn bootstraps services and containers required by UCP.

docker container run --rm -it \
-v /var/run/docker.sock:/var/run/docker.sock `#mount docker sock` \
docker/ucp install `#ucp image is stored on docker hub` \
--host-address 172.31.6.154 `#private IP of your node` \
--san 3.14.248.232 `#public IP of your node` \
--admin-password YourPassword `#set the default password` \
--force-minimums `#for cases where node configuration is less than ideal`

Upon completion you should be able to browse your public IP (https://3.14.248.232) and access UCP UI shown below. To login use the ‘admin’ username and password ‘YourPassword’.

Screen Shot 2019-07-26 at 6.24.19 AM

STEP 3: Install DTR (Docker Trusted Registry)

DTR installation takes a similar route to UCP. It uses a container image ‘dtr’ with install command.

docker run -it --rm docker/dtr install \
--ucp-node yourNodeName `#try docker node ls to retrive name` \
--ucp-username admin \
--ucp-url https://3.14.248.232 \
--ucp-insecure-tls `#ignore cert errors for self signed UCP certs` \
--replica-http-port 8080 `#choose a port other than 80 to avoid conflict with UCP` \
--replica-https-port 8843 `#choose a port other than 443 to avoid conflict with UCP` \
--dtr-external-url https://3.14.248.232:8843 `#this will add the external IP to DTR self-signed cert`

Once the command completes successfully you should be able to browse DTR UI on https port (https://3.14.248.232:8843).

Screen Shot 2019-07-26 at 6.33.59 AM

STEP 4: Deploy an App

Let’s deploy an app to Docker EE. For this I would pull an ASP.NET sample image, create a org and repo in DTR, tag image for DTR repo and upload it. Post upload we can use docker stack to deploy the application. To begin let’s create an organization and repository in our DTR.

Screen Shot 2019-07-26 at 5.06.37 AM

Now let’s pull the image from Docker Hub, Tag it and Push it to DTR.

docker pull mcr.microsoft.com/dotnet/core/samples:aspnetapp
docker tag mcr.microsoft.com/dotnet/core/samples:aspnetapp 3.14.248.232:8843/brkthroo/dotnet:aspnetapp
docker image push 3.14.248.232:8843/brkthroo/dotnet:aspnetapp

Next create a docker-compose.yml file with below instructions.

version: "3"
 services:
 web:
 # replace username/repo:tag with your name and image details
 image: 3.14.248.232:8843/brkthroo/dotnet:aspnetapp
 deploy:
 replicas: 1
 restart_policy:
 condition: on-failure
 resources:
 limits:
 cpus: "0.1"
 memory: 50M
 ports:
 - "3776:80"

Once file is created run the docker stack deploy command to deploy listed services.

docker stack deploy -c docker-compose.yml aspnetapp

You should be able to browse app with node’s public IP on port 3776; in my case http://3.14.248.232:3776/.

Screen Shot 2019-07-26 at 6.38.43 AM.png

In future, we will dig into more Docker EE features but hope this single node setup provided you a jumpstart in exploring Docker Enterprise.

Happy Learning 🙂 !

What is DevOps?

Though the notion of DevOps has been around for years, I still see folks struggling to articulate what really is DevOps. Like recently in our user group meeting, participants highjacked the session for 30 minutes debating what DevOps stood for without really arriving at a conclusion. Similarly, I have seen architects, sales teams and developers, struggle to explain in simple terms to their management as to what is DevOps, why is it important and what business value it would provide. In this post, I will share my thought process and would like to hear from you if you have a simpler way of communicating the DevOps intent.

Traditionally software development has been a complex process involving multiple teams. You have developers, testers, project managers, operations, architects, UX designers, business analysts and many others, collaborating to create value for business. The collaboration among these teams requires handshakes (think of it as multi-party supply chain) which often cause friction leading to non-value adds. For instance, the handshake where UX designer develops UI and then developers add code, or the handshake where analyst captures business requirements and development team write code to meet those requirements, or the traditional handshake between developers and testers for code validation, and so on. One such critical handshake is between developers and operations where developers typically toss software over to operations to deploy it in upstream environments (outside of developer’s workstation). Unfortunately, members of these two teams have been ignoring concerns of each other for decades. Developers assume code will just run (after all it ran on their machine) but in reality it rarely does. And that’s where DevOps comes to rescue.

Considering above, it would be safe to summarize that DevOps is any tool, technology or process that will reduce friction between Developers and Operations (thereby creating more value for business e.g. faster time to market, higher uptimes, lower IT costs and so on). Now this could be app containerization bundling all dependencies into single image, or it could be setting up Continuous Integration and Continuous Delivery pipelines to allow for robust consistent deployments, or adopting microservices architecture to pave way for loosely coupled deployments or it could be infrastructure as a code to allow for reliable version-controlled infrastructure setup, or AI enabled app monitoring tools to proactively mitigate app issues or even reorganizing IT teams and driving cultural changes within IT organization. But the objectives are same – reduce friction between Developers and Operations. One you get these basics right it’s easy to devise a strategy and approach.

So does this resonate with you? Or you have a simpler way to explain DevOps? Looking forward for your comments.

Hitting home run with your cloud migration

Cloud is the new norm. Enterprises today are rapidly moving to adopt cloud platforms migrating all their workloads to cloud based IaaS, PaaS or SaaS offerings. There is lot of guidance out there from cloud vendors around assessing your application landscape, determining cloud ROI, adopting specific cloud services and so on. But carrying out such large-scale transformation isn’t trivial, especially when the stakes are substantially high. In addition, guidance on executing such transformations is minimal, something which can be learnt only in trenches. I have partnered with many fortune 500 companies on their transformation journey, not just from a technology know-how standpoint but owning, executing and delivering on the desired business value. I truly believe cloud presents an unprecedented transformation opportunity, and it would be a colossal loss to take a single run with it, when you can actually hit a home run. In this article, I am going to talk about those execution details which are often overlooked and cost dearly to the enterprise. I have used them repeatedly to ensure our customer’s success on large scale cloud transformations.

Seek more – There will be few IT transformations during our lifetimes, which are as pervasive as cloud. Like, when was it last time you had your engineering, operations, security, governance, architecture and business involved in a transformation. With so many leaders at table, it would be a waste to just shut down few VMs on-premise and bring them up in cloud or make small remediation to your applications and deploy them to cloud. My recommendation is to look beyond mere ‘landing zone change’ for your applications. This is your opportunity to clear the house, pay off long outstanding technical debts, increase the velocity of your IT delivery, unify your architecture stack, enhance the norms of IT compliance, minimize security risks, optimize your operations, and provide a solid foundation for success of your business. These are also the parameters you should use to measure the success of your cloud transformation; not cost savings and neither the number of apps migrated.

Be prepared to innovate – Cloud vendors today are very aggressive to win in the marketplace. Major players like AWS, Microsoft, and Google will go to any length to get your business, providing you dedicated SMEs, prioritizing their product features to meet your needs, and so on. We all have seen Andy Jassy and Satya Nadella at conferences, putting customers on stage, and talking about how far they have gone to meet their customer needs. But honestly cloud ecosystem is still maturing. Yup, that’s true even after a decade of work. It’s very likely that you will run into hurdles, unforeseen technology / compliance / process hurdles. We have met many customers who started and stalled their cloud journey because cloud platforms didn’t provide them features they were so accustomed to on-premise. This could be a vendor product, reporting tool, CMDB integration, load balancer capability, or security control, which cloud platform doesn’t support. That’s where you got roll up your sleeves, get your creative folks together and devise a custom solution to meet your enterprise needs. You have to stand up and take risks to keep the ball rolling.

Manage Chaos – Cloud transformation will be inevitability accompanied by chaos. Massive chaos. Needless to say, it’s very important to manage chaos upfront. Many enterprise leaders fall into the trap of announcing an early victory. Typical sequence of events – a cloud COE is formed, they take couple of workloads of varying complexity, carry out pilots with help of vendors, provide a great presentation to CXOs, design a cookie cutter approach, and publish it to rest of the enterprise. And then the hell breaks loose. App managers who start with cookie cutter approach are stuck half way because the laid-out approach doesn’t provide for their use cases. Eventually cloud COEs become overwhelmed by the amount of outstanding issues, many of which aren’t even in their control and that leads to escalation. Even worse business teams get vocal about it, and before you know initiative has got enough bad press to put entire program on hold. This where it’s important to establish an inner circle. For instance, you can setup a cloud migration factory internal to the cloud COE. Factory can operate as a black box taking Line-of-Business app as an input and delivering cloud hosted app as an output. Once you have substantial volume of use cases covered, detailed step by step guidance documentation, rolled out trainings, established support channels with pre-defined SLAs, automation, operating model, only then you should plan to open the gates for the larger enterprise. You will still have issues, but credibility established by then will see you through.

Build for Future – Building for future is especially true of large enterprises which is to ensure all their cloud transformation investments serve them good over next 5-7 years. Accordingly, one should be careful while adopting cloud services and offerings. Vendors might lure you into easy integration, pay as you go models, better performance, and so on. But successful enterprises don’t adopt cloud services on those parameters, rather they create an overarching layer of enterprise services across cloud platforms choosing the best of breed industry solutions. These could be your authentication / authorization, cert management, anti-virus, log analytics, alert monitoring, application performance monitoring, encryption, DevOps toolset, etc. Overarching technology layer would ensure compatibility and portability of your workloads across cloud platforms. Going further few of our customers only use compute services to avoid vendor lock-in adopting a common denominator like containers / Kubernetes to host their workloads. This planning from ground up will ensure you get the desired long-term value from your cloud transformation.

Align with your long term enterprise goals; for instance, if cloud portability is important to your organization make sure you invest and implement as per that strategy

Be wary of your culture – Invariably at starting of every cloud transformation engagement, our customers have a common question – “what one thing you would suggest we should be focused on as we pursue our cloud transformation journey”. It may sound cliché but my answer is mostly “be wary of your culture”. Cloud transformation is a complex undertaking which would take every ounce of energy of not just the senior leadership but everyone on ground. Hence, it’s vital to assess the culture of your enterprise, and past history of success for other high-stake undertakings. Culture is a big topic for pundits, so without going too deep you must build transparency, openness and create a forum for dialogue. Communication is the key here, and overdose of it won’t hurt. We have witnessed lot of fear and skepticism within enterprise around cloud adoption, so tackling those fears through various initiatives is critical. As a leader you need to replace that fear with hope and optimism, not just for the enterprise but for every individual involved in this journey.

Transformation program will only go as far as your organizational culture would permit; create strategy, goals, budget and timelines considering the culture

Hope you found the above points thought provoking. These are some of the lessons which I have learned the hard way. Do let me know your thoughts or post any comments / experiences that you might have.

Go, Score Big!

TOGAF – Quick Reference Guide

TOGAF is The Open Group Architecture Framework – a framework for Enterprise Architecture. In this post I am going to provide a summary of this framework to create a quick easy reference for fellow architects. Open group reports TOGAF is employed by 80% of Global 50 companies and 60% of Fortune 500 companies, if that motivates you to adopt / learn this framework.

So what’s enterprise architecture? Who can be an enterprise architect (EA)? Should every IT architect aspire to be one? Architecture word in enterprise architecture still conveys ‘organization of a system’ just that you are now alleviating from system level organization (typically tiers / layers) to an enterprise level organization of business capabilities & supporting IT ecosystem. If you are a system architect (SA) you can certainly alleviate to an enterprise architect, though you need to weigh it carefully. EAs are different beasts. They certainly have more visibility, are close to business,  up in the corporate ladder (due to alignment with overall enterprise), but at the same time they aren’t too close to code. So if you rejoice technology, enjoy being close to code, trying out cool stuff, being hands-on with technology innovations, etc., EA may not be the right thing for you. EA’s role is lot more involved with business, people and processes. Now I am not saying EAs don’t mess around with technology, or SAs don’t deal with processes, but ratios are mostly skewed.

Now let’s understand what TOGAF has to offer. TOGAF essentially provides guidance for going about enterprise architecture. To start it recommends a methodology for doing enterprise architecture called ADM (Architecture Development Method). It then goes on describe typical deliverables produced throughout the ADM (Content Framework), how to logically organize them (Enterprise Continuum) and store them (Architecture Repository). For enterprise starting fresh there is guidance on creating architecture capability and how to adopt / adapt ADM for a given organization. TOGAF aims at being comprehensive, and often gets cited as being bloated (or impractical as critics call it). This is true to large extent. I haven’t come across an organization that follows TOGAF verbatim, at the same time it’s hard to find an organization EA practice that hasn’t been influenced by TOGAF. Hence it helps to look at TOGAF as a reference guide than a prescription.

Let me answer one last frequently asked question before we get into details – Is TOGAF dated / dead? After all, version 9.1 was released in December 2011. Isn’t that quite long in today’s rapidly changing tech world? Is TOGAF still relevant? Or what’s the use of TOGAF in a digital world? As mentioned earlier, TOGAF is not bound to technology advancements such as Cloud, AI or Digital. TOGAF framework holds and will work with any underlying technology. For instance, when SOA (Service Oriented Architecture) emerged there was no modification done to TOGAF ADM, rather TOGAF provided additional perspectives on how SOA initiatives would map to ADM phases. The same applies to other technology initiatives. As an EA you should have a handle on technology innovations and how your business could leverage them, but how you go about aligning those two – will stay the same.

With this background let me provide you with a quick overview of TOGAF components. At heart of TOGAF is the ADM method consisting of Phases A-H, along with a preliminary phase and centralized requirements management.

adm

  • Preliminary Phase is used to identify business drivers and requirements for architecture work and capturing outcome in Request for Architecture work document. Depending on organization EA maturity this phase will also be used for defining architecture principles, establishing enterprise architecture team / tools, tailoring architecture process, and determining integration with other management frameworks (ITIL, COBIT, etc.).
  • Vision phase (phase-A) focus it to elaborate on how new proposed capability will meet business needs and address stakeholder concerns. There are various techniques like Business Scenarios which TOGAF recommends for assisting in this process. Along with capability one needs to identify business readiness (organizational), risks involved (program level) and mitigation activities. Outcome of this phase is a statement of architecture work and high level view of baseline and target architectures (including business, information systems & technology).
  • Next three Phases B, C, D are to develop these high level baseline and target architectures for each – business, information systems (apps + data) and technology (infrastructure), identify gaps between them and define the roadmap components (building blocks) which will bridge that gap. These architectures and gaps  are then captured in architecture definition document, along with measurable criteria (must do to comply) captured in architecture requirements specification. At the end of each phase a stakeholder review is conducted to validate outcomes against the statement of architecture work.
  • Phase E is Opportunities and Solutions. Goal here is to consolidate gaps across B, C, D phases, identify possible solutions, determine dependencies across them and ensure interoperability. Accordingly create work packages, group these packages into portfolios and projects, and identify transition architectures wherever incremental approach is needed. The outcome of this phase is a draft architecture roadmap and migration plan.
  • Next Phase F is migration planning. While Phase E is largely driven by EAs, phase F will require them to collaborate with portfolio and project managers. Here the draft migration plan is further worked upon assigning business value (priority) to each work package, cost estimates, risk validations, and finalizing migration plan. At the end of this phase both architecture definition and architecture requirements document are completed.
  • In phase G (implementation governance) the project implementation kicks in and EAs need to ensure that implementations are in accordance with target architecture. This done by drawing out architecture contracts and getting those signed from developing and sponsoring organizations. EAs will conduct reviews throughout this phase and will close it out once the solutions are fully deployed.
  • What’s guaranteed after phase G is ‘change’. Organizational drivers do change either top-down or bottom-up leading to changes in enterprise architecture. Managing this change is what phase H is all about. Here EAs perform analysis of each change request, and determine if the change warrants an architecture cycle of its own. This often requires an architecture board (cross-organization architecture body) approval. Typical outcome of this phase would be a new request for architecture work.
  • Central to these phases is requirements management. Requirements management is a dynamic process where requirements are interchanged across phases and also at times between ADM cycles.

In addition to ADM, TOGAF offers guidelines and techniques which will help you adopt / adapt ADM. For instance, there might be cases where you might skip phases or change order of phases. Consider an enterprise committed to adopt packaged solution. Here you are doing business architecture after information systems and technology architecture since they are already the latter two are already done by vendor. Other change could be where you develop target architecture before baseline architecture to ensure there is effective business alignment (not getting boxed into the existing capability). In both these cases you are adapting ADM to your specific enterprise needs.

Next let’s discuss architecture deliverables for each of the phase. We spoke about architecture definition and architecture requirement documents. Wouldn’t it be nice if there was a meta-model which would dictate the structure of these documents ensuring there is consistency across ADM cycles? This is where architecture content framework (ACF) comes in. Metamodel for individual artifacts can be thought of as viewpoint from which view (a perspective for related set of stakeholder concerns) is created. TOGAF categories all viewpoints into catalogs, matrices or diagrams. Furthermore these viewpoints are used to describe architecture building blocks (ABB), which are then used to build systems (ABBs are mapped to solution building blocks (SBBs) in phase E).

contentfwk

So now we have enterprise architecture development methodology and a way to define its deliverables to ensure consistency. What else? How about classifying and storing these artifacts? If you look at a large enterprise there could be hundreds of ADM cycles operating at any given point in time. Each of these cycles would generate tons of deliverables. Storing all the deliverables in a single bucket would lead to chaos and minimal reuse. This is where enterprise continuum (EC) along with architecture and solution continuum comes in. Continuum is used to establish an evolving classification model starting from Foundation to Systems to Industry to Organization for all the architecture artifacts. These artifacts, along with content metamodel, governance log, standards, etc. is stored in architecture repository. There are two reference architecture models included in TOGAF documentation – first one is TRM a foundation level architecture model and second one is III-RM which is a systems level architecture model.

Finally framework gets into the details of establishing architecture capability within an organization. It talks about the need of Architecture Board, its role and responsibilities, including architecture governance (controls & Objectives) and compliance (audit). For the latter two TOGAF includes a governance framework and compliance review process. Guidance also touches upon maturity models (based on CMM techniques) and necessary role skill mapping.

That was TOGAF summary for you. I strongly encourage to read open group publication. Many find it a dry and lengthy read. But it’s the best way to learn TOGAF and a must read if you want to clear the certification. Though it’s highly unlikely that getting certified in TOGAF will overnight establish you as an enterprise architect, but it’s a first good step in that direction.

All the best for your exam, if you are planning for one. Hope you found this post useful!

NuGet Package Restore, Content Folder and Version Control

I was recently explaining this nuance to a Dev on my team, and he suggested I should capture this in a blog post. So here we go. First some NuGet background.

NuGet is the de facto standard of managing dependencies in the .NET world. Imagine you have some reusable code – rather than sharing that functionality with your team via a DLL, you can create a NuGet Package for your team. What’s the benefit you may ask?

1) Firstly NuGet can do lot more than adding a DLL into your project references. Your NuGet package can add configuration entries as part of the installation, execute scripts or create new folders / files within Visual Studio project structure, which would greatly simplify the adoption of your reusable code.

2) Secondly as a package owner you can include dependencies to other packages or DLLs. So when a team member installs your package, she will get all the required dependencies at one go.

3) Finally the NuGet Package is local to your project, the assemblies are not installed on your system GAC. This not only helps for a clean development, but also at build time. Packages don’t have to be checked into the version control, rather at build time you can restore them on your build server – no more shared lib folders.

It’s quite a simple process to create NuGet Packages. Download the NuGet command line utility, organize artifacts (DLLs, scripts, source code templates, etc.) you want to include into their respective folders, create package metadata (nuget spec), and pack them (nuget pack) to get your nupkg file. You can now install / restore the package through Visual Studio or through command line (nuget install / restore).

Typically NuGet recommends 4 folders to organize your artifacts – ‘Lib’ contains your binaries, ‘Content’ contains the folder structure, files, which will be added to your project root, ‘tools’ contains scripts e.g. init. ps1, install.ps1, and ‘build’ contains custom build targets / props.

Now let’s get to the crux of this post – the restore aspect and what you should check into your version control. When you add NuGet Package to your project, NuGet does two things – it’s creates a packages.config and a packages folder. Config files keep a list of all the added packages, and packages folder contains the actual package (it’s basically an unzip of your nupkg file). Recommended approach is to check-in your packages.config file but not the packages folder. As part of NuGet restore, NuGet brings back all the packages in the packages folder (see workflow image below).

workflow-image

The subtle catch is NuGet restore doesn’t restore content files, or perform transformations that are part of it. These changes are applied the first time you install NuGet package, and they should be checked in the version control. This also goes to say don’t put any DLLs inside the content folder (they should anyways go to lib folder). If you must, you will have to check-in even those DLLs inside your version control.

nugetpackage

In summary, NuGet restore just restores the package files, it doesn’t perform any tokenization, transformation or execution as part of it. These activities are performed at the package installation, and corresponding changes must be checked into the version control.

WS-Fed vs. SAML vs. OAuth vs. OpenID Connect

Identity protocols are more pervasive than ever. Almost every enterprise you would come across will have an identity product incubated, tied with a specific identity protocol. While the initial idea behind these protocols was to unify an enterprise-wide identity store with a single set of credentials across applications, new use cases have popped up since then. In this post, I will provide a quick overview of major protocols and the use cases they are trying to solve. I hope you will find it useful.

WS-Fed & SAML are the old boys in the market. Appearing in the early 2000s they are widespread today. Almost every major SSO COTS product supports one of these protocols. WS-Fed (WS-Federation) is a protocol from WS-* family primarily supported by IBM & Microsoft, while SAML (Security Assertion Markup Language) adopted by Computer Associates, Ping Identity, and others for their SSO products. The premise with both WS-Fed and SAML is similar – decouple the applications (relying party/service provider) from the identity provider. This decoupling allows applications to use an identity provider with a predefined protocol, and not care about the implementation details of the identity provider per se.

For web applications, this works via a set of browser redirects and message exchanges. The user tries to access web application, the application redirects the user to identity provider. User authenticates himself, identity provider issues a claims token, and redirects the user back to the application. The application then validates the token (trust needs to established out of band between application and IdP), authorizes user access by asserting claims, and allows user to access protected resources. The token is then stored in the session cookie of user browser, ensuring the process doesn’t have be repeated for every access request.

At a high level, there isn’t much separating the flow of these two protocols, but they are different specifications with each having its own lingo. WS-Fed is perceived to be less complex and light weight (certainly an exception for WS-* family), but SAML being more complex is perceived to be more secure. At the end, you have to look at your ecosystem including existing investments, partners, in-house expertise, etc., and determine which one will provide higher value. The diagram below taken from wiki, depicts the SAML flow.

640px-Saml2-browser-sso-redirect-post

OAuth (Open Standard for Authorization) has a different intent (the current version is OAuth 2.0). Its driving force isn’t SSO but access delegation (type of authorization). In simplest terms, it means giving your access to someone you trust, so that they can perform the job on your behalf. E.g. let’s say you want to do a social media update across Facebook, Twitter, Instagram, etc. The option you have is either to go to these sites manually, or delegate your access to an app that can implicitly connect to these platforms to update status on your behalf. Flow is pretty simple, you ask the application to update your status on Facebook, the app redirects you to Facebook, you authenticate yourself to Facebook, Facebook throws up a consent page stating you are about give this app rights to update status on your behalf, you agree, the app gets an opaque access token from Facebook, app caches that access token, send the status update with access token to facebook, facebook validates the access token (easy in this case as the token was issued by Facebook itself), and updates your status.

OAuth refers to the parties involved as Client, Resource Owner (end-user), Resource Server, and Authorization Server. Mapping these to our Facebook example, the Client is the application trying to do work on your behalf. Resource owner is you (you owe the Facebook account), Resource Server is the Facebook (holding your account), and Authorization Server is also Facebook (in our case Facebook issues the access token using which client can update status on Facebook account). It perfectly ok for Resource Server and Authorization Server to be managed by separate entities, it just means more work to establish common ground for protocols and token formats. Below screenshot depicts the OAuth2 protocol flow

OAuth2

Web community liked the lightweight approach of OAuth. And hence, the question came – can OAuth do authentication as well, providing an alternative to heavy lifting protocols like WS-Fed and SAML? Enter OpenID Connect. Open ID Connect is about adding Authentication to OAuth. It aims at making Authorization Server do more – i.e. not only issuing access token, but also an ID token. ID token is a JWT (JSON Web Token) containing information about authentication event, like when it did it occur, etc. and also about subject / user (specification talks of a UserInfo Endpoint to obtain user details). Going back to the Facebook example, here the client not only relies on Facebook to provide us an opaque access token for status updates, but also an ID token which client can consume and validate that the user actually authenticated with Facebook. It can also fetch additional user details it needs via Facebook’s UserInfo Endpoint. Below diagram from OpenID connect spec indicates the protocol flow.

OpenIDConnect

OP in above case is OpenID Provider. All OpenID Providers have the discovery details published via JSON document found by concatenating provider URL with /.well-known/openid-configuration. This document has all the provider details including Authorization, Token and UserInfo Endpoints. Let’s see a quick example with a Microsoft offering called Azure Active Directory (Azure AD). Azure AD being a OpenID Provider, will have the openid configuration for it’s tenant demoad2.onmicrosoft.com available at https://login.microsoftonline.com/demoad2.onmicrosoft.com/.well-known/openid-configuration.

Fairly digestible, isn’t it 🙂 ?