Simplified Docker Networking for Developers

When I joined Docker Inc., my hiring manager had mentioned to me that everything they do is fairly simple except Networking. While most will agree to that statement, it’s even more true from a developer standpoint. If you search for any Docker Networking article, it invariably gets into the details of network namespaces, netns commands, ip links, iptables, IPVS, veths, mac addresses, and so on. While the underpinnings are important, what most of them fail to do is to paint a simple picture for Devs who are looking to understand the overall traffic flow and how communication between containers work. And that’s a common pain point I have heard from most of my customers. Hence, I thought to simplify this convoluted piece. Hope you find my attempt useful.

Let’s start by talking about networking in general which is to make computers (VMs and devices) talk to each other. Networking primarily relies on Bridge (or switch) and Router. Interestingly most of us are already using these components in our home – to connect our devices at home with internet. If you need a refresher check out this link on home networking. Docker containers are no different than those devices and to connect containers we need a similar setup. There are 2 parts to establishing connectivity – connecting containers running on single VM and connecting containers spread across VMs (nodes).

Connecting containers on a single VM is simple and I guess well understood. Docker uses a virtual bridge to connect all the containers running on a single VM. This bridge is called ‘Docker0’ and all the containers running on a standalone docker instance are connected to this bridge. The one end of the bridge is connected to host ethernet. This allows for inbound traffic to containers and outbound traffic from containers for accessing external services. The docker0 also has a default IP CIDR range of 172.17.0.0/16 assigned, out of which individual container get their own IP address. To access the container externally you can map a host port to container port something similar to “docker container run -d -p 8585:80 nginx”. With that you should be able to browse to NGINX container using <host-ipaddress>:8585.

Docker containers communicating on a single VM

Now what about the containers running across nodes which is typical with Swarm – running Docker containers at scale. How do we connect all these containers? While the VMs themselves are typically connected via network, they don’t have any clue about containers running on them and the IPs assigned to those containers. It would be very tedious to do any physical network changes to enable this communication. That’s where Docker leverages software defined networking, using VXLAN based overlay networks (note VXLAN overlay is one of the network drivers supported by Docker; for more comprehensive description on Docker Networking refer to this article). When you initialize Swarm on a docker node (“docker swarm init”) it creates two networks by default – docker_gwBridge & ingress (will get to ingress at the end) to enable cross node container communication.

Let’s take an example – consider you have two UI and DB containers let’s say C1 & C2, and you want C1 container to communicate to C2. In this case both C1 & C2 containers are running on different nodes. To make this communication possible you will have to create a user defined overlay network and create C1 and C2 containers attached to it.

docker network create --driver overlay appnet
docker service create --network appnet nginx # Attach UI or DB to Overlay

Yes it’s that simple. Actually when you create the overlay network ‘appnet’ Docker assigns it a CIDR range and attached containers are assigned IPs from that range. You can also customize this CIDR pool for all overlay networks at time of swarm initialization with “docker swarm init –default-addr-pool <ip-range>” or –subnet flag for individual networks.

Here’s a diagrammatic view of the above setup – Overlay N/W with CIDR range of 10.0.1.0/24 connecting individual containers of IP 10.0.1.3 and 10.0.1.4.

Docker containers communicating across VMs using Overlay network

So is that all? Mostly yes, but we still need to figure out the inbound and outbound access for these containers. Let’s start with external access – how do the containers running on these overlay networks reach out to external ecosystem? What if the database we just mentioned is not running in the Swarm cluster but outside of it? For standalone Docker nodes this was carried out by connecting all containers to Docker0 bridge. Following a similar pattern in Swarm mode, we use a Gateway bridge – which is created on every node at the time of node joining the Swarm cluster. Each container in an overlay network has its gateway endpoint attached to Docker gateway bridge for external access. To verify this you can inspect docker_gwbridge network to find the container ID of the attached container as show below.

Docker containers communicating across nodes using overlay and docker_gwbridge networks

You might think why the gateway bridge is connected with individual container and not to VLAN as a whole? That’s because containers can be running on any node and having a single gateway bridge for the entire VLAN would be unreliable and slow. That brings us to the last component inbound access to containers. You might be tempted to think it should be straight forward. After all, every container has outbound access through docker_gwbridge and just like docker standalone approach we can use ‘-p’ flag to allow access via host port. But things aren’t that simple. While port based host access should work, you are pinning your access to specific node and then if the container were to move to different node you will have to change your request URL to point to that node. Take a pause and think over it. To solve this puzzle Docker uses ‘Ingress’ network. Ingress (also called routing mesh) is just another overlay network to which every node in docker cluster is connected via docker_gwbridge. The idea is you still can publish your app on a specific port just like standalone Docker, but now the traffic would be routed to the underlying app regardless of the node the request landed on. This is the routing magic – your request lands on node #3 it still gets routed to container running on node #1. This magic is done by attaching all the app containers which expose published port to ingress overlay in existing to their overlay networks (see the diagram below where container ‘C1’ is attached to both App Overlay & Ingress Overlay). This is also the reason why you will containers like C1 having dual IPs. This attachment of containers to both ingress and app overlay networks is necessary, without which there is no way for ingress to route request. Remember overlay networks are isolated from any external traffic and hence container attaches to both networks to allow for seamless routing. You can see in below commands that when I create a new Swarm service with published ports and attach it to app overlay network, the underlying container is attached to both appnet and ingress overlay networks and has dual IPs.

External access to Docker containers via Ingress routing mesh

So there you go. I have taken some liberty to abstract things out but I would be very happy if all my customers had this level of understanding when I start working with them. You can easily extend this setup by adding L7 load balancer and reverse proxy like Traefik, Interlock, etc. That would allow you to do all app routing through reverse proxy and reverse proxy will be the only component exposed through ingress overlay. Docker engine has DNS / LB capability too, where name resolution works for services in same overlay network (try to ping DB container from UI without IP address using serivce-name of DB swarm service) and there is built-in support for load balancing across service replicas via virtual IPs (you can lookup VIP using docker service inspect <service-name>). May be I will cover these aspects in a future post.

That’s all for this post. Did I demystify the docker networking magic for you? Is there anything else I could simplify further. Please let me know your comments.

Overview of Calico CNI for Kubernetes

One of the key reasons why Kubernetes (K8s) is so popular is due to its single responsibility design. It largely confines itself to one specific job of scheduling and running container workloads. For rest, it relies on container ecosystem vendors and open specifications to fill the gaps. Take for instance ‘CRI’ (Container Runtime interface) – a plugin interface which enables K8s to work with wide variety of container runtimes, including Docker. Another area is Networking. After you schedule the workloads to run on specific nodes, these workloads would invariably need IP address, traffic routing between them and at times even block traffic for security reasons. K8s could have baked in a networking solution into the scheduling engine but it didn’t. Enter CNI (Container Network Interface) – an open specification for container networking plugins that has been adopted by many projects including K8s. There are over dozen vendors out there who have implemented CNI specification and Calico, an open source project from Tigera, is one of them.

Once you install K8s, the first thing typically you would be required to do is to install a CNI plugin. For instance, if you are using kubeadm the official documentation states –

The network must be deployed before any applications. Also, CoreDNS will not start up before a network is installed. kubeadm only supports Container Network Interface (CNI) based networks…

To install Calico you can use yaml file which creates calico-node daemonset (pod that runs on all the nodes of the K8s cluster). Though if you really want to know the Calico installation underpinnings, it might benefit to do it the hard way.

kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml

On other side, if you are using a managed K8s platform like Docker EE, Calico is setup for you by default. Regardless of how you install Calico you should also install Calicoctl – the command line option to interact with Calico resources. Three key components that make Calico work are Felix, BIRD, and etcd. Felix is the primary Calico agent that runs on each machine that hosts endpoints, programming routes and ensure only valid traffic is sent across endpoints. BIRD is a BGP client that distributes routes created by Felix across the datacenter. Calico uses etcd as a distributed data store to ensure network consistency. Once configured, Calico provides you a seamless networking experience. Customizations are typically around IPAM (IP Address Management) and Network Security requirements. Let’s take a quick look at them starting with IPAM.

By default Calico uses 192.168.0.0/16 IP pool to assign IP addresses to pods (sudo ./calicoctl get ippools -o wide). Calico allows to create multiple IP pools and request for an IP from a given IP pool as part of pod deployment. To request IP from a given IP pool you would add a K8s annotation to your pod yaml file as shown below. You can also refer to this link for detailed instruction.

annotations: "cni.projectcalico.org/ipv4pools": "[\"custompool-ipv4-ippool\"]"

Network policy is a specification of how groups of pods are allowed to communicate with each other and other network endpoints. NetworkPolicy is a standard K8s resource that uses labels to select pods and define rules which specify what traffic is allowed to the selected pods. Calico extends this further allowing for policy ordering/priority, deny rules, and more flexible match rules. Here’s a simple Calico tutorial on how to apply NetworkPolicy.

Couple of things before we wrap up. Most of public cloud providers have a native CNI plugin but using Calico provides you the desired portability (same reasons for using Docker EE platform). Also most cloud provides don’t allow for BGP (excluding the VPN Gateways & VNET peering) so it’s quite possible that you may use cloud native plugins for routing & IPAM and leverage Calico plugin for extended network policy features (also note Calico does supports IPinIP and VXLAN to overcome BGP shortcomings). You can examine the CNI plugin configuration of your cluster by looking at /etc/cni/net.d/10-calico.conflist.

Hope that gave you a fair idea of where CNI & Calico fit into the K8s ecosystem. Please let me know your thoughts in comments section below.

Azure ExpressRoute Primer

What is Azure ExpressRoute?
ExpressRoute is an Microsoft Azure service that lets you create private connections between Microsoft datacenters and infrastructure that’s on your premises or in a colocation facility. ExpressRoute connections do not go over the public Internet, and offer higher security, reliability and speeds with lower latencies than typical connections over the Internet.

How to setup ExpressRoute Circuit?
ExpressRoute circuits are resources within Azure subscriptions. But before you setup Expressroute connection (or circuit as it’s normally referred as), you need to make decisions about setup parameters.
1) Connectivity Option – You can establish connection with Azure cloud either by extending your MPLS VPN (WAN), or you can leverage your Colocation provider and it’s cloud exchange or roll out an point-to-point ethernet your self. Most large enterprises would use the first option, medium size enterprise running in COLO would go with second option, and the last is more specialized scenario warranting higher level security
2) Port Speed – Bandwidth for your circuit
3) Service tier / SKU – standard or premium (more on this later)
4) Location – You might get multiple options for this depending on your choice for connectivity (#1). E.g. MPLS providers have multiple peering locations from which you can pick the one closet to you
5) Data Plan – Limited plan with pay as you go egress charges or unlimited plan with higher cost irrespective of egress volume

After you make these five choices, you can fire up PowerShell on your VM to execute ‘New-AzureDedicatedCircuit’. Remember to select the right Azure subscription (Add-AzureAccount / Select-AzureSubscription) where you want the circuit to be created. Please note you would need to import ExpressRoute module if not already (Import-Module ‘C:\Program Files (x86)\Microsoft SDKs\Azure\PowerShell\ServiceManagement\Azure\ExpressRoute\ExpressRoute.psd1’

New-AzureDedicatedCircuit -CircuitName $CircuitName -ServiceProviderName $ServiceProvider -Bandwidth $Bandwidth -Location $Location -sku Standard

As soon as this completes you will get a service key which is kind of a unique identifier for your circuit. At this step only your billing starts, as we have only completed Azure side of things. Now you need to work with your Network Service Provider, provide your service key and ask them to complete their side of configuration to Azure. This would also involve setting up a BGP session at your end. Once this done you are all set to leverage expressroute and connect the circuit to azure virtual networks – with the traffic flowing over private connection.

Connecting ExpressRoute Circuit to Azure Virtual Network
Once the circuit is configured it’s relatively straight forward to connect it to virtual network. Once again PowerShell is your friend. But before firing the below command ensure your VNET and the virtual gateway is created.

New-AzureDedicatedCircuitLink -ServiceKey “***” -VNetName “MyVNet”

ServiceKey parameter uniquely identifies your circuit. As circuits are part of the Azure Subscription (wish there was a way to view them in portal) your VNET should be part of the same subscription. This lead to the question – Can we connect expressroute circuits to VNETs across subscriptions? Answer is yes.

Connecting ExpressRoute Circuit to Azure Virtual Network across subscriptions
As we know circuit is part of a subscription, so as a subscription admin you will have to grant rights to other subscription admins so that they can link their VNETs to your circuit. Here’s the PowerShell cmdlet to do that.

New-AzureDedicatedCircuitLinkAuthorization -ServiceKey “***” -Description “AnotherProdSub” -Limit 2 -MicrosoftIds ‘devtest@contoso.com’

This commands allows 2 VNETs from AnotherProdSub to connect to the ExpressRoute circuit. You might see the last parameter MicrosoftId replaced by AzureAD Id (not sure what IDs are supported right now)

Once you have the authorization, you can query the servicekey from your subscription and link your VNET as appropriate.

Get-AzureAuthorizedDedicatedCircuit #This will get details of the circuit including ServiceKey

New-AzureDedicatedCircuitLink –servicekey “***” –VnetName ‘APSVNET’ #Link VNET in another subscription

Remember you can only connect 10 VNETs per circuit. Though this is a soft limit but you can grow only few folds. If you need to create 100 VNET instance you need to look at ExpressRoute Premium.

What is ExpressRoute Premium?
Premium tier for enterprises that need more VNETs per circuit, need their circuit to span geo-political region or have more than 4000 route prefixes. You will pay around 3000 USD more for the premium features, when compared to standard edition with same bandwidth.

How much it costs?
Express route costs boil down to price you pay to Microsoft and your service provider.
To Microsoft it’s
monthly fee depending on the port speed
Bandwidth consumed (unless you are in unlimited data where you flat 300 USD)
Virtual Gateway which you would provision in your VNET (mandatory for expressroute & S2S VPN)

To Network Service Provider:
It’s one time setup fee for the circuit
Bandwidth charges (how much data goes through their cloud to Microsoft)

How long does it take to setup connection?
Well it depends. If you already have a network service provider or an exchange provider supporting Azure, it shouldn’t take more than a day (excluding paperwork). Otherwise this can turn out to be a project in itself.

Can we use ExpressRoute to connect to office 365?
Answer is yes, but it actually depends on your provider. Apart from connecting to Azure VNET, expressroute allows you to establish public peering and Microsoft peering to route your Azure PaaS (public services) and Office 365 traffic over the private network. For more details refer to this link. Public peering allows you to route your traffic to public services like Azure Services and Azure SQL database over private tunnel.