Beware Docker and sysctl defaults on GCE

On Google Compute Engine (GCE) the latest VM boot images (at the time of writing) for Ubuntu 14.04 and 16.04 (eg ubuntu-1604-xenial-v20170811) ship with a file at /etc/sysctl.d/99-gce.conf which contains:

net.ipv4.ip_forward = 0

This kernel parameter determines whether packets can be forwarded between network interfaces. On its own, the presence of this line isn’t a big deal.

Separately, when you start the Docker daemon (at least in version 17.06.0-ce), it sets this kernel parameter to 1 (assuming you haven’t specified --ip-forward=false in the Docker configuration). Docker needs packet forwarding enabled so that Docker containers using the default bridge network can communicate outside the host.

If you later execute sysctl --system or similar after has Docker has started, for example to apply a new value for the nf_conntrack_max kernel parameter that you’ve specified in another file under /etc/sysctl.d/, then the ip_forward parameter will revert to 0 care of GCE’s default conf file.

At this point you’ll find your containers cannot reach the outside world, for example this will fail to resolve:

docker run ubuntu:16.04 getent hosts google.com

This will remain broken for all existing or new containers until you set the ip_forward parameter back to 1 manually or by restarting the Docker daemon.

If you’re using any Docker version since v1.8 (released about 2 years ago) you should see the following message when running a container with bridge networking if IP forwarding is disabled:

WARNING: IPv4 forwarding is disabled. Networking will not work.

Of course, that only helps if you’re using docker run interactively and does not help if the parameter gets changed after the containers are already running.

If you’re in this situation, add your own file to /etc/sysctl.d/ that follows 99-gce.conf alphabetically (eg 99-luftballon.conf) and ensure it contains:

net.ipv4.ip_forward = 0

You may also want to ensure the file has a trailing LF character to avoid any issues with processing it.

You can check the current value of the ip_forward kernel parameter with one of these two commands:

sysctl net.ipv4.ip_forward
cat /proc/sys/net/ipv4/ip_forward

The case of the addled ARP

In recent weeks we started receiving alerts whenever a new AWS EC2 Instance running Ubuntu 14.04 LTS was launched for a specific Auto Scaling Group. On average, one new instance would be provisioned per day but the fault would only occur for about one or two of the new instances per week.

The alert was an indicator that the new instance was unable to communicate with the message broker located on another instance. However, after approximately 20 minutes the issue would self-resolve. Also, if we manually provisioned a new replacement instance, it would successfully communicate with the broker.

With the short window of failure and no consistent period between occurrences so this problem continued through several operations shifts and staff members before a plan was established to capture more details of the problem.

On the next alert we were able to investigate and establish several facts:

  1. The affected instance was unable to communicate due to a connection timeout. It was sending TCP SYN packets and receiving no reply.
  2. The message broker was receiving the TCP SYN packet from the affected instance and replying with a SYN+ACK packet but the MAC address on the reply packet did not match the MAC address on the incoming SYN packet.
  3. Running ip neigh show on the message broker instance reported that the IP address of the affected instance was associated with an unrelated MAC address and was in the STALE state but occasionally also in the REACHABLE state.
  4. The unrelated MAC address was not associated with any other instances running in the VPC nor any that had been recently terminated.

At this point we setup two monitors on the message broker instance while we waited for the problem to self-resolve. The first was a tcpdump to capture all ARP traffic and the second was a shell script to continuously poll and record the ARP table. The ARP traffic capture contained very little and nothing at all helpful but the ARP table records were very interesting.

While the affected instance was unable to connect to the message broker, the ARP table cycled through the states REACHABLE then STALE then DELAY​ and back to REACHABLE again, retaining the same incorrect MAC address association the whole time. The DELAY state never lasted as long as five seconds.

At the moment when the problem self-resolved, the DELAY state did last for five seconds and then transitioned to the PROBE state, then to the FAILED state and finally back to REACHABLE but this time with the correct MAC address.

This insight lead one of our team members to find this Red Hat bug describing a Linux kernel issue that aligned with exactly the behaviour we were experiencing. Unfortunately the fix for this bug wasn’t merged until Linux kernel 4.11 which was only released in May and reportedly won’t be officially available in Ubuntu until Artful Aardvark 17.10.

Our assessment of all the stale ARP entries on the message broker combined with the known scaling behaviours of the messaging clients suggested that some entries had been there for at least 8 weeks. So this wasn’t a by-product of replacing instances rapidly and recycling IP addresses in the subnet too quickly.

As an interim solution we have implemented a cron job to remove any stale entries from the message broker’s ARP table and this has prevented the alerts from re-appearing for several weeks now.

 

Always upgrade ixgbevf on AWS EC2

I recently had a frustrating experience with network connectivity for a set of AWS EC2 Instances running Ubuntu Trusty 14.04.

Three instances, running Graphite and Carbon Cache 0.9.15 would intermittently become unreachable on the network for seconds or minutes at a time and several times a day. There was no obvious pattern to when these events would occur and when they did there was no interesting change in their CPU utilisation, memory usage, or disk IO aside from the inevitable reduction in activity associated with a lack of data or queries coming from the network.

AWS reported the Graphite instances were failing their Instance Status Check. The external instances attempting to communicate with these Graphite machines just experienced TCP timeouts. When the Graphite instances themselves became network-reachable again, their system logs showed that processes had continued running as normal during the outage.

The first hint of a reason for this behaviour came from the Graphite instances’ syslog reporting No route to host during the outage while a cron job was attempting to connect to another instance on the same subnet in the same Availability Zone. This suggested something was wrong with either ARP, or the network interface, but there were no logs or kernel messages suggesting the network interface had gone down and EC2 resolves ARP at the Hypervisor.

I configured collectd to harvest all the network-related metrics possible on the Graphite instances themselves and I configured VPC Flow Logs to record details of all the network traffic in the subnet. After the next period of failed connectivity I discovered that Flow Logs showed that all packets were reaching the EC2 Network Interfaces of the Graphite instances but the instance’s collectd data showed no packets received, but no network errors either.

These Graphite instances were now running the AWS M4 Instance Type but they were not originally provisioned as such which lead me to investigate the Enhanced Networking features available to these instance types. I eventually found this suspicious paragraph in the AWS documentation:

In the above Ubuntu instance, the module is installed, but the version is 2.11.3-k, which does not have all of the latest bug fixes that the recommended version 2.14.2 does. In this case, the ixgbevf module would work, but a newer version can still be installed and loaded on the instance for the best experience.

Enabling Enhanced Networking with the Intel 82599 VF Interface on Linux Instances in a VPC

Our instances were running the 2.11.3-k version of the ixgbevf driver mentioned in the documentation, which is the older “would work” version but also the most recent version included with Ubuntu Trusty. Some further research into this network driver on AWS revealed some other discussions of similarly flakey network connectivity so I decided to upgrade the driver on one of the Graphite instances.

As per the same AWS documentation, the recommended version 2.14.2 does not build properly on some versions of Ubuntu, so I installed version 2.16.4, which required an OS restart. I monitored the upgraded instance for 24 hours and it remained healthy with no connectivity interruption for the whole period whilst the other two instances continued to fail intermittently so I upgraded the network driver on a second instance. After 72 hours of stable behaviour on the two upgraded instances, I upgraded the third and the problem is now completely resolved for those instances.

Expecting that these issues could easily recur on our other systems I wanted to ensure they were all using the newest driver, however due to the OS restart for the new network driver to load, adding the driver install steps to the provisioning script was undesirable. Experimentation revealed that rmmod and modprobe seem to allow the upgraded network driver to become active without an OS restart but I decided that baking a new AMI with the driver pre-installed was preferred.

I have also discovered that the version of the ixgbevf driver included with Ubuntu Xenial 16.04 is more recent that Trusty’s but still older than the version recommended by AWS so a custom AMI is still required.

I’ve shared my experience and findings with AWS Support and asked them to modify their documentation to more strongly recommend installing the newer driver.

Busy May

In January I presented at the Sydney ALT.NET user group about HTTPS, focusing on all the new advancements in this space and some long-held misconceptions too. It was well received so I re-presented it at the Port80 Sydney meetup in March.

I met Steve Cassidy from Macquarie University who was also presenting at the same Port80 meetup and I was invited to present the talk a third time as a guest lecture to second year Macquarie University Computer Science students on May 4th. The lecture was filmed but is only available to those with a student login. My slide deck from Port80 is available on SlideShare though.

On May 19th I delivered a breakfast talk about my experience deploying some of section.io’s infrastructure into Azure. The video of this talk is publicly available and so are the slides.

This year my friend Aaron lead the organising of the return of the DDD conference in Sydney. I submitted a talk proposal and was fortunate to receive enough votes to earn a speaking slot. So, on Saturday May 28th I presented “Web Performance Lessons” which covered a variety of scenarios I had encountered while improving the performance of other people’s websites as part of my job at section.io. The talk was recorded by the conference sponsor SSW and is available to watch here. Also my slides can be viewed at SlideShare.

At the Port80 meetup in March I also met Mo Badran who organises the Operational Intelligence Sydney meetup. Mo asked if I could do a presentation of how section.io handles operations so on Tuesday May 31st I presented “Monitoring at section.io” where I shared a bunch of detail about our tools and processes for operational visibility at section.io, both for the platform itself, and for users of our CDN. Those slides are published on SlideShare too.

I’ll take a break from speaking in June and instead absorb what other people have to share at the Velocity conference in Santa Clara and take the opportunity to also check out the new section.io office in Colorado.

I know this blog has been quiet for a while. I have been posting most of my written content over at the section.io blog lately and will probably continue to blog there more often than here in the near future. Some of my recent posts include:

Adding HPKP to my blog

In my last post I described how I added HTTPS to my blog and mentioned that implementing HTTP Public Key Pinning (HPKP) was still pending.

The purpose of HPKP is to protect your site in the event that a trusted Certificate Authority issues a certificate for your site to the wrong person. This can happen, and has happened, due to a process error, or due to the CA’s systems being breached. Either way it can enable a 3rd party to Man-In-The-Middle attack your site with often no indication that something is wrong. HPKP allows you to inform the browser that only certain public keys that you’ve pre-approved should be accepted, even if all other aspects of the certificate appear valid.

The reason I didn’t get HPKP done up front is because the process is somewhat arduous even though the end result is simply serving an extra HTTP response header of the format:

Public-Key-Pins: pin-sha256="..fingerprint.."; pin-sha256="..another.."; max-age: 1234;

A single response header may appear trivial at first but there is some complexity waiting to trip you up.

Firstly, the fingerprint is different to any of the other fields you may normally see in a typical certificate information dialog. The fingerperint is a SHA-256 (or SHA-1) digest of the public key (and some public key metadata) which is then base64 encoded. To generate this fingerprint typically involves piping between two or more consecutive openssl commands and OpenSSL isn’t renowned for its clarity.

Starting with an existing certificate, a certificate signing request (CSR), or a private key will each change which collection of OpenSSL commands you need to execute to generate the fingerprint. There is at least one online tool to help with this (thanks Dāvis), but be wary of using any online tools which require the private key.

To make life a little easier for section.io users, I added the calculated fingerprint to the HTTPS configuration page:

section-hpkp-fingerprint

The second gotcha is that the header is not valid with only a single fingerprint of the public key from the certificate currently in use on your site. The specification (RFC 7469) requires that you also include at least one extra fingerprint of a backup public key that you can switch to in the event of a lost or stolen private key. And it is good idea to include fingerprints for two backup keys.

Before you assume that this means you need to buy more certificates, you should note that you only need the fingerprint of the public key component. This means you can generate a key pair, or a CSR, with which you will later purchase a new certificate only in the event that you need to replace your current certificate. Key pairs and CSRs do not expire – although, technically, your chosen key length or algorithm may become less secure as time passes and technology progresses.

The third issue to be mindful of is the max-age directive in the header. This is the number of seconds that a user-agent should cache these fingerprints. Do not conflate this with the validity period of your signed certificate, as certificates expire on a fixed date but the HPKP header is valid for a fixed period starting from the moment the browser parses the header.

With a max-age value equivalent to 365 days, a user could visit your site one month before your certificate expires and then persist your Public-Key-Pins header data for the next 12 months, well past when certificate’s validity. But this is OK. You will likely renew your certificate with the same public key, or renew it with one of the backup public keys already mentioned in your HPKP header.

It is just important to realise that the HPKP max-age is different from the certificate validity and browsers may limit the upper age limit. Ensure that you balance the age and the number of backup keys you think you may need in that age period. And when you consume a backup key from your HPKP header, you should update your header with a new backup key that will be slowly acknowledged by browsers as their cache of your HPKP header expires.

With all that, I added the HPKP response header to my site with the following Varnish configuration:

hpkp-vcl

Adding HTTPS to my blog, economically

I’ve been hosting my blog with WordPress.com for about the last five years for one simple reason: I want to spend my time writing content, not messing about with server maintenance or blog engine updates. If I was making the same decision today I might choose Jekyll or Ghost instead but WordPress.com was just easy and I have no reason to change. Well, maybe one reason…

Security has always been a passion, and these days it is a significant part of my job. I am a fervent supporter of HTTPS everywhere (the concept, not the browser extension) and I recently realised that my blog was not only served without HTTPS by default but it failed with certificate warnings when accessed via HTTPS. My first thought was to bump up my WordPress.com plan to something with TLS support but when I went looking for this option I found that not only do WordPress.com not offer this, they have published some dangerous misinformation about their HTTPS support.

wordpress.com-https

I wanted to avoid going through the effort of migrating my blog to new hosting. All I really needed was to put an intelligent HTTPS proxy in front of my existing blog. Conveniently that is a core component of what my team and I have been building this year: section.io. In short, section.io is a HTTP-reverse-proxy-as-a-service solution focused on a providing a great DevOps story. At the moment it is predominantly used for Varnish Cache cloud-hosting but its capabilities are growing rapidly.

With section.io I was able to register a new, free account and within about 3 minutes the infrastructure had been provisioned to proxy my WordPress.com-hosted blog through a default configuration of Varnish 4. For now, because WordPress.com do their own caching, and I want to focus on writing blog content, I’m leveraging Varnish only for response header manipulation, not caching.

Also, because Varnish Cache (and inevitably other proxies that section.io will support one day) doesn’t have native HTTPS support, section.io provides a thin TLS-offload layer in front of Varnish, all I need to do is upload a certificate. For recent years, my DNS host and registrar of choice is DNSimple and they now sell TLS certificates too. Through DNSimple, I bought a Domain Control Validated certificate for only US$20 for the year, which is then issued by Comodo.

I uploaded my new certificate and private key into the section.io management portal and moments later my blog could be accessed via HTTPS and I was greeted with a friendly green padlock. I should point out that the free HTTPS support on section.io does not support non-SNI capable user agents at this time but I’m comfortable ignoring that quickly shrinking pool of browsers for my blog.

green-padlock

Merely being able to access my blog via HTTPS is not enough however, I want it to be accessed only via HTTPS so that requires a little more work, but its all achievable with a little bit of Varnish Configuration Language.

section.io strives to provide the same unconstrained Varnish experience one would get from hosting Varnish themselves. In this instance, I get access to the default.vcl file in my own section.io account’s git repository, and a convenient web-based editor to make quick changes.

The first change is to add some VCL to detect whether the request was made without HTTPS, by inspecting the conventional X-Forwarded-Proto header, and respond with a synthetic 301 Moved Permanently response to the HTTPS URL as appropriate:

vcl-https-redirect

The second change is to add HSTS response headers so that return visitors will automatically use HTTPS for all requests without needing the server-side redirect:

vcl-hsts

 

At this point section.io is configured to serve my blog as HTTPS-only but public traffic is still hitting WordPress.com directly. When I registered my blog site with section.io I was provided with a new CNAME value to configure my blog’s DNS to resolve to. I didn’t change over immediately though, I used Fiddler (or my local HOSTS file) to simulate the change and verify I had everything working right. I’ve since changed my public DNS records and you should now be reading this post over HTTPS.

Troy Hunt has recently blogged about the generally “premium” nature of TLS being a blocker of wider HTTPS adoption, and he is right, but there are a number of more affordable solutions growing in response to the increasing demand. What I have found though is that the cost of certificates and hosting is quickly surpassed by the knowledge required to implement HTTPS right because it is so much more than just getting a key pair and talking HTTP through an encrypted tunnel.

A good HTTPS deployment needs to consider TLS protocol versions and cipher suites, needs to avoid mixed-mode content, and utilise HPKP, which I’ll be configuring on my blog soon. Some of this will hopefully be handled by your hosting provider but a lot also crosses over into the application domain.

One year in to the new world

My first anniversary of working with Squixa passed recently and I began to reflect on just how much working with an entirely unfamiliar technology stack has been different from working with the Microsoft platform, and how it has been different when compared to my initial expectations.

In the early weeks into the new job I began writing down the names of tools and technologies that I was learning each day but it quickly reached more than 50 long and I stopped updating it. Looking back at that list now, it has become a list of things I use every day, most have formed muscle memories, many I have read the source code for, and a number I have submitted patches to for bug-fixes or enhancements.

On the Microsoft platform I was a regular user and contributor to open-source projects on CodePlex and GitHub and more often than not I trusted .NET Reflector over documentation to better understand how some component should work. Over in *nix land though, source code is unavoidable, sadly sometimes as an alternative to documentation, but more often simply as the preferred distribution method. I’ve certainly read a lot more source code each week than I have previously, and in a wider variety of languages, and although it is sometimes tedious it has also taught me a lot. It’s not just developers that need a compiler installed but any user looking outside what their favourite *nix-flavour packages for them.

I was never particularly bothered by the lack of system package manager on Windows even though I’d heard its absence was oft-maligned by *nix folk. Having used a package manager in anger now, I can both appreciate just how much effort it saves when trying to automate machine provisioning but also found that there are many challenges with version pinning and when one’s chosen distribution does not stay current with new software releases. I’m sure the new Windows 10 PackageManagement (formerly OneGet) will be an awesome step forward for Microsoft in this space.

I wrote a lot of PowerShell before I changed jobs and I revelled in the language’s ability to work with objects and APIs. The typical shell in *nix lacks this but I’ve rarely had to deal with objects or APIs in my new job. Here Everything is a file and normally a plain-text file at that and so languages focused on text manipulation instead are ample. Configuration management systems end up spending most of their time overwriting files generated from templates instead of trying to interact with an API in some idempotent manner. Personally though, dealing with pattern matching and character- or field-offsets still feels too brittle and harder to re-comprehend later.

There are some popular applications in Linux doing some really awesome tricks. One favourite example is the nginx web server which can upgrade its binary, launch a new version of itself, hand-over existing connections and listening sockets and never drop a packet. It’s not that things like this are not achievable on the Windows platform, it’s just that for some unknown reason, nobody is doing it. While Microsoft is still fighting hard against a “just restart it” culture to avoid unnecessary down-time, Torvalds recently merged live kernel patching in Linux.

Ultimately though all the problems are the same across both platforms. You need to make sure you understand exactly what each application needs access to so you can constrain it to the least possible privileges – but not everyone does. You hit resource limits on process counts, file handles, network connections, etc but at different thresholds. You’re susceptible to the same failure conditions but they often have different failure modes, and rarely the one you would have preferred.

For every difference, there are double the similarities. The platforms have different driving principles guiding which solution to prefer for a given problem, but neither is necessarily better, simply idiomatic. At this point I’m expecting that I’ll continue to use whichever platform my current project requires without any favouritism and hopefully be switching back and forth enough to stay abreast of the latest developments on each.