Category: Uncategorized

Beware Docker and sysctl defaults on GCE

On Google Compute Engine (GCE) the latest VM boot images (at the time of writing) for Ubuntu 14.04 and 16.04 (eg ubuntu-1604-xenial-v20170811) ship with a file at /etc/sysctl.d/99-gce.conf which contains:

net.ipv4.ip_forward = 0

This kernel parameter determines whether packets can be forwarded between network interfaces. On its own, the presence of this line isn’t a big deal.

Separately, when you start the Docker daemon (at least in version 17.06.0-ce), it sets this kernel parameter to 1 (assuming you haven’t specified --ip-forward=false in the Docker configuration). Docker needs packet forwarding enabled so that Docker containers using the default bridge network can communicate outside the host.

If you later execute sysctl --system or similar after has Docker has started, for example to apply a new value for the nf_conntrack_max kernel parameter that you’ve specified in another file under /etc/sysctl.d/, then the ip_forward parameter will revert to 0 care of GCE’s default conf file.

At this point you’ll find your containers cannot reach the outside world, for example this will fail to resolve:

docker run ubuntu:16.04 getent hosts google.com

This will remain broken for all existing or new containers until you set the ip_forward parameter back to 1 manually or by restarting the Docker daemon.

If you’re using any Docker version since v1.8 (released about 2 years ago) you should see the following message when running a container with bridge networking if IP forwarding is disabled:

WARNING: IPv4 forwarding is disabled. Networking will not work.

Of course, that only helps if you’re using docker run interactively and does not help if the parameter gets changed after the containers are already running.

If you’re in this situation, add your own file to /etc/sysctl.d/ that follows 99-gce.conf alphabetically (eg 99-luftballon.conf) and ensure it contains:

net.ipv4.ip_forward = 0

You may also want to ensure the file has a trailing LF character to avoid any issues with processing it.

You can check the current value of the ip_forward kernel parameter with one of these two commands:

sysctl net.ipv4.ip_forward
cat /proc/sys/net/ipv4/ip_forward

The case of the addled ARP

In recent weeks we started receiving alerts whenever a new AWS EC2 Instance running Ubuntu 14.04 LTS was launched for a specific Auto Scaling Group. On average, one new instance would be provisioned per day but the fault would only occur for about one or two of the new instances per week.

The alert was an indicator that the new instance was unable to communicate with the message broker located on another instance. However, after approximately 20 minutes the issue would self-resolve. Also, if we manually provisioned a new replacement instance, it would successfully communicate with the broker.

With the short window of failure and no consistent period between occurrences so this problem continued through several operations shifts and staff members before a plan was established to capture more details of the problem.

On the next alert we were able to investigate and establish several facts:

  1. The affected instance was unable to communicate due to a connection timeout. It was sending TCP SYN packets and receiving no reply.
  2. The message broker was receiving the TCP SYN packet from the affected instance and replying with a SYN+ACK packet but the MAC address on the reply packet did not match the MAC address on the incoming SYN packet.
  3. Running ip neigh show on the message broker instance reported that the IP address of the affected instance was associated with an unrelated MAC address and was in the STALE state but occasionally also in the REACHABLE state.
  4. The unrelated MAC address was not associated with any other instances running in the VPC nor any that had been recently terminated.

At this point we setup two monitors on the message broker instance while we waited for the problem to self-resolve. The first was a tcpdump to capture all ARP traffic and the second was a shell script to continuously poll and record the ARP table. The ARP traffic capture contained very little and nothing at all helpful but the ARP table records were very interesting.

While the affected instance was unable to connect to the message broker, the ARP table cycled through the states REACHABLE then STALE then DELAY​ and back to REACHABLE again, retaining the same incorrect MAC address association the whole time. The DELAY state never lasted as long as five seconds.

At the moment when the problem self-resolved, the DELAY state did last for five seconds and then transitioned to the PROBE state, then to the FAILED state and finally back to REACHABLE but this time with the correct MAC address.

This insight lead one of our team members to find this Red Hat bug describing a Linux kernel issue that aligned with exactly the behaviour we were experiencing. Unfortunately the fix for this bug wasn’t merged until Linux kernel 4.11 which was only released in May and reportedly won’t be officially available in Ubuntu until Artful Aardvark 17.10.

Our assessment of all the stale ARP entries on the message broker combined with the known scaling behaviours of the messaging clients suggested that some entries had been there for at least 8 weeks. So this wasn’t a by-product of replacing instances rapidly and recycling IP addresses in the subnet too quickly.

As an interim solution we have implemented a cron job to remove any stale entries from the message broker’s ARP table and this has prevented the alerts from re-appearing for several weeks now.

 

Always upgrade ixgbevf on AWS EC2

I recently had a frustrating experience with network connectivity for a set of AWS EC2 Instances running Ubuntu Trusty 14.04.

Three instances, running Graphite and Carbon Cache 0.9.15 would intermittently become unreachable on the network for seconds or minutes at a time and several times a day. There was no obvious pattern to when these events would occur and when they did there was no interesting change in their CPU utilisation, memory usage, or disk IO aside from the inevitable reduction in activity associated with a lack of data or queries coming from the network.

AWS reported the Graphite instances were failing their Instance Status Check. The external instances attempting to communicate with these Graphite machines just experienced TCP timeouts. When the Graphite instances themselves became network-reachable again, their system logs showed that processes had continued running as normal during the outage.

The first hint of a reason for this behaviour came from the Graphite instances’ syslog reporting No route to host during the outage while a cron job was attempting to connect to another instance on the same subnet in the same Availability Zone. This suggested something was wrong with either ARP, or the network interface, but there were no logs or kernel messages suggesting the network interface had gone down and EC2 resolves ARP at the Hypervisor.

I configured collectd to harvest all the network-related metrics possible on the Graphite instances themselves and I configured VPC Flow Logs to record details of all the network traffic in the subnet. After the next period of failed connectivity I discovered that Flow Logs showed that all packets were reaching the EC2 Network Interfaces of the Graphite instances but the instance’s collectd data showed no packets received, but no network errors either.

These Graphite instances were now running the AWS M4 Instance Type but they were not originally provisioned as such which lead me to investigate the Enhanced Networking features available to these instance types. I eventually found this suspicious paragraph in the AWS documentation:

In the above Ubuntu instance, the module is installed, but the version is 2.11.3-k, which does not have all of the latest bug fixes that the recommended version 2.14.2 does. In this case, the ixgbevf module would work, but a newer version can still be installed and loaded on the instance for the best experience.

Enabling Enhanced Networking with the Intel 82599 VF Interface on Linux Instances in a VPC

Our instances were running the 2.11.3-k version of the ixgbevf driver mentioned in the documentation, which is the older “would work” version but also the most recent version included with Ubuntu Trusty. Some further research into this network driver on AWS revealed some other discussions of similarly flakey network connectivity so I decided to upgrade the driver on one of the Graphite instances.

As per the same AWS documentation, the recommended version 2.14.2 does not build properly on some versions of Ubuntu, so I installed version 2.16.4, which required an OS restart. I monitored the upgraded instance for 24 hours and it remained healthy with no connectivity interruption for the whole period whilst the other two instances continued to fail intermittently so I upgraded the network driver on a second instance. After 72 hours of stable behaviour on the two upgraded instances, I upgraded the third and the problem is now completely resolved for those instances.

Expecting that these issues could easily recur on our other systems I wanted to ensure they were all using the newest driver, however due to the OS restart for the new network driver to load, adding the driver install steps to the provisioning script was undesirable. Experimentation revealed that rmmod and modprobe seem to allow the upgraded network driver to become active without an OS restart but I decided that baking a new AMI with the driver pre-installed was preferred.

I have also discovered that the version of the ixgbevf driver included with Ubuntu Xenial 16.04 is more recent that Trusty’s but still older than the version recommended by AWS so a custom AMI is still required.

I’ve shared my experience and findings with AWS Support and asked them to modify their documentation to more strongly recommend installing the newer driver.

One year in to the new world

My first anniversary of working with Squixa passed recently and I began to reflect on just how much working with an entirely unfamiliar technology stack has been different from working with the Microsoft platform, and how it has been different when compared to my initial expectations.

In the early weeks into the new job I began writing down the names of tools and technologies that I was learning each day but it quickly reached more than 50 long and I stopped updating it. Looking back at that list now, it has become a list of things I use every day, most have formed muscle memories, many I have read the source code for, and a number I have submitted patches to for bug-fixes or enhancements.

On the Microsoft platform I was a regular user and contributor to open-source projects on CodePlex and GitHub and more often than not I trusted .NET Reflector over documentation to better understand how some component should work. Over in *nix land though, source code is unavoidable, sadly sometimes as an alternative to documentation, but more often simply as the preferred distribution method. I’ve certainly read a lot more source code each week than I have previously, and in a wider variety of languages, and although it is sometimes tedious it has also taught me a lot. It’s not just developers that need a compiler installed but any user looking outside what their favourite *nix-flavour packages for them.

I was never particularly bothered by the lack of system package manager on Windows even though I’d heard its absence was oft-maligned by *nix folk. Having used a package manager in anger now, I can both appreciate just how much effort it saves when trying to automate machine provisioning but also found that there are many challenges with version pinning and when one’s chosen distribution does not stay current with new software releases. I’m sure the new Windows 10 PackageManagement (formerly OneGet) will be an awesome step forward for Microsoft in this space.

I wrote a lot of PowerShell before I changed jobs and I revelled in the language’s ability to work with objects and APIs. The typical shell in *nix lacks this but I’ve rarely had to deal with objects or APIs in my new job. Here Everything is a file and normally a plain-text file at that and so languages focused on text manipulation instead are ample. Configuration management systems end up spending most of their time overwriting files generated from templates instead of trying to interact with an API in some idempotent manner. Personally though, dealing with pattern matching and character- or field-offsets still feels too brittle and harder to re-comprehend later.

There are some popular applications in Linux doing some really awesome tricks. One favourite example is the nginx web server which can upgrade its binary, launch a new version of itself, hand-over existing connections and listening sockets and never drop a packet. It’s not that things like this are not achievable on the Windows platform, it’s just that for some unknown reason, nobody is doing it. While Microsoft is still fighting hard against a “just restart it” culture to avoid unnecessary down-time, Torvalds recently merged live kernel patching in Linux.

Ultimately though all the problems are the same across both platforms. You need to make sure you understand exactly what each application needs access to so you can constrain it to the least possible privileges – but not everyone does. You hit resource limits on process counts, file handles, network connections, etc but at different thresholds. You’re susceptible to the same failure conditions but they often have different failure modes, and rarely the one you would have preferred.

For every difference, there are double the similarities. The platforms have different driving principles guiding which solution to prefer for a given problem, but neither is necessarily better, simply idiomatic. At this point I’m expecting that I’ll continue to use whichever platform my current project requires without any favouritism and hopefully be switching back and forth enough to stay abreast of the latest developments on each.

Announcing VclFiddle for Varnish Cache

As part of my new job with Squixa I have been working with Varnish Cache everyday. Varnish, together with its very capable Varnish Configuration Language (VCL), is a great piece of software for getting the best experience for websites that weren’t necessarily built with cache-ability or high-volume traffic in mind.

At the same time though, getting the VCL just right to achieve the desired caching outcome for particular resources can be an exercise in reliably reproducing the expected requests and careful analysis of the varnish logs. It isn’t always possible to find an environment where this can be done with minimal distraction and impact on others.

At a company retreat in October my colleagues and I were discussing this scenario and one of us pointed out how JSFiddle provides a great experience for dealing with similar concerns albeit in the space of client-side JavaScript. I subsequently came to the conclusion that it should be possible build a similar tool for Varnish, so I did and you can use it now at www.vclfiddle.net and it is open-sourced on GitHub too.

VclFiddle enables you to specify a set of Varnish Configuration Language statements (including defining the backend origin server), and a set of HTTP requests and have them executed in a new, isolated Varnish Cache instance. In return you get the raw varnishlog output (including tracing) and all the response headers for each request, including a quick summary of which requests resulted in a cache hit or miss.

Each time a Fiddle is executed, a new Fiddle-specific URL is produced and displayed in the browser address bar and this URL can then be shared with anyone. So, much like JSFiddle, you can use VclFiddle to reproduce a difficult problem you might be having with Varnish and then post the Fiddle URL to your colleagues, or to Twitter, or to an online forum to seek assistance. Or you could share a Fiddle URL to demonstrate some cool behaviour you’ve achieved with Varnish.

VclFiddle is built with Sails.js (a Node.js MVC framework) and Docker. It is the power of Docker that makes it fast for the tool to spawn as many instances and versions of Varnish as needed for each Fiddle to execute and easy for people to add support for different Varnish versions. For example, it takes an average of 709 milliseconds to execute a Fiddle and it took my colleague Glenn less than an hour to add a new Docker image to provide Varnish 2.1 support.

The README in the VclFiddle repository has much more detail on how it works and how to use it. There is also a video demo, and a few example walk-throughs on the left-hand pane of the VclFiddle site. I hope that, if you’re a Varnish user you’ll find VclFiddle useful and it will become a regular tool in your belt. If you’re not familiar with Varnish Cache, perhaps VclFiddle will provide a good introduction to its capabilities so you can adopt it to optimize your web application. In any case, your feedback is welcome by contacting me, the @vclfiddle Twitter account, or via GitHub issues.

New Job, New Platform

After about five and a half years I have resigned from my job with Readify. I have had a great time working for Readify as a software developer, a consultant, an ALM specialist, and an infrastructure coder. Had a new opportunity not presented itself I could have easily continued working for Readify for years to come. The decision to leave was definitely not easy.

Over the last 16 years working as an IT professional I’ve had the opportunity to gain experience with almost all aspects of software development, system administration, networking, and security but all of it on the Microsoft platform. I did do some work with PERL and PHP on Apache and MySQL back in the late 90s (like everyone did I’m sure) but I haven’t spent any quality time with Linux or Mac OS X since.

Starting on June 10th this year (2014) I will begin a new job with Squixa. Squixa provide a set of services for improving the end-user performance of existing web sites and exposing analytics to the web site’s owners. Squixa’s implementation currently involves very few Microsoft technologies, if any. Subsequently my future includes the exciting experience of learning a new set of operating systems, development languages, web servers, database systems, build tools, and so on.

I still have a passion for PowerShell and I feel that the direction Microsoft is heading with Azure, Visual Studio Online, and Project K is exciting and promises to become a much better platform than it is today so I will continue to stay informed of new developments. However, aside from small hobby projects, most of my time, effort, and daily challenges will come from the *nix world and future blog posts will likely reflect this.

Threat Management Gateway, Host header forwarding and redirection

When using Forefront Threat Management Gateway 2010 (TMG) to expose an internal web server to the public Internet beware of the unexpected side-effects of the “Forward the original host header instead of the actual one (specified in the Internal site name field)” check box on the “To” tab in the properties dialog of a Web Publishing Rule.

If the behaviour of the internal web server is to detect an incoming request over HTTP and respond with a 3xx response to a new HTTPS location (for all or only specific URLs) and the “Forward the original host header” option in TMG is checked, then TMG interferes and the HTTP Location header that is returned to the original client does not include the HTTPS scheme and the client’s browser/user-agent gets caught in an infinite redirection loop.

Unchecking the “Forward the original host header” option for the rule and applying the changes fixes the issue and the redirection works correctly.

I have been unable to find any official documentation on this behaviour and only two references to anything related. I found one post on the isaserver.org forums (with no responses) suggesting that this behaviour was introduced in ISA 2004 SP2 (ISA being the original name for TMG). The other reference is a Microsoft Support KB article describing that the request Host header passed from TMG to the internal web server can include the port number with the host name – if the port number is embedded I can imagine this impacting the scheme, but it’s a stretch.

Hopefully this post will help the next person hitting the same issue.