Inspecting Docker container processes from the host

While I favour a containerize-all-the-things approach to new projects I still need to maintain systems that were designed several years ago around a combination of containers and host-based applications working together.

In these situations it is common enough to execute ps or iotop on the host and see all the host and container processes together with no obvious indication of which processes belong to which containers.

Here I will share some simple commands to help map the host-view of a containerised process to its container.

First, given a host PID, how do I know which container it belongs to?

$ cat "/proc/${host_pid}/cgroup"
...
4:cpu:/docker/769739f359ec192edf6c565f7756bb5ecabcfac3e691c2444794ab6a7d398e39
...

The procfs cgroup file will show the full Docker container ID which you can then use with docker inspect to get more container details.

Vice-versa if you have the container ID and want to locate the host process(es) you can use:

$ sudo ps -e -o pid,comm,cgroup | grep "/docker/${cid}"

Lastly, if you’re trying to debug a container process from the host and need the host-path to the process’ binary I have found a method to has been working reliably.

Unfortunately, because the procfs exe file is a symbolic link, not a hard link it won’t resolve to the file within the container’s layered file system so a few extra steps are required.

First, read the symlink to get the fully-qualified container-path to the binary:

$ exe=$(readlink "/proc/${host_pid}/exe")

Next, parse the process’ memory-mapped files to locate the first memory region referencing this file path:

$ map=$(grep -m1 -F "${exe}" "/proc/${host_pid}/maps" | cut -d' ' -f1)

Lastly read the symlink for this memory map from procfs’ map_files directory:

$ readlink "/proc/${host_pid}/map_files/${map}"

This final output should look something like this:

/var/lib/docker/aufs/diff/3cc533dae9a6cc96d6092844be3ce78c737db793cf1493b9f47e652e96bfd71e/bin/sleep

Note that the long identifier in that path is not the container ID, nor is it available via docker inspect, although I’m sure someone else has posted online how to locate this path via other means.

Lessons from DigitalOcean Networking

Update: On 2017-DEC-13, DigitalOcean announced that private networking will be isolated to each account beginning February 2018.


If you’ve come from running virtual machines on AWS, Azure, or Google Cloud, you will be familiar with the idea that the VMs can have a public Internet-facing IP address and a private IP address, or some combination or multiple of the two options.

DigitalOcean offers something similar, but just different enough to throw you when you’re accustomed to the networking models of the other cloud providers. When you create a DigitalOcean Droplet via their Control Panel, or via their API, you have the option to enable “Private networking” but when you read the official documentation, this feature is actually called “Shared private networking” and it is a very important distinction.

Where private networking in AWS, Azure, or Google Cloud gives your VM a private interface to a network shared only with your VMs, the shared private networking in DigitalOcean is, according to this DigitalOcean tutorial, “accessible to other VPSs in the same datacenter–which includes the VPSs of other customers in the same datacenter”. And I have verified that statement is true.

To clarify, if you enable private networking on a DigitalOcean VM in their SFO2 region, every other VM in the SFO2 region from every other DigitalOcean customer can route packets to your VM’s private network interface. While I advocate the use of strict firewall configurations in any cloud hosting environment, the importance of doing so correctly is much higher on DigitalOcean, even for non-Production environments where firewalls have a history of being more relaxed.

The bright side of all this is that DigitalOcean’s tag-based Cloud Firewall applies to both the public and private network interfaces and implements a deny-by-default behaviour. By using tags to restrict which other droplets are permitted to communicate on specific ports and protocols you can achieve a very similar level of isolation as offered by other cloud providers.

There is another caveat though: to improve the security of this shared private networking environment, DigitalOcean do not allow VMs to send packets with a source IP address that does not match their assigned private IP address. This prevents you, for example, from operating one DigitalOcean VM as a Virtual Private Network gateway for your other DigitalOcean VMs to connect through to another non-DigitalOcean private network.

In summary, while DigitalOcean is providing a great service, and adding new features seemingly every quarter, it offers a conceptual model slightly out of sync with the big name cloud companies, and you need to by mindful of this, but the same would be true I guess for people experienced with DigitalOcean moving to AWS or Azure.

Finding deleted code in git

Recently, Matt Hilton blogged about Source Control Antipatterns which included the practice of commenting code instead of deleting the code.

As wholeheartedly as I agree with deleting code, I know that a popular objection is that deleted code is harder to find. While it might be harder than your favourite editor’s Find In Files feature, it is important to know how to use the tools central to your development workflow.

For my work, and seemingly the majority of projects today, git is the version control tool of choice. So I’m sharing some git commands here that I have found useful for locating deleted code. I’m using the Varnish Cache repository for my examples if you want to try them yourself.

If you know some text from the code that was deleted, you can find the commit where it was deleted. In this example I’m looking for when the C structure named smu was deleted.

$ git log -G "struct +smu" --oneline

766dee0 Drop long broken umem code

If you know the name of a file that was deleted, but aren’t sure which directory the file was in, you can find the commit when the file was deleted with:

$ git log --oneline -- **/storage_umem.c

766dee0 Drop long broken umem code
b07c34f include cleanup - found by FlexeLint
75615a6 When I grow up, I want to learn to program in C

 

If you’re not even sure what the deleted file was named but just want to see recent commits with deleted files, you can use:

$ git log --diff-filter=D --summary

commit f4faa6e3c431d6ccf581f5683af56008e4d4be10
Author: Federico G. Schwindt <fgsch@lodoss.net>
Date: Fri Mar 10 18:59:14 2017 +0000

Fold r00936.vtc into vcc_action.c tests

delete mode 100644 bin/varnishtest/tests/r00936.vtc

There is a lot more you can do with git log than just find deleted code but hopefully these examples are a useful start.

Beware Docker and sysctl defaults on GCE

On Google Compute Engine (GCE) the latest VM boot images (at the time of writing) for Ubuntu 14.04 and 16.04 (eg ubuntu-1604-xenial-v20170811) ship with a file at /etc/sysctl.d/99-gce.conf which contains:

net.ipv4.ip_forward = 0

This kernel parameter determines whether packets can be forwarded between network interfaces. On its own, the presence of this line isn’t a big deal.

Separately, when you start the Docker daemon (at least in version 17.06.0-ce), it sets this kernel parameter to 1 (assuming you haven’t specified --ip-forward=false in the Docker configuration). Docker needs packet forwarding enabled so that Docker containers using the default bridge network can communicate outside the host.

If you later execute sysctl --system or similar after has Docker has started, for example to apply a new value for the nf_conntrack_max kernel parameter that you’ve specified in another file under /etc/sysctl.d/, then the ip_forward parameter will revert to 0 care of GCE’s default conf file.

At this point you’ll find your containers cannot reach the outside world, for example this will fail to resolve:

docker run ubuntu:16.04 getent hosts google.com

This will remain broken for all existing or new containers until you set the ip_forward parameter back to 1 manually or by restarting the Docker daemon.

If you’re using any Docker version since v1.8 (released about 2 years ago) you should see the following message when running a container with bridge networking if IP forwarding is disabled:

WARNING: IPv4 forwarding is disabled. Networking will not work.

Of course, that only helps if you’re using docker run interactively and does not help if the parameter gets changed after the containers are already running.

If you’re in this situation, add your own file to /etc/sysctl.d/ that follows 99-gce.conf alphabetically (eg 99-luftballon.conf) and ensure it contains:

net.ipv4.ip_forward = 0

You may also want to ensure the file has a trailing LF character to avoid any issues with processing it.

You can check the current value of the ip_forward kernel parameter with one of these two commands:

sysctl net.ipv4.ip_forward
cat /proc/sys/net/ipv4/ip_forward

The case of the addled ARP

In recent weeks we started receiving alerts whenever a new AWS EC2 Instance running Ubuntu 14.04 LTS was launched for a specific Auto Scaling Group. On average, one new instance would be provisioned per day but the fault would only occur for about one or two of the new instances per week.

The alert was an indicator that the new instance was unable to communicate with the message broker located on another instance. However, after approximately 20 minutes the issue would self-resolve. Also, if we manually provisioned a new replacement instance, it would successfully communicate with the broker.

With the short window of failure and no consistent period between occurrences so this problem continued through several operations shifts and staff members before a plan was established to capture more details of the problem.

On the next alert we were able to investigate and establish several facts:

  1. The affected instance was unable to communicate due to a connection timeout. It was sending TCP SYN packets and receiving no reply.
  2. The message broker was receiving the TCP SYN packet from the affected instance and replying with a SYN+ACK packet but the MAC address on the reply packet did not match the MAC address on the incoming SYN packet.
  3. Running ip neigh show on the message broker instance reported that the IP address of the affected instance was associated with an unrelated MAC address and was in the STALE state but occasionally also in the REACHABLE state.
  4. The unrelated MAC address was not associated with any other instances running in the VPC nor any that had been recently terminated.

At this point we setup two monitors on the message broker instance while we waited for the problem to self-resolve. The first was a tcpdump to capture all ARP traffic and the second was a shell script to continuously poll and record the ARP table. The ARP traffic capture contained very little and nothing at all helpful but the ARP table records were very interesting.

While the affected instance was unable to connect to the message broker, the ARP table cycled through the states REACHABLE then STALE then DELAY​ and back to REACHABLE again, retaining the same incorrect MAC address association the whole time. The DELAY state never lasted as long as five seconds.

At the moment when the problem self-resolved, the DELAY state did last for five seconds and then transitioned to the PROBE state, then to the FAILED state and finally back to REACHABLE but this time with the correct MAC address.

This insight lead one of our team members to find this Red Hat bug describing a Linux kernel issue that aligned with exactly the behaviour we were experiencing. Unfortunately the fix for this bug wasn’t merged until Linux kernel 4.11 which was only released in May and reportedly won’t be officially available in Ubuntu until Artful Aardvark 17.10.

Our assessment of all the stale ARP entries on the message broker combined with the known scaling behaviours of the messaging clients suggested that some entries had been there for at least 8 weeks. So this wasn’t a by-product of replacing instances rapidly and recycling IP addresses in the subnet too quickly.

As an interim solution we have implemented a cron job to remove any stale entries from the message broker’s ARP table and this has prevented the alerts from re-appearing for several weeks now.

 

Always upgrade ixgbevf on AWS EC2

I recently had a frustrating experience with network connectivity for a set of AWS EC2 Instances running Ubuntu Trusty 14.04.

Three instances, running Graphite and Carbon Cache 0.9.15 would intermittently become unreachable on the network for seconds or minutes at a time and several times a day. There was no obvious pattern to when these events would occur and when they did there was no interesting change in their CPU utilisation, memory usage, or disk IO aside from the inevitable reduction in activity associated with a lack of data or queries coming from the network.

AWS reported the Graphite instances were failing their Instance Status Check. The external instances attempting to communicate with these Graphite machines just experienced TCP timeouts. When the Graphite instances themselves became network-reachable again, their system logs showed that processes had continued running as normal during the outage.

The first hint of a reason for this behaviour came from the Graphite instances’ syslog reporting No route to host during the outage while a cron job was attempting to connect to another instance on the same subnet in the same Availability Zone. This suggested something was wrong with either ARP, or the network interface, but there were no logs or kernel messages suggesting the network interface had gone down and EC2 resolves ARP at the Hypervisor.

I configured collectd to harvest all the network-related metrics possible on the Graphite instances themselves and I configured VPC Flow Logs to record details of all the network traffic in the subnet. After the next period of failed connectivity I discovered that Flow Logs showed that all packets were reaching the EC2 Network Interfaces of the Graphite instances but the instance’s collectd data showed no packets received, but no network errors either.

These Graphite instances were now running the AWS M4 Instance Type but they were not originally provisioned as such which lead me to investigate the Enhanced Networking features available to these instance types. I eventually found this suspicious paragraph in the AWS documentation:

In the above Ubuntu instance, the module is installed, but the version is 2.11.3-k, which does not have all of the latest bug fixes that the recommended version 2.14.2 does. In this case, the ixgbevf module would work, but a newer version can still be installed and loaded on the instance for the best experience.

Enabling Enhanced Networking with the Intel 82599 VF Interface on Linux Instances in a VPC

Our instances were running the 2.11.3-k version of the ixgbevf driver mentioned in the documentation, which is the older “would work” version but also the most recent version included with Ubuntu Trusty. Some further research into this network driver on AWS revealed some other discussions of similarly flakey network connectivity so I decided to upgrade the driver on one of the Graphite instances.

As per the same AWS documentation, the recommended version 2.14.2 does not build properly on some versions of Ubuntu, so I installed version 2.16.4, which required an OS restart. I monitored the upgraded instance for 24 hours and it remained healthy with no connectivity interruption for the whole period whilst the other two instances continued to fail intermittently so I upgraded the network driver on a second instance. After 72 hours of stable behaviour on the two upgraded instances, I upgraded the third and the problem is now completely resolved for those instances.

Expecting that these issues could easily recur on our other systems I wanted to ensure they were all using the newest driver, however due to the OS restart for the new network driver to load, adding the driver install steps to the provisioning script was undesirable. Experimentation revealed that rmmod and modprobe seem to allow the upgraded network driver to become active without an OS restart but I decided that baking a new AMI with the driver pre-installed was preferred.

I have also discovered that the version of the ixgbevf driver included with Ubuntu Xenial 16.04 is more recent that Trusty’s but still older than the version recommended by AWS so a custom AMI is still required.

I’ve shared my experience and findings with AWS Support and asked them to modify their documentation to more strongly recommend installing the newer driver.

Busy May

In January I presented at the Sydney ALT.NET user group about HTTPS, focusing on all the new advancements in this space and some long-held misconceptions too. It was well received so I re-presented it at the Port80 Sydney meetup in March.

I met Steve Cassidy from Macquarie University who was also presenting at the same Port80 meetup and I was invited to present the talk a third time as a guest lecture to second year Macquarie University Computer Science students on May 4th. The lecture was filmed but is only available to those with a student login. My slide deck from Port80 is available on SlideShare though.

On May 19th I delivered a breakfast talk about my experience deploying some of section.io’s infrastructure into Azure. The video of this talk is publicly available and so are the slides.

This year my friend Aaron lead the organising of the return of the DDD conference in Sydney. I submitted a talk proposal and was fortunate to receive enough votes to earn a speaking slot. So, on Saturday May 28th I presented “Web Performance Lessons” which covered a variety of scenarios I had encountered while improving the performance of other people’s websites as part of my job at section.io. The talk was recorded by the conference sponsor SSW and is available to watch here. Also my slides can be viewed at SlideShare.

At the Port80 meetup in March I also met Mo Badran who organises the Operational Intelligence Sydney meetup. Mo asked if I could do a presentation of how section.io handles operations so on Tuesday May 31st I presented “Monitoring at section.io” where I shared a bunch of detail about our tools and processes for operational visibility at section.io, both for the platform itself, and for users of our CDN. Those slides are published on SlideShare too.

I’ll take a break from speaking in June and instead absorb what other people have to share at the Velocity conference in Santa Clara and take the opportunity to also check out the new section.io office in Colorado.

I know this blog has been quiet for a while. I have been posting most of my written content over at the section.io blog lately and will probably continue to blog there more often than here in the near future. Some of my recent posts include: