I recently had a frustrating experience with network connectivity for a set of AWS EC2 Instances running Ubuntu Trusty 14.04.
Three instances, running Graphite and Carbon Cache 0.9.15 would intermittently become unreachable on the network for seconds or minutes at a time and several times a day. There was no obvious pattern to when these events would occur and when they did there was no interesting change in their CPU utilisation, memory usage, or disk IO aside from the inevitable reduction in activity associated with a lack of data or queries coming from the network.
AWS reported the Graphite instances were failing their Instance Status Check. The external instances attempting to communicate with these Graphite machines just experienced TCP timeouts. When the Graphite instances themselves became network-reachable again, their system logs showed that processes had continued running as normal during the outage.
The first hint of a reason for this behaviour came from the Graphite instances’ syslog reporting
No route to host during the outage while a cron job was attempting to connect to another instance on the same subnet in the same Availability Zone. This suggested something was wrong with either ARP, or the network interface, but there were no logs or kernel messages suggesting the network interface had gone down and EC2 resolves ARP at the Hypervisor.
I configured collectd to harvest all the network-related metrics possible on the Graphite instances themselves and I configured VPC Flow Logs to record details of all the network traffic in the subnet. After the next period of failed connectivity I discovered that Flow Logs showed that all packets were reaching the EC2 Network Interfaces of the Graphite instances but the instance’s collectd data showed no packets received, but no network errors either.
These Graphite instances were now running the AWS M4 Instance Type but they were not originally provisioned as such which lead me to investigate the Enhanced Networking features available to these instance types. I eventually found this suspicious paragraph in the AWS documentation:
In the above Ubuntu instance, the module is installed, but the version is 2.11.3-k, which does not have all of the latest bug fixes that the recommended version 2.14.2 does. In this case, the ixgbevf module would work, but a newer version can still be installed and loaded on the instance for the best experience.
Our instances were running the
2.11.3-k version of the
ixgbevf driver mentioned in the documentation, which is the older “would work” version but also the most recent version included with Ubuntu Trusty. Some further research into this network driver on AWS revealed some other discussions of similarly flakey network connectivity so I decided to upgrade the driver on one of the Graphite instances.
As per the same AWS documentation, the recommended version
2.14.2 does not build properly on some versions of Ubuntu, so I installed version
2.16.4, which required an OS restart. I monitored the upgraded instance for 24 hours and it remained healthy with no connectivity interruption for the whole period whilst the other two instances continued to fail intermittently so I upgraded the network driver on a second instance. After 72 hours of stable behaviour on the two upgraded instances, I upgraded the third and the problem is now completely resolved for those instances.
Expecting that these issues could easily recur on our other systems I wanted to ensure they were all using the newest driver, however due to the OS restart for the new network driver to load, adding the driver install steps to the provisioning script was undesirable. Experimentation revealed that rmmod and modprobe seem to allow the upgraded network driver to become active without an OS restart but I decided that baking a new AMI with the driver pre-installed was preferred.
I have also discovered that the version of the ixgbevf driver included with Ubuntu Xenial 16.04 is more recent that Trusty’s but still older than the version recommended by AWS so a custom AMI is still required.
I’ve shared my experience and findings with AWS Support and asked them to modify their documentation to more strongly recommend installing the newer driver.