How 3Scale handled Shellshock and cloud re:boot.

by Rhommel Lamas

Shellshock

Vulnerabilities: CVE-2014-7169, CVE-2014-7186, CVE-2014-7187, CVE-2014-6271, CVE-2014-7169, CVE-2014-7186, CVE-2014-7187

Last Wednesday, September 24th 2014, critical vulnerabilities described above were disclosed in the GNU Bash shell by Stephane Chazelas. As you may know, the BASH shell is widely used and distributed on all Unix-like systems. This vulnerability allowed an attacker to execute arbitrary commands remotely on the affected systems.

Once informed about the vulnerability we started rolling out patches using Puppet and MCollective to avoid any impact on our infrastructure and service.

Cloud reboot

A lot of our infrastructure runs on top of several cloud providers such as AWS. On September 24th, Amazon notified us that they were scheduling a system-reboot of around 90 of our instances because they had to immediately patch a critical vulnerability of the Xen Hypervisor CVE-2014-7188, which they even embargoed until the 1st of October.

For those of you who don’t know, a system-reboot means that AWS needs to restart their Xen host so all the patches are correctly applied. As a user you have three ways to mitigate the effects of this action:

  1. Wait until AWS triggers the restart of your instance, which basically means that they will stop your instance, move it to another updated host, and then start it again.
  2. Stop your instances whenever you feel comfortable in a controlled way and then start them again hoping that they get moved to an updated host.
  3. Kill your old instance and re-build it from scratch, again hoping that when you create a new instance it will be launched on a updated host.

All of these are valid options when it comes to mitigating the effects of a scheduled reboot, but not all of them work in every situation.

About 90% of our instances that run on AWS cloud are instance-store to ensure the best performance of our systems. This makes it harder when a system-reboot is scheduled by AWS as you can’t stop them, only terminate them, which left us with only 2 out of the 3 options.

Unfortunately, we couldn’t allow AWS to re:boot in an unattended way, as we rely heavily on Redis as our main data store, which we sharded 2 months ago using tools such as Twemproxy, and Sentinel, so the nodes reboot had to be supervised.

During the reboot, we encountered a misconfiguration issue on our Redis Sentinel which manages the cluster of our task queueing system. This caused last Friday’s 10 minute outage.

Once we finished with the queueing system we did a complete failover on both shards of our main data store on Saturday at 13h PDT. This executed without any issues.

After we took these measures it was time to wait for Amazon to reboot service instances. As they were rebooted they were removed from and then added back again to each pool by HAproxy without users even noticing. We were even able to handle almost 4x of our normal traffic without blinking several times during this weekend.

What did we learn?

What happened this weekend taught us the importance of chaos testing your infrastructure regularly so you can be as resilient as possible against unplanned events or failures, testing your disaster recovery plans will help you discover mistakes that you made when building your infrastructure.

Instance-stores are great, they are fast and reliable but make sure that your applications are designed for them, otherwise you could end up having pain rather than success stories.

Automate everything especially when you are working in the cloud, if our infrastructure wasn’t properly puppetized rebuilding our stack would have taken days instead of hours. At our scale it’s a must to spend time on “Automate all the things”.

Published: October 04 2014

blog comments powered by Disqus