At 3scale we have been using Amazon Web Services (AWS) since we started about 4 years ago. We could not be happier. Amazon does a very good job and have helped us to grow and to manage our infrastructure with less that one person dedicated to it.
We also use other providers like Rackspace and Hetzner, they also do a terrific job, but our heavy-lifting is done in Amazon so we gathered quite a bit of insight on the internal musings of AWS. We thought to share our experience but it turns out that others have done it already, for instance the post AWS: the good, the bad and the ugly by @seldo is fantastic, we can subscribe it 100% and really encourage anyone starting with AWS to check it out.
Measuring Performance in AWS
There is one last bit that the post does not cover which we think is worth sharing, turns out that the same instance type on different availability zones (and regions) is quite different. How different? Enough to actually care about it.
In the figure below you can see the aggregated average response times of all the listener processes running on a subset of servers. Basically a listener process running on an instance in the zone us-east1-d takes about 12ms, whereas the listener on an instance in us-east1-b takes 24ms. Our listeners are single-threaded processes that respond to http requests, do some processing and if needed create a job that gets stored into redis for asynchronous processing.
The figure only includes results for the listeners running on instances on region us-east (the most popular one). The instances are all the same c1.medium, and the number of listener processes running per instance is the same across all instances.
Why the differences on performance?
Network, plain and simple.
Listeners access a redis instance that is on a master-slave setup. And the master, unless fail-over, is on zone us-east-d. Therefore it’s expected that listeners on us-east-d are faster on processing requests than listeners on any other zone, almost double. What is a bit more striking is there was quite a bit of variance among the remaining zones, for instance us us-east-c is ~40% faster than us-east-b.
This variance is due to multiple reasons, physical distance between zones, network topology, exogenous behavior (more on that later). Also, some zones are “older”, AWS is advising customers to get out of zone us-east-a.
For our particular case the zone “classification” looks like this (from slower to faster)
us-east-b > us-east-a > us-east-e > us-east-c > us-east-d
the ordered relation have not changed (unless on failover mode). This does not mean that zone us-east-c is always better than e always for everybody! This is true only for our case where all zones end-up writing to us-east-d. For you it might be different. We do not cross traffic unless it’s really required, and only for the data layer (see amazon architecture tips).
You can find your own classification if you have a good tooling, which is always a good idea. It was not until recently that we systematically instrumented everything to collect performance data. We use good old syslog-ng and graphite. Fancier approaches have been tried but due our high traffic we could make them work as advertised, or we were not able to set them up properly on a finite time scale :-)
What do we do with this? Well, if you want, you can optimize in which zone you create new instances. Since the difference in performance is noticeable it’s worth considering it.
But please be careful of not overdoing it, uptime comes first. Skewing your instances heavily on a subset of zones can bring can make you more failure prone, there is always a sweet-spot, and in case of doubt, be conservative, sacrifice performance for reliability. AWS works great but there are failures, everything fails from time to time. Priority should be given to uptime, at least for our case. We have survived all major AWS issues, even the cloudpocalypse. But we have had downtimes of our own unrelated to our providers as you can see in our twitter status.
AWS is a shared environment
That should not surprise anyone, but it’s good to keep in mind.
Below you can see a plot of the median response (50th percentile) and the traffic received.
Notice that the green line (zone us-east-d) is always “flat” but for some odd “steps”. Request time is uncorrelated with the traffic received (the same applies for other zones).
Those steps are basically exogenous events that resulted on a performance degradation, many times by doubling the time it takes to process a request. The steps appear clustered in batches of two or three zones, something is going on on the connectivity between these zones and it’s not us :-) Perhaps there is congestion on the network, perhaps AWS or another customer doing large bulk transfers. Other ideas?
Conclusion, take into account networking in your systems, try to minimize it as much as possible without heading to premature optimization and plan ahead which zones you choose for when you cannot away without it. Being aware of AWS zones can help you improve performance and reliability.