We have been running Redis in production for about 3 years by now. Overall, it has been a joy. Redis is fast, it is reliable, and what’s more important for a high-throughput backend like ours, it delivers a consistent performance. Unlike other data-stores in our tech stack (MySQL and Cassandra) the query latency is very predictable: 10ms is always 10ms all the way, not a random number between 1ms and 100ms depending on who knows what.
Needless to say that we have had some problems with Redis but they were always our fault. The last issue we have had is not an exception, however, we believe it’s worth sharing…
The WTF moment
About one month ago one of our Redis slaves in Amazon EC2 started to intermittently report unusual memory usage. The slave could end up using twice the memory that it should, for instance, twice the memory that its master was using. It was odd since it was a stand-by slave for failover, it had no writes or read from clients.
Most of the times, the memory consumption went back to normal in a matter of minutes, sometime it took a couple of hours. Aa couple of times, however, it reached the dreadful out-of-memory and naturally crashed hard. After restarting the slave memory usage was always the expected. Everything seemed to be totally fine. Furthermore, the issue was intermittent, we had to wait for one or two days for the issue to reappear. No correlation with spikes of traffic, no issues on the VM, no nothing.
So we had to keep digging…
At first we suspected about background saving, it’s always the first candidate to blame. Disabling it had no effect. Memory usage still went AWOL from time to time.
We switched the offending slave to another EC2 instance of the same type. In the past we have seen that Amazon aggressive over-subscription policy ends up giving us quite heterogenous instances in terms of performance (of the very same type). This did the trick for a while, events were more sparse in time, but still happened.
We bought some time by going to an instance type with more memory, but the problem was still there but was less frequent. The WTF level was raising at an alarming rate. Although the system would keep running, the whole failover of 3scale’s backend was at risk. If we had to failover to the miss-behaving slave it would have been a blast.
The 3scale backend is the core of the our API Management Solution. The backend does the authentication, authorization, rate-limiting, stats and alerts for 100+ API’s that rely on the 3scale to get rid of all the boring yet necessary parts of managing and monitoring and API. Fortunately for our customers and users we keep our uptime in check as you can see in our status feed.
Given the criticality we had to keep looking and eventually found what the problem was: the replication between one Redis slave in Amazon EC2 (us-west) could not keep up with yet another slave that we had in Rackspace.
3scale’s backend has three layers of failover:
1.main master in Amazon EC2 (us-east) 2.first slave in Amazon EC2 (but a different availability zone, the miss-behaving one), and 3.a second slave in Rackspace.
Having a failover system outside Amazon EC2 allowed us to survive the not so uncommon Amazon meltdowns. However, having our backend in multiple data-centers of different providers was the very cause of the issue we were having. Bandwidth between Amazon EC2 and Rackspace was just not enough for the replication of the Redis instances to keep up.
Since the first slave (that is also a master of another slave) could not send the write commands to the second slave, it kept them in memory until they could be send. New write commands were arriving faster than they could be sent and consequently they were piling up. Mystery solved. In fact Redis was warning us about this issue, but we did not know where to look,
$ redis-cli info ... client_longest_output_list: Damn long number :/ ...
Our client_longest_output_list was out of whack. It turns out that the replication service is also a client. Promoting a slave to master when the client_longest_output_list is not zero is dangerous. There is data loss risk if pending write commands are never replayed in the slave. Think of the case of the master crashing hard: the output list is lost, master cannot be sync since it’s down and you might have to promote a slave that is not in sync with the master, bad, really bad. Nasty business unless there is some in-house consistency recovery mechanism.
Redis replication works as follows (roughly): The master sends all write commands to their registered slaves using the Redis Replication Protocol, basically plain text. When the slave receives the write commands it replays them. This way the Redis instances keep in sync. If the connection is lost or if a new slave is created the slave will start a
SYNC operation. The master will then send the full RDB dataset in binary to the slave. After the RDB is loaded on the slave, all write commands on the master since the
SYNC will be sent and replayed.
Our main master Redis instance processes between 25000 and 30000 req/s (on average). This is production data, not a benchmark. Single server, single core, without a sweat, and more importantly, consistently. Did we already mention that Redis rocks?
Out of the 25k~30k requests about 62% are writes that have to be replicated to all the slaves. For our particular case this means that we need a sustained bandwidth of approximately 23 Mbps to maintain replication in sync when processing 30k req/s.
And that’s just too much… 23 Mbps is way too close to the observed bandwidth between Amazon EC2 and Rackspace for a single flow.
Within the same data-center you can get standard gigabit bandwidth. On the same availability zone in Amazon zone we have seen 750Mbps (close to the maximum capacity, but there are no guarantees). Personally, I would not worry for bandwidth requirements below 100-150Mbps between Amazon instances, even if they are across regions (data-centers).
However, when you go across different providers you are out on the Internet, at the mercy of ISP like Level3. 23 Mbps sustained 24/7 is not that low. We would not say that the connectivity between Amazon and Rackspace is shitty, not great but not terrible either. The issue is that Redis can easily max out your Network I/O.
So what happens when we do not get the desired 23 Mbps? For every hour that we are below 23 Mbps threshold, let’s say at 75% capacity, Redis has to keep ~2GB of additional data in memory. It’s easy to see how pending writes can pile up due to network congestion and eventually crash the Redis instance with an out-of-memory that seems to come out of the blue.
Finally, the Solution
Once we found the problem the solution was pretty straight-forward. There were several – discarded – options:
Using multiple flows. Complex to set up and not very scalable, a good setup could double your available bandwidth. This would be a lot of effort to have to revisit the problem in 2 months. Since we launched the free self-service API management (beginning of May 2012) we have increased traffic by 110%.
Using the upcoming Redis’ LUA scripts to create functions encapsulating correlated commands. Most of our Redis commands can be grouped in batches (we are heavy pipeline users). With the LUA functions only the function call needs to be sent to the slave. Creating LUA functions would save a lot of bandwidth but it requires extensive work on the code base and tests. That’s probably the best long term solution and what we will do in the near future (Redis 2.6 is still on a release candidate stage).
What we finally did to decrease network bandwidth is compress the data needed for replication between Redis instances. The Redis commands are plain-text and they are mostly keys that follow regular patterns. A one minute check showed us that we can compress 10GB of write commands to a bit less than 1GB. Compression was the way to go.
After some hacking attempts to use endless pipes of
lzop we settled for the out-of-the-box solution:
With ssh you can create tunnels between servers. In the server hosting the Redis slave type:
$ ssh -C -L 6280:localhost:6379 $MASTER_REDIS
Any connection to localhost:6280 will be tunneled to $MASTER_REDIS:6379. The data going through the ssh tunnel will be automatically compressed thanks to the
-C option. As simple as that. The only thing left to do is to tell the Redis slave to sync to localhost:6280 instead of to sync to the master:
$ redis-cli slaveof localhost:6280
Note that you might have to enable compression on your ssh_config (~/.ssh/ssh_config or /etc/ssh/ssh_config)
Compression yes CompressionLevel 5
The CompressionLevel option is a trade-off between bandwidth and CPU. Needless to say that compression and encryption do not come for free. There is a CPU overhead but it is negligible and a Redis instance is unlikely to be CPU bound anyway.
Before and After
With a screen-captures
nload, we can see a one minute snapshot of what we had before compression,
Current utilization is 23.7Mbps, keeping up with the replication throughput. The minimum (within the minute) of 12.5Mbps and max of 35.9Mbps. We can see that the connection fluctuates quite a bit. If the average goes below 23Mbps for long we will start to see the replication fall behind.
Once we enable compression, the picture changes quite a bit (also a one minute snapshot)
Current utilization is 1.94Mbps, average 1.80Mbps. Replication lag, zero. What did we get? A ten-fold reduction of the bandwidth with marginal increase of the CPU usage. And what’s more important, all Redis instances in perfect sync.
A lovely by-product of the compression is to see how this affect your monthly bill (Amazon might not agree on this point) An average of 23Mbps gives you a monthly consumption of ~ 7 TB. The first GB
TB one is on the house. (Thanks Nicholas to point the error.). But for the remaining data 6 TB cost are between $0.12 and $0.05 per GB depending on your Amazon’s account volume. In any case we are talking of hundreds of dollars that you can slice by 10 just by enabling compression.
Getting ready for Production
Before going to production we have to do some additional work. The ssh tunnel as described above is very fragile. If the connection breaks, and it will, the replication will break too.
For those cases, it’s advisable to set up an autossh that will automatically restart the ssh tunnel. You can use the following bash script, to start the autossh process…
Better to combine the autossh with Monit to make sure it’s restarted if it dies for an unknown reason. The simplest approach:
Monit will check if there is an “autossh” process with the proper PID, and if not, it will start the autossh process. The use of Monit at 3scale deserves a post of its own. Note that the Monit setup above is the lowest of the low, it assumes that the autossh runs as a bash script in /home/bender/bin. Obviously, better create a service daemon so that you can do
Autossh has somewhat a bad reputation internally since we have seen it failing multiple times in our Jenkins and magic IRC driven development setup. Since we started to use compression we have not seen any issue regarding the autossh for the Redis replication, but still, this is something that we are keeping a close eye on.
In favor of the Redis’ autossh setup we must say that the network connectivity between Amazon and Rackspace is way better than between Amazon and 3scale’s HQ (fiber). Another factor that can help autossh to behave itself is that the session never goes inactive. Redis is always pushing traffic through the ssh tunnel.
When using Redis on production on a high-throughput system think of bottlenecks that are not the typical ones. In our case, we explored many options before we realized it was a network bandwidth issue. Redis is rock solid, it’s most likely your fault :-) Don’t assume anything.
Do not attempt to set up a slave from the master if you cannot guarantee a replication with no delay. In our case, the slave from Rackspace was syncing for a master in Amazon that was also a slave of another master in Amazon. That’s ok because the one that crashed was a slave in a consistent state with the main master. But if the slave that is lagging behind gets automatically promoted to main master, you can end up on an inconsistent state difficult if not impossible to recover.
Compression is cheap CPU wise, it can save costs if applied properly.
Please do not hesitate to leave a comment or ping us. If you are interested in the topic you might also want to check the comments on Hackernews