Redis Sentinel failover with no downtime. The hard way.

by Rhommel Lamas

We are heavy users of Redis, in fact most of our applications are built on top of it. Last month we decided that it was time for us to step up and upgrade one of our main, but not biggest, clusters which is used to store Resque Jobs from Redis 2.6.x to Redis 2.8.8 and start using the latest Redis Sentinel version as well for automatic failover.

TL;DR

Redis Sentinel is a useful system designed to perform tasks over your Redis instances, commonly used to execute automatic failover and monitoring in case that your Redis master instance stops working the way it is supposed to. We discovered during this process that while Redis Sentinel was on failover mode, there is a period of time (few milliseconds or seconds) where, depending on the size of your Redis database your instances are not able to handle any request at all.

Depending on your implementation of Redis this may cause unavailability, something that you can afford during outages but not during programmed maintenance periods, like Redis version upgrades, Redis planned maintenances or whenever Amazon declares your instance in maintenance or pending reboot.

The Problem

While using Redis Sentinel, we discover one nasty bug, which we will explain more in depth on our next post. Luckily for us Antirez solved the issue right away and released Redis 2.8.10.

Because of this new release, we had to find a way to upgrade Redis from version 2.8.8 to version 2.8.10, and we found out that there was not easy way to do this transparently with no downtime, even if we had all the security measures on our code to handle retries and timeouts, we reached a point where we got some timeouts from our cluster, so we had to find a workaround in order for this to work.

The Solution

Imagine a scenario where you have a Front End and some backend with Resque workers for asynchronous tasks like the one described below, working with Sentinel. Your application communicates to Sentinel locally so it can know the status of your cluster of instances.

Resque

In order to achieve an upgrade of your Master Redis instance with no downtime we discovered that there were two approaches, so I will start with the easiest one:

First approach

  1. Create 3 new Redis instances.
  2. Create a new cluster in Sentinel, let’s call it resquemaintenance.
  3. Since we have 2 ResqueJob instances, we should deploy one ResqueJob instance pointing Sentinel cluster resquemaintenance.
  4. Deploy all FrontEnds pointing to Sentinel’s resquemaintenance.
  5. Wait until there are no pending jobs in the old Sentinel Cluster and deploy the missing ResqueJob instance pointing to Sentinel’s resquemaintenance.
  6. Remove the old cluster from Sentinel’s configuration.
  7. Kill old Redis instances.

Resque

Second approach

Assuming that you have 3 Redis instances like this:

  1. Redis01 => Master.
  2. Redis02 => Slave of Redis01.
  3. Redis03 => Slave of Redis01.

You should follow these steps:

  1. Upgrade your Redis slaves (Redis02 and Redis03) one by one.
  2. Create a new Redis instance called Redis04.
  3. Attach this Redis04 as a slave of Redis02.
  4. Create a new cluster in Sentinel called resquemaintenance with a quorum bigger than the amount of sentinels that you have on your infrastructure and pointing to Redis02.
  5. Set slave-read-only no on Redis02.
  6. Deploy your ResqueJob instances pointing to the Sentinel cluster resquemaintenance.
  7. Deploy your FrontEnd instances pointing to the Sentinel cluster resquemaintenance.
  8. Remove the old cluster01from Sentinel.
  9. Do slaveof no oneon Redis02.
  10. Set slave-read-only yes on Redis02
  11. Attach Redis03 as a slave of Redis02.
  12. Delete Redis01.

Resque

We ended up using the second approach, with some tooling built with capistrano to automate task for all Redis and Redis Sentinel configurations.

Conclusion

Before implementing it we knew about the good things of Redis Sentinel, but were not aware of all the corner cases that we could face when it comes to maintaining Redis under this topology. This helped us evaluate more carefully our next Redis migration.

During this process we experimented the importance of task automation which makes everything much easier and error proof one more time.

Published: June 18 2014

blog comments powered by Disqus