a red/green off/on pair of old and rusty electrical buttons. And the "on" button breaks under the user's finger.

How (not) to shut down a Ceph cluster

Sometimes, there is a need to shut down the whole Ceph cluster temporarily. For example, this may be related to moving all the infrastructure physically into a different building. Or, the air conditioning system fails, and the datacenter staff tells you that all non-essential loads must be shut down to limit the heat production. Either way, when people face this task, they naturally turn to search engines for it and follow the most widespread and seemingly credible instructions, echoed by well-recognized Ceph vendors. These instructions involve setting and unsetting multiple OSD flags, including nodown and pause.

The problem is that, in our experience, these instructions are incorrect and dangerous.

To save you from reading the whole article: nowadays, we recommend using the noout flag and no other flags when shutting down a cluster.

Incorrect Instructions

The instructions found in the majority of online sources boil down to shutting down all the clients, bringing down RADOS gateways, filesystems, metadata servers (MDSs), and running this sequence of commands to set some OSD flags:

# Warning: Incorrect!
ceph osd set noout
ceph osd set norecover
ceph osd set norebalance
ceph osd set nobackfill
ceph osd set nodown
ceph osd set pause

…followed by shutting down OSDs, MGRs, and MONs.

When the cluster is to be brought up again, allegedly, starting all MONs, MGRs, OSDs, unsetting all the flags, and starting all other services would be enough. And yet, this procedure's failure was precisely the subject of an urgent support request that we received a few days ago.

The Failure Mode

The customer booted all servers, then attempted to unset the pause flag using the croit UI. It failed with a timeout. Sadly, the ceph osd unset pause command, when issued from a terminal, was also hanging and not returning to the prompt.

The croit UI main page often, but not all the time, displayed the “Can't contact Ceph” message. On the command line, ceph -s was also laggy. When ceph -s returned the cluster status, the output was similar to this:

(docker-croit)@ceph-cm ~ $ ceph -s
cluster:
id: <withheld for privacy>
health: HEALTH_WARN
1/5 mons down, quorum ceph-rgw07,ceph-rgw08,ceph-stg01,ceph-stg02
pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub flag(s) set
6 osds down
Slow OSD heartbeats on front (longest 285869.538ms)
Reduced data availability: 13187 pgs inactive, 1 pg down, 8674 pgs peering, 38 pgs stale
Degraded data redundancy: 1109870/13247094866 objects degraded (0.008%), 7 pgs degraded, 7 pgs undersized
125547 slow ops, oldest one blocked for 35 sec, mon.ceph-rgw07 has slow ops
services:
mon: 5 daemons, quorum ceph-rgw07,ceph-rgw08,ceph-stg01,ceph-stg02 (age 6s), out of quorum: ceph-rgw06
mgr: ceph-rgw07(active, since 2h), standbys: ceph-rgw06, ceph-rgw08, ceph-stg01, ceph-stg02
osd: 728 osds: 722 up (since 53s), 727 in (since 2w)
flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub
data:
pools: 10 pools, 14640 pgs
objects: 1.44G objects, 2.6 PiB
usage: 4.0 PiB used, 1.7 PiB / 5.7 PiB avail
pgs: 30.697% pgs unknown
59.378% pgs not active
1109870/13247094866 objects degraded (0.008%)
8636 peering
4494 unknown
1446 active+clean
38 stale+peering
18 activating
7 active+undersized+degraded
1 down

The number of down OSDs was oscillating down and up, but never reaching zero. The MON quorum was similarly unstable.

On the servers, both OSDs and MONs consumed way more CPU than they usually do, thus leaving zero idle time. The system load reached as high as 96 (on servers with 64 CPU threads). Presumably, the OSDs were busy with peering, which is the process of exchanging the list of objects and agreeing on their state. They were so busy that they had no time to communicate with all their peers. They were regularly auto-declared down because of that, thus creating a new OSD epoch, but the nodown flag prevented this from having any real effect. Nevertheless, the constant stream of new OSD epochs contributed to the unsustainable MON load.

Resolution

At that time, the support engineer handling this incident started chasing a (non-existent) network problem, thinking that something like this could prevent OSDs from communicating properly. The ceph osd unset pause command was still hanging in the background.

But we did no magic.

The ceph osd unset pause command finished after almost half an hour; this didn’t resolve the system load issue and the unstable MON quorum, and ceph -s remained laggy. The majority of PGs were in the peering status. We then proceeded with the ceph osd unset nodown command, just in case. When it eventually finished, the load disappeared, and the cluster became snappy. We moved on to unset other flags, and, at the end, the cluster became healthy.

So, in the end, the widely advertised procedure did work, although we wouldn’t rather count on it ever working again in a cluster of this size (700+ OSDs). And it does definitely work for smaller clusters.

We could probably have accelerated the wait by stopping all OSDs, issuing the ceph osd unset … commands while nobody was hammering the MONs, and starting the OSDs again. We don’t have a test cluster large enough to verify this theory.

Some Digital Archaeology

Out of curiosity, we tried to trace the origins of the nodown + pause advice. Apparently, according to a ceph-users list post from April 2017, the advice existed in the RedHat knowledge base back then. It is still there, in a much more elaborate form, in IBM product documentation.

According to the whole git log of the upstream Ceph source code repository, which also contains all the documentation ever published on docs.ceph.com, this advice was never published as official Ceph documentation. The same official documentation does not currently contain any cluster shutdown procedure.

We could not find any justification for all the flags mentioned now by IBM in online sources. The closest match is a short article originally published by OpenAttic in 2018. Interestingly, it portrays the norebalance, nodown, and pause flags as optional, needed only “if you would like to pause your cluster completely” and not necessary for a safe powerdown.

Another disobeyor is SUSE Enterprise Storage, which only mentions the noout, nobackfill, and norecover flags in its documentation for a cluster shutdown procedure.

Our Advice on Cluster Shutdown

We recommend the following procedure for shutting a Ceph cluster down:

Stop all clients.
Check that the cluster is healthy.
Stop all NFS, SMB, iSCSI, and NVMe-oF gateways.
Stop all mirror daemons and RADOS gateways.
Set the noout flag, so that the cluster does not attempt to redistribute the data by creating extra copies or EC shards elsewhere when OSDs go down.
Shut down all nodes in any order.

Please think of the noout flag as a promise that all the OSDs that go down while this flag is set will come up, intact, in the future.

While the nobackfill and norecover flags recommended by SUSE are not harmful, they are not really needed either. They would only prevent OSDs from performing the work required to update their copies of the data from OSDs that were shut down later and started earlier. In any case, this work has to be done at some point, and it’s only a small amount of data movement.

When the cluster is brought up again, by booting all nodes, it should eventually, without any external help, reach a state when the only remaining health warning is related to the noout flag. If this doesn’t happen, you may need to find and restart a few daemons that misbehave. Finally, unset the noout flag.

Closing Remarks

In fact, Ceph is very resilient to power loss events. We have a customer for whom they happen regularly, and so far, they have never needed our help to restore proper operation when the power reappeared. Ceph upstream also tests this exact scenario as a part of their pre-release checklist. And they don’t test the widely replicated procedure with the nodown and pause flags.