[ceph-users] unusual growth in cluster after replacing journal SSDs

Discussion:

Jogi Hofmüller

2017-11-16 12:36:50 UTC

Dear all,

for about a month we experience something strange in our small cluster.
Let me first describe what happened on the way.

On Oct 4ht smartmon told us that the journal SSDs in one of our two
ceph nodes will fail. Since getting replacements took way longer than
expected we decided to place the journal on a spare HDD rather than
have the SSD fail and leave us in an uncertain state.

On Oct 17th we finally got the replacement SSDs. First we replaced
broken/soon to be broken SSD and moved journals from the temporarily
used HDD to the new SSD. Then we also replaced the journal SSD on the
other ceph node since it would probably fail sooner than later.

We performed all operations by setting noout first, then taking down
the OSDs, flushing journals, replacing disks, creating new journals and
starting OSDs again. We waited until the cluster was back in HEALTH_OK
state before we proceeded to the next node.

AFAIR mkjournal crashed once on the second node. So we ran the command
again and journals where created.

The next day in the morning at 6:25 (time of cron.daily jobs on Debian
systems) we registered almost 2000 slow requests. We've had slow
requests before, but never more than 900 per day and that was rare.

Another odd thing we noticed is that the cluster had grown over night
by 50GB! We currently run 12 vservers from ceph images and they are
all not really busy. Usually used data would grow by 2GB per week or
less. Network traffic between our three monitors has roughly doubled
at the same time and stayed on that level until now.

We eventually got rid of all the slow requests by removing all but one
snapshot per image. We used to take nightly snapshots of all images
and keep 14 snapshots per image.

Now we take on snapshot per image per night, use export-diff and
offload the diff to storage outside of ceph and remove the nightly
snapshot right away. The only snapshot we keep is the one that the
diffs are based on.

What remains is the growth of used data in the cluster.

I put background information of our cluster and some graphs of
different metrics on a wiki page:

https://wiki.mur.at/Dokumentation/CephCluster

Basically we need to reduce the growth in the cluster, but since we are
not sure what causes it we don't have an idea.

So the main question that I have is what went gone wrong when we
replaced the journal disks? And of course: how can we fix it?

As always, any hint appreciated!

Regards,

--
J.HofmÃŒller

Ich zitiere wie Espenlaub.
- https://twitter.com/TheGurkenkaiser/status/463444397678690304

Burkhard Linke

2017-11-16 12:44:46 UTC

Permalink

Hi,

Post by Jogi HofmÃ¼ller
Dear all,
for about a month we experience something strange in our small cluster.
Let me first describe what happened on the way.
On Oct 4ht smartmon told us that the journal SSDs in one of our two
ceph nodes will fail. Since getting replacements took way longer than
expected we decided to place the journal on a spare HDD rather than
have the SSD fail and leave us in an uncertain state.
On Oct 17th we finally got the replacement SSDs. First we replaced
broken/soon to be broken SSD and moved journals from the temporarily
used HDD to the new SSD. Then we also replaced the journal SSD on the
other ceph node since it would probably fail sooner than later.
We performed all operations by setting noout first, then taking down
the OSDs, flushing journals, replacing disks, creating new journals and
starting OSDs again. We waited until the cluster was back in HEALTH_OK
state before we proceeded to the next node.
AFAIR mkjournal crashed once on the second node. So we ran the command
again and journals where created.

*snipsnap*

Post by Jogi HofmÃ¼ller
What remains is the growth of used data in the cluster.
I put background information of our cluster and some graphs of
https://wiki.mur.at/Dokumentation/CephCluster
Basically we need to reduce the growth in the cluster, but since we are
not sure what causes it we don't have an idea.

Just a wild guess (wiki page is not accessible yet):

Are you sure that the journals were creating on the new SSD? If the
journals were created as files in the OSD directory, their size might be
accounted for in the cluster size report (assuming OSDs are reporting
their free space, not a sum of all object sizes).

Regards,
Burkhard

Jogi Hofmüller

2017-11-17 07:15:17 UTC

Permalink

Hi,

Post by Jogi HofmÃ¼ller
What remains is the growth of used data in the cluster.
I put background information of our cluster and some graphs of
Â Â Â https://wiki.mur.at/Dokumentation/CephCluster
Basically we need to reduce the growth in the cluster, but since we are
not sure what causes it we don't have an idea.

Oh damn, sorry! Fixed that. The wiki page is accessible now.

Are you sure that the journals were creating on the new SSD? If theÂ
journals were created as files in the OSD directory, their size might
beÂ accounted for in the cluster size report (assuming OSDs are
reportingÂ their free space, not a sum of all object sizes).

Yes, I am sure. Just checked and all the journal links point to the
correct devices. See OSD 5 as an example:

ls -l /var/lib/ceph/osd/ceph-5
total 64
-rw-r--r--Â Â Â 1 root rootÂ Â Â 481 Mar 30Â Â 2017 activate.monmap
-rw-r--r--Â Â Â 1 ceph cephÂ Â Â Â Â 3 Mar 30Â Â 2017 active
-rw-r--r--Â Â Â 1 ceph cephÂ Â Â Â 37 Mar 30Â Â 2017 ceph_fsid
drwxr-xr-x 342 ceph ceph 12288 AprÂ Â 6Â Â 2017 current
-rw-r--r--Â Â Â 1 ceph cephÂ Â Â Â 37 Mar 30Â Â 2017 fsid
lrwxrwxrwxÂ Â Â 1 root rootÂ Â Â Â 58 Oct 17 14:43 journal -> /dev/disk/by-
partuuid/f04832e3-2f09-460e-806f-4a6fe7aa1425
-rw-r--r--Â Â Â 1 ceph cephÂ Â Â Â 37 Oct 25 11:12 journal_uuid
-rw-------Â Â Â 1 ceph cephÂ Â Â Â 56 Mar 30Â Â 2017 keyring
-rw-r--r--Â Â Â 1 ceph cephÂ Â Â Â 21 Mar 30Â Â 2017 magic
-rw-r--r--Â Â Â 1 ceph cephÂ Â Â Â Â 6 Mar 30Â Â 2017 ready
-rw-r--r--Â Â Â 1 ceph cephÂ Â Â Â Â 4 Mar 30Â Â 2017 store_version
-rw-r--r--Â Â Â 1 ceph cephÂ Â Â Â 53 Mar 30Â Â 2017 superblock
-rw-r--r--Â Â Â 1 ceph cephÂ Â Â Â Â 0 NovÂ Â 7 11:45 systemd
-rw-r--r--Â Â Â 1 ceph cephÂ Â Â Â 10 Mar 30Â Â 2017 type
-rw-r--r--Â Â Â 1 ceph cephÂ Â Â Â Â 2 Mar 30Â Â 2017 whoami

Regards,

--
J.HofmÃŒller

Nisiti
- Abie Nathan, 1927-2008

Jogi Hofmüller

2018-02-06 11:18:58 UTC

Permalink

Dear all,

we finally found the reason for the unexpected growth in our cluster.
The data was created by a collectd plugin [1] that measures latency by
running rados bench once a minute. Since our cluster was stressed out
for a while, removing the objects created by rados bench failed. We
completely overlooked the log messages that should have given us the
hint a lot earlier. e.g.:

Jan 18 23:26:09 ceph1 ceph-osd: 2018-01-18 23:26:09.931638
7f963389f700Â Â 0 -- IP:6802/1986 submit_message osd_op_reply(374
benchmark_data_ceph3_31746_object158 [delete] v21240'22867646
uv22867646 ack = 0) v7 remote, IP:0/3091801967, failed lossy con,
dropping message 0x7f96672a6680

Over time we "collected" some 1.5TB of benchmark data :(

Furthermore, due to some misunderstanding we had the collectd plugin
that runs the benchmarks running on two machines, doubling the stress
on the cluster.

And finally we created benchmark data in our main production pool,
which also was a bad idea.

Hope this info will be useful for someone :)

[1] https://github.com/rochaporto/collectd-ceph

Cheers,

--
J.HofmÃŒller
We are all idiots with deadlines.
- Mike West