Discussion:
[ceph-users] unusual growth in cluster after replacing journal SSDs
Jogi Hofmüller
2017-11-16 12:36:50 UTC
Permalink
Dear all,

for about a month we experience something strange in our small cluster.
Let me first describe what happened on the way.

On Oct 4ht smartmon told us that the journal SSDs in one of our two
ceph nodes will fail. Since getting replacements took way longer than
expected we decided to place the journal on a spare HDD rather than
have the SSD fail and leave us in an uncertain state.

On Oct 17th we finally got the replacement SSDs. First we replaced
broken/soon to be broken SSD and moved journals from the temporarily
used HDD to the new SSD. Then we also replaced the journal SSD on the
other ceph node since it would probably fail sooner than later.

We performed all operations by setting noout first, then taking down
the OSDs, flushing journals, replacing disks, creating new journals and
starting OSDs again. We waited until the cluster was back in HEALTH_OK
state before we proceeded to the next node.

AFAIR mkjournal crashed once on the second node. So we ran the command
again and journals where created.

The next day in the morning at 6:25 (time of cron.daily jobs on Debian
systems) we registered almost 2000 slow requests. We've had slow
requests before, but never more than 900 per day and that was rare.

Another odd thing we noticed is that the cluster had grown over night
by 50GB! We currently run 12 vservers from ceph images and they are
all not really busy. Usually used data would grow by 2GB per week or
less. Network traffic between our three monitors has roughly doubled
at the same time and stayed on that level until now.

We eventually got rid of all the slow requests by removing all but one
snapshot per image. We used to take nightly snapshots of all images
and keep 14 snapshots per image.

Now we take on snapshot per image per night, use export-diff and
offload the diff to storage outside of ceph and remove the nightly
snapshot right away. The only snapshot we keep is the one that the
diffs are based on.

What remains is the growth of used data in the cluster.

I put background information of our cluster and some graphs of
different metrics on a wiki page:

https://wiki.mur.at/Dokumentation/CephCluster

Basically we need to reduce the growth in the cluster, but since we are
not sure what causes it we don't have an idea.

So the main question that I have is what went gone wrong when we
replaced the journal disks? And of course: how can we fix it?

As always, any hint appreciated!

Regards,
--
J.HofmÃŒller

Ich zitiere wie Espenlaub.
- https://twitter.com/TheGurkenkaiser/status/463444397678690304
Burkhard Linke
2017-11-16 12:44:46 UTC
Permalink
Hi,
Post by Jogi Hofmüller
Dear all,
for about a month we experience something strange in our small cluster.
Let me first describe what happened on the way.
On Oct 4ht smartmon told us that the journal SSDs in one of our two
ceph nodes will fail. Since getting replacements took way longer than
expected we decided to place the journal on a spare HDD rather than
have the SSD fail and leave us in an uncertain state.
On Oct 17th we finally got the replacement SSDs. First we replaced
broken/soon to be broken SSD and moved journals from the temporarily
used HDD to the new SSD. Then we also replaced the journal SSD on the
other ceph node since it would probably fail sooner than later.
We performed all operations by setting noout first, then taking down
the OSDs, flushing journals, replacing disks, creating new journals and
starting OSDs again. We waited until the cluster was back in HEALTH_OK
state before we proceeded to the next node.
AFAIR mkjournal crashed once on the second node. So we ran the command
again and journals where created.
*snipsnap*
Post by Jogi Hofmüller
What remains is the growth of used data in the cluster.
I put background information of our cluster and some graphs of
https://wiki.mur.at/Dokumentation/CephCluster
Basically we need to reduce the growth in the cluster, but since we are
not sure what causes it we don't have an idea.
Just a wild guess (wiki page is not accessible yet):

Are you sure that the journals were creating on the new SSD? If the
journals were created as files in the OSD directory, their size might be
accounted for in the cluster size report (assuming OSDs are reporting
their free space, not a sum of all object sizes).

Regards,
Burkhard
Jogi Hofmüller
2017-11-17 07:15:17 UTC
Permalink
Hi,
Post by Jogi Hofmüller
What remains is the growth of used data in the cluster.
I put background information of our cluster and some graphs of
   https://wiki.mur.at/Dokumentation/CephCluster
Basically we need to reduce the growth in the cluster, but since we are
not sure what causes it we don't have an idea.
Oh damn, sorry! Fixed that. The wiki page is accessible now.
Are you sure that the journals were creating on the new SSD? If the 
journals were created as files in the OSD directory, their size might
be accounted for in the cluster size report (assuming OSDs are
reporting their free space, not a sum of all object sizes).
Yes, I am sure. Just checked and all the journal links point to the
correct devices. See OSD 5 as an example:

ls -l /var/lib/ceph/osd/ceph-5
total 64
-rw-r--r--   1 root root   481 Mar 30  2017 activate.monmap
-rw-r--r--   1 ceph ceph     3 Mar 30  2017 active
-rw-r--r--   1 ceph ceph    37 Mar 30  2017 ceph_fsid
drwxr-xr-x 342 ceph ceph 12288 Apr  6  2017 current
-rw-r--r--   1 ceph ceph    37 Mar 30  2017 fsid
lrwxrwxrwx   1 root root    58 Oct 17 14:43 journal -> /dev/disk/by-
partuuid/f04832e3-2f09-460e-806f-4a6fe7aa1425
-rw-r--r--   1 ceph ceph    37 Oct 25 11:12 journal_uuid
-rw-------   1 ceph ceph    56 Mar 30  2017 keyring
-rw-r--r--   1 ceph ceph    21 Mar 30  2017 magic
-rw-r--r--   1 ceph ceph     6 Mar 30  2017 ready
-rw-r--r--   1 ceph ceph     4 Mar 30  2017 store_version
-rw-r--r--   1 ceph ceph    53 Mar 30  2017 superblock
-rw-r--r--   1 ceph ceph     0 Nov  7 11:45 systemd
-rw-r--r--   1 ceph ceph    10 Mar 30  2017 type
-rw-r--r--   1 ceph ceph     2 Mar 30  2017 whoami

Regards,
--
J.HofmÃŒller

Nisiti
- Abie Nathan, 1927-2008
Jogi Hofmüller
2018-02-06 11:18:58 UTC
Permalink
Dear all,

we finally found the reason for the unexpected growth in our cluster.
The data was created by a collectd plugin [1] that measures latency by
running rados bench once a minute. Since our cluster was stressed out
for a while, removing the objects created by rados bench failed. We
completely overlooked the log messages that should have given us the
hint a lot earlier. e.g.:

Jan 18 23:26:09 ceph1 ceph-osd: 2018-01-18 23:26:09.931638
7f963389f700  0 -- IP:6802/1986 submit_message osd_op_reply(374
benchmark_data_ceph3_31746_object158 [delete] v21240'22867646
uv22867646 ack = 0) v7 remote, IP:0/3091801967, failed lossy con,
dropping message 0x7f96672a6680

Over time we "collected" some 1.5TB of benchmark data :(

Furthermore, due to some misunderstanding we had the collectd plugin
that runs the benchmarks running on two machines, doubling the stress
on the cluster.

And finally we created benchmark data in our main production pool,
which also was a bad idea.

Hope this info will be useful for someone :)

[1] https://github.com/rochaporto/collectd-ceph

Cheers,
--
J.HofmÃŒller
We are all idiots with deadlines.
- Mike West
Loading...