Discussion:
[ceph-users] Unexplainable high memory usage OSD with BlueStore
Wido den Hollander
2018-11-08 08:52:48 UTC
Permalink
Hi,

Recently I've seen a Ceph cluster experience a few outages due to memory
issues.

The machines:

- Intel Xeon E3 CPU
- 32GB Memory
- 8x 1.92TB SSD
- Ubuntu 16.04
- Ceph 12.2.8

Looking at one of the machines:

***@ceph22:~# free -h
total used free shared buff/cache
available
Mem: 31G 22G 8.2G 9.8M 809M
8.5G
Swap: 1.9G 0B 1.9G
***@ceph22:~#

As you can see, it's using 22GB of the 32GB in the system.

[osd]
bluestore_cache_size_ssd = 1G

The BlueStore Cache size for SSD has been set to 1GB, so the OSDs
shouldn't use more then that.

When dumping the mem pools each OSD claims to be using between 1.8GB and
2.2GB of memory.

$ ceph daemon osd.X dump_mempools|jq '.total.bytes'

Summing up all the values I get to a total of 15.8GB and the system is
using 22GB.

Looking at 'ps aux --sort rss' I see OSDs using almost 10% of the
memory, which would be ~3GB for a single daemon.

Sometimes these machines go OOM without a really good reason. After all
the OSDs have been restarted it runs fine, but about 10 days later the
same problem arrises again.

I know that BlueStore uses more memory, but 32GB with 8 OSDs should be a
workable setup.

Any ideas?

Wido
Stefan Kooman
2018-11-08 10:34:17 UTC
Permalink
Post by Wido den Hollander
Hi,
Recently I've seen a Ceph cluster experience a few outages due to memory
issues.
- Intel Xeon E3 CPU
- 32GB Memory
- 8x 1.92TB SSD
- Ubuntu 16.04
- Ceph 12.2.8
What kernel version is running? What network card is being used?

We hit a mem leak bug in Intel driver (i40e) (Intel x710) which has been
fixed (mostly) in 4.13 and up [1].

Gr. Stefan

[1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1748408
--
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / ***@bit.nl
Wido den Hollander
2018-11-08 10:37:33 UTC
Permalink
Post by Stefan Kooman
Post by Wido den Hollander
Hi,
Recently I've seen a Ceph cluster experience a few outages due to memory
issues.
- Intel Xeon E3 CPU
- 32GB Memory
- 8x 1.92TB SSD
- Ubuntu 16.04
- Ceph 12.2.8
What kernel version is running? What network card is being used?
4.15.0-38-generic

The NIC in this case is 1GbE, a Intel I210.

It's this SuperMicro mainboard:
https://www.supermicro.com/products/motherboard/xeon/c220/x10sl7-f.cfm

So the kernel is already at 4.15 and it's not using any of those NICs.

Thanks!

Wido
Post by Stefan Kooman
We hit a mem leak bug in Intel driver (i40e) (Intel x710) which has been
fixed (mostly) in 4.13 and up [1].
Gr. Stefan
[1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1748408
Hector Martin
2018-11-08 11:28:26 UTC
Permalink
Post by Wido den Hollander
[osd]
bluestore_cache_size_ssd = 1G
The BlueStore Cache size for SSD has been set to 1GB, so the OSDs
shouldn't use more then that.
When dumping the mem pools each OSD claims to be using between 1.8GB and
2.2GB of memory.
$ ceph daemon osd.X dump_mempools|jq '.total.bytes'
Summing up all the values I get to a total of 15.8GB and the system is
using 22GB.
Looking at 'ps aux --sort rss' I see OSDs using almost 10% of the
memory, which would be ~3GB for a single daemon.
This is similar to what I see on a memory-starved host with the OSDs
configured with very little cache:

[osd]
bluestore cache size = 180000000

$ ceph daemon osd.13 dump_mempools|jq '.mempool.total.bytes'
163117861

That adds up, but ps says:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ceph 234576 2.6 6.2 1236200 509620 ? Ssl 20:10 0:16
/usr/bin/ceph-osd -i 13 --pid-file /run/ceph/osd.13.pid -c
/etc/ceph/ceph.conf --foreground

So ~500MB RSS for this one. Due to an emergency situation that made me
lose half of the RAM on this host, I'm actually resorting to killing the
oldest OSD every 5 minutes right now to keep the server from OOMing
(this will be fixed soon).

I would very much like to know if this OSD memory usage outside of the
bluestore cache size can be bounded or reduced somehow. I don't
particularly care about performance, so it would be useful to be able to
tune it lower. This would help single-host and smaller Ceph use cases; I
think Ceph's properties make it a very interesting alternative to things
like btrfs and zfs, but dedicating several GB of RAM per disk/OSD is not
always viable. Right now it seems that besides the cache, OSDs will
creep up in memory usage up to some threshold, and I'm not sure what
determines what that baseline usage is or whether it can be controlled.
--
Hector Martin (***@marcansoft.com)
Public Key: https://mrcn.st/pub
Wido den Hollander
2018-11-08 12:36:11 UTC
Permalink
Post by Hector Martin
Post by Wido den Hollander
[osd]
bluestore_cache_size_ssd = 1G
The BlueStore Cache size for SSD has been set to 1GB, so the OSDs
shouldn't use more then that.
When dumping the mem pools each OSD claims to be using between 1.8GB and
2.2GB of memory.
$ ceph daemon osd.X dump_mempools|jq '.total.bytes'
Summing up all the values I get to a total of 15.8GB and the system is
using 22GB.
Looking at 'ps aux --sort rss' I see OSDs using almost 10% of the
memory, which would be ~3GB for a single daemon.
This is similar to what I see on a memory-starved host with the OSDs
[osd]
bluestore cache size = 180000000
$ ceph daemon osd.13 dump_mempools|jq '.mempool.total.bytes'
163117861
Interesting. Looking at my OSD in this case (cache = 1GB) I see
BlueStore reporting 1548288000 bytes at bluestore_cache_data.

That's 1.5GB while 1GB has been set.

This OSD claims to be using 2GB in total at mempool.total.bytes.

So that's 1.5GB for BlueStore's cache and then 512M for the rest?

PGLog and OSDMaps aren't using that much memory.

Wido
Post by Hector Martin
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ceph 234576 2.6 6.2 1236200 509620 ? Ssl 20:10 0:16
/usr/bin/ceph-osd -i 13 --pid-file /run/ceph/osd.13.pid -c
/etc/ceph/ceph.conf --foreground
So ~500MB RSS for this one. Due to an emergency situation that made me
lose half of the RAM on this host, I'm actually resorting to killing the
oldest OSD every 5 minutes right now to keep the server from OOMing
(this will be fixed soon).
I would very much like to know if this OSD memory usage outside of the
bluestore cache size can be bounded or reduced somehow. I don't
particularly care about performance, so it would be useful to be able to
tune it lower. This would help single-host and smaller Ceph use cases; I
think Ceph's properties make it a very interesting alternative to things
like btrfs and zfs, but dedicating several GB of RAM per disk/OSD is not
always viable. Right now it seems that besides the cache, OSDs will
creep up in memory usage up to some threshold, and I'm not sure what
determines what that baseline usage is or whether it can be controlled.
Loading...