Wido den Hollander
2018-11-08 08:52:48 UTC
Hi,
Recently I've seen a Ceph cluster experience a few outages due to memory
issues.
The machines:
- Intel Xeon E3 CPU
- 32GB Memory
- 8x 1.92TB SSD
- Ubuntu 16.04
- Ceph 12.2.8
Looking at one of the machines:
***@ceph22:~# free -h
total used free shared buff/cache
available
Mem: 31G 22G 8.2G 9.8M 809M
8.5G
Swap: 1.9G 0B 1.9G
***@ceph22:~#
As you can see, it's using 22GB of the 32GB in the system.
[osd]
bluestore_cache_size_ssd = 1G
The BlueStore Cache size for SSD has been set to 1GB, so the OSDs
shouldn't use more then that.
When dumping the mem pools each OSD claims to be using between 1.8GB and
2.2GB of memory.
$ ceph daemon osd.X dump_mempools|jq '.total.bytes'
Summing up all the values I get to a total of 15.8GB and the system is
using 22GB.
Looking at 'ps aux --sort rss' I see OSDs using almost 10% of the
memory, which would be ~3GB for a single daemon.
Sometimes these machines go OOM without a really good reason. After all
the OSDs have been restarted it runs fine, but about 10 days later the
same problem arrises again.
I know that BlueStore uses more memory, but 32GB with 8 OSDs should be a
workable setup.
Any ideas?
Wido
Recently I've seen a Ceph cluster experience a few outages due to memory
issues.
The machines:
- Intel Xeon E3 CPU
- 32GB Memory
- 8x 1.92TB SSD
- Ubuntu 16.04
- Ceph 12.2.8
Looking at one of the machines:
***@ceph22:~# free -h
total used free shared buff/cache
available
Mem: 31G 22G 8.2G 9.8M 809M
8.5G
Swap: 1.9G 0B 1.9G
***@ceph22:~#
As you can see, it's using 22GB of the 32GB in the system.
[osd]
bluestore_cache_size_ssd = 1G
The BlueStore Cache size for SSD has been set to 1GB, so the OSDs
shouldn't use more then that.
When dumping the mem pools each OSD claims to be using between 1.8GB and
2.2GB of memory.
$ ceph daemon osd.X dump_mempools|jq '.total.bytes'
Summing up all the values I get to a total of 15.8GB and the system is
using 22GB.
Looking at 'ps aux --sort rss' I see OSDs using almost 10% of the
memory, which would be ~3GB for a single daemon.
Sometimes these machines go OOM without a really good reason. After all
the OSDs have been restarted it runs fine, but about 10 days later the
same problem arrises again.
I know that BlueStore uses more memory, but 32GB with 8 OSDs should be a
workable setup.
Any ideas?
Wido