[ceph-users] MDS in read-only mode

Discussion:

Dmitriy Lysenko

2016-08-08 08:26:42 UTC

Good day.

My CephFS switched to read only
This problem was previously on Hammer, but i recreated cephfs, upgraded to Jewel and problem was solved, but appeared after some time.

ceph.log
2016-08-07 18:11:31.226960 mon.0 192.168.13.100:6789/0 148601 : cluster [INF] HEALTH_WARN; mds0: MDS in read-only mode

ceph-mds.log:
2016-08-07 18:10:58.699731 7f9fa2ba6700 1 mds.0.cache.dir(10000000afe) commit error -22 v 1
2016-08-07 18:10:58.699755 7f9fa2ba6700 -1 log_channel(cluster) log [ERR] : failed to commit dir 10000000afe object, errno -22
2016-08-07 18:10:58.699763 7f9fa2ba6700 -1 mds.0.2271 unhandled write error (22) Invalid argument, force readonly...
2016-08-07 18:10:58.699773 7f9fa2ba6700 1 mds.0.cache force file system read-only
2016-08-07 18:10:58.699777 7f9fa2ba6700 0 log_channel(cluster) log [WRN] : force file system read-only

I founded this object:
$ rados --pool metadata ls | grep 10000000afe
10000000afe.00000000

and successfully got it:
$ rados --pool metadata get 10000000afe.00000000 obj
$ echo $?
0

How to switchout MDS from readonly mode?
Are there any tools to test the CephFS system for errors?

$ ceph -v
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

$ ceph fs ls
name: cephfs, metadata pool: metadata, data pools: [data ]

$ ceph mds stat
e2283: 1/1/1 up {0=drop-03=up:active}, 3 up:standby

$ ceph osd lspools
0 data,1 metadata,6 one,

$ ceph osd dump | grep 'replicated size'
pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 45647 crash_replay_interval 45 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 45649 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 6 'one' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 53462 flags hashpspool min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0

Thank you for help.

--
Dmitry Lysenko
ISP Sovtest, Kursk, Russia
jabber: ***@jabber.sovtest.ru

John Spray

2016-08-08 10:49:14 UTC

Permalink

Post by Dmitriy Lysenko
Good day.
My CephFS switched to read only
This problem was previously on Hammer, but i recreated cephfs, upgraded to Jewel and problem was solved, but appeared after some time.
ceph.log
2016-08-07 18:11:31.226960 mon.0 192.168.13.100:6789/0 148601 : cluster [INF] HEALTH_WARN; mds0: MDS in read-only mode
2016-08-07 18:10:58.699731 7f9fa2ba6700 1 mds.0.cache.dir(10000000afe) commit error -22 v 1
2016-08-07 18:10:58.699755 7f9fa2ba6700 -1 log_channel(cluster) log [ERR] : failed to commit dir 10000000afe object, errno -22
2016-08-07 18:10:58.699763 7f9fa2ba6700 -1 mds.0.2271 unhandled write error (22) Invalid argument, force readonly...
2016-08-07 18:10:58.699773 7f9fa2ba6700 1 mds.0.cache force file system read-only
2016-08-07 18:10:58.699777 7f9fa2ba6700 0 log_channel(cluster) log [WRN] : force file system read-only

The MDS is going read only because it received an error (22, aka
EINVAL) from an OSD when trying to write a metadata object. You need
to investigate why the error occurred. Are your OSDs using the same
Ceph version as your MDS? Look in the OSD logs for the time at which
the error happened to see if there is more detail about why.

The readonly flag will clear if you restart your MDS (but it will get
set again if it keeps encountering errors writing to OSDs)

John

Post by Dmitriy Lysenko
$ rados --pool metadata ls | grep 10000000afe
10000000afe.00000000
$ rados --pool metadata get 10000000afe.00000000 obj
$ echo $?
0
How to switchout MDS from readonly mode?
Are there any tools to test the CephFS system for errors?
$ ceph -v
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
$ ceph fs ls
name: cephfs, metadata pool: metadata, data pools: [data ]
$ ceph mds stat
e2283: 1/1/1 up {0=drop-03=up:active}, 3 up:standby
$ ceph osd lspools
0 data,1 metadata,6 one,
$ ceph osd dump | grep 'replicated size'
pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 45647 crash_replay_interval 45 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 45649 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 6 'one' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 53462 flags hashpspool min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
Thank you for help.
--
Dmitry Lysenko
ISP Sovtest, Kursk, Russia
_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Wido den Hollander

2016-08-08 10:51:30 UTC

Permalink

Post by John Spray

You might want to add this to the mds config:

debug_rados = 20

That should show you which RADOS operations it is performing and you can also figure out which one failed.

Like John said, might be a issue with a specific OSD.

Wido

Post by John Spray
The readonly flag will clear if you restart your MDS (but it will get
set again if it keeps encountering errors writing to OSDs)
John

_______________________________________________
ceph-users mailing list
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Dmitriy Lysenko

2016-08-08 12:17:47 UTC

Permalink

Post by Wido den Hollander

Post by John Spray

All OSDs using same version:

# ceph tell osd.* version
osd.0: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}
osd.1: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}
osd.2: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}
osd.3: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}
osd.4: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}
osd.5: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}
osd.6: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}
osd.7: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}
osd.8: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}
osd.9: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}
osd.10: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}
osd.11: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}
osd.12: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}
osd.14: {
"version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)"
}

I did not find any errors in osd logs from 18:00 to 19:00 (included in this message)

Post by Wido den Hollander
debug_rados = 20
That should show you which RADOS operations it is performing and you can also figure out which one failed.
Like John said, might be a issue with a specific OSD.
Wido

I added debug_rados into [mds] section in ceph.conf
But i've already fixed error by http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/:

cephfs-journal-tool event recover_dentries summary
cephfs-table-tool all reset session
cephfs-journal-tool journal reset
cephfs-data-scan init
cephfs-data-scan scan_extents data
cephfs-data-scan scan_inodes data

Post by Wido den Hollander

Post by John Spray
The readonly flag will clear if you restart your MDS (but it will get
set again if it keeps encountering errors writing to OSDs)
John