Discussion:
[ceph-users] Ceph S3 multisite replication issue
Rémi Buisson
2018-12-07 10:05:15 UTC
Permalink
Hello,
 
 
 
I'm using ceph 12.2.10 on debian stretch.
 
I have two clusters on two different datacenters interconnected with a ~ 7ms latency link.
 
 
 
I setup S3 replication between those DC and it works fine except when I enable SSL.
 
 
 
My setup is the following:
 
- 2 radosgw on each site
 
- Nginx in front of each radosgw to handle SSL termination (I use also Nginx when replication flow is not encrypted)
 
- 3 GSLB: storage.mydomain.local, storage-dc1.mydomain.local, storage-dc2.mydomain.local
 
 
 
With or without SSL, the replication is working but when I enable SSL after some time (1 hour in average) radosgw on the replicated site have their CPU which increase up to 100% in about a minute or so.
 
Looking at logs, it seems to loop against so oprations to complete:
 
 
 
2018-12-06 10:25:36.743088 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bfa943800:20RGWContinuousLeaseCR: operate()
2018-12-06 10:25:36.743108 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:36.743109 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:36.743119 7f48f30bb700 20 enqueued request req=0x563bf638c600
2018-12-06 10:25:36.743120 7f48f30bb700 20 RGWWQ:
2018-12-06 10:25:36.743121 7f48f30bb700 20 req: 0x563bf638c600
2018-12-06 10:25:36.743124 7f48f30bb700 20 run: stack=0x563c186c9590 is io blocked
2018-12-06 10:25:36.743173 7f48f92cd700 20 dequeued request req=0x563bf638c600
2018-12-06 10:25:36.743176 7f48f92cd700 20 RGWWQ: empty
2018-12-06 10:25:36.748138 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:36.748154 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:36.748155 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:36.748156 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:36.748161 7f48f30bb700 20 cr:s=0x563c186c9590:op=0x563bfa943800:20RGWContinuousLeaseCR: operate()
2018-12-06 10:25:36.748169 7f48f30bb700 20 run: stack=0x563c186c9590 is io blocked
2018-12-06 10:25:37.824409 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf72d3000:20RGWContinuousLeaseCR: operate()
2018-12-06 10:25:37.824425 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:37.824427 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:37.824440 7f48f30bb700 20 enqueued request req=0x563bf638c600
2018-12-06 10:25:37.824442 7f48f30bb700 20 RGWWQ:
2018-12-06 10:25:37.824442 7f48f30bb700 20 req: 0x563bf638c600
2018-12-06 10:25:37.824447 7f48f30bb700 20 run: stack=0x563c0f3a2690 is io blocked
2018-12-06 10:25:37.824528 7f48fead8700 20 dequeued request req=0x563bf638c600
2018-12-06 10:25:37.824531 7f48fead8700 20 RGWWQ: empty
2018-12-06 10:25:37.826461 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf78d9800:20RGWContinuousLeaseCR: operate()
2018-12-06 10:25:37.826474 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf7633c00:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:37.826476 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf7633c00:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:37.826485 7f48f30bb700 20 enqueued request req=0x563bf28d6a00
2018-12-06 10:25:37.826487 7f48f30bb700 20 RGWWQ:
2018-12-06 10:25:37.826487 7f48f30bb700 20 req: 0x563bf28d6a00
2018-12-06 10:25:37.826492 7f48f30bb700 20 run: stack=0x563c0f3a4ee0 is io blocked
2018-12-06 10:25:37.826569 7f48ffada700 20 dequeued request req=0x563bf28d6a00
2018-12-06 10:25:37.826574 7f48ffada700 20 RGWWQ: empty
2018-12-06 10:25:37.827819 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:37.827826 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:37.827827 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:37.827828 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf211e300:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:37.827837 7f48f30bb700 20 cr:s=0x563c0f3a2690:op=0x563bf72d3000:20RGWContinuousLeaseCR: operate()
2018-12-06 10:25:37.827844 7f48f30bb700 20 run: stack=0x563c0f3a2690 is io blocked
2018-12-06 10:25:37.829124 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf7633c00:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:37.829132 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf7633c00:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:37.829134 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf7633c00:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:37.829134 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf7633c00:20RGWSimpleRadosLockCR: operate()
2018-12-06 10:25:37.829141 7f48f30bb700 20 cr:s=0x563c0f3a4ee0:op=0x563bf78d9800:20RGWContinuousLeaseCR: operate()
2018-12-06 10:25:37.829147 7f48f30bb700 20 run: stack=0x563c0f3a4ee0 is io blocked
 
 
 
I have the same behavior on a new cluster or on a single cluster migrated to multisite.
 
I have tested multiple radosgw configurations (rgw_curl*) but not very concluding.
 
Any thoughts ?
 
 
 
Thanks in advance.
 
 
 
Rémi

Loading...