[Gluster-users] Geo-replication stops every few days

Mon Mar 30 16:10:35 UTC 2015

Hi List,

I do geo-replication on a half a dozen volumes or so across several 
locations. It works fine except for one, our largest volume, a 2x2 
distributed-replicated volume with about 40TB of mixed media on it. All 
of these servers/volumes are running 3.4.6 on CentOS 6.

Every few days, geo-replication on this volume will stop, though the 
status command shows OK.

I've checked all of the things like clock drift that the geo-rep 
troubleshooting guide recommends, and everything seems to be fine.

Hopefully without spamming too much logs, here's the last few lines of 
what I hope might be relevant whenever logging stops.

Any ideas would be much apprediated, I've ben running up against this 
intermittently for months.

On the master:

$ cat 
ssh%3A%2F%2Froot%40192.168.78.91%3Agluster%3A%2F%2F127.0.0.1%3Amedia-vol.log
[2015-03-19 20:13:15.269819] I [master:669:crawl] _GMaster: completed 
60 crawls, 0 turns
[2015-03-19 20:14:16.218323] I [master:669:crawl] _GMaster: completed 
60 crawls, 0 turns
[2015-03-19 20:15:17.171961] I [master:669:crawl] _GMaster: completed 
60 crawls, 0 turns
[2015-03-19 20:16:18.112601] I [master:669:crawl] _GMaster: completed 
60 crawls, 0 turns
[2015-03-19 20:17:19.52232] I [master:669:crawl] _GMaster: completed 60 
crawls, 0 turns
[2015-03-19 20:18:19.991274] I [master:669:crawl] _GMaster: completed 
60 crawls, 0 turns
[2015-03-19 20:20:06.722600] I [master:669:crawl] _GMaster: completed 
23 crawls, 0 turns
[2015-03-19 20:42:40.970180] I [master:669:crawl] _GMaster: completed 1 
crawls, 0 turns
[2015-03-19 20:47:14.961935] I [master:669:crawl] _GMaster: completed 1 
crawls, 0 turns
[2015-03-19 20:48:22.333839] E [syncdutils:179:log_raise_exception] 
<top>: connection to peer is broken

$ cat 
ssh%3A%2F%2Froot%40192.168.78.91%3Agluster%3A%2F%2F127.0.0.1%3Amedia-vol.gluster.log
[2015-03-19 20:44:17.172597] W 
[client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-media-vol-client-2: 
remote operation failed: Stale file handle. Path: 
/data/v0.1/images/9c0ea93e-46b2-4a27-b31a-ce26897bd299.jpg 
(0aadc99c-c3e1-455d-a5ac-4bcf04541482)
[2015-03-19 20:44:38.314659] W 
[client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-media-vol-client-2: 
remote operation failed: Stale file handle. Path: 
/data/v0.1/images/28ed6b8f-ec29-4066-02be-d7ae9a5a7bb6.jpg 
(7e97920c-f165-44c3-9868-8dd13cc2b8d0)
[2015-03-19 20:44:38.314738] W 
[client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-media-vol-client-3: 
remote operation failed: Stale file handle. Path: 
/data/v0.1/images/28ed6b8f-ec29-4066-02be-d7ae9a5a7bb6.jpg 
(7e97920c-f165-44c3-9868-8dd13cc2b8d0)
[2015-03-19 20:44:53.029449] W 
[client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-media-vol-client-2: 
remote operation failed: Stale file handle. Path: 
/data/v0.1/images/5d150965-f687-43bb-9gae-8f2d04cb02de.jpg 
(40676e42-47e0-4c2b-a7b9-9e4101f2e32d)
[2015-03-19 20:44:53.029557] W 
[client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-media-vol-client-3: 
remote operation failed: Stale file handle. Path: 
/data/v0.1/images/5d150965-f687-43bb-9gae-8f2d04cb02de.jpg 
(40676e42-47e0-4c2b-a7b9-9e4101f2e32d)
[2015-03-19 20:45:39.031436] W 
[client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-media-vol-client-2: 
remote operation failed: Stale file handle. Path: 
/data/v0.1/images/38067d80-6a2b-498d-b319-ce6c77354151.jpg 
(10e53310-0c88-43a7-aa3f-2b48f0720cc7)
[2015-03-19 20:45:39.031552] W 
[client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-media-vol-client-3: 
remote operation failed: Stale file handle. Path: 
/data/v0.1/images/38067d80-6a2b-498d-b319-ce6c77354151.jpg 
(10e53310-0c88-43a7-aa3f-2b48f0720cc7)

On the slave:

$ cat 
8ed3c554-8d5d-4304-9cd1-cb9a17c0fd64:gluster%3A%2F%2F127.0.0.1%3Agtmmedia-storage.gluster.log
[2015-03-20 01:42:39.715954] I 
[dht-common.c:1000:dht_lookup_everywhere_done] 0-media-vol-dht: STATUS: 
hashed_subvol media-vol-replicate-1 cached_subvol null
[2015-03-20 01:42:39.716792] I 
[dht-common.c:1000:dht_lookup_everywhere_done] 0-media-vol-dht: STATUS: 
hashed_subvol media-vol-replicate-1 cached_subvol null
[2015-03-20 01:42:39.730075] I 
[dht-common.c:1000:dht_lookup_everywhere_done] 0-media-vol-dht: STATUS: 
hashed_subvol media-vol-replicate-1 cached_subvol null
[2015-03-20 01:42:39.730179] I [dht-rename.c:1159:dht_rename] 
0-media-vol-dht: renaming 
/data/v0.1/images-add/.D6C7BEBC-3D2B-413F-96EC-5AE7A44B36C4.jpg.Mmu2MZ 
(hash=media-vol-replicate-1/cache=media-vol-replicate-1) => 
/data/v0.1/images-add/D6C7BEBC-3D2B-413F-96EC-5AE7A44B36C4.jpg 
(hash=media-vol-replicate-1/cache=<nul>)
[2015-03-20 01:51:56.502512] I [fuse-bridge.c:4669:fuse_thread_proc] 
0-fuse: unmounting /tmp/gsyncd-aux-mount-6Hn48F
[2015-03-20 01:51:56.596291] W [glusterfsd.c:1002:cleanup_and_exit] 
(-->/lib64/libc.so.6(clone+0x6d) [0x3455ae88fd] 
(-->/lib64/libpthread.so.0() [0x3455e079d1] 
(-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x40533d]))) 0-: 
received signum (15), shutting down
[2015-03-20 01:51:56.596309] I [fuse-bridge.c:5301:fini] 0-fuse: 
Unmounting '/tmp/gsyncd-aux-mount-6Hn48F'.

$ cat 
8ed3c554-8d5d-4304-9cd1-cb9a17c0fd64:gluster%3A%2F%2F127.0.0.1%3Agtmmedia-storage.log
[2015-03-16 13:19:54.624100] I [gsyncd(slave):404:main_i] <top>: 
syncing: gluster://localhost:media-vol
[2015-03-16 13:19:55.752953] I [resource(slave):483:service_loop] 
GLUSTER: slave listening
[2015-03-16 14:02:40.654535] I [repce(slave):78:service_loop] 
RepceServer: terminating on reaching EOF.
[2015-03-16 14:02:40.661967] I [syncdutils(slave):148:finalize] <top>: 
exiting.
[2015-03-16 14:02:51.987721] I [gsyncd(slave):404:main_i] <top>: 
syncing: gluster://localhost:media-vol
[2015-03-16 14:02:53.141150] I [resource(slave):483:service_loop] 
GLUSTER: slave listening
[2015-03-17 17:31:25.696300] I [repce(slave):78:service_loop] 
RepceServer: terminating on reaching EOF.
[2015-03-17 17:31:25.703775] I [syncdutils(slave):148:finalize] <top>: 
exiting.
[2015-03-17 17:31:37.139935] I [gsyncd(slave):404:main_i] <top>: 
syncing: gluster://localhost:media-vol
[2015-03-17 17:31:38.228033] I [resource(slave):483:service_loop] 
GLUSTER: slave listening
[2015-03-19 20:51:55.965342] I [resource(slave):489:service_loop] 
GLUSTER: connection inactive for 120 seconds, stopping
[2015-03-19 20:51:55.979207] I [syncdutils(slave):148:finalize] <top>: 
exiting.

Thanks,

-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150330/ba1074d3/attachment.html>