[Gluster-users] Debugging georeplication failures

Audrius Butkevicius audrius.butkevicius at gmail.com
Tue Nov 24 21:38:32 UTC 2015


So the version of rsync is 3.1.0, but the bug mentioned only applies to
large files, where as in my case the files are less than a MB.

I've started digging through the logs and found a bunch of these on the
slave:

[2015-11-20 11:40:46.730805] W [fuse-bridge.c:1978:fuse_create_cbk]
0-glusterfs-fuse: 1882288: /.gfid/31d66429-c700-4a10-bb32-35e1b36a479f =>
-1 (Operation not permitted)
[2015-11-20 12:39:59.269844] W [fuse-bridge.c:1978:fuse_create_cbk]
0-glusterfs-fuse: 1918306: /.gfid/6802a0c6-1f62-4213-a70d-7b46d9ff8f3a =>
-1 (Operation not permitted)

So something funky was happening for an hour 4 days ago. Given the volume
is on EBS, maybe there was some glitch there.

I can also find the corresponding failures on the master:

[2015-11-20 11:40:14.93090] W [master(/data/media):803:log_failures]
_GMaster: ENTRY FAILED: ({'uid': 33, 'gfid':
'31d66429-c700-4a10-bb32-35e1b36a479f', 'gid': 33, 'mode': 33206, 'entry':
'.gfid/b1dc6c6d-dac7-4da9-9577-4614942a72a0/official-nightmare-before-christmas-vampire-teddy-girls-dress-body-web.jpg',
'op': 'CREATE'}, 17, 'df0e67f5-f2ce-45c3-b4f1-224aa3059ec7')
[2015-11-20 11:40:14.265054] W [master(/data/media):803:log_failures]
_GMaster: META FAILED: ({'go':
'.gfid/31d66429-c700-4a10-bb32-35e1b36a479f', 'stat': {'atime':
1448019600.232466, 'gid': 33, 'mtime': 1448019600.316466, 'mode': 33279,
'uid': 33}, 'op': 'META'}, 2)

If I grep for SKIPPED GFID I get the following:

[2015-11-20 11:40:40.704817] W [master(/data/media):1014:process] _GMaster:
SKIPPED GFID =
192632af-28c5-4e03-a62d-458fe7f3b5f9,7ea8d7a8-524b-4dd0-b97a-dc7d3481f341,204f6112-0e8d-4f6d-855b-bf10f9c63b62,7e626e8f-edad-4f39-a6c6-547a1da34aa1,1f0d0208-1962-4eb1-91d4-cf7ed297d8e3,95d389c4-3258-4ca0-8fc4-26b8427b1eaf,425cedc6-6343-4326-8540-996d2d56dc9c,5955928b-2b8f-4cc9-a336-3eac4382789b,8932efcd-ba90-46ec-84c8-5e9e51cc84e9,2530275d-5f03-4143-9abf-d07cc79bf80a,73574466-86f3-4ab2-b5da-c31ac28c27c1,776e5e8f-5c6a-46b1-ad54-733e157d2097,008a69f3-217c-4dbc-a469-5a5bc8ecd589,dca8d8d9-03cf-4793-92e4-bfcfddd262f6,c85b7a29-73af-4f44-a07e-a44082d7a93a,6c1f56d6-4ea6-4910-9677-ea33edd35d28,0ea56588-87fa-4355-9403-e311525454fc,c8ce76c9-e21d-46ce-a2b5-14dfd0070f64,db9e6484-0e5e-4f6e-815b-3c2b273deee5,35d10752-43b5-4398-be5f-17cb9de73a6b,396e5faf-74a1-4849-97e3-009dbfb22836,d148e7d5-c2f3-4d06-8cd6-8588e6aac196,404d20c5-1c6c-4aad-98be-2c23930173b3,f1fae11c-db8e-4cd5-8e47-a3870316f89c,d8daa413-e57f-44fb-b907-b1a497f2dcfa,5f6ee8c2-84fb-432e-95cd-e428ab256e83,6bf54dcd-c3b4-4187-a390-eca841e46570,335c07ca-d339-4d3a-aa88-3b5753d24fbf,8fdbac00-6628-4f22-8fb4-b7a6524cae49,31d66429-c700-4a10-bb32-35e1b36a479f
[2015-11-20 11:41:35.907850] W [master(/data/media):1014:process] _GMaster:
SKIPPED GFID =
03069c7f-8eaa-45b0-92ed-50cb648cd912,788f5ed1-923e-4b86-9696-2a6de07ebb2e,43d12b40-b6e2-43c4-8883-85e89dc81321
[2015-11-20 12:11:55.492068] W [master(/data/media):1014:process] _GMaster:
SKIPPED GFID =
eb02369f-7ca8-480a-b00c-768964410ed8,17045ac9-27dd-4bf9-9f90-d7b146070dd5,265e3d9c-1657-45cb-bbf6-db439eb18ccf,553c420f-b3cc-47f2-8d5f-cfc2ffdd1a92
[2015-11-20 12:12:53.372432] W [master(/data/media):1014:process] _GMaster:
SKIPPED GFID =
66c5878e-8c00-4f7d-a3ad-4adec84a5e22,f4dc086d-9c2b-449c-9e31-bbae9ebcdea7,f99317b2-72e8-49e3-b676-647abad508b1
[2015-11-20 12:37:55.773813] W [master(/data/media):1014:process] _GMaster:
SKIPPED GFID =
4af54f1c-e8e1-4915-9328-a458d5d35d5d,acbe1f12-87e8-4192-b864-d90030269bba,7d27a795-da63-4742-9e91-abd8fa543612,8d4e642d-fd40-44d6-8419-8d3459df7ce3
[2015-11-20 12:39:28.852575] W [master(/data/media):1014:process] _GMaster:
SKIPPED GFID =
d90dc121-02e7-4a79-bc03-1bd8fddd9f48,54bb563f-ab44-4e91-a46b-764a122ce7fa,088141de-7545-40f9-b776-751738a89740,2dab3faf-4a6c-407a-88cd-cddef6f55299,d887806f-23b4-4389-a4dc-f9027702a2df,fc5a9bc8-ea62-4677-baed-16510541373a,33136ad2-c5b4-448c-991d-1e72fefef021,cf3e2675-e41b-4782-9478-91773eb0a4aa,6412d878-e0f1-4700-84df-05f4af35962f,ec3cf6e1-7f27-4650-b978-8a5a7f620389,d3651bb9-cd2d-4c5f-93e6-fe4fb1cdf5db,ecb0415e-1524-40f4-870e-1fd0f8371b1d,a118aaae-bd3e-4b19-a0e0-891aa9edb09a,7642d3f3-f1e5-4aca-bcfe-bdb3c44779a9,2e29f3f8-c460-48eb-9db5-b281b67cc2bf,e61db54b-3979-488a-8789-a5d0615c5a97,4212d840-9c22-4d9e-b61b-5e35271dfe80,dad1c60b-9da6-4e57-b014-daa1aca73ce3,93699a3d-40b8-4bbd-b78f-aabf965df57f,4fad7468-91f2-4deb-aaf7-6401068c9e6d,c9738295-46cc-4fe7-b359-dc94f5815ce9,91853c5c-4877-4c9e-9481-c86368942f78,59deed8e-d3d0-4ab7-854e-53a8dd455de0,20b86c13-7df1-4d13-bac1-7d628a00d6ce,b7b86a2d-7963-41a4-a423-14e25d1e78c4,3c17d7fe-bb7f-489c-a525-5c8b7bb93c3e,e230d207-7c68-4983-a958-f2dcfc1ce694,fa8bf3c0-abae-446c-83c5-45ef8bcaa4b8,14089102-8106-45d9-a3f1-d1446b568f4e,6802a0c6-1f62-4213-a70d-7b46d9ff8f3a,0a253bbc-ef98-4da0-951f-e17c5a7f5858,ef054b76-986b-4a89-b8e6-b4988221aaa2,48c0a153-708c-44ee-b186-cf255936a02b,fa2646a6-807c-4e9d-8f2b-a9cdf2674e0c,1ed4a563-4f6a-4b5a-9866-89025fe7afd5,0f293cf7-bc32-4f8a-87d5-388a4bffb4af,f4126726-667b-451d-8214-a18bb3f468cd,e23dc8b3-da1c-4d18-aec9-22e0aa174d81,40b9f10d-7304-4c0b-8498-bef23b305d03,15c25d1e-2a62-495e-887f-14d0cb0527b1,67371804-9084-4801-b664-44e88bea8ac3,4750fa3f-d1a4-4472-b10d-3f75d0b451dc
[2015-11-23 09:18:10.43391] W [master(/data/media):1014:process] _GMaster:
SKIPPED GFID =
228843f3-62f0-4687-b5eb-6d1e21257ad0,b0078359-fbf0-4709-8f40-8383a11d7875,60cff4d5-8b5d-4f7f-8bc1-27081a011458,bedb6ac4-208d-47e1-812c-5547c84ab841,da6810d9-4883-45e1-b73e-55a7ff17b5e7,e03b5c03-b25c-49ba-86f0-8a709a9c2658,053673a0-c1cc-4057-83fa-f97740cb5d4f,dbd6ea84-8f24-4a47-ac41-22c3fd788ecf,43caa3e7-ca04-47ab-b950-105606b313a4,62d8b1d0-fc89-4fb1-a41a-957dcb34d325,4e8fe1fa-60cd-47fa-bad6-f617c312f53b,6c3d6cf3-62ae-4ab8-9dc3-7815552401fe,f79be814-7e78-4985-bcdd-688da23d1808,c4186455-0f06-4b5d-89be-3c5ccbdeb6f0,f9c4ccdb-2337-479d-845d-ee4d85b69ece,bcd14726-1bab-4d97-8915-ec8bbe8faf8c,cca82341-a430-4a59-a900-1af66dcf7bb8,b7043a8e-4286-4831-91ec-c146e40bc6be,995ffeb6-a906-4078-88c6-404a2b38aad4,227f9987-5057-4133-848a-2b22aca5dde1,90b35242-32db-4570-8070-cf9dd49322a5,c6863c8f-1914-4a2d-814b-6e5853134faf,e2d19b1a-fc07-441c-b110-ca816b46fc40,9a3d0c0b-7d84-416f-9f3e-21b32a11ba1d,d8163f6b-8c40-418c-9c06-b3743af24e4e,522d7247-a75b-4af9-acb2-52a99eeced89,4b56ea9d-413a-4e24-b44e-433f7603ad6d

There are also the following lines on the master, which might have some
impact:

E [MSGID: 108008] [afr-read-txn.c:89:afr_read_txn_refresh_done]
0-media-replicate-0: Failing READ on gfid
abdc7d5e-9187-4916-ae83-a8b615e32a17: split-brain observed. [Input/output
error]

E [MSGID: 108008] [afr-read-txn.c:89:afr_read_txn_refresh_done]
0-media-replicate-0: Failing GETXATTR on gfid
abdc7d5e-9187-4916-ae83-a8b615e32a17: split-brain observed. [Input/output
error]

E [mem-pool.c:417:mem_get0]
(-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x809a2) [0x7f79e436b9a2]
-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg+0x79f)
[0x7f79e430cb1f]
-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(mem_get0+0x81)
[0x7f79e433e4a1] ) 0-mem-pool: invalid argument [Invalid argument]

E [mem-pool.c:417:mem_get0]
(-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(recursive_rmdir+0x192)
[0x7f79e4329b32]
-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg+0x79f)
[0x7f79e430cb1f]
-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(mem_get0+0x81)
[0x7f79e433e4a1] ) 0-mem-pool: invalid argument [Invalid argument]

E [resource(/data/media):222:errlog] Popen: command "ssh
-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
/var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S
/tmp/gsyncd-aux-ssh-dpY5cI/8216bb7da58a00926f369bb7ac8c7e03.sock
root at us-west-gluster.server.com /usr/lib/x86_64-linux-gnu/glusterfs/gsyncd
--session-owner 6922055e-49a1-4afd-a3a0-a47960d6ba54 -N --listen --timeout
120 gluster://localhost:media" returned with 143, saying:
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
21:57:19.772896] I [cli.c:721:main] 0-cli: Started running
/usr/sbin/gluster with version 3.7.5
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
21:57:19.772955] I [cli.c:608:cli_rpc_init] 0-cli: Connecting to remote
glusterd at localhost
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
21:57:19.871930] I [MSGID: 101190]
[event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
21:57:19.872018] I [socket.c:2355:socket_event_handler] 0-transport:
disconnecting now
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
21:57:19.872898] I [cli-rpc-ops.c:6348:gf_cli_getwd_cbk] 0-cli: Received
resp to getwd
E [resource(/data/media):226:logerr] Popen: ssh> [2015-11-18
21:57:19.872963] I [input.c:36:cli_batch] 0-: Exiting with: 0

Status detail shows the following:

root at eu-gluster-1:/var/log/glusterfs/geo-replication/media# gluster volume
geo-replication media root at us-west-gluster.websitewebsitewebs.com::media
status detail

MASTER NODE                            MASTER VOL    MASTER BRICK    SLAVE
USER    SLAVE                                            SLAVE NODE
                       STATUS     CRAWL STATUS       LAST_SYNCED
 ENTRY    DATA    META    FAILURES    CHECKPOINT TIME    CHECKPOINT
COMPLETED    CHECKPOINT COMPLETION TIME
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
eu-gluster-1.websitewebsitewebs.com    media         /data/media     root
       us-west-gluster.websitewebsitewebs.com::media
us-west-gluster.websitewebsitewebs.com    Active     Changelog Crawl
 2015-11-24 20:59:25    0        0       0       633         N/A
     N/A                     N/A
eu-gluster-2.websitewebsitewebs.com    media         /data/media     root
       us-west-gluster.websitewebsitewebs.com::media
us-west-gluster.websitewebsitewebs.com    Passive    N/A                N/A
                   N/A      N/A     N/A     N/A         N/A
 N/A                     N/A




What is the right way to retry failed items?
Can I get a list of them somehow so that I could touch them in hopes to fix
this?
I wonder why does it not retry the items automatically?


On Tue, Nov 24, 2015 at 6:11 AM, Venky Shankar <vshankar at redhat.com> wrote:

> On Tue, Nov 24, 2015 at 1:23 AM, Audrius Butkevicius
> <audrius.butkevicius at gmail.com> wrote:
> > Hi,
> >
> > I've got a geo-replicated gluster volume, with a few hundred thousand
> > images, which get generated on demand.
> >
> > I started getting replication failures in the status detail view, but
> it's
> > not obvious to me where to find the actual errors or how to actually fix
> > them.
>
> Chris here[1] mentioned about a bug in rsync (thanks!). Could that be
> the issue here?
>
> Mind checking rsync version used?
>
> [1]:
> http://www.gluster.org/pipermail/gluster-users/2015-November/024423.html
>
> >
> > The docs seem to be secretive about this as well. It seems if I tear the
> > geo-replication down, and do a force create from scratch, it goes back in
> > sync again, but as the files get generated, it starts getting failures
> again
> > at some point.
> >
> > Can someone provide me with information on how to check which files are
> > causing failures, and what are the actual failures? Or point me to the
> > relevant part in the docs?
> >
> > Version 3.7.5-ubuntu1~trusty1
> >
> > Related SO question:
> >
> http://stackoverflow.com/questions/33839056/gluster-geo-replication-debugging-failures
> >
> > Thanks,
> >
> > Audrius.
> >
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151124/e4612032/attachment.html>


More information about the Gluster-users mailing list