[Gluster-users] Issues with Geo-replication (GlusterFS 6.3 on Ubuntu 18.04)

Thu Oct 10 22:31:58 UTC 2019

Hi all,

I ended up reinstalling the nodes with CentOS 7.5 and GlusterFS 6.5 
(installed from the SIG.)

Now when I try to create a replication session I get the following:

 > # gluster volume geo-replication store1 <slave-host>::store2 create 
push-pem
 > Unable to mount and fetch slave volume details. Please check the log: 
/var/log/glusterfs/geo-replication/gverify-slavemnt.log
 > geo-replication command failed

You can find the contents of gverify-slavemnt.log below, but the initial 
error seems to be:

 > [2019-10-10 22:07:51.578519] E [fuse-bridge.c:5211:fuse_first_lookup] 
0-fuse: first lookup on root failed (Transport endpoint is not connected)

I only found [this](https://bugzilla.redhat.com/show_bug.cgi?id=1659824) 
bug report which doesn't seem to help. The reported issue is failure to 
mount a volume on a GlusterFS client, but in my case I need 
geo-replication which implies the client (geo-replication master) being 
on a different network.

Any help will be appreciated.

Thanks!

gverify-slavemnt.log:

 > [2019-10-10 22:07:40.571256] I [MSGID: 100030] 
[glusterfsd.c:2847:main] 0-glusterfs: Started running glusterfs version 
6.5 (args: glusterfs --xlator-option=*dht.lookup-unhashed=off 
--volfile-server <slave-host> --volfile-id store2 -l 
/var/log/glusterfs/geo-replication/gverify-slavemnt.log 
/tmp/gverify.sh.5nFlRh)
 > [2019-10-10 22:07:40.575438] I [glusterfsd.c:2556:daemonize] 
0-glusterfs: Pid of current running process is 6021
 > [2019-10-10 22:07:40.584282] I [MSGID: 101190] 
[event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread 
with index 0
 > [2019-10-10 22:07:40.584299] I [MSGID: 101190] 
[event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread 
with index 1
 > [2019-10-10 22:07:40.928094] I [MSGID: 114020] [client.c:2393:notify] 
0-store2-client-0: parent translators are ready, attempting connect on 
transport
 > [2019-10-10 22:07:40.931121] I [MSGID: 114020] [client.c:2393:notify] 
0-store2-client-1: parent translators are ready, attempting connect on 
transport
 > [2019-10-10 22:07:40.933976] I [MSGID: 114020] [client.c:2393:notify] 
0-store2-client-2: parent translators are ready, attempting connect on 
transport
 > Final graph:
 > 
+------------------------------------------------------------------------------+
 >   1: volume store2-client-0
 >   2:     type protocol/client
 >   3:     option ping-timeout 42
 >   4:     option remote-host 172.31.36.11
 >   5:     option remote-subvolume /data/gfs/store1/1/brick-store2
 >   6:     option transport-type socket
 >   7:     option transport.address-family inet
 >   8:     option transport.socket.ssl-enabled off
 >   9:     option transport.tcp-user-timeout 0
 >  10:     option transport.socket.keepalive-time 20
 >  11:     option transport.socket.keepalive-interval 2
 >  12:     option transport.socket.keepalive-count 9
 >  13:     option send-gids true
 >  14: end-volume
 >  15:
 >  16: volume store2-client-1
 >  17:     type protocol/client
 >  18:     option ping-timeout 42
 >  19:     option remote-host 172.31.36.12
 >  20:     option remote-subvolume /data/gfs/store1/1/brick-store2
 >  21:     option transport-type socket
 >  22:     option transport.address-family inet
 >  23:     option transport.socket.ssl-enabled off
 >  24:     option transport.tcp-user-timeout 0
 >  25:     option transport.socket.keepalive-time 20
 >  26:     option transport.socket.keepalive-interval 2
 >  27:     option transport.socket.keepalive-count 9
 >  28:     option send-gids true
 >  29: end-volume
 >  30:
 >  31: volume store2-client-2
 >  32:     type protocol/client
 >  33:     option ping-timeout 42
 >  34:     option remote-host 172.31.36.13
 >  35:     option remote-subvolume /data/gfs/store1/1/brick-store2
 >  36:     option transport-type socket
 >  37:     option transport.address-family inet
 >  38:     option transport.socket.ssl-enabled off
 >  39:     option transport.tcp-user-timeout 0
 >  40:     option transport.socket.keepalive-time 20
 >  41:     option transport.socket.keepalive-interval 2
 >  42:     option transport.socket.keepalive-count 9
 >  43:     option send-gids true
 >  44: end-volume
 >  45:
 >  46: volume store2-replicate-0
 >  47:     type cluster/replicate
 >  48:     option afr-pending-xattr 
store2-client-0,store2-client-1,store2-client-2
 >  49:     option use-compound-fops off
 >  50:     subvolumes store2-client-0 store2-client-1 store2-client-2
 >  51: end-volume
 >  52:
 >  53: volume store2-dht
 >  54:     type cluster/distribute
 >  55:     option lookup-unhashed off
 >  56:     option lock-migration off
 >  57:     option force-migration off
 >  58:     subvolumes store2-replicate-0
 >  59: end-volume
 >  60:
 >  61: volume store2-write-behind
 >  62:     type performance/write-behind
 >  63:     subvolumes store2-dht
 >  64: end-volume
 >  65:
 >  66: volume store2-read-ahead
 >  67:     type performance/read-ahead
 >  68:     subvolumes store2-write-behind
 >  69: end-volume
 >  70:
 >  71: volume store2-readdir-ahead
 >  72:     type performance/readdir-ahead
 >  73:     option parallel-readdir off
 >  74:     option rda-request-size 131072
 >  75:     option rda-cache-limit 10MB
 >  76:     subvolumes store2-read-ahead
 >  77: end-volume
 >  78:
 >  79: volume store2-io-cache
 >  80:     type performance/io-cache
 >  81:     subvolumes store2-readdir-ahead
 >  82: end-volume
 >  83:
 >  84: volume store2-open-behind
 >  85:     type performance/open-behind
 >  86:     subvolumes store2-io-cache
 >  87: end-volume
 >  88:
 >  89: volume store2-quick-read
 >  90:     type performance/quick-read
 >  91:     subvolumes store2-open-behind
 >  92: end-volume
 >  93:
 >  94: volume store2-md-cache
 >  95:     type performance/md-cache
 >  96:     subvolumes store2-quick-read
 >  97: end-volume
 >  98:
 >  99: volume store2
 > 100:     type debug/io-stats
 > 101:     option log-level INFO
 > 102:     option latency-measurement off
 > 103:     option count-fop-hits off
 > 104:     subvolumes store2-md-cache
 > 105: end-volume
 > 106:
 > 107: volume meta-autoload
 > 108:     type meta
 > 109:     subvolumes store2
 > 110: end-volume
 > 111:
 > 
+------------------------------------------------------------------------------+
 > [2019-10-10 22:07:51.578287] I [fuse-bridge.c:5142:fuse_init] 
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 
kernel 7.22
 > [2019-10-10 22:07:51.578356] I [fuse-bridge.c:5753:fuse_graph_sync] 
0-fuse: switched to graph 0
 > [2019-10-10 22:07:51.578467] I [MSGID: 108006] 
[afr-common.c:5666:afr_local_init] 0-store2-replicate-0: no subvolumes up
 > [2019-10-10 22:07:51.578519] E [fuse-bridge.c:5211:fuse_first_lookup] 
0-fuse: first lookup on root failed (Transport endpoint is not connected)
 > [2019-10-10 22:07:51.578709] W [fuse-bridge.c:1266:fuse_attr_cbk] 
0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected)
 > [2019-10-10 22:07:51.578687] I [MSGID: 108006] 
[afr-common.c:5666:afr_local_init] 0-store2-replicate-0: no subvolumes up
 > [2019-10-10 22:09:48.222459] E [MSGID: 108006] 
[afr-common.c:5318:__afr_handle_child_down_event] 0-store2-replicate-0: 
All subvolumes are down. Going offline until at least one of them comes 
back up.
 > The message "E [MSGID: 108006] 
[afr-common.c:5318:__afr_handle_child_down_event] 0-store2-replicate-0: 
All subvolumes are down. Going offline until at least one of them comes 
back up." repeated 2 times between [2019-10-10 22:09:48.222459] and 
[2019-10-10 22:09:48.222891]
 >

alexander iliev

On 9/8/19 4:50 PM, Alexander Iliev wrote:
> Hi all,
> 
> Sunny, thank you for the update.
> 
> I have applied the patch locally on my slave system and now the 
> mountbroker setup is successful.
> 
> I am facing another issue though - when I try to create a replication 
> session between the two sites I am getting:
> 
>          # gluster volume geo-replication store1 
> glustergeorep@<slave-host>::store1 create push-pem
>          Error : Request timed out
>          geo-replication command failed
> 
> It is still unclear to me if my setup is expected to work at all.
> 
> Reading the geo-replication documentation at [1] I see this paragraph:
> 
>  > A password-less SSH connection is also required for gsyncd between 
> every node in the master to every node in the slave. The gluster 
> system:: execute gsec_create command creates secret-pem files on all the 
> nodes in the master, and is used to implement the password-less SSH 
> connection. The push-pem option in the geo-replication create command 
> pushes these keys to all the nodes in the slave.
> 
> It is not clear to me whether connectivity from each master node to each 
> slave node is a requirement in terms of networking. In my setup the 
> slave nodes form the Gluster pool over a private network which is not 
> reachable from the master site.
> 
> Any ideas how to proceed from here will be greatly appreciated.
> 
> Thanks!
> 
> Links:
> [1] 
> https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/sect-preparing_to_deploy_geo-replication 
> 
> 
> Best regards,
> -- 
> alexander iliev
> 
> On 9/3/19 2:50 PM, Sunny Kumar wrote:
>> Thank you for the explanation Kaleb.
>>
>> Alexander,
>>
>> This fix will be available with next release for all supported versions.
>>
>> /sunny
>>
>> On Mon, Sep 2, 2019 at 6:47 PM Kaleb Keithley <kkeithle at redhat.com> 
>> wrote:
>>>
>>> Fixes on master (before or after the release-7 branch was taken) 
>>> almost certainly warrant a backport IMO to at least release-6, and 
>>> probably release-5 as well.
>>>
>>> We used to have a "tracker" BZ for each minor release (e.g. 6.6) to 
>>> keep track of backports by cloning the original BZ and changing the 
>>> Version, and adding that BZ to the tracker. I'm not sure what 
>>> happened to that practice. The last ones I can find are for 6.3 and 
>>> 5.7;  https://bugzilla.redhat.com/show_bug.cgi?id=glusterfs-6.3 and 
>>> https://bugzilla.redhat.com/show_bug.cgi?id=glusterfs-5.7
>>>
>>> It isn't enough to just backport recent fixes on master to release-7. 
>>> We are supposedly continuing to maintain release-6 and release-5 
>>> after release-7 GAs. If that has changed, I haven't seen an 
>>> announcement to that effect. I don't know why our developers don't 
>>> automatically backport to all the actively maintained releases.
>>>
>>> Even if there isn't a tracker BZ, you can always create a backport BZ 
>>> by cloning the original BZ and change the release to 6. That'd be a 
>>> good place to start.
>>>
>>> On Sun, Sep 1, 2019 at 8:45 AM Alexander Iliev 
>>> <ailiev+gluster at mamul.org> wrote:
>>>>
>>>> Hi Strahil,
>>>>
>>>> Yes, this might be right, but I would still expect fixes like this 
>>>> to be
>>>> released for all supported major versions (which should include 6.) At
>>>> least that's how I understand 
>>>> https://www.gluster.org/release-schedule/.
>>>>
>>>> Anyway, let's wait for Sunny to clarify.
>>>>
>>>> Best regards,
>>>> alexander iliev
>>>>
>>>> On 9/1/19 2:07 PM, Strahil Nikolov wrote:
>>>>> Hi Alex,
>>>>>
>>>>> I'm not very deep into bugzilla stuff, but for me NEXTRELEASE means 
>>>>> v7.
>>>>>
>>>>> Sunny,
>>>>> Am I understanding it correctly ?
>>>>>
>>>>> Best Regards,
>>>>> Strahil Nikolov
>>>>>
>>>>> В неделя, 1 септември 2019 г., 14:27:32 ч. Гринуич+3, Alexander Iliev
>>>>> <ailiev+gluster at mamul.org> написа:
>>>>>
>>>>>
>>>>> Hi Sunny,
>>>>>
>>>>> Thank you for the quick response.
>>>>>
>>>>> It's not clear to me however if the fix has been already released 
>>>>> or not.
>>>>>
>>>>> The bug status is CLOSED NEXTRELEASE and according to [1] the
>>>>> NEXTRELEASE resolution means that the fix will be included in the next
>>>>> supported release. The bug is logged against the mainline version
>>>>> though, so I'm not sure what this means exactly.
>>>>>
>>>>>   From the 6.4[2] and 6.5[3] release notes it seems it hasn't been
>>>>> released yet.
>>>>>
>>>>> Ideally I would not like to patch my systems locally, so if you 
>>>>> have an
>>>>> ETA on when this will be out officially I would really appreciate it.
>>>>>
>>>>> Links:
>>>>> [1] https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_status
>>>>> [2] https://docs.gluster.org/en/latest/release-notes/6.4/
>>>>> [3] https://docs.gluster.org/en/latest/release-notes/6.5/
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Best regards,
>>>>>
>>>>> alexander iliev
>>>>>
>>>>> On 8/30/19 9:22 AM, Sunny Kumar wrote:
>>>>>   > Hi Alexander,
>>>>>   >
>>>>>   > Thanks for pointing that out!
>>>>>   >
>>>>>   > But this issue is fixed now you can see below link for bz-link 
>>>>> and patch.
>>>>>   >
>>>>>   > BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1709248
>>>>>   >
>>>>>   > Patch - https://review.gluster.org/#/c/glusterfs/+/22716/
>>>>>   >
>>>>>   > Hope this helps.
>>>>>   >
>>>>>   > /sunny
>>>>>   >
>>>>>   > On Fri, Aug 30, 2019 at 2:30 AM Alexander Iliev
>>>>>   > <ailiev+gluster at mamul.org <mailto:gluster at mamul.org>> wrote:
>>>>>   >>
>>>>>   >> Hello dear GlusterFS users list,
>>>>>   >>
>>>>>   >> I have been trying to set up geo-replication between two 
>>>>> clusters for
>>>>>   >> some time now. The desired state is (Cluster #1) being 
>>>>> replicated to
>>>>>   >> (Cluster #2).
>>>>>   >>
>>>>>   >> Here are some details about the setup:
>>>>>   >>
>>>>>   >> Cluster #1: three nodes connected via a local network 
>>>>> (172.31.35.0/24),
>>>>>   >> one replicated (3 replica) volume.
>>>>>   >>
>>>>>   >> Cluster #2: three nodes connected via a local network 
>>>>> (172.31.36.0/24),
>>>>>   >> one replicated (3 replica) volume.
>>>>>   >>
>>>>>   >> The two clusters are connected to the Internet via separate 
>>>>> network
>>>>>   >> adapters.
>>>>>   >>
>>>>>   >> Only SSH (port 22) is open on cluster #2 nodes' adapters 
>>>>> connected to
>>>>>   >> the Internet.
>>>>>   >>
>>>>>   >> All nodes are running Ubuntu 18.04 and GlusterFS 6.3 installed 
>>>>> from [1].
>>>>>   >>
>>>>>   >> The first time I followed the guide[2] everything went fine up 
>>>>> until I
>>>>>   >> reached the "Create the session" step. That was like a month 
>>>>> ago, then I
>>>>>   >> had to temporarily stop working in this and now I am coming 
>>>>> back to it.
>>>>>   >>
>>>>>   >> Currently, if I try to see the mountbroker status I get the 
>>>>> following:
>>>>>   >>
>>>>>   >>> # gluster-mountbroker status
>>>>>   >>> Traceback (most recent call last):
>>>>>   >>>    File "/usr/sbin/gluster-mountbroker", line 396, in <module>
>>>>>   >>>      runcli()
>>>>>   >>>    File
>>>>> "/usr/lib/python3/dist-packages/gluster/cliutils/cliutils.py", line 
>>>>> 225,
>>>>> in runcli
>>>>>   >>>      cls.run(args)
>>>>>   >>>    File "/usr/sbin/gluster-mountbroker", line 275, in run
>>>>>   >>>      out = execute_in_peers("node-status")
>>>>>   >>>    File 
>>>>> "/usr/lib/python3/dist-packages/gluster/cliutils/cliutils.py",
>>>>>   >> line 127, in execute_in_peers
>>>>>   >>>      raise GlusterCmdException((rc, out, err, " ".join(cmd)))
>>>>>   >>> gluster.cliutils.cliutils.GlusterCmdException: (1, '', 
>>>>> 'Unable to
>>>>>   >> end. Error : Success\n', 'gluster system:: execute mountbroker.py
>>>>>   >> node-status')
>>>>>   >>
>>>>>   >> And in /var/log/gluster/glusterd.log I have:
>>>>>   >>
>>>>>   >>> [2019-08-10 15:24:21.418834] E [MSGID: 106336]
>>>>>   >> [glusterd-geo-rep.c:5413:glusterd_op_sys_exec] 0-management: 
>>>>> Unable to
>>>>>   >> end. Error : Success
>>>>>   >>> [2019-08-10 15:24:21.418908] E [MSGID: 106122]
>>>>>   >> [glusterd-syncop.c:1445:gd_commit_op_phase] 0-management: 
>>>>> Commit of
>>>>>   >> operation 'Volume Execute system commands' failed on localhost 
>>>>> : Unable
>>>>>   >> to end. Error : Success
>>>>>   >>
>>>>>   >> So, I have two questions right now:
>>>>>   >>
>>>>>   >> 1) Is there anything wrong with my setup (networking, open 
>>>>> ports, etc.)?
>>>>>   >> Is it expected to work with this setup or should I redo it in a
>>>>>   >> different way?
>>>>>   >> 2) How can I troubleshoot the current status of my setup? Can 
>>>>> I find out
>>>>>   >> what's missing/wrong and continue from there or should I just 
>>>>> start from
>>>>>   >> scratch?
>>>>>   >>
>>>>>   >> Links:
>>>>>   >> [1] http://ppa.launchpad.net/gluster/glusterfs-6/ubuntu
>>>>>   >> [2]
>>>>>   >>
>>>>> https://docs.gluster.org/en/latest/Administrator%20Guide/Geo%20Replication/ 
>>>>>
>>>>>   >>
>>>>>   >> Thank you!
>>>>>   >>
>>>>>   >> Best regards,
>>>>>   >> --
>>>>>   >> alexander iliev
>>>>>   >> _______________________________________________
>>>>>   >> Gluster-users mailing list
>>>>>   >> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>>>>   >> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users