<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 06/05/19 6:43 PM, Łukasz Michalski

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:4376d725-a451-7b18-a7a1-c5285c3570b3@zork.pl">Hi,

      <br>

      <br>

      I have problem resolving split-brain in one of my installations.

      <br>

      <br>

      CenOS 7, glusterfs 3.10.12, replica on two nodes:

      <br>

      <br>

      [root@ixmed1 iscsi]# gluster volume status cluster

      <br>

      Status of volume: cluster

      <br>

      Gluster process                             TCP Port  RDMA Port

      Online  Pid

      <br>

------------------------------------------------------------------------------

      <br>

      Brick ixmed2:/glusterfs-bricks/cluster/clus

      <br>

      ter                                         49153     0 Y      

      3028

      <br>

      Brick ixmed1:/glusterfs-bricks/cluster/clus

      <br>

      ter                                         49153     0 Y      

      2917

      <br>

      Self-heal Daemon on localhost               N/A       N/A Y      

      112929

      <br>

      Self-heal Daemon on ixmed2                  N/A       N/A Y      

      57774

      <br>

      <br>

      Task Status of Volume cluster

      <br>

------------------------------------------------------------------------------

      <br>

      There are no active volume tasks

      <br>

      <br>

      When I try to access one file glusterd reports split brain:

      <br>

      <br>

      [2019-05-06 12:36:43.785098] E [MSGID: 108008]

      [afr-read-txn.c:90:afr_read_txn_refresh_done]

      0-cluster-replicate-0: Failing READ on gfid

      2584a0e2-c0fa-4fde-8537-5d5b6a5a4635: split-brain observed.

      [Input/output error]

      <br>

      [2019-05-06 12:36:43.787952] E [MSGID: 108008]

      [afr-read-txn.c:90:afr_read_txn_refresh_done]

      0-cluster-replicate-0: Failing FGETXATTR on gfid

      2584a0e2-c0fa-4fde-8537-5d5b6a5a4635: split-brain observed.

      [Input/output error]

      <br>

      [2019-05-06 12:36:43.788778] W [MSGID: 108027]

      [afr-common.c:2722:afr_discover_done] 0-cluster-replicate-0: no

      read subvols for (null)

      <br>

      [2019-05-06 12:36:43.790123] W [fuse-bridge.c:2254:fuse_readv_cbk]

      0-glusterfs-fuse: 3352501: READ =&gt; -1

      gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde0803f390

      (Input/output error)

      <br>

      [2019-05-06 12:36:43.794979] W [fuse-bridge.c:2254:fuse_readv_cbk]

      0-glusterfs-fuse: 3352506: READ =&gt; -1

      gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde08215ed0

      (Input/output error)

      <br>

      [2019-05-06 12:36:43.800468] W [fuse-bridge.c:2254:fuse_readv_cbk]

      0-glusterfs-fuse: 3352508: READ =&gt; -1

      gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde08215ed0

      (Input/output error)

      <br>

      <br>

      The problem is that "gluster volume heal info" hangs for 10

      seconds and returns:

      <br>

      <br>

          Not able to fetch volfile from glusterd

      <br>

          Volume heal failed

      <br>

      <br>

      glfsheal.log contains:

      <br>

      <br>

      [2019-05-06 12:40:25.589879] I [afr.c:94:fix_quorum_options]

      0-cluster-replicate-0: reindeer: incoming qtype = none

      <br>

      [2019-05-06 12:40:25.589967] I [afr.c:116:fix_quorum_options]

      0-cluster-replicate-0: reindeer: quorum_count = 0

      <br>

      [2019-05-06 12:40:25.593294] W [MSGID: 101174]

      [graph.c:361:_log_if_unknown_option] 0-cluster-readdir-ahead:

      option 'parallel-readdir' is not recognized

      <br>

      [2019-05-06 12:40:25.593895] I [MSGID: 104045]

      [glfs-master.c:91:notify] 0-gfapi: New graph

      69786d65-6431-2d32-3037-3739322d3230 (0) coming up

      <br>

      [2019-05-06 12:40:25.593972] I [MSGID: 114020]

      [client.c:2352:notify] 0-cluster-client-0: parent translators are

      ready, attempting connect on transport

      <br>

      [2019-05-06 12:40:25.607836] I [MSGID: 114020]

      [client.c:2352:notify] 0-cluster-client-1: parent translators are

      ready, attempting connect on transport

      <br>

      [2019-05-06 12:40:25.608556] I [rpc-clnt.c:2000:rpc_clnt_reconfig]

      0-cluster-client-0: changing port to 49153 (from 0)

      <br>

      [2019-05-06 12:40:25.618167] I [rpc-clnt.c:2000:rpc_clnt_reconfig]

      0-cluster-client-1: changing port to 49153 (from 0)

      <br>

      [2019-05-06 12:40:25.629595] I [MSGID: 114057]

      [client-handshake.c:1451:select_server_supported_programs]

      0-cluster-client-0: Using Program GlusterFS 3.3, Num (1298437),

      Version (330)

      <br>

      [2019-05-06 12:40:25.632031] I [MSGID: 114046]

      [client-handshake.c:1216:client_setvolume_cbk] 0-cluster-client-0:

      Connected to cluster-client-0, attached to remote volume

      '/glusterfs-bricks/cluster/cluster'.

      <br>

      [2019-05-06 12:40:25.632100] I [MSGID: 114047]

      [client-handshake.c:1227:client_setvolume_cbk] 0-cluster-client-0:

      Server and Client lk-version numbers are not same, reopening the

      fds

      <br>

      [2019-05-06 12:40:25.632263] I [MSGID: 108005]

      [afr-common.c:4817:afr_notify] 0-cluster-replicate-0: Subvolume

      'cluster-client-0' came back up; going online.

      <br>

      [2019-05-06 12:40:25.637707] I [MSGID: 114057]

      [client-handshake.c:1451:select_server_supported_programs]

      0-cluster-client-1: Using Program GlusterFS 3.3, Num (1298437),

      Version (330)

      <br>

      [2019-05-06 12:40:25.639285] I [MSGID: 114046]

      [client-handshake.c:1216:client_setvolume_cbk] 0-cluster-client-1:

      Connected to cluster-client-1, attached to remote volume

      '/glusterfs-bricks/cluster/cluster'.

      <br>

      [2019-05-06 12:40:25.639341] I [MSGID: 114047]

      [client-handshake.c:1227:client_setvolume_cbk] 0-cluster-client-1:

      Server and Client lk-version numbers are not same, reopening the

      fds

      <br>

      [2019-05-06 12:40:31.564407] C

      [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired]

      0-cluster-client-0: server 10.0.104.26:49153 has not responded in

      the last 5 seconds, disconnecting.

      <br>

      [2019-05-06 12:40:31.565764] C

      [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired]

      0-cluster-client-1: server 10.0.7.26:49153 has not responded in

      the last 5 seconds, disconnecting.

      <br>

    </blockquote>

    <p>This seems to be a problem.  Have you changed the value of

      ping-timeout ? Could you share the output of `gluster volume

      info`?</p>

    <p>Does the same issue occur if you try to resolve the split-brain

      on the gfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 using the <code>gluster

        volume heal &lt;VOLNAME&gt; split-brain </code>CLI?</p>

    <p>-Ravi<br>

    </p>

    <blockquote type="cite"

      cite="mid:4376d725-a451-7b18-a7a1-c5285c3570b3@zork.pl">[2019-05-06

      12:40:35.645545] I [MSGID: 114018]

      [client.c:2276:client_rpc_notify] 0-cluster-client-0: disconnected

      from cluster-client-0. Client process will keep trying to connect

      to glusterd until brick's port is available

      <br>

      [2019-05-06 12:40:35.645683] I

      [socket.c:3534:socket_submit_request] 0-cluster-client-0: not

      connected (priv-&gt;connected = -1)

      <br>

      [2019-05-06 12:40:35.645755] W [rpc-clnt.c:1693:rpc_clnt_submit]

      0-cluster-client-0: failed to submit rpc-request (XID: 0x7

      Program: GlusterFS 3.3, ProgVers: 330, Proc: 14) to rpc-transport

      (cluster-client-0)

      <br>

      [2019-05-06 12:40:35.645807] W [MSGID: 114031]

      [client-rpc-fops.c:797:client3_3_statfs_cbk] 0-cluster-client-0:

      remote operation failed [Drugi koniec nie jest połączony]

      <br>

      [2019-05-06 12:40:35.645887] I

      [socket.c:3534:socket_submit_request] 0-cluster-client-1: not

      connected (priv-&gt;connected = -1)

      <br>

      [2019-05-06 12:40:35.645918] W [rpc-clnt.c:1693:rpc_clnt_submit]

      0-cluster-client-1: failed to submit rpc-request (XID: 0x7

      Program: GlusterFS 3.3, ProgVers: 330, Proc: 14) to rpc-transport

      (cluster-client-1)

      <br>

      [2019-05-06 12:40:35.645955] W [MSGID: 114031]

      [client-rpc-fops.c:797:client3_3_statfs_cbk] 0-cluster-client-1:

      remote operation failed [Drugi koniec nie jest połączony]

      <br>

      [2019-05-06 12:40:35.646008] W [MSGID: 109075]

      [dht-diskusage.c:44:dht_du_info_cbk] 0-cluster-dht: failed to get

      disk info from cluster-replicate-0 [Drugi koniec nie jest

      połączony]

      <br>

      [2019-05-06 12:40:35.647846] I [MSGID: 114018]

      [client.c:2276:client_rpc_notify] 0-cluster-client-1: disconnected

      from cluster-client-1. Client process will keep trying to connect

      to glusterd until brick's port is available

      <br>

      [2019-05-06 12:40:35.647895] E [MSGID: 108006]

      [afr-common.c:4842:afr_notify] 0-cluster-replicate-0: All

      subvolumes are down. Going offline until atleast one of them comes

      back up.

      <br>

      [2019-05-06 12:40:35.647989] I [MSGID: 108006]

      [afr-common.c:4984:afr_local_init] 0-cluster-replicate-0: no

      subvolumes up

      <br>

      [2019-05-06 12:40:35.648051] I [MSGID: 108006]

      [afr-common.c:4984:afr_local_init] 0-cluster-replicate-0: no

      subvolumes up

      <br>

      [2019-05-06 12:40:35.648122] I [MSGID: 104039]

      [glfs-resolve.c:902:__glfs_active_subvol] 0-cluster: first lookup

      on graph 69786d65-6431-2d32-3037-3739322d3230 (0) failed (Drugi

      koniec nie jest połączony) [Drugi koniec nie jest połączony]

      <br>

      <br>

      "Drugi koniec nie jest połączony" -&gt; Transport endpoint not

      connected

      <br>

      <br>

      On brick process side there is an connection attempt:

      <br>

      <br>

      [2019-05-06 12:40:25.638032] I [addr.c:182:gf_auth]

      0-/glusterfs-bricks/cluster/cluster: allowed = "*", received addr

      = "10.0.7.26"

      <br>

      [2019-05-06 12:40:25.638080] I [login.c:111:gf_auth] 0-auth/login:

      allowed user names: e2f4c8f4-d040-4856-b6e3-62611fbab0ea

      <br>

      [2019-05-06 12:40:25.638109] I [MSGID: 115029]

      [server-handshake.c:695:server_setvolume] 0-cluster-server:

      accepted client from

      ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0

      (version: 3.10.12)

      <br>

      [2019-05-06 12:40:31.565931] I [MSGID: 115036]

      [server.c:559:server_rpc_notify] 0-cluster-server: disconnecting

      connection from

      ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0

      <br>

      [2019-05-06 12:40:31.566420] I [MSGID: 101055]

      [client_t.c:436:gf_client_unref] 0-cluster-server: Shutting down

      connection

      ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0

      <br>

      <br>

      I am not able to use any heal command because of this problem.

      <br>

      <br>

      I have three volumes configured on that nodes. Configuration is

      identical and "gluster volume heal" command fails for all of them.

      <br>

      <br>

      Can anyone help?

      <br>

      <br>

      Thanks,

      <br>

      Łukasz

      <br>

      <br>

      <br>

      _______________________________________________

      <br>

      Gluster-users mailing list

      <br>

      <a class="moz-txt-link-abbreviated" href="mailto:Gluster-users@gluster.org">Gluster-users@gluster.org</a>

      <br>

      <a class="moz-txt-link-freetext" href="https://lists.gluster.org/mailman/listinfo/gluster-users">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>

    </blockquote>

  </body>

</html>