[Gluster-users] Gfid mismatch (Transport endpoint is not connected) after long degeneracy

Wed Aug 22 11:11:54 UTC 2018

Hi.

I am new to glusterfs and tried some failover evaluation and experienced "Transport endpoint is not connected”
error. Can someone guess about the status and tell how to recover ?

Here’s the detail.

The Gluster version is 4.1.1. The volume info is something like the following. There are 9 nodes (machines),
running CentOS7.5, each of which have one brick. The gluster volume type is "Distributed Replicated” with 3
replications (one of the three is arbiter).

  # gluster volume info

  Volume Name: vol0
  Type: Distributed-Replicate
  Volume ID: 2c450ca8-d385-43a3-8761-7d227ee61d37
  Status: Started
  Snapshot Count: 0
  Number of Bricks: 3 x (2 + 1) = 9
  Transport-type: tcp
  Bricks:
  Brick1: host01:/glusterfs/vol0/brick0
  Brick2: host02:/glusterfs/vol0/brick0
  Brick3: host03:/glusterfs/vol0/brick0 (arbiter)
  Brick4: host04:/glusterfs/vol0/brick0
  Brick5: host05:/glusterfs/vol0/brick0
  Brick6: host06:/glusterfs/vol0/brick0 (arbiter)
  Brick7: host07:/glusterfs/vol0/brick0
  Brick8: host08:/glusterfs/vol0/brick0
  Brick9: host09:/glusterfs/vol0/brick0 (arbiter)
  Options Reconfigured:
  transport.address-family: inet
  nfs.disable: on
  performance.client-io-threads: off

The evaluation consists of the following two simultaneous (independent) operations.

(1) On a gluster native client, which is not any of the gluster server machines, files in a local filesystem
    are copy to directory under gluster mount point, read the copied files and delete them. Repeat this
    copy-read-delete procedure continuously. The number of files is 1000, each of the file size is 10MB.
(2) Stop and restart gluster server OSs one after another. The period of stop-start is about 3 minutes. 
    (At any time, no more than one server machine is stopped.) The stop-stat target OSs are sequentially
    selected. ( host01 -> host02 -> ... -> host09 -> host01 -> ... )

I continued the operation for long time and after more than 48 hours, the client started generating messages,
to stderr, containing "Transport endpoint is not connected”. On the occasion, the following logs under
/var/log/glusterfs/<mountpoint>.log are generated on the client.

[2018-08-20 01:37:58.372075] E [MSGID: 108008] [afr-self-heal-common.c:335:afr_gfid_split_brain_source] 0-vol0-replicate-0: Gfid mismatch detected for <gfid:fd685a4f-3b6a-4307-b7ef-79356b0802a2>/file990>, aca9c945-8a3b-4ffe-af45-d07c4cea355a on vol0-client-2 and 8518500f-fb45-4466-9878-10a25545aa48 on vol0-client-0.
[2018-08-20 01:37:58.372234] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-vol0-replicate-0: Skipping conservative merge on the file.
[2018-08-20 01:37:58.381751] E [MSGID: 108008] [afr-self-heal-common.c:335:afr_gfid_split_brain_source] 0-vol0-replicate-0: Gfid mismatch detected for <gfid:fd685a4f-3b6a-4307-b7ef-79356b0802a2>/file998>, 0b401c35-70d7-4255-a4ca-71b27f3fd7e7 on vol0-client-2 and 45a14259-9d4f-43e3-8c55-4c6dc080a97c on vol0-client-0.
[2018-08-20 01:37:58.381931] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-vol0-replicate-0: Skipping conservative merge on the file.

... repeat many many times (the string of ID are different)

The only way, I found, to recover from this error is to remove all the associated
files under the /glusterfs/vol0/brick0.

I want to know about the status of errors, some methods to recover and some methods
to avoid the status to happen.

Thanks.

P.S. As far as we tried, there’s no memory leak with the version.