[Gluster-users] Gfid mismatch (Transport endpoint is not connected) after long degeneracy
i_j_e_x_a at yahoo.co.jp
i_j_e_x_a at yahoo.co.jp
Wed Aug 22 11:11:54 UTC 2018
Hi.
I am new to glusterfs and tried some failover evaluation and experienced "Transport endpoint is not connected”
error. Can someone guess about the status and tell how to recover ?
Here’s the detail.
The Gluster version is 4.1.1. The volume info is something like the following. There are 9 nodes (machines),
running CentOS7.5, each of which have one brick. The gluster volume type is "Distributed Replicated” with 3
replications (one of the three is arbiter).
# gluster volume info
Volume Name: vol0
Type: Distributed-Replicate
Volume ID: 2c450ca8-d385-43a3-8761-7d227ee61d37
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x (2 + 1) = 9
Transport-type: tcp
Bricks:
Brick1: host01:/glusterfs/vol0/brick0
Brick2: host02:/glusterfs/vol0/brick0
Brick3: host03:/glusterfs/vol0/brick0 (arbiter)
Brick4: host04:/glusterfs/vol0/brick0
Brick5: host05:/glusterfs/vol0/brick0
Brick6: host06:/glusterfs/vol0/brick0 (arbiter)
Brick7: host07:/glusterfs/vol0/brick0
Brick8: host08:/glusterfs/vol0/brick0
Brick9: host09:/glusterfs/vol0/brick0 (arbiter)
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
The evaluation consists of the following two simultaneous (independent) operations.
(1) On a gluster native client, which is not any of the gluster server machines, files in a local filesystem
are copy to directory under gluster mount point, read the copied files and delete them. Repeat this
copy-read-delete procedure continuously. The number of files is 1000, each of the file size is 10MB.
(2) Stop and restart gluster server OSs one after another. The period of stop-start is about 3 minutes.
(At any time, no more than one server machine is stopped.) The stop-stat target OSs are sequentially
selected. ( host01 -> host02 -> ... -> host09 -> host01 -> ... )
I continued the operation for long time and after more than 48 hours, the client started generating messages,
to stderr, containing "Transport endpoint is not connected”. On the occasion, the following logs under
/var/log/glusterfs/<mountpoint>.log are generated on the client.
[2018-08-20 01:37:58.372075] E [MSGID: 108008] [afr-self-heal-common.c:335:afr_gfid_split_brain_source] 0-vol0-replicate-0: Gfid mismatch detected for <gfid:fd685a4f-3b6a-4307-b7ef-79356b0802a2>/file990>, aca9c945-8a3b-4ffe-af45-d07c4cea355a on vol0-client-2 and 8518500f-fb45-4466-9878-10a25545aa48 on vol0-client-0.
[2018-08-20 01:37:58.372234] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-vol0-replicate-0: Skipping conservative merge on the file.
[2018-08-20 01:37:58.381751] E [MSGID: 108008] [afr-self-heal-common.c:335:afr_gfid_split_brain_source] 0-vol0-replicate-0: Gfid mismatch detected for <gfid:fd685a4f-3b6a-4307-b7ef-79356b0802a2>/file998>, 0b401c35-70d7-4255-a4ca-71b27f3fd7e7 on vol0-client-2 and 45a14259-9d4f-43e3-8c55-4c6dc080a97c on vol0-client-0.
[2018-08-20 01:37:58.381931] E [MSGID: 108008] [afr-self-heal-entry.c:260:afr_selfheal_detect_gfid_and_type_mismatch] 0-vol0-replicate-0: Skipping conservative merge on the file.
... repeat many many times (the string of ID are different)
The only way, I found, to recover from this error is to remove all the associated
files under the /glusterfs/vol0/brick0.
I want to know about the status of errors, some methods to recover and some methods
to avoid the status to happen.
Thanks.
P.S. As far as we tried, there’s no memory leak with the version.
More information about the Gluster-users
mailing list