[Gluster-users] transport endpoint not connected and sudden unmount

Brian Andrus toomuchit at gmail.com
Wed Jun 27 14:19:25 UTC 2018


All,

I have a gluster filesystem (glusterfs-4.0.2-1, Type: 
Distributed-Replicate, Number of Bricks: 5 x 3 = 15)

I have one directory that is used for slurm statefiles, which seems to 
get out of sync fairly often. There are particular files that end up 
never healing.

Since the files are ephemeral, I'm ok with losing them (for now). 
Following some advice, I deleted UUID files that were in 
/GLUSTER/brick1/.glusterfs/indices/xattrop/

This makes gluster volume heal GDATA statistics heal-count show no 
issues, however the issue is still there. Even though nothing is showing 
up with gluster volume heal GDATA info, there are some files/directories 
that, if I try to access them at all, I get "Transport endpoint is not 
connected"
There is even a directory, which is empty but if I try to 'rmdir' it, I 
get "rmdir: failed to remove ‘/DATA/slurmstate.old/slurm/’: Software 
caused connection abort" and the mount goes bad. I have to umount/mount 
it to get it back.

There is a bit of info in the log file that has to do with the crash 
which is attached.

How do I clean this up? And what is the 'proper' way to handle when you 
have a file that will not heal even in a 3-way replicate?

Brian Andrus

-------------- next part --------------
[2018-06-27 14:16:00.075738] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-GDATA-client-12: Connected to GDATA-client-12, attached to remote volume '/GLUSTER/brick1'.
[2018-06-27 14:16:00.075755] I [MSGID: 108005] [afr-common.c:5081:__afr_handle_child_up_event] 0-GDATA-replicate-4: Subvolume 'GDATA-client-12' came back up; going online.
[2018-06-27 14:16:00.076274] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-GDATA-client-14: error returned while attempting to connect to host:(null), port:0
[2018-06-27 14:16:00.076468] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-GDATA-client-14: error returned while attempting to connect to host:(null), port:0
[2018-06-27 14:16:00.076582] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-GDATA-client-14: changing port to 49152 (from 0)
[2018-06-27 14:16:00.076772] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-GDATA-client-13: error returned while attempting to connect to host:(null), port:0
[2018-06-27 14:16:00.076922] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-GDATA-client-13: error returned while attempting to connect to host:(null), port:0
[2018-06-27 14:16:00.077407] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-GDATA-client-13: Connected to GDATA-client-13, attached to remote volume '/GLUSTER/brick1'.
[2018-06-27 14:16:00.077422] I [MSGID: 108002] [afr-common.c:5378:afr_notify] 0-GDATA-replicate-4: Client-quorum is met
[2018-06-27 14:16:00.079479] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-GDATA-client-14: error returned while attempting to connect to host:(null), port:0
[2018-06-27 14:16:00.079723] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-GDATA-client-14: error returned while attempting to connect to host:(null), port:0
[2018-06-27 14:16:00.080249] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-GDATA-client-14: Connected to GDATA-client-14, attached to remote volume '/GLUSTER/brick1'.
[2018-06-27 14:16:00.081176] I [fuse-bridge.c:4234:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22
[2018-06-27 14:16:00.081196] I [fuse-bridge.c:4864:fuse_graph_sync] 0-fuse: switched to graph 0
[2018-06-27 14:16:00.088870] I [MSGID: 109005] [dht-selfheal.c:2328:dht_selfheal_directory] 0-GDATA-dht: Directory selfheal failed: Unable to form layout for directory /
[2018-06-27 14:16:03.675890] W [MSGID: 108027] [afr-common.c:2255:afr_attempt_readsubvol_set] 0-GDATA-replicate-1: no read subvols for /slurmstate.old/slurm
[2018-06-27 14:16:03.675921] I [MSGID: 109063] [dht-layout.c:693:dht_layout_normalize] 0-GDATA-dht: Found anomalies in /slurmstate.old/slurm (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0
[2018-06-27 14:16:03.675936] W [MSGID: 109005] [dht-selfheal.c:2303:dht_selfheal_directory] 0-GDATA-dht: Directory selfheal failed: 1 subvolumes down.Not fixing. path = /slurmstate.old/slurm, gfid = 8ed6a9e9-2820-40bd-8d9d-77b7f79c774
8
[2018-06-27 14:16:03.679061] I [MSGID: 108026] [afr-self-heal-entry.c:887:afr_selfheal_entry_do] 0-GDATA-replicate-2: performing entry selfheal on 8ed6a9e9-2820-40bd-8d9d-77b7f79c7748
[2018-06-27 14:16:03.681899] W [MSGID: 108015] [afr-self-heal-entry.c:56:afr_selfheal_entry_delete] 0-GDATA-replicate-2: expunging file 8ed6a9e9-2820-40bd-8d9d-77b7f79c7748/heartbeat (00000000-0000-0000-0000-000000000000) on GDATA-cli
ent-6
[2018-06-27 14:16:03.683080] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-GDATA-client-4: remote operation failed. Path: /slurmstate.old/slurm/qos_usage (848b3d5e-3492-4343-a1b2-a86cc975b3c2) [No data available
]
[2018-06-27 14:16:03.683624] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-GDATA-client-4: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2018-06-27 14:16:03.684098] I [MSGID: 108026] [afr-self-heal-entry.c:887:afr_selfheal_entry_do] 0-GDATA-replicate-1: performing entry selfheal on 8ed6a9e9-2820-40bd-8d9d-77b7f79c7748
[2018-06-27 14:16:03.684982] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-GDATA-client-4: remote operation failed. Path: /slurmstate.old/slurm/job_state.old (05eb1a4e-6b6e-4e6f-a13b-a016431f0e1f) [No data avail
able]
[2018-06-27 14:16:03.686319] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-GDATA-client-4: remote operation failed. Path: /slurmstate.old/slurm/resv_state (6658195b-9a22-4663-89ab-c7e3f4fc5258) [No data availabl
e]
[2018-06-27 14:16:03.690118] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-GDATA-client-4: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2018-06-27 14:16:03.690283] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-GDATA-client-4: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2018-06-27 14:16:03.690807] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-GDATA-client-4: remote operation failed. Path: /slurmstate.old/slurm/node_state (91e1ee64-154e-4aff-be2d-fcd2ac616dbe) [No data availabl
e]
[2018-06-27 14:16:03.692947] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-GDATA-client-4: remote operation failed. Path: /slurmstate.old/slurm/node_state.old (d1a4ce4f-fc99-418e-ac0e-78a191795275) [No data avai
lable]
[2018-06-27 14:16:03.693357] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-GDATA-client-4: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-GDATA-client-4: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]" repeated 2 times between [2018-06-27 1
4:16:03.693357] and [2018-06-27 14:16:03.694530]
[2018-06-27 14:16:03.694560] W [MSGID: 109048] [dht-common.c:9736:dht_rmdir_cached_lookup_cbk] 0-GDATA-dht: /slurmstate.old/slurm/qos_usage not found on cached subvol GDATA-replicate-2 [No data available]
[2018-06-27 14:16:03.695685] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-GDATA-client-4: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2018-06-27 14:16:03.697547] W [MSGID: 108015] [afr-self-heal-entry.c:56:afr_selfheal_entry_delete] 0-GDATA-replicate-1: expunging file 8ed6a9e9-2820-40bd-8d9d-77b7f79c7748/node_state.old (00000000-0000-0000-0000-000000000000) on GDATA-client-4
[2018-06-27 14:16:03.697770] W [MSGID: 108015] [afr-self-heal-entry.c:56:afr_selfheal_entry_delete] 0-GDATA-replicate-1: expunging file 8ed6a9e9-2820-40bd-8d9d-77b7f79c7748/node_state (00000000-0000-0000-0000-000000000000) on GDATA-client-4
[2018-06-27 14:16:03.697954] W [MSGID: 108015] [afr-self-heal-entry.c:56:afr_selfheal_entry_delete] 0-GDATA-replicate-1: expunging file 8ed6a9e9-2820-40bd-8d9d-77b7f79c7748/resv_state (00000000-0000-0000-0000-000000000000) on GDATA-client-4
[2018-06-27 14:16:03.698160] W [MSGID: 108015] [afr-self-heal-entry.c:56:afr_selfheal_entry_delete] 0-GDATA-replicate-1: expunging file 8ed6a9e9-2820-40bd-8d9d-77b7f79c7748/job_state.old (00000000-0000-0000-0000-000000000000) on GDATA-client-4
[2018-06-27 14:16:03.698950] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-GDATA-client-4: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
[2018-06-27 14:16:03.699007] W [MSGID: 108027] [afr-common.c:2255:afr_attempt_readsubvol_set] 0-GDATA-replicate-1: no read subvols for /slurmstate.old/slurm/node_state.old
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(1) op(RMDIR)
frame : type(1) op(RMDIR)
frame : type(1) op(RMDIR)
frame : type(1) op(RMDIR)
frame : type(1) op(RMDIR)
frame : type(1) op(RMDIR)
frame : type(1) op(RMDIR)
frame : type(1) op(RMDIR)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-06-27 14:16:03
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 4.0.2
[2018-06-27 14:16:03.699346] W [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_lookup_cbk] 0-GDATA-client-4: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No data available]
/lib64/libglusterfs.so.0(+0x244b0)[0x7f3db028c4b0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f3db02963e4]
/lib64/libc.so.6(+0x362f0)[0x7f3dae8f22f0]
/usr/lib64/glusterfs/4.0.2/xlator/cluster/distribute.so(+0x5d840)[0x7f3da2214840]
/usr/lib64/glusterfs/4.0.2/xlator/cluster/replicate.so(+0x69c90)[0x7f3da24cfc90]
/usr/lib64/glusterfs/4.0.2/xlator/cluster/replicate.so(+0x6aaaa)[0x7f3da24d0aaa]
/usr/lib64/glusterfs/4.0.2/xlator/cluster/replicate.so(+0x6ae43)[0x7f3da24d0e43]
/lib64/libglusterfs.so.0(+0x60840)[0x7f3db02c8840]
/lib64/libc.so.6(+0x48030)[0x7f3dae904030]
---------
[2018-06-27 14:16:03.699540] W [MSGID: 108027] [afr-common.c:2255:afr_attempt_readsubvol_set] 0-GDATA-replicate-1: no read subvols for /slurmstate.old/slurm/node_state


More information about the Gluster-users mailing list