[Gluster-users] gluster volume heal info hang on 3.10.6 & 3.12.11

Fri Sep 14 07:45:01 UTC 2018

(My last mail seemed to be dropped by the list, reposting in plain-text format)

Hello:

I have recently done a rolling upgraded on some of my 3.7.6 gluster
(AFR + DHT) deploments to 3.10.6, having two problems afterwards:
1)  IO error occationally
2)  "gluster volume heal vol0 info" hangs after  decent amount of
files written to the cluster  (vary from hours to days)

All of my 7 upgraded 3.10 clusters  reproduced these problems,  and
then I tried upgraded them to 3.12.11, but nothing improved.
I have submited a bug on
https://bugzilla.redhat.com/show_bug.cgi?id=1618932 ,  so far no one
seams to look into it.

The problem of heal hanging is most annoying because it makes me
impossible to monitor my clusters. So I tried to look into the code
and analyze with the DEBUG & TRACE log, found the following situation.

1) I have "network.ping-timeout: 120" in volume option . If I modify
network.ping-timeout to 0,
"gluster volume heal vol0 info" will finish eventually after 10~20
minutes (8x3 bricks setup). And it says no file to heal.
It takes even longer when cluster is larger (36x3 bricks setup).
It does not matter even I stop all my IO or restart all glusterd &
glusterfsd process.  heal info still hangs.

"gluster volume status"  is all normal.  All my brick servers are in
low latency LAN enviroment.

2) As one of my clusters having this problem is testing setup,
currently I may do some debug there.
I looked into heal/src/glfs-heal.c, compile the code and added some log,
the initialize call on "ret = glfs_init (fs)" hangs.
I traced to " ret = glfs_chdir (fs, "/");" and then "dht_lookup(
path='/' )",  it never get to "dht_readdirp_cbk()"

I stopped all client IO operation on my gluster volume and then ran
"glfsheal vol0".
I looked into /var/log/gluster/glfshead_vol0.log, there is messive
looping of reconnect attempt on all bricks. eg:

    [2018-09-14 05:56:09.395869] T [rpc-clnt.c:407:rpc_clnt_reconnect]
0-vol0-client-12: attempting reconnect
    [2018-09-14 05:56:09.396160] T [socket.c:3125:socket_connect] (-->
/lib/libglusterfs.so.0(_gf_log_callingfn+0x1ee)[0x7f10d4e79778] (-->
/usr/lib/glus
    terfs//rpc-transport/socket.so(+0xd442)[0x7f10c647c442] (-->
/lib/libgfrpc.so.0(rpc_transport_connect+0x93)[0x7f10d437de43] (-->
/lib/libgfrpc.so.0(r
    pc_clnt_reconnect+0x136)[0x7f10d438121b] (-->
/lib/libglusterfs.so.0(+0x3e4e9)[0x7f10d4e894e9] )))))
0-vol0-client-12: connect () called on transport
     already connected
    [2018-09-14 05:56:09.396194] T [rpc-clnt.c:407:rpc_clnt_reconnect]
0-vol0-client-10: attempting reconnect
    [2018-09-14 05:56:09.396424] T [socket.c:3125:socket_connect] (-->
/lib/libglusterfs.so.0(_gf_log_callingfn+0x1ee)[0x7f10d4e79778] (-->
/usr/lib/glus
    terfs//rpc-transport/socket.so(+0xd442)[0x7f10c647c442] (-->
/lib/libgfrpc.so.0(rpc_transport_connect+0x93)[0x7f10d437de43] (-->
/lib/libgfrpc.so.0(r
    pc_clnt_reconnect+0x136)[0x7f10d438121b] (-->
/lib/libglusterfs.so.0(+0x3e4e9)[0x7f10d4e894e9] )))))
0-vol0-client-10: connect () called on transport
     already connected

It's also weird that 0-vol0-client-5 was having huge ammount RPC ping latency
(29 seconds it's not possible because simple ICMP pinging that server
is less then 0.1 second)

     root at storage0-cluster0:~# less /vgrep ping
/var/log/glusterfs/glfsheal-vol0.log | grep laten
    [2018-09-14 05:13:16.008866] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-0: Ping latency
is 0ms
    [2018-09-14 05:13:16.016669] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-1: Ping latency
is 0ms
    [2018-09-14 05:13:16.021769] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-2: Ping latency
is 0ms
    [2018-09-14 05:13:16.033761] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-3: Ping latency
is 4ms
    [2018-09-14 05:13:16.101962] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-4: Ping latency
is 62ms
    [2018-09-14 05:13:46.105502] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-5: Ping latency
is 29996ms
    [2018-09-14 05:20:13.895285] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-0: Ping latency
is 0ms
    [2018-09-14 05:20:13.901476] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-1: Ping latency
is 0ms
    [2018-09-14 05:20:13.908373] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-2: Ping latency
is 0ms
    [2018-09-14 05:20:13.918782] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-3: Ping latency
is 4ms
    [2018-09-14 05:20:13.987162] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-4: Ping latency
is 62ms
    [2018-09-14 05:20:43.992528] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-5: Ping latency
is 29998ms
    [2018-09-14 05:25:11.983259] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-0: Ping latency
is 0ms
    [2018-09-14 05:25:11.991199] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-1: Ping latency
is 0ms
    [2018-09-14 05:25:11.996933] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-2: Ping latency
is 0ms
    [2018-09-14 05:25:12.007867] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-3: Ping latency
is 4ms
    [2018-09-14 05:25:12.076057] D
[rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-4: Ping latency
is 62ms

I checked vol0-client-5, nothing special comparing to other brick. CPU
is not high, disk has no problem.
No warning or error log in  /var/log/gluster/brick/data-brk01-src.log,
 After I turn on TRACE level log on brick,
the logging is too much to analyze, and  no apparent difference
against other brick.

I hope someone can have some idea about this, or pointing the
direction to look into may help.

Thanks very much