[Bugs] [Bug 1414287] repeated operation failed warnings in gluster mount logs with disperse volume

Wed Jan 18 09:12:30 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1414287

--- Comment #1 from Ashish Pandey <aspandey at redhat.com> ---
+++ This bug was initially created as a clone of Bug #1406322 +++

Description of problem:

Tests in the Scale Lab targeted at 36-drives per server JBOD-mode support are
showing repeated warnings in gluster mount logs when running on
distributed-disperse volume.

From: /var/log/glusterfs/<mount-point>.log:
<excerpt>
[2016-12-20 06:12:09.768372] I [fuse-bridge.c:5241:fuse_graph_setup] 0-fuse:
switched to graph 0
[2016-12-20 06:12:09.768423] I [MSGID: 114035]
[client-handshake.c:201:client_set_lk_version_cbk] 0-perfvol-client-215: Server
lk version = 1
[2016-12-20 06:12:09.768528] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse:
FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22
[2016-12-20 06:23:44.204340] W [MSGID: 122053]
[ec-common.c:116:ec_check_status] 0-perfvol-disperse-4: Operation failed on
some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-12-20 06:23:44.213433] I [MSGID: 122058] [ec-heal.c:2361:ec_heal_do]
0-perfvol-disperse-4: /smf1/file_srcdir/h17-priv/thrd_05/d_001/d_002/d_009:
name heal successful on 3F
[2016-12-20 06:28:30.254713] W [MSGID: 122053]
[ec-common.c:116:ec_check_status] 0-perfvol-disperse-19: Operation failed on
some subvolumes (up=3F, mask=3F, remaining=0, good=3C, bad=3)
[2016-12-20 06:28:30.256043] I [MSGID: 122058] [ec-heal.c:2361:ec_heal_do]
0-perfvol-disperse-19: /smf1/file_srcdir/h17-priv/thrd_07/d_001/d_007/d_002:
name heal successful on 3F
[2016-12-20 06:39:22.749543] W [MSGID: 122053]
[ec-common.c:116:ec_check_status] 0-perfvol-disperse-13: Operation failed on
some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-12-20 06:39:22.751316] I [MSGID: 122058] [ec-heal.c:2361:ec_heal_do]
0-perfvol-disperse-13: /smf1/file_srcdir/h17-priv/thrd_06/d_002/d_008/d_002:
name heal successful on 3F
</excerpt>

One of the concerns with 36-drive JBOD mode is the large number of brick
processes and the memory and cpu pressure from them. The above warnings seem to
cause self-heal process to become active, which in turn adds to stress on the
system:

<top output excerpt>
  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
[...]
17676 root      20   0 17.661g 1.015g   3236 S  0.0  0.4   0:00.00 glusterfs
</top output excerpt>

Version-Release number of selected component (if applicable):
glusterfs-*-3.8.4-8.el7rhgs.x86_64
kernel-3.10.0-514.el7.x86_64 (RHEL 7.3)

How reproducible:

The Scale Lab servers being used have 36 SATA drives, 40GbE, 256GB RAM. The
servers are also serving as clients to run performance benchmarks. There are 6
servers, 36x(4+2) distributed-disperse gluster volume. In this configuration,
the problem is consistently reproducible with the smallfile benchmark.

Steps to Reproduce:
The warnings seen above are from a smallfile run:
smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 8 --files
32768 --file-size 32 --record-size 32 --fsync Y --response-times N --operation
create

Additional info:

Volume options changed:
cluster.lookup-optimize: on
client.event-threads: 8
performance.client-io-threads: on

More Additional info:
The situation is much worse in tests with the more complex SPECsfs2014 VDA
benchmark, which generates a lot more data, adding to memory pressure from page
cache usage. I'll add info from those runs later, but briefly, there we see a
steady stream of "operation failed" warnings in the gluster mount log.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-12-20
04:30:21 EST ---

This bug is automatically being proposed for the current release of Red Hat
Gluster Storage 3 under active development, by setting the release flag
'rhgs‑3.2.0' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Manoj Pillai on 2016-12-22 23:30:20 EST ---

Update on this bz: I was able to reproduce this issue on our BAGL setup --
12-drives per server, 10GbE. So this is not specific to the Scale Lab setup and
36-drive config I am currently working with.

I will send an email to Pranith and Nag with access details to the BAGL setup
so they can proceed with RCA. I need to continue with the Scale Lab work.

--- Additional comment from Manoj Pillai on 2016-12-22 23:32:51 EST ---

Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1358606.

I am wondering if a fix for the above bz would help here as well.

--- Additional comment from Pranith Kumar K on 2016-12-23 05:10:32 EST ---

We see a lot of disconnects based on the following logs:
erd until brick's port is available
[2016-12-22 10:11:00.615628] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-perfvol-client-40: disconnected from
perfvol-client-40. Client process will keep trying to connect to glusterd until
brick's port is available
[2016-12-22 10:11:00.615675] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-perfvol-client-41: disconnected from
perfvol-client-41. Client process will keep trying to connect to glusterd until
brick's port is available
[2016-12-22 10:11:00.615777] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-perfvol-client-42: disconnected from
perfvol-client-42. Client process will keep trying to connect to glusterd until
brick's port is available
[2016-12-22 10:11:00.615804] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-perfvol-client-43: disconnected from
perfvol-client-43. Client process will keep trying to connect to glusterd until
brick's port is available
[2016-12-22 10:11:00.615824] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-perfvol-client-44: disconnected from
perfvol-client-44. Client process will keep trying to connect to glusterd until
brick's port is available
[2016-12-22 10:11:00.615844] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-perfvol-client-45: disconnected from
perfvol-client-45. Client process will keep trying to connect to glusterd until
brick's port is available
[2016-12-22 10:11:00.615875] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-perfvol-client-46: disconnected from
perfvol-client-46. Client process will keep trying to connect to glusterd until
brick's port is available
[2016-12-22 10:11:00.615896] I [MSGID: 114018]
[client.c:2280:client_rpc_notify] 0-perfvol-client-47: disconnected from
perfvol-client-47. Client process will keep trying to connect to glusterd until
brick's port is available

I also checked /var/log/messages to find if there are any problems it looked
fine. We need to find RC why the disconnections are happening.

--- Additional comment from Atin Mukherjee on 2016-12-23 05:26:28 EST ---

Manoj - could you specify the exact build version with which you hit the issue?

--- Additional comment from Manoj Pillai on 2016-12-23 06:16:55 EST ---

(In reply to Pranith Kumar K from comment #4)
> We see a lot of disconnects based on the following logs:
> erd until brick's port is available
> [2016-12-22 10:11:00.615628] I [MSGID: 114018]
[...]
> [2016-12-22 10:11:00.615896] I [MSGID: 114018]
> [client.c:2280:client_rpc_notify] 0-perfvol-client-47: disconnected from
> perfvol-client-47. Client process will keep trying to connect to glusterd
> until brick's port is available
> 
> I also checked /var/log/messages to find if there are any problems it looked
> fine. We need to find RC why the disconnections are happening.

These BAGL systems had an older upstream gluster version on them. For ease of
analysis, I put them on a recent downstream version today i.e. 12-23. Please
look at the logs from today. I don't see any such disconnect messages from
today. 

The disconnects might be because of tear-down of an existing volume rather than
anything else.

--- Additional comment from Manoj Pillai on 2016-12-23 06:29:41 EST ---

(In reply to Atin Mukherjee from comment #5)
> Manoj - could you specify the exact build version with which you hit the
> issue?

glusterfs-api-3.8.4-8.el7rhgs.x86_64
glusterfs-3.8.4-8.el7rhgs.x86_64
glusterfs-server-3.8.4-8.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-8.el7rhgs.x86_64
glusterfs-cli-3.8.4-8.el7rhgs.x86_64
glusterfs-libs-3.8.4-8.el7rhgs.x86_64
glusterfs-fuse-3.8.4-8.el7rhgs.x86_64

--- Additional comment from Pranith Kumar K on 2016-12-23 06:41:48 EST ---

Changing back the component as I was not looking at the correct logs.

--- Additional comment from Raghavendra G on 2016-12-23 06:45:02 EST ---

Some data from logs:

* The number of disconnects were 240
[root at gprfc069 ~]# grep "disconnected from"
/var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 |  sort |  wc
-l
240

* Of which 144 were from 1st graph
[root at gprfc069 ~]# grep "disconnected from"
/var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "0-" |
sort |  wc -l
144

* and 96 were from 3rd graph
[root at gprfc069 ~]# grep "disconnected from"
/var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "2-" |
sort |  wc -l
96

(where did 2nd graph go?)

some random interesting facts (I manually verified it looking through logs
too):

* there were 3 disconnects per transport in 1st graph - 144/48 = 3.
* there were 2 disconnects per transport in 3rd graph - 96/48 = 2

Among these 1 disconnect per transport can be attributed to graph switch.

[root at gprfc069 ~]# grep "current graph is no longer active"
/var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 | grep "0-" |
wc -l
48

[root at gprfc069 ~]# grep "current graph is no longer active"
/var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 | grep "2-" |
wc -l
48

There were bunch of msgs indicating that client was not able to get remote
brick's port number

[root at gprfc069 ~]# grep "failed to get the port number for"
/var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 |sort| grep
"0-" | wc -l
48

[root at gprfc069 ~]# grep "failed to get the port number for"
/var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 |sort| grep
"2-" | wc -l
48

Note that we disconnect when we cannot get port number and start a fresh
connection sequence. So, 1 disconnect per transport for both transport can be
attributed to port-map failure

--- Additional comment from Pranith Kumar K on 2016-12-23 06:46:49 EST ---

(In reply to Raghavendra G from comment #9)
> Some data from logs:
> 
> * The number of disconnects were 240
> [root at gprfc069 ~]# grep "disconnected from"
> /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 |  sort | 
> wc -l
> 240
> 
> * Of which 144 were from 1st graph
> [root at gprfc069 ~]# grep "disconnected from"
> /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "0-"
> | sort |  wc -l
> 144
> 
> * and 96 were from 3rd graph
> [root at gprfc069 ~]# grep "disconnected from"
> /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "2-"
> | sort |  wc -l
> 96
> 
> (where did 2nd graph go?)
> 
> some random interesting facts (I manually verified it looking through logs
> too):
> 
> * there were 3 disconnects per transport in 1st graph - 144/48 = 3.
> * there were 2 disconnects per transport in 3rd graph - 96/48 = 2
> 
> Among these 1 disconnect per transport can be attributed to graph switch.
> 
> [root at gprfc069 ~]# grep "current graph is no longer active"
> /var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 | grep "0-"
> | wc -l
> 48
> 
> [root at gprfc069 ~]# grep "current graph is no longer active"
> /var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 | grep "2-"
> | wc -l
> 48
> 
> There were bunch of msgs indicating that client was not able to get remote
> brick's port number
> 
> [root at gprfc069 ~]# grep "failed to get the port number for"
> /var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 |sort| grep
> "0-" | wc -l
> 48
> 
> [root at gprfc069 ~]# grep "failed to get the port number for"
> /var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 |sort| grep
> "2-" | wc -l
> 48
> 
> Note that we disconnect when we cannot get port number and start a fresh
> connection sequence. So, 1 disconnect per transport for both transport can
> be attributed to port-map failure

Raghavendra,
     This is not RPC issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1406322#c6

We can't believe old logs :-).

Pranith

--- Additional comment from Raghavendra G on 2016-12-23 06:51:51 EST ---

> Raghavendra,
>      This is not RPC issue:
> https://bugzilla.redhat.com/show_bug.cgi?id=1406322#c6
> 
> We can't believe old logs :-).

I know. Just posted other evidence to corroborate comment 6 :).

> 
> Pranith

--- Additional comment from Rejy M Cyriac on 2016-12-26 08:27:13 EST ---

At the 'RHGS 3.2.0 - Blocker Bug Triage' meeting on 26 December, it was decided
that this BZ is being ACCEPTED AS BLOCKER at the RHGS 3.2.0 release

--- Additional comment from Manoj Pillai on 2017-01-03 01:11:44 EST ---

Subsequent tests have shown that the issue is not specific to the 36-drive
setup. Changing summary to reflect that.

--- Additional comment from Ashish Pandey on 2017-01-18 02:27:10 EST ---

I reproduced this issue easily on my laptop.

What I have been seeing is that while creating and writing files on mount
point, some lookup are also happening. These lookups happens without locks.

At this point if some parallel version or size update is happening, there is
possibility that some bricks will have different xattrs. Now, ec_check_status
called by ec_look->ec_complete, will trigger this message logging.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=nySYx7ewD4&a=cc_unsubscribe