[Bugs] [Bug 1402621] New: High load one node, gluster fuse clients hang, heal info does not complete

Thu Dec 8 00:22:04 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1402621

            Bug ID: 1402621
           Summary: High load one node, gluster fuse clients hang, heal
                    info does not complete
           Product: GlusterFS
           Version: 3.7.16
         Component: glusterd
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: tu2Bgone at gmail.com
                CC: bugs at gluster.org

Created attachment 1229281
  --> https://bugzilla.redhat.com/attachment.cgi?id=1229281&action=edit
ftp gluster fuse client log (redacted personal information)

Description of problem:

We have a problem that has occurred twice in two days, but has occurred more
than once before.

3 x node Fedora Cluster in AWS (m4.xlarge) (Fedora 23 (Cloud Edition))
2.5Tb volume
Volume Name: marketplace_nfs
Type: Distributed-Replicate
Volume ID: 528de1b5-0bd5-488b-83cf-c4f3f747e6cd
Status: Started
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.90.5.105:/data/data0/marketplace_nfs
Brick2: 10.90.3.14:/data/data3/marketplace_nfs
Brick3: 10.90.4.195:/data/data0/marketplace_nfs
Brick4: 10.90.5.105:/data/data1/marketplace_nfs
Brick5: 10.90.3.14:/data/data1/marketplace_nfs
Brick6: 10.90.4.195:/data/data1/marketplace_nfs
Options Reconfigured:
server.outstanding-rpc-limit: 128
cluster.self-heal-readdir-size: 16KB
cluster.self-heal-window-size: 3
diagnostics.brick-log-level: INFO
network.ping-timeout: 15
cluster.quorum-type: none
performance.readdir-ahead: on
cluster.self-heal-daemon: enable
performance.cache-size: 512MB
cluster.lookup-optimize: on
cluster.data-self-heal-algorithm: diff
cluster.server-quorum-ratio: 51%

Status of volume: marketplace_nfs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.90.5.105:/data/data0/marketplace_n
fs                                          49152     0          Y       3426
Brick 10.90.3.14:/data/data3/marketplace_nf
s                                           49154     0          Y       3402
Brick 10.90.4.195:/data/data0/marketplace_n
fs                                          49152     0          Y       4868
Brick 10.90.5.105:/data/data1/marketplace_n
fs                                          49153     0          Y       31636
Brick 10.90.3.14:/data/data1/marketplace_nf
s                                           49153     0          Y       348
Brick 10.90.4.195:/data/data1/marketplace_n
fs                                          49153     0          Y       31238
NFS Server on localhost                     2049      0          Y       3999
Self-heal Daemon on localhost               N/A       N/A        Y       4008
NFS Server on ip-10-90-5-105.ec2.internal   2049      0          Y       1488
Self-heal Daemon on ip-10-90-5-105.ec2.inte
rnal                                        N/A       N/A        Y       1496
NFS Server on ip-10-90-4-195.ec2.internal   2049      0          Y       20526
Self-heal Daemon on ip-10-90-4-195.ec2.inte
rnal                                        N/A       N/A        Y       20534

Task Status of Volume marketplace_nfs
------------------------------------------------------------------------------
There are no active volume tasks

Version-Release number of selected component (if applicable):
3.7.16

How reproducible:

Cannot reproduce on demand but occurs frequently.

Actual results:

Client processes hang and cannot list the GlusterFS mount
$ gluster volume heal marketplace_nfs info hangs and cannot list healing
information
Shutdown clients (not umount - halt clients)
$ gluster volume heal completes
Load starts reducing and we can remount.
Recovery time is around 20 minutes and causes significant problems

Expected results:
This does not happen

Additional info:

The file size average is 13Mb - 5Gb is around the largest size. We do some post
processing after initial upload (mv, unzip, mv, delete). We have the logs from
the ftp server, web servers also mount and work off this volume but we do not
have logs from them.

Gluster servers provide no useful logging during this time. I will attach
statedumps as well as the client log.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.