[Bugs] [Bug 1311354] New: File operation hangs in 26 node cluster under heavy load

Wed Feb 24 02:28:33 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1311354

            Bug ID: 1311354
           Summary: File operation hangs in 26 node cluster under heavy
                    load
           Product: GlusterFS
           Version: 3.5.5
         Component: fuse
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: wymonsoon at gmail.com
                CC: bugs at gluster.org

Created attachment 1130004
  --> https://bugzilla.redhat.com/attachment.cgi?id=1130004&action=edit
client side log

Description of problem:
We are using GlusterFS 3.5.5.
The server-end is deployed on a 26-node cluster.  Each node has one brick.
The client-end is a 32-node cluster (including the 26 server node) which runs
distributed video transcoding.  GFS is the file share between the 32 servers,
mounted with FUSE.

We found that when workload is high, client often hangs on file operations on
gfs.  The client log indicates that the client losts ping from the server and
leads to a bunch of "Transport point is not connected" in the log.

Version-Release number of selected component (if applicable):
3.5.5

How reproducible:
Like dozens of times per hour

Steps to Reproduce:
1.
2.
3.

Actual results:
All file operations runs correctly

Expected results:
No hang

Additional info:
OS: debian 8.2
Kernel:  3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3 (2015-08-04) x86_64
GNU/Linux

The TCP ping during the hang period is working correctly.

Our volume info:
Volume Name: hzsq_encode_02
Type: Distributed-Replicate
Volume ID: 653b554b-47aa-4f25-a102-7ac6858f41e1
Status: Started
Number of Bricks: 13 x 2 = 26
Transport-type: tcp
Bricks:
Brick1: hzsq-encode-33:/data/gfs-brk
Brick2: hzsq-encode-34:/data/gfs-brk
Brick3: hzsq-encode-41:/data/gfs-brk
Brick4: hzsq-encode-42:/data/gfs-brk
Brick5: hzsq-encode-43:/data/gfs-brk
Brick6: hzsq-encode-44:/data/gfs-brk
Brick7: hzsq-encode-45:/data/gfs-brk
Brick8: hzsq-encode-46:/data/gfs-brk
Brick9: hzsq-encode-47:/data/gfs-brk
Brick10: hzsq-encode-48:/data/gfs-brk
Brick11: hzsq-encode-49:/data/gfs-brk
Brick12: hzsq-encode-50:/data/gfs-brk
Brick13: hzsq-encode-51:/data/gfs-brk
Brick14: hzsq-encode-52:/data/gfs-brk
Brick15: hzsq-encode-53:/data/gfs-brk
Brick16: hzsq-encode-54:/data/gfs-brk
Brick17: hzsq-encode-55:/data/gfs-brk
Brick18: hzsq-encode-56:/data/gfs-brk
Brick19: hzsq-encode-57:/data/gfs-brk
Brick20: hzsq-encode-58:/data/gfs-brk
Brick21: hzsq-encode-59:/data/gfs-brk
Brick22: hzsq-encode-60:/data/gfs-brk
Brick23: hzsq-encode-61:/data/gfs-brk
Brick24: hzsq-encode-62:/data/gfs-brk
Brick25: hzsq-encode-63:/data/gfs-brk
Brick26: hzsq-encode-64:/data/gfs-brk
Options Reconfigured:
nfs.disable: On
performance.io-thread-count: 32
performance.cache-refresh-timeout: 1
performance.write-behind-window-size: 1MB
performance.cache-size: 128MB
performance.flush-behind: On
server.outstanding-rpc-limit: 0
performance.read-ahead: On
performance.io-cache: On
performance.quick-read: off
nfs.outstanding-rpc-limit: 0
network.ping-timeout: 20
server.statedump-path: /tmp

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.