[Gluster-users] Need some help on Mismatching xdata / Failed combine iatt / Too many fd

陈陈 chenchen at smartquerier.com
Fri Mar 25 03:29:04 UTC 2016

Hi Everyone,

I have a "2 x (4 + 2) = 12 Distributed-Disperse" volume. After upgraded 
to 3.7.8 I noticed the volume is frequently out of service. The 
glustershd.log is flooded by:

[ec-combine.c:866:ec_combine_check] 0-mainvol-disperse-1: Mismatching 
xdata in answers of 'LOOKUP'"
[ec-common.c:116:ec_check_status] 0-mainvol-disperse-1: Operation failed 
on some subvolumes (up=3F, mask=3F, remaining=0, good=1E, bad=21)
[ec-common.c:71:ec_heal_report] 0-mainvol-disperse-1: Heal failed 
[Invalid argument]
[ec-combine.c:206:ec_iatt_combine] 0-mainvol-disperse-0: Failed to 
combine iatt (inode: xxx, links: 1-1, uid: 1000-1000, gid: 1000-1000, 
rdev: 0-0, size: xxx-xxx, mode: 100600-100600)

in normal working state, and sometimes 1000+ lines of:

[client-rpc-fops.c:466:client3_3_open_cbk] 0-mainvol-client-7: remote 
operation failed. Path: <gfid:xxxx> (xxxx) [Too many open files]

and the brick went offline. "top open" showed "Max open fds: 899195".

Can anyone suggest me what happened, and what should I do? I was trying 
to deal with the terrible IOPS problem but things got even worse.

Each Server has 2 x E5-2630v3 (32threads/server), 32GB RAM. Additional 
infos are in the attachements. Many thanks.

Volume Name: mainvol
Type: Distributed-Disperse
Volume ID: 2e190c59-9e28-43a5-b22a-24f75e9a580b
Status: Started
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Brick1: sm11:/mnt/disk1/mainvol
Brick2: sm12:/mnt/disk1/mainvol
Brick3: sm13:/mnt/disk1/mainvol
Brick4: sm14:/mnt/disk2/mainvol
Brick5: sm15:/mnt/disk2/mainvol
Brick6: sm16:/mnt/disk2/mainvol
Brick7: sm11:/mnt/disk2/mainvol
Brick8: sm12:/mnt/disk2/mainvol
Brick9: sm13:/mnt/disk2/mainvol
Brick10: sm14:/mnt/disk1/mainvol
Brick11: sm15:/mnt/disk1/mainvol
Brick12: sm16:/mnt/disk1/mainvol
Options Reconfigured:
server.outstanding-rpc-limit: 256
network.remote-dio: false
performance.io-cache: true
performance.readdir-ahead: on
auth.allow: 172.16.135.*
performance.cache-size: 16GB
client.event-threads: 8
server.event-threads: 8
performance.io-thread-count: 32
performance.write-behind-window-size: 4MB
nfs.disable: on
diagnostics.client-log-level: WARNING
diagnostics.brick-log-level: WARNING
cluster.lookup-optimize: on
cluster.readdir-optimize: on
Status of volume: mainvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
Brick sm11:/mnt/disk1/mainvol               49152     0          Y       16501
Brick sm12:/mnt/disk1/mainvol               49152     0          Y       15007
Brick sm13:/mnt/disk1/mainvol               49154     0          Y       13123
Brick sm14:/mnt/disk2/mainvol               49154     0          Y       14947
Brick sm15:/mnt/disk2/mainvol               49152     0          Y       13236
Brick sm16:/mnt/disk2/mainvol               49152     0          Y       14762
Brick sm11:/mnt/disk2/mainvol               49153     0          Y       23039
Brick sm12:/mnt/disk2/mainvol               49153     0          Y       19614
Brick sm13:/mnt/disk2/mainvol               49155     0          Y       15387
Brick sm14:/mnt/disk1/mainvol               49155     0          Y       23231
Brick sm15:/mnt/disk1/mainvol               49153     0          Y       28494
Brick sm16:/mnt/disk1/mainvol               49153     0          Y       17656
Self-heal Daemon on localhost               N/A       N/A        Y       25029
Self-heal Daemon on sm11                    N/A       N/A        Y       23634
Self-heal Daemon on sm13                    N/A       N/A        Y       17394
Self-heal Daemon on sm14                    N/A       N/A        Y       31322
Self-heal Daemon on sm12                    N/A       N/A        Y       19609
Self-heal Daemon on hw10                    N/A       N/A        Y       14926
Self-heal Daemon on sm16                    N/A       N/A        Y       17648
