[Gluster-users] gluster server overload; recovers, now "Transport endpoint is not connected" for some files

Thu Aug 2 01:20:54 UTC 2012

Hi All,

I'm using  3.3-1 on  IPoIB servers that are serving to a native
gluster clients mostly over GbE:

$ gluster volume info

Volume Name: gli
Type: Distribute
Volume ID: 76cc5e88-0ac4-42ac-a4a3-31bf2ba611d4
Status: Started
Number of Bricks: 4
Transport-type: tcp,rdma
Bricks:
Brick1: pbs1ib:/bducgl
Brick2: pbs2ib:/bducgl
Brick3: pbs3ib:/bducgl
Brick4: pbs4ib:/bducgl
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: off
performance.io-thread-count: 64
performance.quick-read: on
performance.io-cache: on

Everything was working fine until a few hours ago when a user noticed
that gluster perf was very slow.  Turns out that one (and only one) of
the 4 gluster servers had a load of >30 and the glusterfsd was
consuming >3GB RAM:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20953 root      20   0 3862m 3.2g 1240 R  382 20.7  19669:20
glusterfsd
  443 root      20   0     0    0    0 S    4  0.0   6017:19 md127_raid5
 ...

The init.d scripts are meant for RH distros so wouldn't initiate a
restart on this Ubuntu server and rather than go thru the re-write
process then, I killed off the offending glusterfsd and restarted it
manually (perhaps ill -advised).
The node immediately re-joined the others:
root at pbs1:~
730 $ gluster peer status
Number of Peers: 3

Hostname: pbs4ib
Uuid: 2a593581-bf45-446c-8f7c-212c53297803
State: Peer in Cluster (Connected)

Hostname: pbs2ib
Uuid: 26de63bd-c5b7-48ba-b81d-5d77a533d077
State: Peer in Cluster (Connected)

Hostname: pbs3ib
Uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42
State: Peer in Cluster (Connected)

and the volume did appear to be re-constituted.  The load did indeed
decrease almost to zero and responsiveness immediately improved but
almost immediately users started reporting locked or missing files.
The files appear to be on the bricks as you might guess, but they
appear to be locked on the client, even when they're not visible per
se:

$ cp -vR DeNovoGenome/* /gl/iychang/DeNovoGenome/
`DeNovoGenome/contigs_k51.fa' -> `/gl/iychang/DeNovoGenome/contigs_k51.fa'
`DeNovoGenome/L7-17-1-Sequences.fastq' ->
`/gl/iychang/DeNovoGenome/L7-17-1-Sequences.fastq'
`DeNovoGenome/L7-9-1-Sequences.fastq' ->
`/gl/iychang/DeNovoGenome/L7-9-1-Sequences.fastq'
`DeNovoGenome/y_lipolitica.1.ebwt' ->
`/gl/iychang/DeNovoGenome/y_lipolitica.1.ebwt'
`DeNovoGenome/y_lipolitica.2.ebwt' ->
`/gl/iychang/DeNovoGenome/y_lipolitica.2.ebwt'
cp: cannot create regular file
`/gl/iychang/DeNovoGenome/y_lipolitica.2.ebwt': Transport endpoint is
not connected
`DeNovoGenome/y_lipolitica.3.ebwt' ->
`/gl/iychang/DeNovoGenome/y_lipolitica.3.ebwt'
`DeNovoGenome/y_lipolitica.4.ebwt' ->
`/gl/iychang/DeNovoGenome/y_lipolitica.4.ebwt'
`DeNovoGenome/y_lipolitica.fa' -> `/gl/iychang/DeNovoGenome/y_lipolitica.fa'
`DeNovoGenome/y_lipolitica.rev.1.ebwt' ->
`/gl/iychang/DeNovoGenome/y_lipolitica.rev.1.ebwt'
`DeNovoGenome/y_lipolitica.rev.2.ebwt' ->
`/gl/iychang/DeNovoGenome/y_lipolitica.rev.2.ebwt'

In the case of the problem file above,
$ ls -l /gl/iychang/DeNovoGenome/y_lipolitica.2.ebwt
ls: /gl/iychang/DeNovoGenome/y_lipolitica.2.ebwt: No such file or directory

it doesn't appear to exist on the gluster fs (/gl) but the replacement
file can't be copied there.
That file DOES exist on the bricks tho:
root at pbs3://bducgl/iychang/DeNovoGenome
295 $ ls -l y_lipolitica.2.ebwt
-rw-rw-r-- 2 7424 7424 2528956 2012-08-01 17:38 y_lipolitica.2.ebwt

so I think that the problem is a messed up volume index or something like it.

How do you resolve this issue?  I read that gluster 3.3 does self
healing and in fact we had an instance where an entire gluster node
when down for several hours (at night, in a low-usage environment) and
when we brought it back, it immediately re-joined the gluster group
and we did not see any such problems (altho it may have been during an
idle time).

the client log is filled with lines like this:

[2012-08-01 18:00:11.976198] W
[client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gli-client-2: remote
operation failed: Transport endpoint is not connected. Path:
/fsaizpoy/input/emim-tf2n/consecutive/8thcSi/output/libc.so.6
(00000000-0000-0000-0000-000000000000)
[2012-08-01 18:00:50.809517] W
[client3_1-fops.c:763:client3_1_statfs_cbk] 0-gli-client-2: remote
operation failed: Transport endpoint is not connected
[2012-08-01 18:01:30.885114] W
[client3_1-fops.c:763:client3_1_statfs_cbk] 0-gli-client-2: remote
operation failed: Transport endpoint is not connected
[2012-08-01 18:02:10.964532] W
[client3_1-fops.c:763:client3_1_statfs_cbk] 0-gli-client-2: remote
operation failed: Transport endpoint is not connected

In a best guess, I initiated a 'fix-layout' and while the 'status'
printout says that no files have been rebalanced (expected, since
there have been no bricks added), there have been lots of fixes:

2012-08-01 18:04:31.149116] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/EBGM_CSUNG
[2012-08-01 18:04:31.462275] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/EBGM_CSU_FG
[2012-08-01 18:04:31.778421] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/csuScrapShots
[2012-08-01 18:04:31.885009] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/csuScrapShots/normSep2002sfi
[2012-08-01 18:04:32.337981] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/csuScrapShots/source
[2012-08-01 18:04:32.441383] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/csuScrapShots/source/pgm
[2012-08-01 18:04:32.558827] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/faceGraphsWiskott
[2012-08-01 18:04:32.617823] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/ALPBench/Face_Rec/data/novelGraphsWiskott

Unfortunately, I'm also seeing this:

[2012-08-01 18:07:26.104859] I [dht-layout.c:593:dht_layout_normalize]
0-gli-dht: found anomalies in
/nlduong/benchmarks/SPEC2K6-org/benchspec/CPU2006/403.gcc/data/test/input.
holes=1 overlaps=0
[2012-08-01 18:07:26.104910] W
[dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes
down -- not fixing
[2012-08-01 18:07:26.104996] I [dht-common.c:2337:dht_setxattr]
0-gli-dht: fixing the layout of
/nlduong/benchmarks/SPEC2K6-org/benchspec/CPU2006/403.gcc/data/test/input
[2012-08-01 18:07:26.189403] I [dht-layout.c:593:dht_layout_normalize]
0-gli-dht: found anomalies in
/nlduong/benchmarks/SPEC2K6-org/benchspec/CPU2006/403.gcc/data/test/output.
holes=1 overlaps=0
[2012-08-01 18:07:26.189457] W
[dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes
down -- not fixing

which implies that some of the errors are not fixable.

Is there a best-practices solution  for this problem?  I suspect this
is one of the most common problems to affect an operating gluster fs.

hjm

--
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)