[Gluster-users] fractured/split glusterfs - 2 up, 2 down for an hour

harry mangalam harry.mangalam at uci.edu
Sun Jan 5 03:09:46 UTC 2014


Also some other anomalies.  Even when the files are visible and readable, many 
dirs are unwritable and/or undeleteable.

for example: 
====
Sat Jan 04 18:36:17 [0.02 0.08 0.12]  root at hpc-s:/bio/mmacchie
1104 $ mkdir hjmtest
mkdir: cannot create directory `hjmtest': Invalid argument

Sat Jan 04 18:36:23 [0.02 0.08 0.12]  root at hpc-s:/bio/mmacchie
====
The client log says this for that operation (note offset times - UTC vs local:
<http://pastie.org/8602365>

And in many subdirs,  other dirs can be made, but not deleted:

Sat Jan 04 18:41:45 [0.00 0.04 0.09]  root at hpc-
s:/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered
1109 $ mkdir j1

Sat Jan 04 18:42:00 [0.00 0.03 0.09]  root at hpc-
s:/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered
1110 $ rmdir j1
rmdir: failed to remove `j1': Transport endpoint is not connected

Sat Jan 04 18:42:09 [0.08 0.05 0.09]  root at hpc-
s:/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered

With the client log saying:
====
[2014-01-05 02:42:09.548263] W [client-rpc-fops.c:526:client3_3_stat_cbk] 0-
gl-client-2: remote operation failed: Transport endpoint is not connected
[2014-01-05 02:42:09.549314] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 
0-gl-client-2: remote operation failed: Transport endpoint is not connected. 
Path: /bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered/j1 
(aebbf21f-37fe-4edc-be8a-0f57b057b516)
[2014-01-05 02:42:09.550124] W [client-rpc-fops.c:2541:client3_3_opendir_cbk] 
0-gl-client-2: remote operation failed: Transport endpoint is not connected. 
Path: /bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered/j1 
(aebbf21f-37fe-4edc-be8a-0f57b057b516)
[2014-01-05 02:42:09.552439] W [fuse-bridge.c:1193:fuse_unlink_cbk] 0-
glusterfs-fuse: 5805445: RMDIR() 
/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered/j1 => -1 
(Transport endpoint is not connected)
[2014-01-05 02:42:12.175860] W [socket.c:514:__socket_rwv] 0-gl-client-2: 
readv failed (No data available)
[2014-01-05 02:42:15.181365] W [socket.c:514:__socket_rwv] 0-gl-client-2: 
readv failed (No data available)
[2014-01-05 02:42:18.186668] W [socket.c:514:__socket_rwv] 0-gl-client-2: 
readv failed (No data available)
====

This is odd - how can a dir be created OK but then the fs lose track of it to 
delete it?

And that dir (j1)  can have /files/ created and deleted inside of it, but not 
other /dirs/ (same result as the parent dir).

In looking thru the client log, I see instances of this:
====
[2014-01-05 02:27:20.721043] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 
0-gl-client-2: remote operation failed: Transport endpoint is not connected. 
Path: /bio/mmacchie/Nematodes (00000000-0000-0000-0000-000000000000)
[2014-01-05 02:27:20.769058] I [dht-layout.c:630:dht_layout_normalize] 0-gl-
dht: found anomalies in /bio/mmacchie/Nematodes. holes=2 overlaps=0
[2014-01-05 02:27:20.769090] W [dht-selfheal.c:900:dht_selfheal_directory] 0-
gl-dht: 1 subvolumes down -- not fixing
[2014-01-05 02:27:20.784335] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 
0-gl-client-2: 
====
more at: <http://pastie.org/8602381>

alarming since it says: 
[2014-01-05 02:27:20.769090] W [dht-selfheal.c:900:dht_selfheal_directory] 0-
gl-dht: 1 subvolumes down -- not fixing

All my servers and bricks appear to be up and online:

Sat Jan 04 18:54:09 [0.76 0.30 0.20]  root at biostor1:~
1003 $ gluster volume status gl detail | egrep "Brick|Online"
Brick                : Brick bs2:/raid1    
Online               : Y                   
Brick                : Brick bs2:/raid2    
Online               : Y                   
Brick                : Brick bs3:/raid1    
Online               : Y                   
Brick                : Brick bs3:/raid2    
Online               : Y                   
Brick                : Brick bs4:/raid1    
Online               : Y                   
Brick                : Brick bs4:/raid2    
Online               : Y                   
Brick                : Brick bs1:/raid1    
Online               : Y                   
Brick                : Brick bs1:/raid2    
Online               : Y                   


The gluster server logs seem to be fairly quiet thru this.  the followig 
contains the logs for the last day or so from the 4 servers, reduced by the 
following command to eliminate the 'socket.c:2788' errors

grep -v socket.c:2788 /var/log/glusterfs/etc-glusterfs-glusterd.vol.log

<http://pastie.org/8602412>

hjm


On Saturday, January 04, 2014 10:45:29 PM Vijay Bellur wrote:
> On 01/04/2014 07:21 AM, harry mangalam wrote:
> > This is a distributed-only glusterfs on 4 servers with 2 bricks each on
> > an IPoIB network.
> > 
> > Thanks to a misconfigured autoupdate script, when 3.4.2 was released
> > today, my gluster servers tried to update themselves. 2 succeeded, but
> > then failed to restart, the other 2 failed to update and kept running.
> > 
> > Not realizing the sequence of events, I restarted the 2 that failed to
> > restart, which gave my fs 2 servers running 3.4.1 and 2 running 3.4.2.
> > 
> > When I realized this after about 30m, I shut everything down and updated
> > the 2 remaining to 3.4.2 and then restarted but now I'm getting lots of
> > reports of file errors of the type 'endpoints not connected' and the like:
> > 
> > [2014-01-04 01:31:18.593547] W
> > [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote
> > operation failed: Transport endpoint i
> > 
> > s not connected. Path: /bio/fishm/test_cuffdiff.sh
> > (00000000-0000-0000-0000-000000000000)
> > 
> > [2014-01-04 01:31:18.594928] W
> > [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote
> > operation failed: Transport endpoint i
> > 
> > s not connected. Path: /bio/fishm/test_cuffdiff.sh
> > (00000000-0000-0000-0000-000000000000)
> > 
> > [2014-01-04 01:31:18.595818] W
> > [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote
> > operation failed: Transport endpoint i
> > 
> > s not connected. Path: /bio/fishm/.#test_cuffdiff.sh
> > (14c3b612-e952-4aec-ae18-7f3dbb422dcc)
> > 
> > [2014-01-04 01:31:18.597381] W
> > [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote
> > operation failed: Transport endpoint i
> > 
> > s not connected. Path: /bio/fishm/test_cuffdiff.sh
> > (00000000-0000-0000-0000-000000000000)
> > 
> > [2014-01-04 01:31:18.598212] W
> > [client-rpc-fops.c:814:client3_3_statfs_cbk] 0-gl-client-2: remote
> > operation failed: Transport endpoint is
> > 
> > not connected
> > 
> > [2014-01-04 01:31:18.598236] W [dht-diskusage.c:45:dht_du_info_cbk]
> > 0-gl-dht: failed to get disk info from gl-client-2
> > 
> > [2014-01-04 01:31:19.912210] W [socket.c:514:__socket_rwv]
> > 0-gl-client-2: readv failed (No data available)
> > 
> > [2014-01-04 01:31:22.912717] W [socket.c:514:__socket_rwv]
> > 0-gl-client-2: readv failed (No data available)
> > 
> > [2014-01-04 01:31:25.913208] W [socket.c:514:__socket_rwv]
> > 0-gl-client-2: readv failed (No data available)
> > 
> > The servers at the same time provided the following error 'E' messages:
> > 
> > Fri Jan 03 17:46:42 [0.20 0.12 0.13] root at biostor1:~
> > 
> > 1008 $ grep ' E ' /var/log/glusterfs/bricks/raid1.log |grep '2014-01-03'
> > 
> > [2014-01-03 06:11:36.251786] E [server-helpers.c:751:server_alloc_frame]
> > (-->/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103) [0x3161e090d3]
> > (-->/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x245)
> > [0x3161e08f85]
> > (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/server.so(server3_3_lookup+
> > 0xa0) [0x7fa60e577170]))) 0-server: invalid argument: conn
> > 
> > [2014-01-03 06:11:36.251813] E
> > [rpcsvc.c:450:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
> > to complete successfully
> > 
> > [2014-01-03 17:48:44.236127] E [rpc-transport.c:253:rpc_transport_load]
> > 0-rpc-transport: /usr/lib64/glusterfs/3.4.1/rpc-transport/rdma.so:
> > cannot open shared object file: No such file or directory
> > 
> > [2014-01-03 19:15:26.643378] E [rpc-transport.c:253:rpc_transport_load]
> > 0-rpc-transport: /usr/lib64/glusterfs/3.4.2/rpc-transport/rdma.so:
> > cannot open shared object file: No such file or directory
> 
> rdma.so seems to be missing here. Is glusterfs-rdma-3.4.2-1 rpm
> installed on the servers?
> 
> -Vijay
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140104/2cb814d7/attachment.html>


More information about the Gluster-users mailing list