[Gluster-devel] Broken glusterfs-3.3.1 volume

Wed Apr 2 03:52:31 UTC 2014

Hi

I have been running a 4x2 distributed/replicated vluster on glusterfs
3.3.1 for a while now. It was quite stable, but now one of the server
ceased to work correcty. It still serves clients, but on the server,
glusterfs volume status will say "Connection failed. Please check if
gluster daemon is operational."

glusterd runs and is looping on this error:
[2014-04-02 05:30:03.921659] E [rpcsvc.c:491:rpcsvc_handle_rpc_call]
0-glusterd: Request received from non-privileged port. Failing request

glustershd logs loops on:
[2014-04-02 05:30:54.791893] W
[socket.c:1521:__socket_proto_state_machine] 0-bacasable-var-client-1:
reading from socket failed. Error (Socket is not connected), peer
(192.0.2.109:24007)
[2014-04-02 05:30:54.791996] E [rpc-clnt.c:373:saved_frames_unwind]
0-bacasable-var-client-1: forced unwinding frame type(GF-DUMP)
op(DUMP(1)) called at 2014-04-02 05:30:54.791717 (xid=0x26461x)
[2014-04-02 05:30:54.792037] W
[client-handshake.c:1819:client_dump_version_cbk]
0-bacasable-var-client-1: received RPC status error
[2014-04-02 05:30:54.792072] I [client.c:2090:client_rpc_notify]
0-bacasable-var-client-1: disconnected
[2014-04-02 05:30:54.792175] W
[socket.c:1521:__socket_proto_state_machine] 0-bacasable-var-client-0:
reading from socket failed. Error (Socket is not connected), peer
(192.0.2.110:24007)
[2014-04-02 05:30:54.792235] E [rpc-clnt.c:373:saved_frames_unwind]
0-bacasable-var-client-0: forced unwinding frame type(GF-DUMP)
op(DUMP(1)) called at 2014-04-02 05:30:54.790368 (xid=0x6037x)
[2014-04-02 05:30:54.792268] W
[client-handshake.c:1819:client_dump_version_cbk]
0-bacasable-var-client-0: received RPC status error
[2014-04-02 05:30:54.792300] I [client.c:2090:client_rpc_notify]
0-bacasable-var-client-0: disconnected

Now I am not sure if it is related or not, if I copy a large file on the
glusterfs volume, first everything it fine, then after some time, it
gets suddently  extremely slow. A ls on the volume top takes seconds to
complete. Stopping the copy (which does not copy much at that time as it
crawls too) lets the volume recover to normal state.

Any idea of what is going on?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu at netbsd.org