[Gluster-users] Instable server with server/server encryption

Mon Dec 7 09:25:57 UTC 2015

Hello,

I'm having problems with glusterfs and server/server encryption.

I have 2 servers (sto1 & sto2) with latest stable version (3.6.7-1 from 
gluster repo) on Debian 8.2 (amd64), with one single volume with 
replication.

Without /var/lib/glusterd/secure-access all works as expected.

Then I shut down both servers (without any client mounting the volume), 
touch /var/lib/glusterd/secure-access on both servers, and start service 
on one of the servers:
root at sto2:~# /etc/init.d/glusterfs-server stop
[ ok ] Stopping glusterfs-server (via systemctl): glusterfs-server.service.

I touch the file:
root at sto2:~# touch /var/lib/glusterd/secure-access

I start the service (the other server is still down):
root at sto2:~# /etc/init.d/glusterfs-server start
[ ok ] Starting glusterfs-server (via systemctl): glusterfs-server.service.
root at sto2:~# ps aux | grep glus
root     22538  1.3  0.4 402828 18668 ?        Ssl  10:07   0:00 
/usr/sbin/glusterd -p /var/run/glusterd.pid
-> it is running.

I check the pool:
root at sto2:~# gluster pool list
UUID                    Hostname              State
5fdb629d-886f-43cb-9a71-582051b0dbb2    sto1...    Disconnected
8f51f101-254e-43f9-82a3-ec02591110b5    localhost Connected

It is what expected at this point.
But now the gluster daemon is dead:
root at sto2:~# gluster pool list
Connection failed. Please check if gluster daemon is operational.

I can stop and start again the service, and it dies after the 1st 
command, whatever the command (tested with 'gluster volume status' which 
answers 'Volume HOME is not started' which is the correct state as I 
stoped the only volume before activating server/server encryption).

Note that at this point the other server is still down and no client is 
started.
See at the end the "crash log" from the server.

I guess it is not the expected behavior, and it is clearly a different 
behavior than without server/server encryption. For example if I remove 
the secure-access file:

root at sto2:~# /etc/init.d/glusterfs-server stop
[ ok ] Stopping glusterfs-server (via systemctl): glusterfs-server.service.
root at sto2:~# rm /var/lib/glusterd/secure-access
root at sto2:~# /etc/init.d/glusterfs-server start
[ ok ] Starting glusterfs-server (via systemctl): glusterfs-server.service.
root at sto2:~# gluster pool list
UUID                    Hostname              State
5fdb629d-886f-43cb-9a71-582051b0dbb2    sto1...    Disconnected
8f51f101-254e-43f9-82a3-ec02591110b5    localhost Connected

And whatever I do the daemon is still alive and responding.

Is this a bug or I missed something needed when moving to server/server 
encryption?

Moreover if I try to start the other server without performing any 
action on the 1st (to prevent crash I have a "ping-pong" crash (start at 
sto2 then start at sto1):
root at sto2:~# /etc/init.d/glusterfs-server start
[ ok ] Starting glusterfs-server (via systemctl): glusterfs-server.service.
root at sto1:~# /etc/init.d/glusterfs-server start
[ ok ] Starting glusterfs-server (via systemctl): glusterfs-server.service.
root at sto1:~# gluster pool list
UUID                    Hostname              State
8f51f101-254e-43f9-82a3-ec02591110b5    sto2.liris.cnrs.fr Disconnected
5fdb629d-886f-43cb-9a71-582051b0dbb2    localhost Connected
-> here daemon is dead on sto2. Let restart sto2 daemon:
root at sto2:~# /etc/init.d/glusterfs-server restart
[ ok ] Restarting glusterfs-server (via systemctl): 
glusterfs-server.service.
root at sto2:~# gluster pool list
UUID                    Hostname              State
5fdb629d-886f-43cb-9a71-582051b0dbb2    sto1.liris.cnrs.fr Disconnected
8f51f101-254e-43f9-82a3-ec02591110b5    localhost Connected
-> here daemon is dead on sto1.
root at sto1:~# gluster pool list
Connection failed. Please check if gluster daemon is operational.

If I restart both daemons (mostly) at the same time it works fine:
root at sto1:~# /etc/init.d/glusterfs-server restart
[ ok ] Restarting glusterfs-server (via systemctl): 
glusterfs-server.service.
root at sto2:~# /etc/init.d/glusterfs-server restart
[ ok ] Restarting glusterfs-server (via systemctl): glusterfs-server.service
root at sto1:~# gluster pool list
UUID                    Hostname              State
8f51f101-254e-43f9-82a3-ec02591110b5    sto2.liris.cnrs.fr Connected
5fdb629d-886f-43cb-9a71-582051b0dbb2    localhost Connected
root at sto2:~# gluster pool list
UUID                    Hostname              State
5fdb629d-886f-43cb-9a71-582051b0dbb2    sto1.liris.cnrs.fr Connected
8f51f101-254e-43f9-82a3-ec02591110b5    localhost Connected

Of course this is not an expected behavior as after a global shutdown 
servers may not restart at the same time. Moreover it is a real problem 
when shuting down a single server (i.e. for maintenance) as I get again 
the "ping-pong" problem.

Any help would be appreciate.

Note : before that these 2 servers were used for testing replicated 
volumes (without encryption) without any problem.

Regards,
--
Y.

Log from sto2:

cat /var/log/glusterfs/etc-glusterfs-glusterd.vol.log

[2015-12-07 09:09:43.345640] I [MSGID: 100030] [glusterfsd.c:2035:main] 
0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.6.7 
(args: /usr/sbin/glusterd -p /var/run/glusterd.pid)
[2015-12-07 09:09:43.352452] I [glusterd.c:1214:init] 0-management: 
Maximum allowed open file descriptors set to 65536
[2015-12-07 09:09:43.352516] I [glusterd.c:1259:init] 0-management: 
Using /var/lib/glusterd as working directory
[2015-12-07 09:09:43.359063] I [socket.c:3880:socket_init] 
0-socket.management: SSL support on the I/O path is ENABLED
[2015-12-07 09:09:43.359102] I [socket.c:3883:socket_init] 
0-socket.management: SSL support for glusterd is ENABLED
[2015-12-07 09:09:43.359138] I [socket.c:3900:socket_init] 
0-socket.management: using private polling thread
[2015-12-07 09:09:43.361848] W [rdma.c:4440:__gf_rdma_ctx_create] 
0-rpc-transport/rdma: rdma_cm event channel creation failed (Aucun 
périphérique de ce type)
[2015-12-07 09:09:43.361885] E [rdma.c:4744:init] 0-rdma.management: 
Failed to initialize IB Device
[2015-12-07 09:09:43.361902] E [rpc-transport.c:333:rpc_transport_load] 
0-rpc-transport: 'rdma' initialization failed
[2015-12-07 09:09:43.362023] W [rpcsvc.c:1524:rpcsvc_transport_create] 
0-rpc-service: cannot create listener, initing the transport failed
[2015-12-07 09:09:43.362267] I [socket.c:3883:socket_init] 
0-socket.management: SSL support for glusterd is ENABLED
[2015-12-07 09:09:46.812491] I 
[glusterd-store.c:2048:glusterd_restore_op_version] 0-glusterd: 
retrieved op-version: 30603
[2015-12-07 09:09:47.192205] I 
[glusterd-handler.c:3179:glusterd_friend_add_from_peerinfo] 
0-management: connect returned 0
[2015-12-07 09:09:47.192321] I [rpc-clnt.c:969:rpc_clnt_connection_init] 
0-management: setting frame-timeout to 600
[2015-12-07 09:09:47.192564] I [socket.c:3880:socket_init] 0-management: 
SSL support on the I/O path is ENABLED
[2015-12-07 09:09:47.192585] I [socket.c:3883:socket_init] 0-management: 
SSL support for glusterd is ENABLED
[2015-12-07 09:09:47.192601] I [socket.c:3900:socket_init] 0-management: 
using private polling thread
[2015-12-07 09:09:47.195831] E [socket.c:3016:socket_connect] 
0-management: connection attempt on  failed, (Connexion refusée)
[2015-12-07 09:09:47.196341] I [MSGID: 106004] 
[glusterd-handler.c:4398:__glusterd_peer_rpc_notify] 0-management: Peer 
5fdb629d-886f-43cb-9a71-582051b0dbb2, in Peer in Cluster state, has 
disconnected from glusterd.
[2015-12-07 09:09:47.196413] E [socket.c:384:ssl_setup_connection] 
0-management: SSL connect error
[2015-12-07 09:09:47.196480] E [socket.c:2386:socket_poller] 
0-management: client setup failed
[2015-12-07 09:09:47.196534] E [glusterd-utils.c:181:glusterd_unlock] 
0-management: Cluster lock not held!
[2015-12-07 09:09:47.196642] I [mem-pool.c:545:mem_pool_destroy] 
0-management: size=588 max=0 total=0
[2015-12-07 09:09:47.196671] I [mem-pool.c:545:mem_pool_destroy] 
0-management: size=124 max=0 total=0
[2015-12-07 09:09:47.196787] I [glusterd.c:146:glusterd_uuid_init] 
0-management: retrieved UUID: 8f51f101-254e-43f9-82a3-ec02591110b5
Final graph:
+------------------------------------------------------------------------------+
   1: volume management
   2:     type mgmt/glusterd
   3:     option transport.socket.ssl-enabled on
   4:     option rpc-auth.auth-glusterfs on
   5:     option rpc-auth.auth-unix on
   6:     option rpc-auth.auth-null on
   7:     option transport.socket.listen-backlog 128
   8:     option ping-timeout 30
   9:     option transport.socket.read-fail-log off
  10:     option transport.socket.keepalive-interval 2
  11:     option transport.socket.keepalive-time 10
  12:     option transport-type rdma
  13:     option working-directory /var/lib/glusterd
  14: end-volume
  15:
+------------------------------------------------------------------------------+
[2015-12-07 09:09:50.348636] E [socket.c:2859:socket_connect] (--> 
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x147)[0x7f1b5a951497] 
(--> 
/usr/lib/x86_64-linux-gnu/glusterfs/3.6.7/rpc-transport/socket.so(+0x6c32)[0x7f1b545c3c32] 
(--> 
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9)[0x7f1b5a723469] 
(--> 
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xcd)[0x7f1b5a96b40d] 
(--> /lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f1b5a0e50a4] 
))))) 0-socket: invalid argument: this->private
[2015-12-07 09:09:53.349724] E [socket.c:2859:socket_connect] (--> 
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x147)[0x7f1b5a951497] 
(--> 
/usr/lib/x86_64-linux-gnu/glusterfs/3.6.7/rpc-transport/socket.so(+0x6c32)[0x7f1b545c3c32] 
(--> 
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9)[0x7f1b5a723469] 
(--> 
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xcd)[0x7f1b5a96b40d] 
(--> /lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f1b5a0e50a4] 
))))) 0-socket: invalid argument: this->private
[2015-12-07 09:09:55.604061] W 
[glusterd-op-sm.c:4073:glusterd_op_modify_op_ctx] 0-management: op_ctx 
modification failed
[2015-12-07 09:09:55.604797] I 
[glusterd-handler.c:3836:__glusterd_handle_status_volume] 0-management: 
Received status volume req for volume HOME
[2015-12-07 09:09:55.605488] E 
[glusterd-syncop.c:1184:gd_stage_op_phase] 0-management: Staging of 
operation 'Volume Status' failed on localhost : Volume HOME is not started
[2015-12-07 09:09:47.196634] I [MSGID: 106004] 
[glusterd-handler.c:4398:__glusterd_peer_rpc_notify] 0-management: Peer 
5fdb629d-886f-43cb-9a71-582051b0dbb2, in Peer in Cluster state, has 
disconnected from glusterd.
pending frames:
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2015-12-07 09:09:56
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.6.7
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb1)[0x7f1b5a9522a1]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x32d)[0x7f1b5a96919d]
/lib/x86_64-linux-gnu/libc.so.6(+0x35180)[0x7f1b5996e180]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_connect+0x8)[0x7f1b5a721f48]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9)[0x7f1b5a723469]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xcd)[0x7f1b5a96b40d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f1b5a0e50a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f1b59a1f04d]
---------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3369 bytes
Desc: Signature cryptographique S/MIME
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151207/adae8c19/attachment.p7s>