[Gluster-users] Unable to make HA work; mounts hang on remote node reboot

Wed Apr 8 04:26:37 UTC 2015


On 04/07/2015 10:11 PM, CJ Baar wrote:
> Then, I issue “init 0” on node2, and the mount on node1 becomes unresponsive. This is the log from node1
> [2015-04-07 16:36:04.250693] W [glusterd-op-sm.c:4021:glusterd_op_modify_op_ctx] 0-management: op_ctx modification failed
> [2015-04-07 16:36:04.251102] I [glusterd-handler.c:3803:__glusterd_handle_status_volume] 0-management: Received status volume req for volume test1
> The message "I [MSGID: 106004] [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer 1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has disconnected from glusterd." repeated 39 times between [2015-04-07 16:34:40.609878] and [2015-04-07 16:36:37.752489]
> [2015-04-07 16:36:40.755989] I [MSGID: 106004] [glusterd-handler.c:4365:__glusterd_peer_rpc_notify] 0-management: Peer 1069f037-13eb-458e-a9c4-0e7e79e595d0, in Peer in Cluster state, has disconnected from glusterd.
This is the glusterd log. Could you also share the mount log of the 
healthy node in the non-responsive -->responsive time interval?
If this is indeed the ping timer issue, you should see something like: 
"server xxx has not responded in the last 42 seconds, disconnecting."
Have you, for testing sake, tried reducing the network.ping-timeout 
value to something lower and checked that the hang happens only for that 
time?
>
> This does not seem like desired behaviour. I was trying to create this cluster because I was under the impression it would be more resilient than a single-point-of-failure NFS server. However, if the mount halts when one node in the cluster dies, then I’m no better off.
>
> I also can’t seem to figure out how to bring a volume online if only one node in the cluster is running; again, not really functioning as HA. The gluster service runs and the volume “starts”, but it is not “online” or mountable until both nodes are running. In a situation where a node fails and we need storage online before we can troubleshoot the cause of the node failure, how do I get a volume to go online?
This is expected behavior. In a two node cluster, if only one is powered 
on, glusterd will not start other gluster processes (brick, nfs, shd ) 
until the glusterd of the other node is also up (i.e. quorum is met). If 
you want to override this behavior, do a `gluster vol start <volname> 
force` on the node that is up.

-Ravi
>
> Thanks.