[Gluster-users] GlusterFS cluster stalls if one server from the cluster goes down and then comes back up

Wed Mar 23 15:52:59 UTC 2016

On 03/23/2016 02:01 PM, Daniel Kanchev wrote:
> Hi, everyone.
>
> We are using GlusterFS configured in the following way:
>
> [root at web1 ~]# gluster volume info
>
> Volume Name: share
> Type: Replicate
> Volume ID: hidden data on purpose
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: c10839:/gluster
> Brick2: c10840:/gluster
> Brick3: web3:/gluster
> Options Reconfigured:
> cluster.consistent-metadata: on
> performance.readdir-ahead: on
> nfs.disable: true
> cluster.self-heal-daemon: on
> cluster.metadata-self-heal: on
> auth.allow: hidden data on purpose
> performance.cache-size: 256MB
> performance.io-thread-count: 8
> performance.cache-refresh-timeout: 3
>
> Here is the output of the status command for the volume and the peers:
>
> [root at web1 ~]# gluster volume status
> Status of volume: share
> Gluster process                             TCP Port  RDMA Port  
> Online  Pid
> ------------------------------------------------------------------------------
> Brick c10839:/gluster                       49152 0          Y       540
> Brick c10840:/gluster                       49152 0          Y       533
> Brick web3:/gluster                         49152 0          Y       782
> Self-heal Daemon on localhost               N/A N/A        Y       602
> Self-heal Daemon on web3                    N/A N/A        Y       790
> Self-heal Daemon on web4                    N/A N/A        Y       636
> Self-heal Daemon on web2                    N/A N/A        Y       523
>
> Task Status of Volume share
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
> [root at web1 ~]# gluster peer status
> Number of Peers: 3
>
> Hostname: web3
> Uuid: b138b4d5-8623-4224-825e-1dfdc3770743
> State: Peer in Cluster (Connected)
>
> Hostname: web2
> Uuid: b3926959-3ae8-4826-933a-4bf3b3bd55aa
> State: Peer in Cluster (Connected)
> Other names:
> c10840.sgvps.net <http://c10840.sgvps.net>
>
> Hostname: web4
> Uuid: f7553cba-c105-4d2c-8b89-e5e78a269847
> State: Peer in Cluster (Connected)
>
> All in all, we have three servers that are servers and actually store 
> the data and one server which is just a peer and is connected to one 
> of the other servers.
> *
> *
> *The Problem*: If any of the 4 servers goes down then the cluster 
> continues to work as expected. However, once this server comes back up 
> then the whole cluster stalls for a certain period of time (30-120 
> seconds). During this period no I/O operations could be executed and 
> the apps that use the data on the GlusterFS simply go down because 
> they cannot read/write any data.
>
> We suspect that the issue is related to the self-heal daemons but we 
> are not sure. Could you please advice how to debug this issue and what 
> could be causing the whole cluster to go down. If it is the self-heal 
> as we suspect do you think it is ok to disable it. If some of the 
> settings are causing this problem could you please advice how to 
> configure the cluster to avoid this problem.
>

What version of gluster is this?
Do you observe the problem even when only the 4th 'non data' server 
comes up? In that case it is unlikely that self-heal is the issue.
Are the clients using FUSE or NFS mounts?
-Ravi
> If any info from the logs is requested please let us know what do you 
> need.
>
> Thanks in advance!
>
> Regards,
> Daniel
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160323/442f6062/attachment.html>