[Gluster-users] 3.3.1 Replicate only replicating one way

Wed Mar 6 08:02:35 UTC 2013

In our recent testing, we saw all kinds of weird problems while testing
rebuilding a failed brick in the same 2 node replicate cluster.  Several times
we had to kill off all gluster processes and restart things from scratch to
get the two sides talking correctly again (where both sides thought they were
happily talking to the other side, but self-heal wasn't doing anything).  We'd
run a full heal or stat some files and they wouldn't replicate back to the
other side.  After restarting the processes (not just glusterd, but all of the
glusterfs ones too), things would start working.  Once things were running and
the nodes were properly replicating, it appeared to flow both ways nicely.

We also saw an lstat of a client mount hang once for 105 seconds while we were
rsyncing data into our cluster.  No idea why things would lock up for that
long.  It was an lstat of a directory full of 4GB iso files, so maybe it was
waiting for the isos to copy to both boxes.  At gigabit speed (~950Mbps),
though, 105 seconds is something like 12GB of data.  And not sure why it would
lock out lstat calls.

I'm new to glusterfs, so I don't really have anything more to add.  I just
wanted you to know I've seen similar weirdness with 3.3.1 in a relatively
simple replicate configuration.

Todd

On Fri, Mar 01, 2013 at 01:37:42AM +0100, Marcus Bointon wrote:
> I've given up on trying to upgrade a 3.2.5 installation to 3.3.1 directly, so I'm scrapping it and starting again. I'm on Ubuntu Lucid, using stock packages from the semiosis ppa.
> 
> My config is very simple - 2 nodes running replicate on a single volume with 4G of small files, created like this:
> 
> gluster volume create shared replica 2 transport tcp 192.168.0.8:/var/shared 192.168.0.34:/var/shared
> 
> I copied off all files from the gluster volume, removed all signs of gluster 3.2.5, installed 3.3.1, reconfigured using the same commands as for 3.2.5. Install, peer probe, volume creation and mount (via NFS) all reported working correctly. The problem I'm now seeing is that I can touch a file on one side and it appears on the other, but not the other way around.
> 
> If I ask for heal info on the volume, both nodes report zero differences, but ls shows there are! If I request a full heal, the files appear correctly and the fixed files appear in the healed list. Something is clearly not talking...
> 
> I doubt it's a firewall issue since this was previously a working setup and the firewall hasn't been touched.
> 
> I'm finding it hard to track down since gluster's logs are spread across so many places - just this simple config has 20+ logs - and I've not found anything to explain this behaviour.
> 
> Node 1:
> 
> # gluster peer status
> Number of Peers: 1
> 
> Hostname: 192.168.0.8
> Uuid: 8f30902f-f125-47bc-87dd-fa48e583efd3
> State: Peer in Cluster (Connected)
> 
> # gluster volume status
> Status of volume: shared
> Gluster process                                         Port    Online  Pid
> ------------------------------------------------------------------------------
> Brick 192.168.0.8:/var/shared                          24010   Y       22440
> Brick 192.168.0.34:/var/shared                          24009   Y       16957
> NFS Server on localhost                                 38467   Y       16963
> Self-heal Daemon on localhost                           N/A     Y       16969
> NFS Server on 192.168.0.8                              38467   Y       22446
> Self-heal Daemon on 192.168.0.8                        N/A     Y       22452
> 
> Node 2:
> 
> # gluster peer status
> Number of Peers: 1
> 
> Hostname: 192.168.0.34
> Uuid: cf6d4c23-a5a2-4c35-859c-52410b6429e1
> State: Peer in Cluster (Connected)
> 
> # gluster volume status
> Status of volume: shared
> Gluster process                                         Port    Online  Pid
> ------------------------------------------------------------------------------
> Brick 192.168.0.8:/var/shared                          24010   Y       22440
> Brick 192.168.0.34:/var/shared                          24009   Y       16957
> NFS Server on localhost                                 38467   Y       22446
> Self-heal Daemon on localhost                           N/A     Y       22452
> NFS Server on 192.168.0.34                              38467   Y       16963
> Self-heal Daemon on 192.168.0.34                        N/A     Y       16969
> 
> Having said all that, I've just noticed that files *are* appearing on the other node in the direction I thought they were not - but it's *really* slow; I copied about 10,000 files onto it and they are all visible on one node, but after 30 minutes only 10% of them are present on the other node, and they are all listed in the 'info healed' output. This sounds to me as if the replication is only happening in one direction via self-heal, and not through the normal replication route - it's certainly not synchronous. Any idea what could be amiss?
> 
> Marcus
> -- 
> Marcus Bointon
> Synchromedia Limited: Creators of http://www.smartmessages.net/
> UK info at hand CRM solutions
> marcus at synchromedia.co.uk | http://www.synchromedia.co.uk/
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users