> In our recent testing, we saw all kinds of weird problems while testing
> rebuilding a failed brick in the same 2 node replicate cluster.  Several times
> we had to kill off all gluster processes and restart things from scratch to
> get the two sides talking correctly again (where both sides thought they were
> happily talking to the other side, but self-heal wasn't doing anything).  We'd
> run a full heal or stat some files and they wouldn't replicate back to the
> other side.  After restarting the processes (not just glusterd, but all of the
> glusterfs ones too), things would start working.  Once things were running and
> the nodes were properly replicating, it appeared to flow both ways nicely.

Thanks. I've managed to fix it by deleting every bit of gluster I could find, reinstalling and copying all the data back on.

I also saw that bug report of NFS hangs with 3.3.1, so I downgraded to 3.3.0 (which also meant switching to the older ppa).

It would be good to have a definitive list of where gluster puts everything - after uninstalling and deleting everything I could find, it still clearly had some info about my old config. I did this after unmounting volume and stopping all gluster services:

aptitude remove glusterfs-client glusterfs-server
aptitude purge glusterfs-client glusterfs-server
rm -rf /etc/glusterfs
rm -rf /var/log/glusterfs
rm -rf /var/lib/glusterfs
rm -rf /usr/lib/glusterfs
rm -rf /var/shared (my gluster storage area)

and yet when I reinstalled, I saw entries in the logs mentioning the brick storage area from my old installation - I've no idea where that info was lurking for it to find it again.

Incidentally, that also caused a bug of sorts. When starting glusterd it hung up, filling /var/log/glusterfs/etc-glusterfs-glusterd.vol.log with repeats of this:

[2013-03-05 19:24:21.137209] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.3.0
[2013-03-05 19:24:21.137393] E [glusterfsd.c:1296:glusterfs_pidfile_setup] 0-glusterfsd: pidfile /var/run/glusterd.pid lock error (Resource temporarily unavailable)

I couldn't find anything on google relating this error, but it seems it's caused when gluster can't find its storage area. I my case, creating /var/shared fixed this problem. I've no idea why it would report that as an issue with the pid file, but hopefully this will help someone else.

