[Gluster-users] Odd error with older Gluster/NFS3 tail swallowing

Tue Apr 9 13:44:26 UTC 2013

Had some data loss with an older - 3.1.4 - Gluster share last night. Now
trying to see what the best lessons are to learn from it. Obviously it's too
old a version for a bug report to matter. Wondering if anyone recognizes
this particular sort of error condition though. 

It's a 300 G replicated share, mounted by Gluster's NFS3 to several systems.
It was getting fuller than we like it to be, at over 80%, so I copied a
directory containg 26 G off of it. Checked the copy and it was good. Then I
went and "rm -r"'d that directory. After a few minutes it complained "Cannot
delete directory, directory not empty," citing a subdirectory. Strange.

So I stopped the process and looked in that subdirectory. The subdirectory
had within it ... the whole of the Gluster share. Damn. Yes, the "rm -r"
had, due to this illusion, managed to wipe out over half of the share
because it had descended into other directories at its root level via this
illusion. 

And it's an illusion. It wasn't some bizarre tail-eating self-mounting. The
copy of the directory shows it populated by three subdirectories with a few
files in them. The version of the directory on the Gluster NFS share has
none of those three. Instead it has the whole of the root-level directies of
the share visible there. Both of the backing shares also have those three
directories properly populated - no tail-eating mount. Another system with
the same NFS3 mount of this share shows those three directories properly. So
it's an illusion based in some error in how Gluster and/or the NFS client is
presenting the directory structure on the one client. The backing stores are
ext4; the kernel is old enough to be fully compatible with Gluster in ext4.
Anyway, no apparent error at that level.

I'm a generalist as a sysadmin, in a smallish shop, half-ignorant on
filesystems. Aside from putting more weight on the "pursuing the cutting
edge may sometimes be safer than staying with the apparent time-proven
stability of an older version" side of the perennial debate between
progressive and conservative approaches, what should I learn from this? The
NFS client is running a 2.6.24 kernel - yeah, that's old, but it's been very
reliable up until this. Is this some known NFS client problem since fixed?
Or a really-bizarre one-off?

If it is something that can still "just happen," what's safe procedure?
Looking through every subdirectory of a tree about to be deleted to make
sure it hasn't in a virtual way and without anything that shows to "mount"
mounted the whole filesystem its within to itself seems much. I did have,
fortunately, an rsnapshot backup of the whole thing, so have been able to
restore. But I'd like to avoid the whole experience next time. What's the
wisest way to go?

Thanks
Whit