[Gluster-users] Problems with .gluster structure - bad symlinks

Tue Mar 11 03:22:15 UTC 2014

On 3/9/2014 10:39 AM, Shawn Heisey wrote:
> On 3/8/2014 7:45 PM, Shawn Heisey wrote:
>> cat:
>> /bricks/d00v00/mdfs/.glusterfs/65/30/6530ce82-310d-4c7c-8d14-135655328a77:
>> Too many levels of symbolic links
>>
>> What do I need to do to fix this problem?  Is there something I can do
>> for each of the bad symlinks?  Would a 'heal full' do anything useful?
>> Do I need to do something more drastic, like take the volume down and
>> entirely remove (or rename) the .glusterfs structure from all 32 bricks
>> (16x2 distributed-replicate)?  I don't want to cause myself more
>> problems, but I want to get the volume in a completely pristine state
>> and NOT risk losing any of the 52 terabytes of data that's in the volume.
> 
> Some additional info:
> 
> http://fpaste.org/83806/43825451/
> 
> This is from nfs.log on the server that all my clients contact for NFS
> mounts.  It is peered with the other servers, but has no bricks.
> 
> So far I have determined the following about my bricks:
> 
> * There are no stray directories under .glusterfs/??/??/
> * There is nothing remaining with nonzero trusted.afr* attributes
> * There *are* broken symlinks (too many levels)
> 
> I will run another check to make sure there are no files with one
> hardlink outside of the indices directory.  I will also check for files
> that have more than two hardlinks.  I do not use hardlinks in my data,
> so I think that this should never happen.
> 
> Is there anything else I can look for, and if I find something, where
> can I go for information about how to fix it?

Pointed question: Is it safe to rebalance with these broken symlinks?

I've gotten no response, either on IRC or this email.

Here's what I found when I traced one of these broken symlinks through
manually:

http://apaste.info/m03v

I am in a situation where I have to get a rebalance done, but after one
attempt on 3.3.1 with absolutely spectacular failures, and another on
3.4.2 where heal problems surfaced, I'm somewhat terrified to start
another rebalance as long as there are potential problems with my
volume.  We've got 52 terabytes of data on this volume.  The 3.3.1
failure involved data that simply went missing, can't be recovered.

It is looking like the glibc max symlink limit may be at work here.  My
best guess is that the .glusterfs infrastructure fails unless you stick
to paths with 8 directory levels, because with 9 levels, the .glusterfs
structure has too many symlinks.

On a related note, is there any way to increase the max symlinks without
recompiling glibc?  My research suggests that the Linux kernel (at least
as of the 2.6 versions in CentOS/RHEL 6) no longer has an inherent
limit, and that glibc is what enforces it.

Thanks,
Shawn