[Bugs] [Bug 1565844] New: 65000 file heal limit on ext* file systems

Tue Apr 10 22:12:00 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1565844

            Bug ID: 1565844
           Summary: 65000 file heal limit on ext* file systems
           Product: GlusterFS
           Version: 4.0
         Component: selfheal
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: jaco at uls.co.za
                CC: bugs at gluster.org

Description of problem:

When the underlying bricks are ext4 formatted, then it's impossible for the
heal process to have more than 65000 (64999?) entries queued for heal.

This is (presumably) due to the max link-count in ext* being 65000 (and one
link is xattrop-${SOME_UUID} which presumably is used for linking against). 
The link-count in the ext structure itself is 16-bit, and I can't remember why
65000 was used as the hard limit but I'm sure by design that looked sane.

I found that all files in the .glusterfs/indices/xattrop references the same
inode, and that a stat of any of those files shows a link-count of near 65000
(at least 64000) at any point in time.

I've got a 2x2 replicate-distribute cluster with a total of approximately 9.8m
files on (df -i).  Monday morning I was forced into a situation where I had to
replace the partition on which my bricks was residing (disk failure, neither
pvmove nor rsync could get the data off onto the replacement drive).  As such I
was forced into a reset-brick situation to get the bricks to rebuild.  This
immediately triggered a heal, and I could see directory structures and the like
getting recreated.  I also noted that many (most) of the files got created
owned root:root and no content.  I discovered this after noticing that in terms
of disk usage the size used (df -h) grew much slower than the inode count (df
-i), the former grew to approximately 6% of that on the healty brick in the
same time as the inode count grew to 26%.  Currently these are at 127G/1.4T
(~9%) and 3.5M/9.8M (~36%).

How reproducible:

Not the first time I saw root:root owned, 0-size files, previously fixed it by
using find to rm them all (including the gfid hard-linked file and then running
stat on the file via fuse mount.

Steps to Reproduce:

1.  Set up glusterfs with ext4 backing the bricks (using replicate).
2.  Create a directory structure containing more than 65k files (possibly a lot
more).  Files should not be root: owned, and should not be empty.
3.  Down a brick, reformat and recover using reset-brick.
4.  watch the heal count increase, up to 65k.
5.  find /path/to/brick -user root -size 0

Step 5 will hopefully reveal some files, which would be a problem.  This should
not happen.

Possible fixes:

For files, it may be possible to hard-link the gfid that should be healed
instead.  This just migrates the risk to that file reaching the hard-link limit
but it should be less likely.

For folders (not sure if they end up in xattrop for healing) the current
strategy may be fine.

One could if the xattrop-${UUID} file reaches max count unlink it and create a
new file (inode) and start using that, but that would cause for much harder
calculation of the heal-count since it's no longer a simple stat on this file
to get that count.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.