[Gluster-devel] Need advice re some major issues with glusterfind

Wed Oct 21 00:23:23 UTC 2015

Hi Everybody,

We have recently upgraded our 220 TB gluster to 3.7.4, and we've been trying to use the new glusterfind feature but have been having some serious problems with it. Overall the glusterfind looks very promising, so I don't want to offend anyone by raising these issues. 

If these issues can be resolved or worked around, glusterfind will be a great feature.  So I would really appreciate any information or advice:

1) What can be done about the vast number of tiny changelogs? We are seeing often 5+ small 89 byte changelog files per minute on EACH brick. Larger files if busier. We've been generating these changelogs for a few weeks and have in excess of 10,000 or 12,000 on most bricks. This makes glusterfinds very, very slow, especially on a node which has a lot of bricks, and looks unsustainable in the long run. Why are these files so small, and why are there so many of them, and how are they supposed to be managed in the long run? The sheer number of these files looks sure to impact performance in the long run.

2) Pgfid xattribute is wreaking havoc with our backup scheme - when gluster adds this extended attribute to files it changes the ctime, which we were using to determine which files need to be archived. There should be a warning added to release notes & upgrade notes, so people can make a plan to manage this if required.

Also, we ran a rebalance immediately after the 3.7.4 upgrade, and the rebalance took 5 days or so to complete, which looks like a major speed improvement over the more serial rebalance algorithm, so that's good. But I was hoping that the rebalance would also have had the side-effect of triggering all files to be labelled with the pgfid attribute by the time the rebalance completed, or failing that, after creation of an mlocate database across our entire gluster (which would have accessed every file, unless it is getting the info it needs only from directory inodes). Now it looks like ctimes are still being modified, and I think this can only be caused by files still being labelled with pgfids.

How can we force gluster to get this pgfid labelling over and done with, for all files that are already on the volume? We can't have gluster continuing to add pgfids in bursts here and there, eg when files are read for the first time since the upgrade. We need to get it over and done with. We have just had to turn off pgfid creation on the volume until we can force gluster to get it over and done with in one go.

3) Files modified just before a glusterfind pre are often not included in the changed files list, unless pre command is run again a bit later - I think changelogs are missing very recent changes and need to be flushed or something before the pre command uses them?

4) BUG: Glusterfind follows symlinks off bricks and onto NFS mounted directories (and will cause these shares to be mounted if you have autofs enabled). Glusterfind should definitely not follow symlinks, but it does. For now, we are getting around this by turning off autofs when re run glusterfinds, but this should not be necessary. Glusterfind must be fixed so it never follows symlinks and never leaves the brick it is currently searching.

5) We have one of our nodes  with 16 bricks, and on this machine, glusterfind pre command seems to get stuck pegging all 8 cores to 100%, an strace of an offending processes gives an endless stream of these lseeks and reads and very little else. What is going on here? It doesn't look right... : 

lseek(13, 17188864, SEEK_SET)           = 17188864
read(13, "\r\0\0\0\4\0J\0\3\25\2\"\0013\0J\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
lseek(13, 17189888, SEEK_SET)           = 17189888
read(13, "\r\0\0\0\4\0\"\0\3\31\0020\1#\0\"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
lseek(13, 17190912, SEEK_SET)           = 17190912
read(13, "\r\0\0\0\3\0\365\0\3\1\1\372\0\365\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
lseek(13, 17191936, SEEK_SET)           = 17191936
read(13, "\r\0\0\0\4\0F\0\3\17\2\"\0017\0F\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
lseek(13, 17192960, SEEK_SET)           = 17192960
read(13, "\r\0\0\0\4\0006\0\2\371\2\4\1\31\0006\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
lseek(13, 17193984, SEEK_SET)           = 17193984
read(13, "\r\0\0\0\4\0L\0\3\31\2\36\1/\0L\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024

I saved one of these straces for 20 or 30 secs or so, and then doing a quick analysis of it:
    cat ~/strace.glusterfind-lseeks2.txt | wc -l
    2719285
2.7 million system calls, and grepping to exclude all the lseeks and reads leaves only 24 other syscalls:

cat ~/strace.glusterfind-lseeks2.txt | grep -v lseek | grep -v read
Process 28076 attached - interrupt to quit
write(13, "\r\0\0\0\4\0\317\0\3N\2\241\1\322\0\317\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
write(13, "\r\0\0\0\4\0_\0\3\5\2\34\1I\0_\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
write(13, "\r\0\0\0\4\0\24\0\3\10\2\f\1\34\0\24\0\0\0\0\202\3\203\324?\f\0!\31UU?"..., 1024) = 1024
close(15)                               = 0
munmap(0x7f3570b01000, 4096)            = 0
lstat("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history", {st_mode=S_IFDIR|0600, st_size=4096, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing", {st_mode=S_IFDIR|0600, st_size=249856, ...}) = 0
lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388354", {st_mode=S_IFREG|0644, st_size=5793, ...}) = 0
write(6, "[2015-10-16 02:59:53.437769] D ["..., 273) = 273
rename("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388354", "/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processed/CHANGELOG.1444388354") = 0
open("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388369", O_RDONLY) = 15
fstat(15, {st_mode=S_IFREG|0644, st_size=4026, ...}) = 0
fstat(15, {st_mode=S_IFREG|0644, st_size=4026, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3570b01000
write(13, "\r\0\0\0\4\0]\0\3\22\0027\1L\0]\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
Process 28076 detached

That seems like an enormous number of system calls to process just one changelog - especially when most of these changelogs are only 89 bytes long and few are larger than about 5 KB, and the largest is about 20KB. We only upgraded to 3.7.4 several weeks ago, and we already have 12,000  or so changelogs to process on each brick, which will all have to be processed if I want to generate a listing which goes back to the time we did the upgrade - which I do... If each of the changelogs are being processed in this sort of apparently inefficient way, it must be making the process a lot slower than it needs to be.

This is a big problem and makes it almost impossible to use glusterfind for what we need to use it for... 

Again, I'm not intending to be negative, just hoping these issues can be addressed if possible, and seeking advice or info re managing these issues and making glusterfind usable in the meantime.

Many thanks for any advice.

John