[Gluster-devel] Need advice re some major issues with glusterfind

Wed Oct 21 08:07:08 UTC 2015

On Wed, Oct 21, 2015 at 5:53 AM, Sincock, John [FLCPTY] <J.Sincock at fugro.com
> wrote:

> Hi Everybody,
>
> We have recently upgraded our 220 TB gluster to 3.7.4, and we've been
> trying to use the new glusterfind feature but have been having some serious
> problems with it. Overall the glusterfind looks very promising, so I don't
> want to offend anyone by raising these issues.
>
> If these issues can be resolved or worked around, glusterfind will be a
> great feature.  So I would really appreciate any information or advice:
>
> 1) What can be done about the vast number of tiny changelogs? We are
> seeing often 5+ small 89 byte changelog files per minute on EACH brick.
> Larger files if busier. We've been generating these changelogs for a few
> weeks and have in excess of 10,000 or 12,000 on most bricks. This makes
> glusterfinds very, very slow, especially on a node which has a lot of
> bricks, and looks unsustainable in the long run. Why are these files so
> small, and why are there so many of them, and how are they supposed to be
> managed in the long run? The sheer number of these files looks sure to
> impact performance in the long run.
>
> 2) Pgfid xattribute is wreaking havoc with our backup scheme - when
> gluster adds this extended attribute to files it changes the ctime, which
> we were using to determine which files need to be archived. There should be
> a warning added to release notes & upgrade notes, so people can make a plan
> to manage this if required.
>
> Also, we ran a rebalance immediately after the 3.7.4 upgrade, and the
> rebalance took 5 days or so to complete, which looks like a major speed
> improvement over the more serial rebalance algorithm, so that's good. But I
> was hoping that the rebalance would also have had the side-effect of
> triggering all files to be labelled with the pgfid attribute by the time
> the rebalance completed, or failing that, after creation of an mlocate
> database across our entire gluster (which would have accessed every file,
> unless it is getting the info it needs only from directory inodes). Now it
> looks like ctimes are still being modified, and I think this can only be
> caused by files still being labelled with pgfids.
>
> How can we force gluster to get this pgfid labelling over and done with,
> for all files that are already on the volume? We can't have gluster
> continuing to add pgfids in bursts here and there, eg when files are read
> for the first time since the upgrade. We need to get it over and done with.
> We have just had to turn off pgfid creation on the volume until we can
> force gluster to get it over and done with in one go.
>
> Hi John,

Was quota turned on/off before/after performing re-balance? If the pgfid is
 missing, this can be healed by performing 'find <mount_point> | xargs
stat', all the files will get looked-up once and the pgfid healing will
happen.
Also could you please provide all the volume files under
'/var/lib/glusterd/vols/<volname>/*.vol'?

Thanks,
Vijay

> 3) Files modified just before a glusterfind pre are often not included in
> the changed files list, unless pre command is run again a bit later - I
> think changelogs are missing very recent changes and need to be flushed or
> something before the pre command uses them?
>
> 4) BUG: Glusterfind follows symlinks off bricks and onto NFS mounted
> directories (and will cause these shares to be mounted if you have autofs
> enabled). Glusterfind should definitely not follow symlinks, but it does.
> For now, we are getting around this by turning off autofs when re run
> glusterfinds, but this should not be necessary. Glusterfind must be fixed
> so it never follows symlinks and never leaves the brick it is currently
> searching.
>
> 5) We have one of our nodes  with 16 bricks, and on this machine,
> glusterfind pre command seems to get stuck pegging all 8 cores to 100%, an
> strace of an offending processes gives an endless stream of these lseeks
> and reads and very little else. What is going on here? It doesn't look
> right... :
>
> lseek(13, 17188864, SEEK_SET)           = 17188864
> read(13,
> "\r\0\0\0\4\0J\0\3\25\2\"\0013\0J\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 1024
> lseek(13, 17189888, SEEK_SET)           = 17189888
> read(13,
> "\r\0\0\0\4\0\"\0\3\31\0020\1#\0\"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 1024
> lseek(13, 17190912, SEEK_SET)           = 17190912
> read(13,
> "\r\0\0\0\3\0\365\0\3\1\1\372\0\365\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 1024
> lseek(13, 17191936, SEEK_SET)           = 17191936
> read(13,
> "\r\0\0\0\4\0F\0\3\17\2\"\0017\0F\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 1024
> lseek(13, 17192960, SEEK_SET)           = 17192960
> read(13,
> "\r\0\0\0\4\0006\0\2\371\2\4\1\31\0006\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 1024
> lseek(13, 17193984, SEEK_SET)           = 17193984
> read(13,
> "\r\0\0\0\4\0L\0\3\31\2\36\1/\0L\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024)
> = 1024
>
> I saved one of these straces for 20 or 30 secs or so, and then doing a
> quick analysis of it:
>     cat ~/strace.glusterfind-lseeks2.txt | wc -l
>     2719285
> 2.7 million system calls, and grepping to exclude all the lseeks and reads
> leaves only 24 other syscalls:
>
> cat ~/strace.glusterfind-lseeks2.txt | grep -v lseek | grep -v read
> Process 28076 attached - interrupt to quit
> write(13,
> "\r\0\0\0\4\0\317\0\3N\2\241\1\322\0\317\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 1024
> write(13,
> "\r\0\0\0\4\0_\0\3\5\2\34\1I\0_\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024)
> = 1024
> write(13,
> "\r\0\0\0\4\0\24\0\3\10\2\f\1\34\0\24\0\0\0\0\202\3\203\324?\f\0!\31UU?"...,
> 1024) = 1024
> close(15)                               = 0
> munmap(0x7f3570b01000, 4096)            = 0
> lstat("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd", {st_mode=S_IFDIR|0755, st_size=4096,
> ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind", {st_mode=S_IFDIR|0755,
> st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1",
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00",
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e",
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history",
> {st_mode=S_IFDIR|0600, st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing",
> {st_mode=S_IFDIR|0600, st_size=249856, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388354",
> {st_mode=S_IFREG|0644, st_size=5793, ...}) = 0
> write(6, "[2015-10-16 02:59:53.437769] D ["..., 273) = 273
> rename("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388354",
> "/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processed/CHANGELOG.1444388354")
> = 0
> open("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388369",
> O_RDONLY) = 15
> fstat(15, {st_mode=S_IFREG|0644, st_size=4026, ...}) = 0
> fstat(15, {st_mode=S_IFREG|0644, st_size=4026, ...}) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
> 0x7f3570b01000
> write(13,
> "\r\0\0\0\4\0]\0\3\22\0027\1L\0]\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024)
> = 1024
> Process 28076 detached
>
> That seems like an enormous number of system calls to process just one
> changelog - especially when most of these changelogs are only 89 bytes long
> and few are larger than about 5 KB, and the largest is about 20KB. We only
> upgraded to 3.7.4 several weeks ago, and we already have 12,000  or so
> changelogs to process on each brick, which will all have to be processed if
> I want to generate a listing which goes back to the time we did the upgrade
> - which I do... If each of the changelogs are being processed in this sort
> of apparently inefficient way, it must be making the process a lot slower
> than it needs to be.
>
> This is a big problem and makes it almost impossible to use glusterfind
> for what we need to use it for...
>
> Again, I'm not intending to be negative, just hoping these issues can be
> addressed if possible, and seeking advice or info re managing these issues
> and making glusterfind usable in the meantime.
>
> Many thanks for any advice.
>
> John
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20151021/ce9c4237/attachment.html>