[Gluster-devel] Need advice re some major issues with glusterfind

Fri Oct 23 06:54:29 UTC 2015

Hi John,

The changelog files are generated every 15 secs recording the changes happened to filesystem
within that span.  So every 15 sec, once the new changelog file is generated, it is ready 
to be consumed by glusterfind or any other consumers. The 15 sec time period is a tune-able.
e.g.,
     gluster vol set <VOLNAME> changelog.rollover-time 300

The above will generate new changelog file every 300 sec instead of 15 sec. Hence reducing
the number of changelogs. But glusterfind, will come to know about the changes in filesystem
only after 300 secs!

Deleting these changelogs at .glusterfs/changelog/... is not recommeneded. It will affect any
new glusterfind session going to be established. 

Thanks and Regards,
Kotresh H R1

----- Original Message -----
> From: "John Sincock [FLCPTY]" <J.Sincock at fugro.com>
> To: "Vijaikumar Mallikarjuna" <vmallika at redhat.com>
> Cc: gluster-devel at gluster.org
> Sent: Friday, October 23, 2015 9:54:25 AM
> Subject: Re: [Gluster-devel] Need advice re some major issues with glusterfind
> 
> 
> Hi Vijay, pls see below again (I'm wondering if top-posting would be easier,
> that's usually what I do, though I know some ppl don’t like it)
> 
>  
> On Wed, Oct 21, 2015 at 5:53 AM, Sincock, John [FLCPTY] <J.Sincock at fugro.com>
> wrote:
> Hi Everybody,
> 
> We have recently upgraded our 220 TB gluster to 3.7.4, and we've been trying
> to use the new glusterfind feature but have been having some serious
> problems with it. Overall the glusterfind looks very promising, so I don't
> want to offend anyone by raising these issues.
> 
> If these issues can be resolved or worked around, glusterfind will be a great
> feature.  So I would really appreciate any information or advice:
> 
> 1) What can be done about the vast number of tiny changelogs? We are seeing
> often 5+ small 89 byte changelog files per minute on EACH brick. Larger
> files if busier. We've been generating these changelogs for a few weeks and
> have in excess of 10,000 or 12,000 on most bricks. This makes glusterfinds
> very, very slow, especially on a node which has a lot of bricks, and looks
> unsustainable in the long run. Why are these files so small, and why are
> there so many of them, and how are they supposed to be managed in the long
> run? The sheer number of these files looks sure to impact performance in the
> long run.
> 
> 2) Pgfid xattribute is wreaking havoc with our backup scheme - when gluster
> adds this extended attribute to files it changes the ctime, which we were
> using to determine which files need to be archived. There should be a
> warning added to release notes & upgrade notes, so people can make a plan to
> manage this if required.
> 
> Also, we ran a rebalance immediately after the 3.7.4 upgrade, and the
> rebalance took 5 days or so to complete, which looks like a major speed
> improvement over the more serial rebalance algorithm, so that's good. But I
> was hoping that the rebalance would also have had the side-effect of
> triggering all files to be labelled with the pgfid attribute by the time the
> rebalance completed, or failing that, after creation of an mlocate database
> across our entire gluster (which would have accessed every file, unless it
> is getting the info it needs only from directory inodes). Now it looks like
> ctimes are still being modified, and I think this can only be caused by
> files still being labelled with pgfids.
> 
> How can we force gluster to get this pgfid labelling over and done with, for
> all files that are already on the volume? We can't have gluster continuing
> to add pgfids in bursts here and there, eg when files are read for the first
> time since the upgrade. We need to get it over and done with. We have just
> had to turn off pgfid creation on the volume until we can force gluster to
> get it over and done with in one go.
>  
>  
> Hi John,
>  
> Was quota turned on/off before/after performing re-balance? If the pgfid is
>  missing, this can be healed by performing 'find <mount_point> | xargs
> stat', all the files will get looked-up once and the pgfid healing will
> happen.
> Also could you please provide all the volume files under
> '/var/lib/glusterd/vols/<volname>/*.vol'?
>  
> Thanks,
> Vijay
>  
>  
> Hi Vijay
>  
> Quota has never been turned on in our gluster, so it can’t be any
> quota-related xattrs which are resetting our ctimes, so I’m pretty sure it
> must be due to pgfids still being added.
>  
> Thanks for the tip re using stat, if that should trigger the pgfid build on
> each file, then I will run that when I have a chance. We’ll have to get our
> archiving of data back up to date, re-enable pgfid build option, and then
> run the stat over a weekend or something, as it will take a while.
>  
> I’m still quite concerned about the number of changelogs being generated. Do
> you know if there any plans to change the way changelogs are generated so
> there aren’t so many of them, and to process them more efficiently? I think
> this will be vital to improving performance of glusterfind in future, as
> there are currently an enormous number of these small changelogs being
> generated on each of our gluster bricks.
>   
> Below is the volfile for one brick, the others are all equivalent. We haven’t
> tweaked the volume options much, besides increasing the io thread count to
> 32, and client/event threads to 6 (since we have a lot of small files on our
> gluster (30 million files, a lot of which are small, and some of which are
> large to very large):
>  
> 
> Hi John,
> 
> PGFID xattrs are updated only when update-link-count-parent is enabled in the
> brick volume file. This option is enabled when quota is enabled on a volume.
> In the volume file you provided below has update-link-count-parent disabled,
> I am wondering why PGFID xattrs are updated.
> 
> Thanks,
> Vijay
>  
> 
> Hi Vijay,
> somewhere in the 3.7.5 upgrade instructions or the glusterfind documentation,
> there was a mention that we should enable a server option called
> storage.build-pgfid, which we did as it speeds up glusterfinds. You cannot
> see this in the volfile but you can see it when you do gluster volume info
> volname. So for our volume we currently have:
> 
> Options Reconfigured:
> server.allow-insecure: on
> nfs.disable: false
> performance.io-thread-count: 32
> features.quota: off
> client.bind-insecure: on
> 
> storage.build-pgfid: off
> 
> changelog.changelog: on
> changelog.capture-del-path: on
> server.event-threads: 6
> client.event-threads: 6
> 
> We've turned storage.build-pgfid OFF now, but we turned it on when we did the
> upgrade to 3.7.4, and we had it on until a few days ago. So, for us, with
> update-link-count-parent off - storage.build-pgfid would've been the thing
> responsible for adding the pgfids to files on our volume.
> 
> I should've realised the best thing to do would’ve been to do a stat on every
> file, in order to trigger the pgfid build, but at first I thought the pgfids
> would be added to every file during the rebalance which was a priority at
> the time (we had just added 40TB of new bricks to a very full volume), and
> then we hit pgfid/backup issues etc.
> 
> I think we can get the pgfid issue resolved now you've confirmed that a stat
> will do it (thanks :-) We'll just have to stop our clients writing to the
> volume for a day or so while we stat every file on the volume. Then, if
> we've stopped our clients writing during that time, we can re-jig our
> backups to safely ignore any changed ctimes that've changed during the day
> or so we were stating the volume.
> 
> I'll let you know how things go with the pgfid's if we can get them turned
> back on and added to every file sometime as soon as possible.
> 
> I'm definitely more concerned now about the changelog issue. As mentioned we
> have an enormous number of these, eg as of now (about 25 days since
> upgrading to 3.7.4), we have 13000 or so changelogs on each of our bricks:
> 
> ls -la /mnt/glusterfs/bricks/1/.glusterfs/changelogs/ | wc -l
> 13096
> 
> And they are very small, about 5 KB on average, and ranging from (many at
> just) 89 bytes, up to 20 KB or so for the larger ones:
> du -hs /mnt/glusterfs/bricks/1/.glusterfs/changelogs/
> 68M     /mnt/glusterfs/bricks/1/.glusterfs/changelogs/
> 
> The size of the changelogs is not an issue (68M for almost a month worth of
> changes is nothing), but the sheer number of files is, as is the fact that
> it seems to be very cpu-intensive to process these files (eg an strace
> showed glusterfind taking 2.7 million system calls to process just one of
> these small changelogs).
> 
> Do you know if anyone is working on reducing the number of these changelogs
> and/or processing them more efficiently?
> 
> Thanks again for any info!
> 
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel