[Gluster-devel] Need advice re some major issues with glusterfind

Fri Oct 23 04:41:07 UTC 2015

Hi Kotresh/Venky,

Could you please provide your inputs on the change-log issues mentioned
below?

Thanks,
Vijay

On Fri, Oct 23, 2015 at 9:54 AM, Sincock, John [FLCPTY] <J.Sincock at fugro.com
> wrote:

>
> Hi Vijay, pls see below again (I'm wondering if top-posting would be
> easier, that's usually what I do, though I know some ppl don’t like it)
>
>
> On Wed, Oct 21, 2015 at 5:53 AM, Sincock, John [FLCPTY] <
> J.Sincock at fugro.com> wrote:
> Hi Everybody,
>
> We have recently upgraded our 220 TB gluster to 3.7.4, and we've been
> trying to use the new glusterfind feature but have been having some serious
> problems with it. Overall the glusterfind looks very promising, so I don't
> want to offend anyone by raising these issues.
>
> If these issues can be resolved or worked around, glusterfind will be a
> great feature.  So I would really appreciate any information or advice:
>
> 1) What can be done about the vast number of tiny changelogs? We are
> seeing often 5+ small 89 byte changelog files per minute on EACH brick.
> Larger files if busier. We've been generating these changelogs for a few
> weeks and have in excess of 10,000 or 12,000 on most bricks. This makes
> glusterfinds very, very slow, especially on a node which has a lot of
> bricks, and looks unsustainable in the long run. Why are these files so
> small, and why are there so many of them, and how are they supposed to be
> managed in the long run? The sheer number of these files looks sure to
> impact performance in the long run.
>
> 2) Pgfid xattribute is wreaking havoc with our backup scheme - when
> gluster adds this extended attribute to files it changes the ctime, which
> we were using to determine which files need to be archived. There should be
> a warning added to release notes & upgrade notes, so people can make a plan
> to manage this if required.
>
> Also, we ran a rebalance immediately after the 3.7.4 upgrade, and the
> rebalance took 5 days or so to complete, which looks like a major speed
> improvement over the more serial rebalance algorithm, so that's good. But I
> was hoping that the rebalance would also have had the side-effect of
> triggering all files to be labelled with the pgfid attribute by the time
> the rebalance completed, or failing that, after creation of an mlocate
> database across our entire gluster (which would have accessed every file,
> unless it is getting the info it needs only from directory inodes). Now it
> looks like ctimes are still being modified, and I think this can only be
> caused by files still being labelled with pgfids.
>
> How can we force gluster to get this pgfid labelling over and done with,
> for all files that are already on the volume? We can't have gluster
> continuing to add pgfids in bursts here and there, eg when files are read
> for the first time since the upgrade. We need to get it over and done with.
> We have just had to turn off pgfid creation on the volume until we can
> force gluster to get it over and done with in one go.
>
>
> Hi John,
>
> Was quota turned on/off before/after performing re-balance? If the pgfid
> is  missing, this can be healed by performing 'find <mount_point> | xargs
> stat', all the files will get looked-up once and the pgfid healing will
> happen.
> Also could you please provide all the volume files under
> '/var/lib/glusterd/vols/<volname>/*.vol'?
>
> Thanks,
> Vijay
>
>
> Hi Vijay
>
> Quota has never been turned on in our gluster, so it can’t be any
> quota-related xattrs which are resetting our ctimes, so I’m pretty sure it
> must be due to pgfids still being added.
>
> Thanks for the tip re using stat, if that should trigger the pgfid build
> on each file, then I will run that when I have a chance. We’ll have to get
> our archiving of data back up to date, re-enable pgfid build option, and
> then run the stat over a weekend or something, as it will take a while.
>
> I’m still quite concerned about the number of changelogs being generated.
> Do you know if there any plans to change the way changelogs are generated
> so there aren’t so many of them, and to process them more efficiently? I
> think this will be vital to improving performance of glusterfind in future,
> as there are currently an enormous number of these small changelogs being
> generated on each of our gluster bricks.
>
> Below is the volfile for one brick, the others are all equivalent. We
> haven’t tweaked the volume options much, besides increasing the io thread
> count to 32, and client/event threads to 6 (since we have a lot of small
> files on our gluster (30 million files, a lot of which are small, and some
> of which are large to very large):
>
>
> Hi John,
>
> PGFID xattrs are updated only when update-link-count-parent is enabled in
> the brick volume file. This option is enabled when quota is enabled on a
> volume.
> In the volume file you provided below
> has update-link-count-parent disabled, I am wondering why PGFID xattrs are
> updated.
>
> Thanks,
> Vijay
>
>
> Hi Vijay,
> somewhere in the 3.7.5 upgrade instructions or the glusterfind
> documentation, there was a mention that we should enable a server option
> called storage.build-pgfid, which we did as it speeds up glusterfinds. You
> cannot see this in the volfile but you can see it when you do gluster
> volume info volname. So for our volume we currently have:
>
> Options Reconfigured:
> server.allow-insecure: on
> nfs.disable: false
> performance.io-thread-count: 32
> features.quota: off
> client.bind-insecure: on
>
> storage.build-pgfid: off
>
> changelog.changelog: on
> changelog.capture-del-path: on
> server.event-threads: 6
> client.event-threads: 6
>
> We've turned storage.build-pgfid OFF now, but we turned it on when we did
> the upgrade to 3.7.4, and we had it on until a few days ago. So, for us,
> with update-link-count-parent off - storage.build-pgfid would've been the
> thing responsible for adding the pgfids to files on our volume.
>
> I should've realised the best thing to do would’ve been to do a stat on
> every file, in order to trigger the pgfid build, but at first I thought the
> pgfids would be added to every file during the rebalance which was a
> priority at the time (we had just added 40TB of new bricks to a very full
> volume), and then we hit pgfid/backup issues etc.
>
> I think we can get the pgfid issue resolved now you've confirmed that a
> stat will do it (thanks :-) We'll just have to stop our clients writing to
> the volume for a day or so while we stat every file on the volume. Then, if
> we've stopped our clients writing during that time, we can re-jig our
> backups to safely ignore any changed ctimes that've changed during the day
> or so we were stating the volume.
>
> I'll let you know how things go with the pgfid's if we can get them turned
> back on and added to every file sometime as soon as possible.
>
> I'm definitely more concerned now about the changelog issue. As mentioned
> we have an enormous number of these, eg as of now (about 25 days since
> upgrading to 3.7.4), we have 13000 or so changelogs on each of our bricks:
>
> ls -la /mnt/glusterfs/bricks/1/.glusterfs/changelogs/ | wc -l
> 13096
>
> And they are very small, about 5 KB on average, and ranging from (many at
> just) 89 bytes, up to 20 KB or so for the larger ones:
> du -hs /mnt/glusterfs/bricks/1/.glusterfs/changelogs/
> 68M     /mnt/glusterfs/bricks/1/.glusterfs/changelogs/
>
> The size of the changelogs is not an issue (68M for almost a month worth
> of changes is nothing), but the sheer number of files is, as is the fact
> that it seems to be very cpu-intensive to process these files (eg an strace
> showed glusterfind taking 2.7 million system calls to process just one of
> these small changelogs).
>
> Do you know if anyone is working on reducing the number of these
> changelogs and/or processing them more efficiently?
>
> Thanks again for any info!
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20151023/b90340e0/attachment-0001.html>