[Gluster-devel] Need advice re some major issues with glusterfind

Thu Oct 22 03:11:15 UTC 2015

Pls see below

From: Vijaikumar Mallikarjuna [mailto:vmallika at redhat.com]
Sent: Wednesday, 21 October 2015 6:37 PM
To: Sincock, John [FLCPTY]
Cc: gluster-devel at gluster.org
Subject: Re: [Gluster-devel] Need advice re some major issues with glusterfind

On Wed, Oct 21, 2015 at 5:53 AM, Sincock, John [FLCPTY] <J.Sincock at fugro.com<mailto:J.Sincock at fugro.com>> wrote:
Hi Everybody,

We have recently upgraded our 220 TB gluster to 3.7.4, and we've been trying to use the new glusterfind feature but have been having some serious problems with it. Overall the glusterfind looks very promising, so I don't want to offend anyone by raising these issues.

If these issues can be resolved or worked around, glusterfind will be a great feature.  So I would really appreciate any information or advice:

1) What can be done about the vast number of tiny changelogs? We are seeing often 5+ small 89 byte changelog files per minute on EACH brick. Larger files if busier. We've been generating these changelogs for a few weeks and have in excess of 10,000 or 12,000 on most bricks. This makes glusterfinds very, very slow, especially on a node which has a lot of bricks, and looks unsustainable in the long run. Why are these files so small, and why are there so many of them, and how are they supposed to be managed in the long run? The sheer number of these files looks sure to impact performance in the long run.

2) Pgfid xattribute is wreaking havoc with our backup scheme - when gluster adds this extended attribute to files it changes the ctime, which we were using to determine which files need to be archived. There should be a warning added to release notes & upgrade notes, so people can make a plan to manage this if required.

Also, we ran a rebalance immediately after the 3.7.4 upgrade, and the rebalance took 5 days or so to complete, which looks like a major speed improvement over the more serial rebalance algorithm, so that's good. But I was hoping that the rebalance would also have had the side-effect of triggering all files to be labelled with the pgfid attribute by the time the rebalance completed, or failing that, after creation of an mlocate database across our entire gluster (which would have accessed every file, unless it is getting the info it needs only from directory inodes). Now it looks like ctimes are still being modified, and I think this can only be caused by files still being labelled with pgfids.

How can we force gluster to get this pgfid labelling over and done with, for all files that are already on the volume? We can't have gluster continuing to add pgfids in bursts here and there, eg when files are read for the first time since the upgrade. We need to get it over and done with. We have just had to turn off pgfid creation on the volume until we can force gluster to get it over and done with in one go.

Hi John,

Was quota turned on/off before/after performing re-balance? If the pgfid is  missing, this can be healed by performing 'find <mount_point> | xargs stat', all the files will get looked-up once and the pgfid healing will happen.
Also could you please provide all the volume files under '/var/lib/glusterd/vols/<volname>/*.vol'?

Thanks,
Vijay

Hi Vijay

Quota has never been turned on in our gluster, so it can’t be any quota-related xattrs which are resetting our ctimes, so I’m pretty sure it must be due to pgfids still being added.

Thanks for the tip re using stat, if that should trigger the pgfid build on each file, then I will run that when I have a chance. We’ll have to get our archiving of data back up to date, re-enable pgfid build option, and then run the stat over a weekend or something, as it will take a while.

I’m still quite concerned about the number of changelogs being generated. Do you know if there any plans to change the way changelogs are generated so there aren’t so many of them, and to process them more efficiently? I think this will be vital to improving performance of glusterfind in future, as there are currently an enormous number of these small changelogs being generated on each of our gluster bricks.

Below is the volfile for one brick, the others are all equivalent. We haven’t tweaked the volume options much, besides increasing the io thread count to 32, and client/event threads to 6 (since we have a lot of small files on our gluster (30 million files, a lot of which are small, and some of which are large to very large):

[root at g-unit-1 sbin]# cat /var/lib/glusterd/vols/vol00/vol00.g-unit-1.mnt-glusterfs-bricks-1.vol
volume vol00-posix
    type storage/posix
    option update-link-count-parent off
    option volume-id 292b8701-d394-48ee-a224-b5a20ca7ce0f
    option directory /mnt/glusterfs/bricks/1
end-volume

volume vol00-trash
    type features/trash
    option trash-internal-op off
    option brick-path /mnt/glusterfs/bricks/1
    option trash-dir .trashcan
    subvolumes vol00-posix
end-volume

volume vol00-changetimerecorder
    type features/changetimerecorder
    option record-counters off
    option ctr-enabled off
    option record-entry on
    option ctr_inode_heal_expire_period 300
    option ctr_hardlink_heal_expire_period 300
    option ctr_link_consistency off
    option record-exit off
    option db-path /mnt/glusterfs/bricks/1/.glusterfs/
    option db-name 1.db
    option hot-brick off
    option db-type sqlite3
    subvolumes vol00-trash
end-volume

volume vol00-changelog
    type features/changelog
    option capture-del-path on
    option changelog-barrier-timeout 120
    option changelog on
    option changelog-dir /mnt/glusterfs/bricks/1/.glusterfs/changelogs
    option changelog-brick /mnt/glusterfs/bricks/1
    subvolumes vol00-changetimerecorder
end-volume

volume vol00-bitrot-stub
    type features/bitrot-stub
    option export /mnt/glusterfs/bricks/1
    subvolumes vol00-changelog
end-volume

volume vol00-access-control
    type features/access-control
    subvolumes vol00-bitrot-stub
end-volume

volume vol00-locks
    type features/locks
    subvolumes vol00-access-control
end-volume

volume vol00-upcall
    type features/upcall
    option cache-invalidation off
    subvolumes vol00-locks
end-volume

volume vol00-io-threads
    type performance/io-threads
    option thread-count 32
    subvolumes vol00-upcall
end-volume

volume vol00-marker
    type features/marker
    option inode-quota off
    option quota off
    option gsync-force-xtime off
    option xtime off
    option timestamp-file /var/lib/glusterd/vols/vol00/marker.tstamp
    option volume-uuid 292b8701-d394-48ee-a224-b5a20ca7ce0f
    subvolumes vol00-io-threads
end-volume

volume vol00-barrier
    type features/barrier
    option barrier-timeout 120
    option barrier disable
    subvolumes vol00-marker
end-volume

volume vol00-index
    type features/index
    option index-base /mnt/glusterfs/bricks/1/.glusterfs/indices
    subvolumes vol00-barrier
end-volume

volume vol00-quota
    type features/quota
    option deem-statfs off
    option timeout 0
    option server-quota off
    option volume-uuid vol00
    subvolumes vol00-index
end-volume

volume vol00-worm
    type features/worm
    option worm off
    subvolumes vol00-quota
end-volume

volume vol00-read-only
    type features/read-only
    option read-only off
    subvolumes vol00-worm
end-volume

volume /mnt/glusterfs/bricks/1
    type debug/io-stats
    option count-fop-hits off
    option latency-measurement off
    subvolumes vol00-read-only
end-volume

volume vol00-server
    type protocol/server
    option event-threads 6
    option rpc-auth-allow-insecure on
    option auth.addr./mnt/glusterfs/bricks/1.allow *
    option auth.login.dc3d05ba-40ce-47ee-8f4c-a729917784dc.password 58c2072b-8d1c-4921-9270-bf4b477c4126
    option auth.login./mnt/glusterfs/bricks/1.allow dc3d05ba-40ce-47ee-8f4c-a729917784dc
    option transport-type tcp
    subvolumes /mnt/glusterfs/bricks/1
end-volume

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20151022/9f5e251a/attachment-0001.html>