[Gluster-users] Gluter 3.12.12: performance during heal and in general

Thu Jul 26 08:17:16 UTC 2018

On Thu, Jul 26, 2018 at 12:59 PM, Hu Bert <revirii at googlemail.com> wrote:

> Hi Pranith,
>
> thanks a lot for your efforts and for tracking "my" problem with an issue.
> :-)
>

> I've set this params on the gluster volume and will start the
> 'find...' command within a short time. I'll probably add another
> answer to the list to document the progress.
>
> btw. - you had some typos:
> gluster volume set <volname> cluster.cluster.heal-wait-queue-length
> 10000 => cluster is doubled
> gluster volume set <volname> cluster.data-self-heal-window-size 16 =>
> it's actually cluster.self-heal-window-size
>
> but actually no problem :-)
>

Sorry, bad copy/paste :-(.

>
> Just curious: would gluster 4.1 improve the performance for healing
> and in general for "my" scenario?
>

No, this issue is present in all the existing releases. But it is solvable.
You can follow that issue to see progress and when it is fixed etc.

>
> 2018-07-26 8:56 GMT+02:00 Pranith Kumar Karampuri <pkarampu at redhat.com>:
> > Thanks a lot for detailed write-up, this helps find the bottlenecks
> easily.
> > On a high level, to handle this directory hierarchy i.e. lots of
> directories
> > with files, we need to improve healing
> > algorithms. Based on the data you provided, we need to make the following
> > enhancements:
> >
> > 1) At the moment directories are healed one at a time, but files can be
> > healed upto 64 in parallel per replica subvolume.
> > So if you have nX2 or nX3 distributed subvolumes, it can heal 64n number
> of
> > files in parallel.
> >
> > I raised https://github.com/gluster/glusterfs/issues/477 to track this.
> In
> > the mean-while you can use the following work-around:
> > a) Increase background heals on the mount:
> > gluster volume set <volname> cluster.background-self-heal-count 256
> > gluster volume set <volname> cluster.cluster.heal-wait-queue-length
> 10000
> > find <mnt> -type d | xargs stat
> >
> > one 'find' will trigger 10256 directories. So you may have to do this
> > periodically until all directories are healed.
> >
> > 2) Self-heal heals a file 128KB at a time(data-self-heal-window-size). I
> > think for your environment bumping upto MBs is better. Say 2MB i.e.
> > 16*128KB?
> >
> > Command to do that is:
> > gluster volume set <volname> cluster.data-self-heal-window-size 16
> >
> >
> > On Thu, Jul 26, 2018 at 10:40 AM, Hu Bert <revirii at googlemail.com>
> wrote:
> >>
> >> Hi Pranith,
> >>
> >> Sry, it took a while to count the directories. I'll try to answer your
> >> questions as good as possible.
> >>
> >> > What kind of data do you have?
> >> > How many directories in the filesystem?
> >> > On average how many files per directory?
> >> > What is the depth of your directory hierarchy on average?
> >> > What is average filesize?
> >>
> >> We have mostly images (more than 95% of disk usage, 90% of file
> >> count), some text files (like css, jsp, gpx etc.) and some binaries.
> >>
> >> There are about 190.000 directories in the file system; maybe there
> >> are some more because we're hit by bug 1512371 (parallel-readdir =
> >> TRUE prevents directories listing). But the number of directories
> >> could/will rise in the future (maybe millions).
> >>
> >> files per directory: ranges from 0 to 100, on average it should be 20
> >> files per directory (well, at least in the deepest dirs, see
> >> explanation below).
> >>
> >> Average filesize: ranges from a few hundred bytes up to 30 MB, on
> >> average it should be 2-3 MB.
> >>
> >> Directory hierarchy: maximum depth as seen from within the volume is
> >> 6, the average should be 3.
> >>
> >> volume name: shared
> >> mount point on clients: /data/repository/shared/
> >> below /shared/ there are 2 directories:
> >> - public/: mainly calculated images (file sizes from a few KB up to
> >> max 1 MB) and some resouces (small PNGs with a size of a few hundred
> >> bytes).
> >> - private/: mainly source images; file sizes from 50 KB up to 30MB
> >>
> >> We migrated from a NFS server (SPOF) to glusterfs and simply copied
> >> our files. The images (which have an ID) are stored in the deepest
> >> directories of the dir tree. I'll better explain it :-)
> >>
> >> directory structure for the images (i'll omit some other miscellaneous
> >> stuff, but it looks quite similar):
> >> - ID of an image has 7 or 8 digits
> >> - /shared/private/: /(first 3 digits of ID)/(next 3 digits of
> ID)/$ID.jpg
> >> - /shared/public/: /(first 3 digits of ID)/(next 3 digits of
> >> ID)/$ID/$misc_formats.jpg
> >>
> >> That's why we have that many (sub-)directories. Files are only stored
> >> in the lowest directory hierarchy. I hope i could make our structure
> >> at least a bit more transparent.
> >>
> >> i hope there's something we can do to raise performance a bit. thx in
> >> advance :-)
> >>
> >>
> >> 2018-07-24 10:40 GMT+02:00 Pranith Kumar Karampuri <pkarampu at redhat.com
> >:
> >> >
> >> >
> >> > On Mon, Jul 23, 2018 at 4:16 PM, Hu Bert <revirii at googlemail.com>
> wrote:
> >> >>
> >> >> Well, over the weekend about 200GB were copied, so now there are
> >> >> ~400GB copied to the brick. That's far beyond a speed of 10GB per
> >> >> hour. If I copied the 1.6 TB directly, that would be done within max
> 2
> >> >> days. But with the self heal this will take at least 20 days minimum.
> >> >>
> >> >> Why is the performance that bad? No chance of speeding this up?
> >> >
> >> >
> >> > What kind of data do you have?
> >> > How many directories in the filesystem?
> >> > On average how many files per directory?
> >> > What is the depth of your directory hierarchy on average?
> >> > What is average filesize?
> >> >
> >> > Based on this data we can see if anything can be improved. Or if there
> >> > are
> >> > some
> >> > enhancements that need to be implemented in gluster to address this
> kind
> >> > of
> >> > data layout
> >> >>
> >> >>
> >> >> 2018-07-20 9:41 GMT+02:00 Hu Bert <revirii at googlemail.com>:
> >> >> > hmm... no one any idea?
> >> >> >
> >> >> > Additional question: the hdd on server gluster12 was changed, so
> far
> >> >> > ~220 GB were copied. On the other 2 servers i see a lot of entries
> in
> >> >> > glustershd.log, about 312.000 respectively 336.000 entries there
> >> >> > yesterday, most of them (current log output) looking like this:
> >> >> >
> >> >> > [2018-07-20 07:30:49.757595] I [MSGID: 108026]
> >> >> > [afr-self-heal-common.c:1724:afr_log_selfheal]
> 0-shared-replicate-3:
> >> >> > Completed data selfheal on 0d863a62-0dd8-401c-b699-2b642d9fd2b6.
> >> >> > sources=0 [2]  sinks=1
> >> >> > [2018-07-20 07:30:49.992398] I [MSGID: 108026]
> >> >> > [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
> >> >> > 0-shared-replicate-3: performing metadata selfheal on
> >> >> > 0d863a62-0dd8-401c-b699-2b642d9fd2b6
> >> >> > [2018-07-20 07:30:50.243551] I [MSGID: 108026]
> >> >> > [afr-self-heal-common.c:1724:afr_log_selfheal]
> 0-shared-replicate-3:
> >> >> > Completed metadata selfheal on 0d863a62-0dd8-401c-b699-
> 2b642d9fd2b6.
> >> >> > sources=0 [2]  sinks=1
> >> >> >
> >> >> > or like this:
> >> >> >
> >> >> > [2018-07-20 07:38:41.726943] I [MSGID: 108026]
> >> >> > [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do]
> >> >> > 0-shared-replicate-3: performing metadata selfheal on
> >> >> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba
> >> >> > [2018-07-20 07:38:41.855737] I [MSGID: 108026]
> >> >> > [afr-self-heal-common.c:1724:afr_log_selfheal]
> 0-shared-replicate-3:
> >> >> > Completed metadata selfheal on 9276097a-cdac-4d12-9dc6-
> 04b1ea4458ba.
> >> >> > sources=[0] 2  sinks=1
> >> >> > [2018-07-20 07:38:44.755800] I [MSGID: 108026]
> >> >> > [afr-self-heal-entry.c:887:afr_selfheal_entry_do]
> >> >> > 0-shared-replicate-3: performing entry selfheal on
> >> >> > 9276097a-cdac-4d12-9dc6-04b1ea4458ba
> >> >> >
> >> >> > is this behaviour normal? I'd expect these messages on the server
> >> >> > with
> >> >> > the failed brick, not on the other ones.
> >> >> >
> >> >> > 2018-07-19 8:31 GMT+02:00 Hu Bert <revirii at googlemail.com>:
> >> >> >> Hi there,
> >> >> >>
> >> >> >> sent this mail yesterday, but somehow it didn't work? Wasn't
> >> >> >> archived,
> >> >> >> so please be indulgent it you receive this mail again :-)
> >> >> >>
> >> >> >> We are currently running a replicate setup and are experiencing a
> >> >> >> quite poor performance. It got even worse when within a couple of
> >> >> >> weeks 2 bricks (disks) crashed. Maybe some general information of
> >> >> >> our
> >> >> >> setup:
> >> >> >>
> >> >> >> 3 Dell PowerEdge R530 (Xeon E5-1650 v3 Hexa-Core, 64 GB DDR4, OS
> on
> >> >> >> separate disks); each server has 4 10TB disks -> each is a brick;
> >> >> >> replica 3 setup (see gluster volume status below). Debian stretch,
> >> >> >> kernel 4.9.0, gluster version 3.12.12. Servers and clients are
> >> >> >> connected via 10 GBit ethernet.
> >> >> >>
> >> >> >> About a month ago and 2 days ago a disk died (on different
> servers);
> >> >> >> disk were replaced, were brought back into the volume and full
> self
> >> >> >> heal started. But the speed for this is quite... disappointing.
> Each
> >> >> >> brick has ~1.6TB of data on it (mostly the infamous small files).
> >> >> >> The
> >> >> >> full heal i started yesterday copied only ~50GB within 24 hours
> (48
> >> >> >> hours: about 100GB) - with
> >> >> >> this rate it would take weeks until the self heal finishes.
> >> >> >>
> >> >> >> After the first heal (started on gluster13 about a month ago, took
> >> >> >> about 3 weeks) finished we had a terrible performance; CPU on one
> or
> >> >> >> two of the nodes (gluster11, gluster12) was up to 1200%, consumed
> by
> >> >> >> the brick process of the former crashed brick (bricksdd1),
> >> >> >> interestingly not on the server with the failed this, but on the
> >> >> >> other
> >> >> >> 2 ones...
> >> >> >>
> >> >> >> Well... am i doing something wrong? Some options wrongly
> configured?
> >> >> >> Terrible setup? Anyone got an idea? Any additional information
> >> >> >> needed?
> >> >> >>
> >> >> >>
> >> >> >> Thx in advance :-)
> >> >> >>
> >> >> >> gluster volume status
> >> >> >>
> >> >> >> Volume Name: shared
> >> >> >> Type: Distributed-Replicate
> >> >> >> Volume ID: e879d208-1d8c-4089-85f3-ef1b3aa45d36
> >> >> >> Status: Started
> >> >> >> Snapshot Count: 0
> >> >> >> Number of Bricks: 4 x 3 = 12
> >> >> >> Transport-type: tcp
> >> >> >> Bricks:
> >> >> >> Brick1: gluster11:/gluster/bricksda1/shared
> >> >> >> Brick2: gluster12:/gluster/bricksda1/shared
> >> >> >> Brick3: gluster13:/gluster/bricksda1/shared
> >> >> >> Brick4: gluster11:/gluster/bricksdb1/shared
> >> >> >> Brick5: gluster12:/gluster/bricksdb1/shared
> >> >> >> Brick6: gluster13:/gluster/bricksdb1/shared
> >> >> >> Brick7: gluster11:/gluster/bricksdc1/shared
> >> >> >> Brick8: gluster12:/gluster/bricksdc1/shared
> >> >> >> Brick9: gluster13:/gluster/bricksdc1/shared
> >> >> >> Brick10: gluster11:/gluster/bricksdd1/shared
> >> >> >> Brick11: gluster12:/gluster/bricksdd1_new/shared
> >> >> >> Brick12: gluster13:/gluster/bricksdd1_new/shared
> >> >> >> Options Reconfigured:
> >> >> >> cluster.shd-max-threads: 4
> >> >> >> performance.md-cache-timeout: 60
> >> >> >> cluster.lookup-optimize: on
> >> >> >> cluster.readdir-optimize: on
> >> >> >> performance.cache-refresh-timeout: 4
> >> >> >> performance.parallel-readdir: on
> >> >> >> server.event-threads: 8
> >> >> >> client.event-threads: 8
> >> >> >> performance.cache-max-file-size: 128MB
> >> >> >> performance.write-behind-window-size: 16MB
> >> >> >> performance.io-thread-count: 64
> >> >> >> cluster.min-free-disk: 1%
> >> >> >> performance.cache-size: 24GB
> >> >> >> nfs.disable: on
> >> >> >> transport.address-family: inet
> >> >> >> performance.high-prio-threads: 32
> >> >> >> performance.normal-prio-threads: 32
> >> >> >> performance.low-prio-threads: 32
> >> >> >> performance.least-prio-threads: 8
> >> >> >> performance.io-cache: on
> >> >> >> server.allow-insecure: on
> >> >> >> performance.strict-o-direct: off
> >> >> >> transport.listen-backlog: 100
> >> >> >> server.outstanding-rpc-limit: 128
> >> >> _______________________________________________
> >> >> Gluster-users mailing list
> >> >> Gluster-users at gluster.org
> >> >> https://lists.gluster.org/mailman/listinfo/gluster-users
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Pranith
> >
> >
> >
> >
> > --
> > Pranith
>

-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180726/8338c62f/attachment.html>