[Gluster-users] performance.cache-size for high-RAM clients/servers, other tweaks for performance, and improvements to Gluster docs

Wed Apr 18 04:58:12 UTC 2018

On 04/18/2018 10:14 AM, Artem Russakovskii wrote:
> Following up here on a related and very serious for us issue.
>
> I took down one of the 4 replicate gluster servers for maintenance 
> today. There are 2 gluster volumes totaling about 600GB. Not that much 
> data. After the server comes back online, it starts auto healing and 
> pretty much all operations on gluster freeze for many minutes.
>
> For example, I was trying to run an ls -alrt in a folder with 7300 
> files, and it took a good 15-20 minutes before returning.
>
> During this time, I can see iostat show 100% utilization on the brick, 
> heal status takes many minutes to return, glusterfsd uses up tons of 
> CPU (I saw it spike to 600%). gluster already has massive performance 
> issues for me, but healing after a 4-hour downtime is on another level 
> of bad perf.
>
> For example, this command took many minutes to run:
>
> gluster volume heal androidpolice_data3 info summary
> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
> Status: Connected
> Total Number of entries: 91
> Number of entries in heal pending: 90
> Number of entries in split-brain: 0
> Number of entries possibly healing: 1
>
> Brick forge:/mnt/forge_block4/androidpolice_data3
> Status: Connected
> Total Number of entries: 87
> Number of entries in heal pending: 86
> Number of entries in split-brain: 0
> Number of entries possibly healing: 1
>
> Brick hive:/mnt/hive_block4/androidpolice_data3
> Status: Connected
> Total Number of entries: 87
> Number of entries in heal pending: 86
> Number of entries in split-brain: 0
> Number of entries possibly healing: 1
>
> Brick citadel:/mnt/citadel_block4/androidpolice_data3
> Status: Connected
> Total Number of entries: 0
> Number of entries in heal pending: 0
> Number of entries in split-brain: 0
> Number of entries possibly healing: 0
>
>
> Statistics showed a diminishing number of failed heals:
> ...
> Ending time of crawl: Tue Apr 17 21:13:08 2018
>
> Type of crawl: INDEX
> No. of entries healed: 2
> No. of entries in split-brain: 0
> No. of heal failed entries: 102
>
> Starting time of crawl: Tue Apr 17 21:13:09 2018
>
> Ending time of crawl: Tue Apr 17 21:14:30 2018
>
> Type of crawl: INDEX
> No. of entries healed: 4
> No. of entries in split-brain: 0
> No. of heal failed entries: 91
>
> Starting time of crawl: Tue Apr 17 21:14:31 2018
>
> Ending time of crawl: Tue Apr 17 21:15:34 2018
>
> Type of crawl: INDEX
> No. of entries healed: 0
> No. of entries in split-brain: 0
> No. of heal failed entries: 88
> ...
>
> Eventually, everything heals and goes back to at least where the roof 
> isn't on fire anymore.
>
> The server stats and volume options were given in one of the previous 
> replies to this thread.
>
> Any ideas or things I could run and show the output of to help 
> diagnose? I'm also very open to working with someone on the team on a 
> live debugging session if there's interest.

It is likely that self-heal is causing the CPU spike due to the flood of 
lookups/ locks and checksum fops that the self-heal-daemon sends to the 
bricks.
There's a script to control shd's cpu usage using cgroups. That should 
help in regulating self-heal traffic: 
https://review.gluster.org/#/c/18404/ (see extras/control-cpu-load.sh)
Other self-heal related volume options that you could change are setting 
'cluster.data-self-heal-algorithm' to 'full' and 'granular-entry-heal' 
to 'enable'.  `gluster volume set help` should give you more information 
about these options.
Thanks,
Ravi

>
> Thank you.
>
>
> Sincerely,
> Artem
>
> --
> Founder, Android Police <http://www.androidpolice.com>, APK Mirror 
> <http://www.apkmirror.com/>, Illogical Robot LLC
> beerpla.net <http://beerpla.net/> | +ArtemRussakovskii 
> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR 
> <http://twitter.com/ArtemR>
>
> On Tue, Apr 10, 2018 at 9:56 AM, Artem Russakovskii 
> <archon810 at gmail.com <mailto:archon810 at gmail.com>> wrote:
>
>     Hi Vlad,
>
>     I actually saw that post already and even asked a question 4 days
>     ago
>     (https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode#comment1172497_540917
>     <https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode#comment1172497_540917>).
>     The accepted answer also seems to go against your suggestion to
>     enable direct-io-mode as it says it should be disabled for better
>     performance when used just for file accesses.
>
>     It'd be great if someone from the Gluster team chimed in about
>     this thread.
>
>
>     Sincerely,
>     Artem
>
>     --
>     Founder, Android Police <http://www.androidpolice.com>, APK Mirror
>     <http://www.apkmirror.com/>, Illogical Robot LLC
>     beerpla.net <http://beerpla.net/> | +ArtemRussakovskii
>     <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
>     <http://twitter.com/ArtemR>
>
>     On Tue, Apr 10, 2018 at 7:01 AM, Vlad Kopylov <vladkopy at gmail.com
>     <mailto:vladkopy at gmail.com>> wrote:
>
>         Wish I knew or was able to get detailed description of those
>         options myself.
>         here is direct-io-mode
>         https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode
>         <https://serverfault.com/questions/517775/glusterfs-direct-i-o-mode>
>         Same as you I ran tests on a large volume of files, finding
>         that main delays are in attribute calls, ending up with those
>         mount options to add performance.
>         I discovered those options through basically googling this
>         user list with people sharing their tests.
>         Not sure I would share your optimism, and rather then going up
>         I downgraded to 3.12 and have no dir view issue now. Though I
>         had to recreate the cluster and had to re-add bricks with
>         existing data.
>
>         On Tue, Apr 10, 2018 at 1:47 AM, Artem Russakovskii
>         <archon810 at gmail.com <mailto:archon810 at gmail.com>> wrote:
>
>             Hi Vlad,
>
>             I'm using only localhost: mounts.
>
>             Can you please explain what effect each option has on
>             performance issues shown in my posts?
>             "negative-timeout=10,attribute-timeout=30,fopen-keep-cache,direct-io-mode=enable,fetch-attempts=5"
>             From what I remember, direct-io-mode=enable didn't make a
>             difference in my tests, but I suppose I can try again. The
>             explanations about direct-io-mode are quite confusing on
>             the web in various guides, saying enabling it could make
>             performance worse in some situations and better in others
>             due to OS file cache.
>
>             There are also these gluster volume settings, adding to
>             the confusion:
>             Option: performance.strict-o-direct
>             Default Value: off
>             Description: This option when set to off, ignores the
>             O_DIRECT flag.
>
>             Option: performance.nfs.strict-o-direct
>             Default Value: off
>             Description: This option when set to off, ignores the
>             O_DIRECT flag.
>
>             Re: 4.0. I moved to 4.0 after finding out that it fixes
>             the disappearing dirs bug related to
>             cluster.readdir-optimize if you remember
>             (http://lists.gluster.org/pipermail/gluster-users/2018-April/033830.html
>             <http://lists.gluster.org/pipermail/gluster-users/2018-April/033830.html>).
>             I was already on 3.13 by then, and 4.0 resolved the issue.
>             It's been stable for me so far, thankfully.
>
>
>             Sincerely,
>             Artem
>
>             --
>             Founder, Android Police <http://www.androidpolice.com>,
>             APK Mirror <http://www.apkmirror.com/>, Illogical Robot LLC
>             beerpla.net <http://beerpla.net/> | +ArtemRussakovskii
>             <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
>             <http://twitter.com/ArtemR>
>
>             On Mon, Apr 9, 2018 at 10:38 PM, Vlad Kopylov
>             <vladkopy at gmail.com <mailto:vladkopy at gmail.com>> wrote:
>
>                 you definitely need mount options to /etc/fstab
>                 use ones from here
>                 http://lists.gluster.org/pipermail/gluster-users/2018-April/033811.html
>                 <http://lists.gluster.org/pipermail/gluster-users/2018-April/033811.html>
>
>                 I went on with using local mounts to achieve
>                 performance as well
>
>                 Also, 3.12 or 3.10 branches would be preferable for
>                 production
>
>                 On Fri, Apr 6, 2018 at 4:12 AM, Artem Russakovskii
>                 <archon810 at gmail.com <mailto:archon810 at gmail.com>> wrote:
>
>                     Hi again,
>
>                     I'd like to expand on the performance issues and
>                     plead for help. Here's one case which shows these
>                     odd hiccups: https://i.imgur.com/CXBPjTK.gifv
>                     <https://i.imgur.com/CXBPjTK.gifv>.
>
>                     In this GIF where I switch back and forth between
>                     copy operations on 2 servers, I'm copying a 10GB
>                     dir full of .apk and image files.
>
>                     On server "hive" I'm copying straight from the
>                     main disk to an attached volume block (xfs). As
>                     you can see, the transfers are relatively speedy
>                     and don't hiccup.
>                     On server "citadel" I'm copying the same set of
>                     data to a 4-replicate gluster which uses block
>                     storage as a brick. As you can see, performance is
>                     much worse, and there are frequent pauses for many
>                     seconds where nothing seems to be happening - just
>                     freezes.
>
>                     All 4 servers have the same specs, and all of them
>                     have performance issues with gluster and no such
>                     issues when raw xfs block storage is used.
>
>                     hive has long finished copying the data, while
>                     citadel is barely chugging along and is expected
>                     to take probably half an hour to an hour. I have
>                     over 1TB of data to migrate, at which point if we
>                     went live, I'm not even sure gluster would be able
>                     to keep up instead of bringing the machines and
>                     services down.
>
>
>
>                     Here's the cluster config, though it didn't seem
>                     to make any difference performance-wise before I
>                     applied the customizations vs after.
>
>                     Volume Name: apkmirror_data1
>                     Type: Replicate
>                     Volume ID: 11ecee7e-d4f8-497a-9994-ceb144d6841e
>                     Status: Started
>                     Snapshot Count: 0
>                     Number of Bricks: 1 x 4 = 4
>                     Transport-type: tcp
>                     Bricks:
>                     Brick1: nexus2:/mnt/nexus2_block1/apkmirror_data1
>                     Brick2: forge:/mnt/forge_block1/apkmirror_data1
>                     Brick3: hive:/mnt/hive_block1/apkmirror_data1
>                     Brick4: citadel:/mnt/citadel_block1/apkmirror_data1
>                     Options Reconfigured:
>                     cluster.quorum-count: 1
>                     cluster.quorum-type: fixed
>                     network.ping-timeout: 5
>                     network.remote-dio: enable
>                     performance.rda-cache-limit: 256MB
>                     performance.readdir-ahead: on
>                     performance.parallel-readdir: on
>                     network.inode-lru-limit: 500000
>                     performance.md-cache-timeout: 600
>                     performance.cache-invalidation: on
>                     performance.stat-prefetch: on
>                     features.cache-invalidation-timeout: 600
>                     features.cache-invalidation: on
>                     cluster.readdir-optimize: on
>                     performance.io-thread-count: 32
>                     server.event-threads: 4
>                     client.event-threads: 4
>                     performance.read-ahead: off
>                     cluster.lookup-optimize: on
>                     performance.cache-size: 1GB
>                     cluster.self-heal-daemon: enable
>                     transport.address-family: inet
>                     nfs.disable: on
>                     performance.client-io-threads: on
>
>
>                     The mounts are done as follows in /etc/fstab:
>                     /dev/disk/by-id/scsi-0Linode_Volume_citadel_block1
>                     /mnt/citadel_block1 xfs defaults 0 2
>                     localhost:/apkmirror_data1 /mnt/apkmirror_data1
>                     glusterfs defaults,_netdev 0 0
>
>                     I'm really not sure if direct-io-mode mount tweaks
>                     would do anything here, what the value should be
>                     set to, and what it is by default.
>
>                     The OS is OpenSUSE 42.3, 64-bit. 80GB of RAM, 20
>                     CPUs, hosted by Linode.
>
>                     I'd really appreciate any help in the matter.
>
>                     Thank you.
>
>
>                     Sincerely,
>                     Artem
>
>                     --
>                     Founder, Android Police
>                     <http://www.androidpolice.com>, APK Mirror
>                     <http://www.apkmirror.com/>, Illogical Robot LLC
>                     beerpla.net <http://beerpla.net/> |
>                     +ArtemRussakovskii
>                     <https://plus.google.com/+ArtemRussakovskii> |
>                     @ArtemR <http://twitter.com/ArtemR>
>
>                     On Thu, Apr 5, 2018 at 11:13 PM, Artem
>                     Russakovskii <archon810 at gmail.com
>                     <mailto:archon810 at gmail.com>> wrote:
>
>                         Hi,
>
>                         I'm trying to squeeze performance out of
>                         gluster on 4 80GB RAM 20-CPU machines where
>                         Gluster runs on attached block storage
>                         (Linode) in (4 replicate bricks), and so far
>                         everything I tried results in sub-optimal
>                         performance.
>
>                         There are many files - mostly images, several
>                         million - and many operations take minutes,
>                         copying multiple files (even if they're small)
>                         suddenly freezes up for seconds at a time,
>                         then continues, iostat frequently shows large
>                         r_await and w_awaits with 100% utilization for
>                         the attached block device, etc.
>
>                         But anyway, there are many guides out there
>                         for small-file performance improvements, but
>                         more explanation is needed, and I think more
>                         tweaks should be possible.
>
>                         My question today is
>                         about performance.cache-size. Is this a size
>                         of cache in RAM? If so, how do I view the
>                         current cache size to see if it gets full and
>                         I should increase its size? Is it advisable to
>                         bump it up if I have many tens of gigs of RAM
>                         free?
>
>
>
>                         More generally, in the last 2 months since I
>                         first started working with gluster and set a
>                         production system live, I've been feeling
>                         frustrated because Gluster has a lot of
>                         poorly-documented and confusing options. I
>                         really wish documentation could be improved
>                         with examples and better explanations.
>
>                         Specifically, it'd be absolutely amazing if
>                         the docs offered a strategy for setting each
>                         value and ways of determining more optimal
>                         values. For example,
>                         for performance.cache-size, if it said
>                         something like "run command abc to see your
>                         current cache size, and if it's hurting, up
>                         it, but be aware that it's limited by RAM,"
>                         it'd be already a huge improvement to the
>                         docs. And so on with other options.
>
>
>
>                         The gluster team is quite helpful on this
>                         mailing list, but in a reactive rather than
>                         proactive way. Perhaps it's tunnel vision once
>                         you've worked on a project for so long where
>                         less technical explanations and even proper
>                         documentation of options takes a back seat,
>                         but I encourage you to be more proactive about
>                         helping us understand and optimize Gluster.
>
>                         Thank you.
>
>                         Sincerely,
>                         Artem
>
>                         --
>                         Founder, Android Police
>                         <http://www.androidpolice.com>, APK Mirror
>                         <http://www.apkmirror.com/>, Illogical Robot LLC
>                         beerpla.net <http://beerpla.net/> |
>                         +ArtemRussakovskii
>                         <https://plus.google.com/+ArtemRussakovskii> |
>                         @ArtemR <http://twitter.com/ArtemR>
>
>
>
>                     _______________________________________________
>                     Gluster-users mailing list
>                     Gluster-users at gluster.org
>                     <mailto:Gluster-users at gluster.org>
>                     http://lists.gluster.org/mailman/listinfo/gluster-users
>                     <http://lists.gluster.org/mailman/listinfo/gluster-users>
>
>
>
>
>
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180418/76911809/attachment.html>