[Gluster-users] performance.cache-size for high-RAM clients/servers, other tweaks for performance, and improvements to Gluster docs
Artem Russakovskii
archon810 at gmail.com
Wed Apr 18 06:29:18 UTC 2018
Btw, I've now noticed at least 5 variations in toggling binary option
values. Are they all interchangeable, or will using the wrong value not
work in some cases?
yes/no
true/false
True/False
on/off
enable/disable
It's quite a confusing/inconsistent practice, especially given that many
options will accept any value without erroring out/validation.
Sincerely,
Artem
--
Founder, Android Police <http://www.androidpolice.com>, APK Mirror
<http://www.apkmirror.com/>, Illogical Robot LLC
beerpla.net | +ArtemRussakovskii
<https://plus.google.com/+ArtemRussakovskii> | @ArtemR
<http://twitter.com/ArtemR>
On Tue, Apr 17, 2018 at 11:22 PM, Artem Russakovskii <archon810 at gmail.com>
wrote:
> Thanks for the link. Looking at the status of that doc, it isn't quite
> ready yet, and there's no mention of the option.
>
> Does it mean that whatever is ready now in 4.0.1 is incomplete but can be
> enabled via granular-entry-heal=on, and when it is complete, it'll become
> the default and the flag will simply go away?
>
> Is there any risk enabling the option now in 4.0.1?
>
>
> Sincerely,
> Artem
>
> --
> Founder, Android Police <http://www.androidpolice.com>, APK Mirror
> <http://www.apkmirror.com/>, Illogical Robot LLC
> beerpla.net | +ArtemRussakovskii
> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
> <http://twitter.com/ArtemR>
>
> On Tue, Apr 17, 2018 at 11:16 PM, Ravishankar N <ravishankar at redhat.com>
> wrote:
>
>>
>>
>> On 04/18/2018 10:35 AM, Artem Russakovskii wrote:
>>
>> Hi Ravi,
>>
>> Could you please expand on how these would help?
>>
>> By forcing full here, we move the logic from the CPU to network, thus
>> decreasing CPU utilization, is that right?
>>
>> Yes, 'diff' employs the rchecksum FOP which does a sha256 checksum which
>> can consume CPU. So yes it is sort of shifting the load from CPU to the
>> network. But if your average file size is small, it would make sense to
>> copy the entire file instead of computing checksums.
>>
>> This is assuming the CPU and disk utilization are caused by the differ
>> and not by lstat and other calls or something.
>>
>>> Option: cluster.data-self-heal-algorithm
>>> Default Value: (null)
>>> Description: Select between "full", "diff". The "full" algorithm copies
>>> the entire file from source to sink. The "diff" algorithm copies to sink
>>> only those blocks whose checksums don't match with those of source. If no
>>> option is configured the option is chosen dynamically as follows: If the
>>> file does not exist on one of the sinks or empty file exists or if the
>>> source file size is about the same as page size the entire file will be
>>> read and written i.e "full" algo, otherwise "diff" algo is chosen.
>>
>>
>> I really have no idea what this means and how/why it would help. Any more
>> info on this option?
>>
>>
>> https://github.com/gluster/glusterfs-specs/blob/master/done/
>> GlusterFS%203.8/granular-entry-self-healing.md should help.
>> Regards,
>> Ravi
>>
>>
>> Option: cluster.granular-entry-heal
>>> Default Value: no
>>> Description: If this option is enabled, self-heal will resort to
>>> granular way of recording changelogs and doing entry self-heal.
>>
>>
>> Thank you.
>>
>>
>> Sincerely,
>> Artem
>>
>> --
>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror
>> <http://www.apkmirror.com/>, Illogical Robot LLC
>> beerpla.net | +ArtemRussakovskii
>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
>> <http://twitter.com/ArtemR>
>>
>> On Tue, Apr 17, 2018 at 9:58 PM, Ravishankar N <ravishankar at redhat.com>
>> wrote:
>>
>>>
>>> On 04/18/2018 10:14 AM, Artem Russakovskii wrote:
>>>
>>> Following up here on a related and very serious for us issue.
>>>
>>> I took down one of the 4 replicate gluster servers for maintenance
>>> today. There are 2 gluster volumes totaling about 600GB. Not that much
>>> data. After the server comes back online, it starts auto healing and pretty
>>> much all operations on gluster freeze for many minutes.
>>>
>>> For example, I was trying to run an ls -alrt in a folder with 7300
>>> files, and it took a good 15-20 minutes before returning.
>>>
>>> During this time, I can see iostat show 100% utilization on the brick,
>>> heal status takes many minutes to return, glusterfsd uses up tons of CPU (I
>>> saw it spike to 600%). gluster already has massive performance issues for
>>> me, but healing after a 4-hour downtime is on another level of bad perf.
>>>
>>> For example, this command took many minutes to run:
>>>
>>> gluster volume heal androidpolice_data3 info summary
>>> Brick nexus2:/mnt/nexus2_block4/androidpolice_data3
>>> Status: Connected
>>> Total Number of entries: 91
>>> Number of entries in heal pending: 90
>>> Number of entries in split-brain: 0
>>> Number of entries possibly healing: 1
>>>
>>> Brick forge:/mnt/forge_block4/androidpolice_data3
>>> Status: Connected
>>> Total Number of entries: 87
>>> Number of entries in heal pending: 86
>>> Number of entries in split-brain: 0
>>> Number of entries possibly healing: 1
>>>
>>> Brick hive:/mnt/hive_block4/androidpolice_data3
>>> Status: Connected
>>> Total Number of entries: 87
>>> Number of entries in heal pending: 86
>>> Number of entries in split-brain: 0
>>> Number of entries possibly healing: 1
>>>
>>> Brick citadel:/mnt/citadel_block4/androidpolice_data3
>>> Status: Connected
>>> Total Number of entries: 0
>>> Number of entries in heal pending: 0
>>> Number of entries in split-brain: 0
>>> Number of entries possibly healing: 0
>>>
>>>
>>> Statistics showed a diminishing number of failed heals:
>>> ...
>>> Ending time of crawl: Tue Apr 17 21:13:08 2018
>>>
>>> Type of crawl: INDEX
>>> No. of entries healed: 2
>>> No. of entries in split-brain: 0
>>> No. of heal failed entries: 102
>>>
>>> Starting time of crawl: Tue Apr 17 21:13:09 2018
>>>
>>> Ending time of crawl: Tue Apr 17 21:14:30 2018
>>>
>>> Type of crawl: INDEX
>>> No. of entries healed: 4
>>> No. of entries in split-brain: 0
>>> No. of heal failed entries: 91
>>>
>>> Starting time of crawl: Tue Apr 17 21:14:31 2018
>>>
>>> Ending time of crawl: Tue Apr 17 21:15:34 2018
>>>
>>> Type of crawl: INDEX
>>> No. of entries healed: 0
>>> No. of entries in split-brain: 0
>>> No. of heal failed entries: 88
>>> ...
>>>
>>> Eventually, everything heals and goes back to at least where the roof
>>> isn't on fire anymore.
>>>
>>> The server stats and volume options were given in one of the previous
>>> replies to this thread.
>>>
>>> Any ideas or things I could run and show the output of to help diagnose?
>>> I'm also very open to working with someone on the team on a live debugging
>>> session if there's interest.
>>>
>>>
>>> It is likely that self-heal is causing the CPU spike due to the flood of
>>> lookups/ locks and checksum fops that the self-heal-daemon sends to the
>>> bricks.
>>> There's a script to control shd's cpu usage using cgroups. That should
>>> help in regulating self-heal traffic: https://review.gluster.org/#/c
>>> /18404/ (see extras/control-cpu-load.sh)
>>> Other self-heal related volume options that you could change are setting
>>> 'cluster.data-self-heal-algorithm' to 'full' and 'granular-entry-heal'
>>> to 'enable'. `gluster volume set help` should give you more information
>>> about these options.
>>> Thanks,
>>> Ravi
>>>
>>>
>>>
>>> Thank you.
>>>
>>>
>>> Sincerely,
>>> Artem
>>>
>>> --
>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror
>>> <http://www.apkmirror.com/>, Illogical Robot LLC
>>> beerpla.net | +ArtemRussakovskii
>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
>>> <http://twitter.com/ArtemR>
>>>
>>> On Tue, Apr 10, 2018 at 9:56 AM, Artem Russakovskii <archon810 at gmail.com
>>> > wrote:
>>>
>>>> Hi Vlad,
>>>>
>>>> I actually saw that post already and even asked a question 4 days ago (
>>>> https://serverfault.com/questions/517775/glusterfs-direct-i
>>>> -o-mode#comment1172497_540917). The accepted answer also seems to go
>>>> against your suggestion to enable direct-io-mode as it says it should be
>>>> disabled for better performance when used just for file accesses.
>>>>
>>>> It'd be great if someone from the Gluster team chimed in about this
>>>> thread.
>>>>
>>>>
>>>> Sincerely,
>>>> Artem
>>>>
>>>> --
>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror
>>>> <http://www.apkmirror.com/>, Illogical Robot LLC
>>>> beerpla.net | +ArtemRussakovskii
>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
>>>> <http://twitter.com/ArtemR>
>>>>
>>>> On Tue, Apr 10, 2018 at 7:01 AM, Vlad Kopylov <vladkopy at gmail.com>
>>>> wrote:
>>>>
>>>>> Wish I knew or was able to get detailed description of those options
>>>>> myself.
>>>>> here is direct-io-mode https://serverfault.com/questi
>>>>> ons/517775/glusterfs-direct-i-o-mode
>>>>> Same as you I ran tests on a large volume of files, finding that main
>>>>> delays are in attribute calls, ending up with those mount options to add
>>>>> performance.
>>>>> I discovered those options through basically googling this user list
>>>>> with people sharing their tests.
>>>>> Not sure I would share your optimism, and rather then going up I
>>>>> downgraded to 3.12 and have no dir view issue now. Though I had to recreate
>>>>> the cluster and had to re-add bricks with existing data.
>>>>>
>>>>> On Tue, Apr 10, 2018 at 1:47 AM, Artem Russakovskii <
>>>>> archon810 at gmail.com> wrote:
>>>>>
>>>>>> Hi Vlad,
>>>>>>
>>>>>> I'm using only localhost: mounts.
>>>>>>
>>>>>> Can you please explain what effect each option has on performance
>>>>>> issues shown in my posts? "negative-timeout=10,attribute
>>>>>> -timeout=30,fopen-keep-cache,direct-io-mode=enable,fetch-attempts=5"
>>>>>> From what I remember, direct-io-mode=enable didn't make a difference in my
>>>>>> tests, but I suppose I can try again. The explanations about direct-io-mode
>>>>>> are quite confusing on the web in various guides, saying enabling it could
>>>>>> make performance worse in some situations and better in others due to OS
>>>>>> file cache.
>>>>>>
>>>>>> There are also these gluster volume settings, adding to the confusion:
>>>>>> Option: performance.strict-o-direct
>>>>>> Default Value: off
>>>>>> Description: This option when set to off, ignores the O_DIRECT flag.
>>>>>>
>>>>>> Option: performance.nfs.strict-o-direct
>>>>>> Default Value: off
>>>>>> Description: This option when set to off, ignores the O_DIRECT flag.
>>>>>>
>>>>>> Re: 4.0. I moved to 4.0 after finding out that it fixes the
>>>>>> disappearing dirs bug related to cluster.readdir-optimize if you remember (
>>>>>> http://lists.gluster.org/pipermail/gluster-users/2018-April
>>>>>> /033830.html). I was already on 3.13 by then, and 4.0 resolved the
>>>>>> issue. It's been stable for me so far, thankfully.
>>>>>>
>>>>>>
>>>>>> Sincerely,
>>>>>> Artem
>>>>>>
>>>>>> --
>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror
>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC
>>>>>> beerpla.net | +ArtemRussakovskii
>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
>>>>>> <http://twitter.com/ArtemR>
>>>>>>
>>>>>> On Mon, Apr 9, 2018 at 10:38 PM, Vlad Kopylov <vladkopy at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> you definitely need mount options to /etc/fstab
>>>>>>> use ones from here http://lists.gluster.org/piper
>>>>>>> mail/gluster-users/2018-April/033811.html
>>>>>>>
>>>>>>> I went on with using local mounts to achieve performance as well
>>>>>>>
>>>>>>> Also, 3.12 or 3.10 branches would be preferable for production
>>>>>>>
>>>>>>> On Fri, Apr 6, 2018 at 4:12 AM, Artem Russakovskii <
>>>>>>> archon810 at gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi again,
>>>>>>>>
>>>>>>>> I'd like to expand on the performance issues and plead for help.
>>>>>>>> Here's one case which shows these odd hiccups:
>>>>>>>> https://i.imgur.com/CXBPjTK.gifv.
>>>>>>>>
>>>>>>>> In this GIF where I switch back and forth between copy operations
>>>>>>>> on 2 servers, I'm copying a 10GB dir full of .apk and image files.
>>>>>>>>
>>>>>>>> On server "hive" I'm copying straight from the main disk to an
>>>>>>>> attached volume block (xfs). As you can see, the transfers are relatively
>>>>>>>> speedy and don't hiccup.
>>>>>>>> On server "citadel" I'm copying the same set of data to a
>>>>>>>> 4-replicate gluster which uses block storage as a brick. As you can see,
>>>>>>>> performance is much worse, and there are frequent pauses for many seconds
>>>>>>>> where nothing seems to be happening - just freezes.
>>>>>>>>
>>>>>>>> All 4 servers have the same specs, and all of them have performance
>>>>>>>> issues with gluster and no such issues when raw xfs block storage is used.
>>>>>>>>
>>>>>>>> hive has long finished copying the data, while citadel is barely
>>>>>>>> chugging along and is expected to take probably half an hour to an hour. I
>>>>>>>> have over 1TB of data to migrate, at which point if we went live, I'm not
>>>>>>>> even sure gluster would be able to keep up instead of bringing the machines
>>>>>>>> and services down.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Here's the cluster config, though it didn't seem to make any
>>>>>>>> difference performance-wise before I applied the customizations vs after.
>>>>>>>>
>>>>>>>> Volume Name: apkmirror_data1
>>>>>>>> Type: Replicate
>>>>>>>> Volume ID: 11ecee7e-d4f8-497a-9994-ceb144d6841e
>>>>>>>> Status: Started
>>>>>>>> Snapshot Count: 0
>>>>>>>> Number of Bricks: 1 x 4 = 4
>>>>>>>> Transport-type: tcp
>>>>>>>> Bricks:
>>>>>>>> Brick1: nexus2:/mnt/nexus2_block1/apkmirror_data1
>>>>>>>> Brick2: forge:/mnt/forge_block1/apkmirror_data1
>>>>>>>> Brick3: hive:/mnt/hive_block1/apkmirror_data1
>>>>>>>> Brick4: citadel:/mnt/citadel_block1/apkmirror_data1
>>>>>>>> Options Reconfigured:
>>>>>>>> cluster.quorum-count: 1
>>>>>>>> cluster.quorum-type: fixed
>>>>>>>> network.ping-timeout: 5
>>>>>>>> network.remote-dio: enable
>>>>>>>> performance.rda-cache-limit: 256MB
>>>>>>>> performance.readdir-ahead: on
>>>>>>>> performance.parallel-readdir: on
>>>>>>>> network.inode-lru-limit: 500000
>>>>>>>> performance.md-cache-timeout: 600
>>>>>>>> performance.cache-invalidation: on
>>>>>>>> performance.stat-prefetch: on
>>>>>>>> features.cache-invalidation-timeout: 600
>>>>>>>> features.cache-invalidation: on
>>>>>>>> cluster.readdir-optimize: on
>>>>>>>> performance.io-thread-count: 32
>>>>>>>> server.event-threads: 4
>>>>>>>> client.event-threads: 4
>>>>>>>> performance.read-ahead: off
>>>>>>>> cluster.lookup-optimize: on
>>>>>>>> performance.cache-size: 1GB
>>>>>>>> cluster.self-heal-daemon: enable
>>>>>>>> transport.address-family: inet
>>>>>>>> nfs.disable: on
>>>>>>>> performance.client-io-threads: on
>>>>>>>>
>>>>>>>>
>>>>>>>> The mounts are done as follows in /etc/fstab:
>>>>>>>> /dev/disk/by-id/scsi-0Linode_Volume_citadel_block1
>>>>>>>> /mnt/citadel_block1 xfs defaults 0 2
>>>>>>>> localhost:/apkmirror_data1 /mnt/apkmirror_data1 glusterfs
>>>>>>>> defaults,_netdev 0 0
>>>>>>>>
>>>>>>>> I'm really not sure if direct-io-mode mount tweaks would do
>>>>>>>> anything here, what the value should be set to, and what it is by default.
>>>>>>>>
>>>>>>>> The OS is OpenSUSE 42.3, 64-bit. 80GB of RAM, 20 CPUs, hosted by
>>>>>>>> Linode.
>>>>>>>>
>>>>>>>> I'd really appreciate any help in the matter.
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>>
>>>>>>>> Sincerely,
>>>>>>>> Artem
>>>>>>>>
>>>>>>>> --
>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror
>>>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC
>>>>>>>> beerpla.net | +ArtemRussakovskii
>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
>>>>>>>> <http://twitter.com/ArtemR>
>>>>>>>>
>>>>>>>> On Thu, Apr 5, 2018 at 11:13 PM, Artem Russakovskii <
>>>>>>>> archon810 at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm trying to squeeze performance out of gluster on 4 80GB RAM
>>>>>>>>> 20-CPU machines where Gluster runs on attached block storage (Linode) in (4
>>>>>>>>> replicate bricks), and so far everything I tried results in sub-optimal
>>>>>>>>> performance.
>>>>>>>>>
>>>>>>>>> There are many files - mostly images, several million - and many
>>>>>>>>> operations take minutes, copying multiple files (even if they're small)
>>>>>>>>> suddenly freezes up for seconds at a time, then continues, iostat
>>>>>>>>> frequently shows large r_await and w_awaits with 100% utilization for the
>>>>>>>>> attached block device, etc.
>>>>>>>>>
>>>>>>>>> But anyway, there are many guides out there for small-file
>>>>>>>>> performance improvements, but more explanation is needed, and I think more
>>>>>>>>> tweaks should be possible.
>>>>>>>>>
>>>>>>>>> My question today is about performance.cache-size. Is this a size
>>>>>>>>> of cache in RAM? If so, how do I view the current cache size to see if it
>>>>>>>>> gets full and I should increase its size? Is it advisable to bump it up if
>>>>>>>>> I have many tens of gigs of RAM free?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> More generally, in the last 2 months since I first started working
>>>>>>>>> with gluster and set a production system live, I've been feeling frustrated
>>>>>>>>> because Gluster has a lot of poorly-documented and confusing options. I
>>>>>>>>> really wish documentation could be improved with examples and better
>>>>>>>>> explanations.
>>>>>>>>>
>>>>>>>>> Specifically, it'd be absolutely amazing if the docs offered a
>>>>>>>>> strategy for setting each value and ways of determining more optimal
>>>>>>>>> values. For example, for performance.cache-size, if it said something like
>>>>>>>>> "run command abc to see your current cache size, and if it's hurting, up
>>>>>>>>> it, but be aware that it's limited by RAM," it'd be already a huge
>>>>>>>>> improvement to the docs. And so on with other options.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The gluster team is quite helpful on this mailing list, but in a
>>>>>>>>> reactive rather than proactive way. Perhaps it's tunnel vision once you've
>>>>>>>>> worked on a project for so long where less technical explanations and even
>>>>>>>>> proper documentation of options takes a back seat, but I encourage you to
>>>>>>>>> be more proactive about helping us understand and optimize Gluster.
>>>>>>>>>
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>> Sincerely,
>>>>>>>>> Artem
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Founder, Android Police <http://www.androidpolice.com>, APK Mirror
>>>>>>>>> <http://www.apkmirror.com/>, Illogical Robot LLC
>>>>>>>>> beerpla.net | +ArtemRussakovskii
>>>>>>>>> <https://plus.google.com/+ArtemRussakovskii> | @ArtemR
>>>>>>>>> <http://twitter.com/ArtemR>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Gluster-users mailing list
>>>>>>>> Gluster-users at gluster.org
>>>>>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing listGluster-users at gluster.orghttp://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180417/efdb3d54/attachment.html>
More information about the Gluster-users
mailing list