[Gluster-users] performance

Thu Aug 20 00:46:41 UTC 2020

Hi Strahil,

so over the last two weeks, the system has been relatively stable.  I 
have powered off both servers at least once, for about 5 minutes each 
time.  server came up, auto-healed what it needed to, so all of that 
part is working as expected.

will answer things inline and follow with more questions:

>>> Hm...  OK. I guess you can try 7.7 whenever it's possible.
>>
>> Acknowledged.

Still on my list.
> It could be a bad firmware also. If you get the opportunity,  flash the firmware and bump the OS to the max.

Datacenter says everything was up to date as of installation, not really 
wanting them to take the servers offline for long enough to redo all the 
hardware.

>>>> more number of CPU cycles than needed, increasing the event thread
>>>> count
>>>> would enhance the performance of the Red Hat Storage Server."  which
>> is
>>>> why I had it at 8.
>>> Yeah, but you got only 6 cores  and they are not dedicated for
>> gluster only. I think that you need to test with lower values.

figured out my magic number for client/server threads, it should be 5. 
I set it to 5, observed no change I could attribute to it, so tried 4, 
and got the same thing; no visible effect.

>>>> right now the only suggested parameter I haven't played with is the
>>>> performance.io-thread-count, which I currently have at 64.
>> not really sure what would be a reasonable value for my system.
> I guess you can try to increase it a little bit and check how is it going.

turns out if you try to set this higher than 64, you get an error saying 
64 is the max.

>>> What I/O scheduler are you using for the SSDs (you can check via 'cat
>> /sys/block/sdX/queue/scheduler)?
>>
>> # cat /sys/block/vda/queue/scheduler
>> [mq-deadline] none
> 
> Deadline prioritizes  reads in a 2:1 ratio /default tunings/ . You can consider testing 'none' if your SSDs are good.

I did this.  I would say it did have a positive effect, but it was a 
minimal one.

> I see vda , please share details on the infra as this is very important. Virtual disks have their limitations and if you are on a VM,  then there might be chance to increase the CPU count.
> If you are on a VM, I would recommend you to use more (in numbers)  and smaller disks in stripe sets (either raid0  via mdadm,  or pure striped LV).
> Also, if you are  on a VM -> there is no reason to reorder  your I/O requests  in the VM, just to do  it again on the Hypervisour. In such case 'none' can bring better performance,  but this varies on the workload.

hm, this is a good question, one I have been asking the datacenter for a 
while, but they are a little bit slippery on what exactly it is they 
have going on there.  They advertise the servers as metal with a virtual 
layer.  The virtual layer is so you can log into a site and power the 
server down or up, mount an ISO to boot from, access a console, and some 
other nifty things.  can't any more, but when they first introduced the 
system, you could even access the BIOS of the server.  But apparently, 
and they swear up and down by this, it is a physical server, with real 
dedicated SSDs and real sticks of RAM.  I have found virtio and qemu as 
loaded kernel modules, so certainly there is something virtual involved, 
but other than that and their nifty little tools, it has always acted 
and worked like a metal server to me.

> All necessary data is in the file attributes on the brick. I doubt you will need to have access times on the brick itself. Another possibility is to use 'relatime'.

remounted all bricks with noatime, no significant difference.

>> cache unless flush-behind is on.  So seems that is a way to throw ram
>> to
>> it?  I put performance.write-behind-window-size: 512MB and
>> performance.flush-behind: on and the whole system calmed down pretty
>> much immediately.  could be just timing, though, will have to see
>> tomorrow during business hours whether the system stays at a reasonable

Tried increasing this to its max of 1GB, no noticeable change from 512MB.

The 2nd server is not acting inline with the first server.  glusterfsd 
processes are running at 50-80% of a core each, with one brick often 
going over 200%, where as they usually stick to 30-45% on the first 
server.  apache processes consume as much as 90% of a core where as they 
rarely go over 15% on the first server, and they frequently stack up to 
having more than 100 running at once, which drives load average up to 
40-60.  It's very much like the first server was before I found the 
flush-behind setting, but not as bad; at least it isn't going completely 
non-responsive.

Additionally, it is still taking an excessive time to load the first 
page of most sites.  I am guessing I need to increase read speeds to fix 
this, so I have played with 
performance.io-cache/cache-max-file-size(slight positive change), 
read-ahead/read-ahead-page-count(negative change till page count set to 
max of 16, then no noticeable difference), and 
rda-cache-limit/rda-request-size(minimal positive effect).  I still have 
RAM to spare, so would be nice if I could be using it to improve things 
on the read side of things, but have found no magic bullet like 
flush-behind was.

I found a good number of more options to try, have been going a little 
crazy with them, will post them at the bottom.  I found a post that 
suggested mount options are also important:

https://lists.gluster.org/pipermail/gluster-users/2018-September/034937.html

I confirmed these are in the man pages, so I tried umounting and 
re-mounting with the -o option to include these thusly:

mount -t glusterfs moogle:webisms /Computerisms/ -o 
negative-timeout=10,attribute-timeout=30,fopen-keep-cache,direct-io-mode=enable,fetch-attempts=5

But I don't think they are working:

/# mount | grep glus
moogle:webisms on /Computerisms type fuse.glusterfs 
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

would be grateful if there are any other suggestions anyone can think of.

root at moogle:/# gluster v info

Volume Name: webisms
Type: Distributed-Replicate
Volume ID: 261901e7-60b4-4760-897d-0163beed356e
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: mooglian:/var/GlusterBrick/replset-0/webisms-replset-0
Brick2: moogle:/var/GlusterBrick/replset-0/webisms-replset-0
Brick3: moogle:/var/GlusterBrick/replset-0-arb/webisms-replset-0-arb 
(arbiter)
Brick4: moogle:/var/GlusterBrick/replset-1/webisms-replset-1
Brick5: mooglian:/var/GlusterBrick/replset-1/webisms-replset-1
Brick6: mooglian:/var/GlusterBrick/replset-1-arb/webisms-replset-1-arb 
(arbiter)
Options Reconfigured:
performance.rda-cache-limit: 1GB
performance.client-io-threads: off
nfs.disable: on
storage.fips-mode-rchecksum: off
transport.address-family: inet
performance.stat-prefetch: on
network.inode-lru-limit: 200000
performance.write-behind-window-size: 1073741824
performance.readdir-ahead: on
performance.io-thread-count: 64
performance.cache-size: 12GB
server.event-threads: 4
client.event-threads: 4
performance.nl-cache-timeout: 600
auth.allow: xxxxxx
performance.open-behind: off
performance.quick-read: off
cluster.lookup-optimize: off
cluster.rebal-throttle: lazy
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.cache-invalidation: on
performance.md-cache-timeout: 600
performance.flush-behind: on
cluster.read-hash-mode: 0
performance.strict-o-direct: on
cluster.readdir-optimize: on
cluster.lookup-unhashed: off
performance.cache-refresh-timeout: 30
performance.enable-least-priority: off
cluster.choose-local: on
performance.rda-request-size: 128KB
performance.read-ahead: on
performance.read-ahead-page-count: 16
performance.cache-max-file-size: 5MB
performance.io-cache: on