[Gluster-users] poor performance

Mon Dec 19 07:15:57 UTC 2022

Hi Joe,

I've read your blogs extensively and frequently reference it to 
correlate my own findings with.  It has been one of the better sources 
of information over the years.  Sorry for the really, really long email 
below, but I reckon it's required at this stage to explain what's going 
on from what we can see.

Some of the replies I've received are of the form "use VMs for serving 
content and use glusterfs for the backing store only", the problem with 
this is that running 1000+ VMs for websites that in some cases don't 
exactly serve more than 10 users a day is an extreme waste of 
resources.  In particular with respect to RAM. docker may limit the 
impact, but that's more complex to achieve.

varnish and squid only really helps if the content is set to be cached, 
otherwise all requests hit the backend servers anyway. That said, yes, 
we should deploy varnish/squid as a reverse proxy at some point, so 
perhaps this should be step one.  So effectively haproxy => 
varnish/squid => haproxy => apache/php (probably second haproxy can be 
eliminated since varnish/squid should know how to load balance between 
multiple back-end servers, plus SSL can then be offloaded away from 
apache too).

None of this solves the underlying problem though:  with nl-cache 
performance is good (enough), but filesystem is inconsistent, without 
nl-cache, performance is terrible to the point where we are considering 
shelving redundancy.  Merely migrating to VMs doesn't actually solve the 
redundancy problem as your VM still remains the single point of failure 
at this point.

One consideration could be made to rather use docker instances 
potentially.  Such that there is exactly one docker instance per virtual 
host, but I'm not sure this solves the performance issue in that each 
docker instance will still need to access the filesystem, so unless I 
can export a *block* device via gfapi (as per KVM, but that's too RAM 
intensive since it requires a VM per virtual host, each with at least 
1GB RAM that adds up to at least 1TB of RAM per physical node that will 
be required, and I'm fairly certain CPU will be significantly increased 
too).

One other solution currently being contemplated is to use lsync to 
rather use a cold standby host compared to a load-balanced setup.  
Switch-over will have to be manual, and the risks w.r.t. data 
consistency (how up to date the standby is) is also not something I 
really want to contemplate.  This would allow us to leave most of the 
rest of the configuration in tact.  Here however lies the problem as per 
the github page:

"synchronize a local directory tree with low profile of expected changes 
to a remote mirror." ... this is definitely NOT low profile.

First prise:  Sort the filesystem inconsistency with using nl-cache, or 
at least dramatically reduce the time-period of the inconsistency from 
infinite to a relatively short period (eg, 30 to 60 seconds).

Second prise:  Get close to nl-cache performance without nl-cache.  This 
doesn't seem feasible whilst still using php.

Third prise:  sort out php to not have as many negative filesystem 
hits.  realpath_cache_size doesn't seem to make sufficient difference, 
default incidentally is no longer 16KB but 64KB (and combine with 
realpath_cache_ttl=120 default, up to say 86400), so I'm guessing I can 
push this for 512KB or even 1MB, so spend 1-2GB of RAM on this.  May 
need to also switch the php-fpm process manager to keep per-vhost 
processes around for longer but this isn't a major concern, we've got a 
reasonable amount of RAM available.  Unless this realpath_cache is 
persistent over multiple php-fpm processes.

https://pecl.php.net/apcu just came onto my radar now, can definitely 
also investigate that.  APC itself is dead from the looks of it.  
Looking at the docs though, the mechanism to avoid that stat() call is 
no longer present either.  And the primary goal of avoiding the stat() 
call was to avoid self-heal (which is nowadays off on glusterfs side by 
default anyway).  So not sure this will make a significant difference.

Otherwise, that specific blog entry has been read through so many times 
by myself I can mostly recall the recommendations from memory.  You 
still reference glusterfs 3.2.6 ... we're at 10.2, and we're running 
with an extra inode-table-size patch by yours truly which helps avoid 
lock contention when you have >64k files in the active set.  Other 
tricks and hacks too such as limiting the invalidate-size to 16 or 32 
(recommendations currently seem to be in the 128-256 region but we found 
that anything over 32 if lru-limit >> inode-table-size is simply 
untennable, at 16 we pretty much avoid all latency spikes with the 
caveat that it's quite possible for the number of entries in the inode 
table to exceeed lru-limit for reasonable periods of time, but we reason 
that's just an indicator that you should probably be inreasing 
lru-limit, and quite possibly inode-table-size too - patches on 
github).  The recommendation regarding RDMA over Infiniband is also no 
longer possible, since infiniband support in glusterfs has been abandoned.

One other option that has not been mentioned is to use cluster-lvm and 
basically export PVs from glusterfs, which can then be sectored into 
Cluster-aware VGs, such that they're only active on one node at a time, 
and then run some posix filesystem directly on those, and basically 
retain the current setup otherwise, with the caveat that each vhost will 
be active only on one specific node, which will mean we will need a 
relevant mechanism to ensure that all requests for the vhost always hits 
the right physical node.

Kind Regards,
Jaco

On 2022/12/14 17:37, Joe Julian wrote:
> PHP is not a good filesystem user. I've written about this a while 
> back: 
> https://joejulian.name/post/optimizing-web-performance-with-glusterfs/
>
> On December 14, 2022 6:16:54 AM PST, Jaco Kroon <jaco at uls.co.za> wrote:
>
>     Hi Peter,
>
>     Yes, we could.  but with ~1000 vhosts that gets extremely
>     cumbersome to maintain and get clients to be able to manage their
>     own stuff.  Essentially except if the htdocs/ folder is on a
>     single filesystem we're going to need to get involved with each
>     and every update, which isn't feasible. Then I'd rather partition
>     the vhosts such that half runs on one server and the other half on
>     the other server and risk downtime.
>
>     Our experience indicates that the slow part is in fact not the
>     execution of the php code but for php to locate the files.  It
>     tries a bunch of folders with stat() and/or open() and gets the
>     ordering wrong, resulting numerous ENOENT errors before hitting
>     the right locations, after which it actually does quite well.  On
>     code I wrote which does NOT suffer this problem quite as badly as
>     wordpress we find that from a local filesystem we get 200ms on
>     full processing (idle system, nvme physical disk, although I doubt
>     this matters since the fs layer should have most of this cached in
>     RAM anyway) vs 300ms on top of glusterfs. The bricks barely ever
>     goes to disk (fs layer caching) according to the system stats we
>     gathered.
>
>     How does big hosting entities like wordpress.org (iirc) deal with
>     this?  Because honestly, I doubt they do single-server setups. 
>     Then again, I reckon that if you ONLY host wordpress (based on
>     experience) it's possible to have a single master copy of
>     wordpress on each server, with a lsync'ed themes/ folder for each
>     vhost and a shared (glusterfs) uploads folder.  Enters things like
>     wordfence that insists on being able to write to alternative
>     locations.
>
>     Anyway, barring using glusterfs we can certainly come up with
>     solutions, which may even include having *some* sites run on the
>     shared setup, and others on single-host, possibly with lsync
>     keeping a "semi hot standby" up to date with something like
>     lsync.  That does get complex though.
>
>     Our ideal solution remains a fairly performant clustered
>     filesystem such as glusterfs (with which we have a lot of
>     experience, including using it for large email clusters where it's
>     performance is excellent, but I would have LOVED inotify
>     support).  With nl-cache the performance is adequate, however, the
>     cache-invalidation doesn't seem to function properly.  Which I
>     believe can be solved, either by fixing settings, or by fixing
>     code bugs.  Basically whenver a file is modified or a new file is
>     created, clients should be alerted in order to invalidate cache. 
>     Since this cluster is mostly-read, some write, and there is only
>     two clients, this should be perfectly manageable, and there seems
>     to be hints of this in the gluster volume options already:
>
>     # gluster volume get volname all | grep invalid
>     performance.quick-read-cache-invalidation false (DEFAULT)
>     performance.ctime-invalidation           false (DEFAULT)
>     performance.cache-invalidation on
>     performance.global-cache-invalidation    true (DEFAULT)
>     features.cache-invalidation on
>     features.cache-invalidation-timeout 600
>
>     Kind Regards,
>     Jaco
>
>     On 2022/12/14 14:56, Péter Károly JUHÁSZ wrote:
>
>>     We did this with WordPress too. It uses a tons of static files,
>>     executing them is the slow part. You can rsync them and use the
>>     upload dir from glusterfs.
>>
>>     Jaco Kroon <jaco at uls.co.za> 于 2022年12月14日周三 13:20写道：
>>
>>         Hi,
>>
>>         The problem is files generated by wordpress, and uploads etc
>>         ... so copying them to frontend hosts whilst making perfect
>>         sense assumes I have control over the code to not write to
>>         the local front-end, else we could have relied on something
>>         like lsync.
>>
>>         As it stands, performance is acceptable with nl-cache
>>         enabled, but the fact that we get those ENOENT errors are
>>         highly problematic.
>>
>>
>>         Kind Regards,
>>         Jaco Kroon
>>
>>
>>         n 2022/12/14 14:04, Péter Károly JUHÁSZ wrote:
>>
>>>         When we used glusterfs for websites, we copied the web dir
>>>         from gluster to local on frontend boots, then served it from
>>>         there.
>>>
>>>         Jaco Kroon <jaco at uls.co.za> 于 2022年12月14日周三 12:49写道：
>>>
>>>             Hi All,
>>>
>>>             We've got a glusterfs cluster that houses some php web
>>>             sites.
>>>
>>>             This is generally considered a bad idea and we can see why.
>>>
>>>             With performance.nl-cache on it actually turns out to be
>>>             very
>>>             reasonable, however, with this turned of performance is
>>>             roughly 5x
>>>             worse.  meaning a request that would take sub 500ms now
>>>             takes 2500ms.
>>>             In other cases we see far, far worse cases, eg, with
>>>             nl-cache takes
>>>             ~1500ms, without takes ~30s (20x worse).
>>>
>>>             So why not use nl-cache?  Well, it results in readdir
>>>             reporting files
>>>             which then fails to open with ENOENT.  The cache also
>>>             never clears even
>>>             though the configuration says nl-cache entries should
>>>             only be cached for
>>>             60s.  Even for "ls -lah" in affected folders you'll
>>>             notice ???? mark
>>>             entries for attributes on files.  If this recovers in a
>>>             reasonable time
>>>             (say, a few seconds, sure).
>>>
>>>             # gluster volume info
>>>             Type: Replicate
>>>             Volume ID: cbe08331-8b83-41ac-b56d-88ef30c0f5c7
>>>             Status: Started
>>>             Snapshot Count: 0
>>>             Number of Bricks: 1 x 2 = 2
>>>             Transport-type: tcp
>>>             Options Reconfigured:
>>>             performance.nl-cache: on
>>>             cluster.readdir-optimize: on
>>>             config.client-threads: 2
>>>             config.brick-threads: 4
>>>             config.global-threading: on
>>>             performance.iot-pass-through: on
>>>             storage.fips-mode-rchecksum: on
>>>             cluster.granular-entry-heal: enable
>>>             cluster.data-self-heal-algorithm: full
>>>             cluster.locking-scheme: granular
>>>             client.event-threads: 2
>>>             server.event-threads: 2
>>>             transport.address-family: inet
>>>             nfs.disable: on
>>>             cluster.metadata-self-heal: off
>>>             cluster.entry-self-heal: off
>>>             cluster.data-self-heal: off
>>>             cluster.self-heal-daemon: on
>>>             server.allow-insecure: on
>>>             features.ctime: off
>>>             performance.io-cache: on
>>>             performance.cache-invalidation: on
>>>             features.cache-invalidation: on
>>>             performance.qr-cache-timeout: 600
>>>             features.cache-invalidation-timeout: 600
>>>             performance.io-cache-size: 128MB
>>>             performance.cache-size: 128MB
>>>
>>>             Are there any other recommendations short of abandon all
>>>             hope of
>>>             redundancy and to revert to a single-server setup (for
>>>             the web code at
>>>             least).  Currently the cost of the redundancy seems to
>>>             outweigh the benefit.
>>>
>>>             Glusterfs version 10.2.  With patch for
>>>             --inode-table-size, mounts
>>>             happen with:
>>>
>>>             /usr/sbin/glusterfs --acl --reader-thread-count=2
>>>             --lru-limit=524288
>>>             --inode-table-size=524288 --invalidate-limit=16
>>>             --background-qlen=32
>>>             --fuse-mountopts=nodev,nosuid,noexec,noatime
>>>             --process-name fuse
>>>             --volfile-server=127.0.0.1 --volfile-id=gv_home
>>>             --fuse-mountopts=nodev,nosuid,noexec,noatime /home
>>>
>>>             Kind Regards,
>>>             Jaco
>>>
>>>             ________
>>>
>>>
>>>
>>>             Community Meeting Calendar:
>>>
>>>             Schedule -
>>>             Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>             Bridge: https://meet.google.com/cpu-eiue-hvk
>>>             Gluster-users mailing list
>>>             Gluster-users at gluster.org
>>>             https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20221219/686c7390/attachment.html>