<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Hi Joe,</p>

    <p>I've read your blogs extensively and frequently reference it to

      correlate my own findings with.  It has been one of the better

      sources of information over the years.  Sorry for the really,

      really long email below, but I reckon it's required at this stage

      to explain what's going on from what we can see.<br>

    </p>

    <p>Some of the replies I've received are of the form "use VMs for

      serving content and use glusterfs for the backing store only", the

      problem with this is that running 1000+ VMs for websites that in

      some cases don't exactly serve more than 10 users a day is an

      extreme waste of resources.  In particular with respect to RAM. 

      docker may limit the impact, but that's more complex to achieve.<br>

    </p>

    <p>varnish and squid only really helps if the content is set to be

      cached, otherwise all requests hit the backend servers anyway. 

      That said, yes, we should deploy varnish/squid as a reverse proxy

      at some point, so perhaps this should be step one.  So effectively

      haproxy => varnish/squid => haproxy => apache/php

      (probably second haproxy can be eliminated since varnish/squid

      should know how to load balance between multiple back-end servers,

      plus SSL can then be offloaded away from apache too).<br>

      <br>

      None of this solves the underlying problem though:  with nl-cache

      performance is good (enough), but filesystem is inconsistent,

      without nl-cache, performance is terrible to the point where we

      are considering shelving redundancy.  Merely migrating to VMs

      doesn't actually solve the redundancy problem as your VM still

      remains the single point of failure at this point.</p>

    <p>One consideration could be made to rather use docker instances

      potentially.  Such that there is exactly one docker instance per

      virtual host, but I'm not sure this solves the performance issue

      in that each docker instance will still need to access the

      filesystem, so unless I can export a *block* device via gfapi (as

      per KVM, but that's too RAM intensive since it requires a VM per

      virtual host, each with at least 1GB RAM that adds up to at least

      1TB of RAM per physical node that will be required, and I'm fairly

      certain CPU will be significantly increased too).</p>

    <p>One other solution currently being contemplated is to use lsync

      to rather use a cold standby host compared to a load-balanced

      setup.  Switch-over will have to be manual, and the risks w.r.t.

      data consistency (how up to date the standby is) is also not

      something I really want to contemplate.  This would allow us to

      leave most of the rest of the configuration in tact.  Here however

      lies the problem as per the github page:<br>

      <br>

      "synchronize a local directory tree with low profile of expected

      changes to a remote mirror." ... this is definitely NOT low

      profile.</p>

    <p>First prise:  Sort the filesystem inconsistency with using

      nl-cache, or at least dramatically reduce the time-period of the

      inconsistency from infinite to a relatively short period (eg, 30

      to 60 seconds).</p>

    <p>Second prise:  Get close to nl-cache performance without

      nl-cache.  This doesn't seem feasible whilst still using php.</p>

    <p>Third prise:  sort out php to not have as many negative

      filesystem hits.  realpath_cache_size doesn't seem to make

      sufficient difference, default incidentally is no longer 16KB but

      64KB (and combine with realpath_cache_ttl=120 default, up to say

      86400), so I'm guessing I can push this for 512KB or even 1MB, so

      spend 1-2GB of RAM on this.  May need to also switch the php-fpm

      process manager to keep per-vhost processes around for longer but

      this isn't a major concern, we've got a reasonable amount of RAM

      available.  Unless this realpath_cache is persistent over multiple

      php-fpm processes.</p>

    <p><a class="moz-txt-link-freetext" href="https://pecl.php.net/apcu">https://pecl.php.net/apcu</a> just came onto my radar now, can

      definitely also investigate that.  APC itself is dead from the

      looks of it.  Looking at the docs though, the mechanism to avoid

      that stat() call is no longer present either.  And the primary

      goal of avoiding the stat() call was to avoid self-heal (which is

      nowadays off on glusterfs side by default anyway).  So not sure

      this will make a significant difference.</p>

    <p>Otherwise, that specific blog entry has been read through so many

      times by myself I can mostly recall the recommendations from

      memory.  You still reference glusterfs 3.2.6 ... we're at 10.2,

      and we're running with an extra inode-table-size patch by yours

      truly which helps avoid lock contention when you have >64k

      files in the active set.  Other tricks and hacks too such as

      limiting the invalidate-size to 16 or 32 (recommendations

      currently seem to be in the 128-256 region but we found that

      anything over 32 if lru-limit >> inode-table-size is simply

      untennable, at 16 we pretty much avoid all latency spikes with the

      caveat that it's quite possible for the number of entries in the

      inode table to exceeed lru-limit for reasonable periods of time,

      but we reason that's just an indicator that you should probably be

      inreasing lru-limit, and quite possibly inode-table-size too -

      patches on github).  The recommendation regarding RDMA over

      Infiniband is also no longer possible, since infiniband support in

      glusterfs has been abandoned.<br>

    </p>

    <p>One other option that has not been mentioned is to use

      cluster-lvm and basically export PVs from glusterfs, which can

      then be sectored into Cluster-aware VGs, such that they're only

      active on one node at a time, and then run some posix filesystem

      directly on those, and basically retain the current setup

      otherwise, with the caveat that each vhost will be active only on

      one specific node, which will mean we will need a relevant

      mechanism to ensure that all requests for the vhost always hits

      the right physical node.<br>

    </p>

    <p>Kind Regards,<br>

      Jaco<br>

    </p>

    On 2022/12/14 17:37, Joe Julian wrote:<br>

    <blockquote type="cite"

      cite="mid:009FA163-24F6-4848-B9AF-40D0A9A482E9@julianfamily.org">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      PHP is not a good filesystem user. I've written about this a while

      back: <a

href="https://joejulian.name/post/optimizing-web-performance-with-glusterfs/"

        moz-do-not-send="true" class="moz-txt-link-freetext">https://joejulian.name/post/optimizing-web-performance-with-glusterfs/</a><br>

      <br>

      <div class="gmail_quote">On December 14, 2022 6:16:54 AM PST, Jaco

        Kroon <a class="moz-txt-link-rfc2396E" href="mailto:jaco@uls.co.za"><jaco@uls.co.za></a> wrote:

        <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt

          0.8ex; border-left: 1px solid rgb(204, 204, 204);

          padding-left: 1ex;">

          <p>Hi Peter,</p>

          <p>Yes, we could.  but with ~1000 vhosts that gets extremely

            cumbersome to maintain and get clients to be able to manage

            their own stuff.  Essentially except if the htdocs/ folder

            is on a single filesystem we're going to need to get

            involved with each and every update, which isn't feasible. 

            Then I'd rather partition the vhosts such that half runs on

            one server and the other half on the other server and risk

            downtime.</p>

          <p>Our experience indicates that the slow part is in fact not

            the execution of the php code but for php to locate the

            files.  It tries a bunch of folders with stat() and/or

            open() and gets the ordering wrong, resulting numerous

            ENOENT errors before hitting the right locations, after

            which it actually does quite well.  On code I wrote which

            does NOT suffer this problem quite as badly as wordpress we

            find that from a local filesystem we get 200ms on full

            processing (idle system, nvme physical disk, although I

            doubt this matters since the fs layer should have most of

            this cached in RAM anyway) vs 300ms on top of glusterfs. 

            The bricks barely ever goes to disk (fs layer caching)

            according to the system stats we gathered.<br>

          </p>

          <p>How does big hosting entities like wordpress.org (iirc)

            deal with this?  Because honestly, I doubt they do

            single-server setups.  Then again, I reckon that if you ONLY

            host wordpress (based on experience) it's possible to have a

            single master copy of wordpress on each server, with a

            lsync'ed themes/ folder for each vhost and a shared

            (glusterfs) uploads folder.  Enters things like wordfence

            that insists on being able to write to alternative

            locations.<br>

          </p>

          <p>Anyway, barring using glusterfs we can certainly come up

            with solutions, which may even include having *some* sites

            run on the shared setup, and others on single-host, possibly

            with lsync keeping a "semi hot standby" up to date with

            something like lsync.  That does get complex though.</p>

          <p>Our ideal solution remains a fairly performant clustered

            filesystem such as glusterfs (with which we have a lot of

            experience, including using it for large email clusters

            where it's performance is excellent, but I would have LOVED

            inotify support).  With nl-cache the performance is

            adequate, however, the cache-invalidation doesn't seem to

            function properly.  Which I believe can be solved, either by

            fixing settings, or by fixing code bugs.  Basically whenver

            a file is modified or a new file is created, clients should

            be alerted in order to invalidate cache.  Since this cluster

            is mostly-read, some write, and there is only two clients,

            this should be perfectly manageable, and there seems to be

            hints of this in the gluster volume options already:<br>

            <br>

            # gluster volume get volname all | grep invalid<br>

            performance.quick-read-cache-invalidation false

            (DEFAULT)                        <br>

            performance.ctime-invalidation           false

            (DEFAULT)                        <br>

            performance.cache-invalidation          

            on                                     <br>

            performance.global-cache-invalidation    true

            (DEFAULT)                         <br>

            features.cache-invalidation             

            on                                     <br>

            features.cache-invalidation-timeout     

            600                                    <br>

            <br>

          </p>

          <p>Kind Regards,<br>

            Jaco</p>

          <p> On 2022/12/14 14:56, Péter Károly JUHÁSZ wrote:<br>

          </p>

          <blockquote type="cite"

cite="mid:CAAA01izvqKNdikAby07bjVja58_ogjjcSzT_=mYc5oWC=1ZEVA@mail.gmail.com">

            <meta http-equiv="content-type" content="text/html;

              charset=UTF-8">

            <div dir="auto">We did this with WordPress too. It uses a

              tons of static files, executing them is the slow part. You

              can rsync them and use the upload dir from glusterfs.</div>

            <br>

            <div class="gmail_quote">

              <div dir="ltr" class="gmail_attr">Jaco Kroon <<a

                  href="mailto:jaco@uls.co.za" moz-do-not-send="true"

                  class="moz-txt-link-freetext">jaco@uls.co.za</a>> 于

                2022年12月14日周三 13:20写道：<br>

              </div>

              <blockquote class="gmail_quote" style="margin:0 0 0

                .8ex;border-left:1px #ccc solid;padding-left:1ex">

                <div>

                  <p>Hi,</p>

                  <p>The problem is files generated by wordpress, and

                    uploads etc ... so copying them to frontend hosts

                    whilst making perfect sense assumes I have control

                    over the code to not write to the local front-end,

                    else we could have relied on something like lsync.</p>

                  <p>As it stands, performance is acceptable with

                    nl-cache enabled, but the fact that we get those

                    ENOENT errors are highly problematic.<br>

                  </p>

                  <p><br>

                  </p>

                  <div>

                    <p>Kind Regards,<br>

                      Jaco Kroon<br>

                    </p>

                    <p><br>

                    </p>

                    <p>n 2022/12/14 14:04, Péter Károly JUHÁSZ wrote:<br>

                    </p>

                  </div>

                  <blockquote type="cite">

                    <div dir="auto">When we used glusterfs for websites,

                      we copied the web dir from gluster to local on

                      frontend boots, then served it from there.</div>

                    <br>

                    <div class="gmail_quote">

                      <div dir="ltr" class="gmail_attr">Jaco Kroon <<a

                          href="mailto:jaco@uls.co.za" target="_blank"

                          rel="noreferrer" moz-do-not-send="true"

                          class="moz-txt-link-freetext">jaco@uls.co.za</a>>

                        于 2022年12月14日周三 12:49写道：<br>

                      </div>

                      <blockquote class="gmail_quote" style="margin:0 0

                        0 .8ex;border-left:1px #ccc

                        solid;padding-left:1ex">Hi All,<br>

                        <br>

                        We've got a glusterfs cluster that houses some

                        php web sites.<br>

                        <br>

                        This is generally considered a bad idea and we

                        can see why.<br>

                        <br>

                        With performance.nl-cache on it actually turns

                        out to be very <br>

                        reasonable, however, with this turned of

                        performance is roughly 5x <br>

                        worse.  meaning a request that would take sub

                        500ms now takes 2500ms.  <br>

                        In other cases we see far, far worse cases, eg,

                        with nl-cache takes <br>

                        ~1500ms, without takes ~30s (20x worse).<br>

                        <br>

                        So why not use nl-cache?  Well, it results in

                        readdir reporting files <br>

                        which then fails to open with ENOENT.  The cache

                        also never clears even <br>

                        though the configuration says nl-cache entries

                        should only be cached for <br>

                        60s.  Even for "ls -lah" in affected folders

                        you'll notice ???? mark <br>

                        entries for attributes on files.  If this

                        recovers in a reasonable time <br>

                        (say, a few seconds, sure).<br>

                        <br>

                        # gluster volume info<br>

                        Type: Replicate<br>

                        Volume ID: cbe08331-8b83-41ac-b56d-88ef30c0f5c7<br>

                        Status: Started<br>

                        Snapshot Count: 0<br>

                        Number of Bricks: 1 x 2 = 2<br>

                        Transport-type: tcp<br>

                        Options Reconfigured:<br>

                        performance.nl-cache: on<br>

                        cluster.readdir-optimize: on<br>

                        config.client-threads: 2<br>

                        config.brick-threads: 4<br>

                        config.global-threading: on<br>

                        performance.iot-pass-through: on<br>

                        storage.fips-mode-rchecksum: on<br>

                        cluster.granular-entry-heal: enable<br>

                        cluster.data-self-heal-algorithm: full<br>

                        cluster.locking-scheme: granular<br>

                        client.event-threads: 2<br>

                        server.event-threads: 2<br>

                        transport.address-family: inet<br>

                        nfs.disable: on<br>

                        cluster.metadata-self-heal: off<br>

                        cluster.entry-self-heal: off<br>

                        cluster.data-self-heal: off<br>

                        cluster.self-heal-daemon: on<br>

                        server.allow-insecure: on<br>

                        features.ctime: off<br>

                        performance.io-cache: on<br>

                        performance.cache-invalidation: on<br>

                        features.cache-invalidation: on<br>

                        performance.qr-cache-timeout: 600<br>

                        features.cache-invalidation-timeout: 600<br>

                        performance.io-cache-size: 128MB<br>

                        performance.cache-size: 128MB<br>

                        <br>

                        Are there any other recommendations short of

                        abandon all hope of <br>

                        redundancy and to revert to a single-server

                        setup (for the web code at <br>

                        least).  Currently the cost of the redundancy

                        seems to outweigh the benefit.<br>

                        <br>

                        Glusterfs version 10.2.  With patch for

                        --inode-table-size, mounts <br>

                        happen with:<br>

                        <br>

                        /usr/sbin/glusterfs --acl

                        --reader-thread-count=2 --lru-limit=524288 <br>

                        --inode-table-size=524288 --invalidate-limit=16

                        --background-qlen=32 <br>

                        --fuse-mountopts=nodev,nosuid,noexec,noatime

                        --process-name fuse <br>

                        --volfile-server=127.0.0.1 --volfile-id=gv_home

                        <br>

                        --fuse-mountopts=nodev,nosuid,noexec,noatime

                        /home<br>

                        <br>

                        Kind Regards,<br>

                        Jaco<br>

                        <br>

                        ________<br>

                        <br>

                        <br>

                        <br>

                        Community Meeting Calendar:<br>

                        <br>

                        Schedule -<br>

                        Every 2nd and 4th Tuesday at 14:30 IST / 09:00

                        UTC<br>

                        Bridge: <a

                          href="https://meet.google.com/cpu-eiue-hvk"

                          rel="noreferrer noreferrer noreferrer"

                          target="_blank" moz-do-not-send="true"

                          class="moz-txt-link-freetext">https://meet.google.com/cpu-eiue-hvk</a><br>

                        Gluster-users mailing list<br>

                        <a href="mailto:Gluster-users@gluster.org"

                          rel="noreferrer noreferrer" target="_blank"

                          moz-do-not-send="true"

                          class="moz-txt-link-freetext">Gluster-users@gluster.org</a><br>

                        <a

                          href="https://lists.gluster.org/mailman/listinfo/gluster-users"

                          rel="noreferrer noreferrer noreferrer"

                          target="_blank" moz-do-not-send="true"

                          class="moz-txt-link-freetext">https://lists.gluster.org/mailman/listinfo/gluster-users</a><br>

                      </blockquote>

                    </div>

                  </blockquote>

                </div>

              </blockquote>

            </div>

          </blockquote>

        </blockquote>

      </div>

      <div class="k9mail-signature">-- <br>

        Sent from my Android device with K-9 Mail. Please excuse my

        brevity.</div>

    </blockquote>

  </body>

</html>