[Gluster-devel] replicate background threads

Sun Mar 11 10:17:15 UTC 2012

Hello,

  My mate Michael and I have been steadily 
advancing our Gluster testing and today we finally 
reached some heavier conditions.  The outcome 
was different from expectations built from our more 
basic testing so I think we have a couple of 
questions regarding the AFR/Replicate background 
threads that may need a developer's contribution.  
Any help appreciated.

  The setup is a 3 box environment, but lets start
with two;

    SaturnM (Server)
       - 6core CPU, 16GB RAM, 1Gbps net
       - 3.2.6 Kernel (custom distro)
       - 3.2.5 Gluster (32bit)
       - 3x2TB drives, CFQ, EXT3
       - Bricked up into a single local 6TB 
          "distribute" brick 
       - "brick" served to the network

    MMC (Client)
       - 4core CPU, 8GB RAM, 1Gbps net
       - Ubuntu
       - 3.2.5 Gluster (32bit)
       - Don't recall the disk space locally
       - "brick" from SaturnM mounted

    500 x 15Gbyte files were copied from MMC
to a single sub-directory on the brick served from 
SaturnM, all went well and dandy.  So then we 
moved on to a 3 box environment;

    SaturnI (Server)
       = 1core CPU, 1GB RAM, 1Gbps net
       = 3.2.6 Kernel (custom distro)
       = 3.2.5 Gluster (32bit)
       = 4x2TB drives, CFQ, EXT3
       = Bricked up into a single local 8TB 
          "distribute" brick 
       = "brick" served to the network

    SaturnM (Server/Client)
       - 6core CPU, 16GB RAM, 1Gbps net
       - 3.2.6 Kernel (custom distro)
       - 3.2.5 Gluster (32bit)
       - 3x2TB drives, CFQ, EXT3
       - Bricked up into a single local 6TB 
          "distribute" brick 
       = Replicate brick added to sit over 
          the local distribute brick and a
          client "brick" mapped from SaturnI
       - Replicate "brick" served to the network

    MMC (Client)
       - 4core CPU, 8GB RAM, 1Gbps net
       - Ubuntu
       - 3.2.5 Gluster (32bit)
       - Don't recall the disk space locally
       - "brick" from SaturnM mounted
       = "brick" from SaturnI mounted

  Now, in lesser testing in this scenario all was
well - any files on SaturnI would appear on SaturnM 
(not a functional part of our test) and the content on
SaturnM would appear on SaturnI (the real 
objective).  

  Earlier testing used a handful of smaller files (10s 
to 100s of Mbytes) and a single 15Gbyte file.  The
15Gbyte file would be "stat" via an "ls", which would
kick off a background replication (ls appeared un-
blocked) and the file would be transferred.  Also,
interrupting the transfer (pulling the LAN cable)
would result in a partial 15Gbyte file being corrected
in a subsequent background process on another
stat.

  *However* .. when confronted with 500 x 15Gbyte 
files, in a single directory (but not the root directory) 
things don't quite work out as nicely.  
  - First, the "ls" (at MMC against the SaturnM brick) 
    is blocking and hangs the terminal (ctrl-c doesn't 
    kill it).  
  - Then, looking from MMC at the SaturnI file 
     system (ls -s) once per second, and then 
     comparing the output (diff ls1.txt ls2.txt | 
     grep -v '>') we can see that between 10 and 17 
     files are being updated simultaneously by the
     background process
  - Further, a request at MMC for a single file that 
    was originally in the 500 x 15Gbyte sub-dir on
    SaturnM (which should return unblocked with 
    correct results) will;
      a) work as expected if there are less than 17
          active background file tasks
      b) block/hang if there are more
  - Where-as a stat (ls) outside of the 500 x 15
     sub-directory, such as the root of that brick,
     would always work as expected (return
     immediately, unblocked).

  Thus, to us, it appears as though there is a 
queue feeding a set of (around) 16 worker threads 
in AFR.  If your request was to the loaded directory
then you would be blocked until a worker was 
available, and if your request was to any other
location, it would return unblocked regardless of
the worker pool state.

  The only thread metric that we could find to tweak
was performance/io-threads (which was set to
16 per physical disk; well per locks + posix brick 
stacks) but increasing this to 64 per stack didn't
change the outcome (10 to 17 active background
transfers).

  So, given the above, is our analysis sound, and
if so, is there a way to increase the size of the 
pool of active worker threads?  The objective 
being to allow unblocked access to an existing
repository of files (on SaturnM) while a 
secondary/back-up is being filled, via GlusterFS?

  Note that I understand that performance 
(through-put) will be an issue in the described 
environment: this replication process is 
estimated to run for between 10 and 40 hours, 
which is acceptable so long as it isn't blocking
(there's a production-capable file set in place).

Any help appreciated.

Thanks,

--
Ian Latter
Late night coder ..
http://midnightcode.org/