[Gluster-devel] replicate background threads

Tue Mar 13 14:22:40 UTC 2012

Hello,

  Well we've been privy to our first true error in 
Gluster now, and we're not sure of the cause.

  The SaturnI machine with 1Gbyte of RAM was
exhausting its memory and crashing and we saw 
core dumps on SaturnM and MMC.  Replacing 
the SaturnI hardware with identical hardware to 
SaturnM, but retaining SaturnI's original disks, 
(so fixing the memory capacity problem) we saw 
crashes randomly at all nodes.

  Looking for irregularities at the file system
we noticed that (we'd estimate) about 60% of
the files at the OS/EXT3 layer of SaturnI 
(sourced via replicate from SaturnM) were of 
size 2147483648 (2^31) where they should 
have been substantially larger.  While we would 
happily accept "you shouldn't be using a 32bit
gluster package" as the answer, we note two
deltas;
  1) All files used in testing were copied on from
       32 bit clients to 32 bit servers, with no
       observable errors
  2) Of the file that were replicated, not all were
       corrupted (capped at 2G -- note that we 
       confirmed that this was the first 2G of the
       source file contents).

So is there a known replicate issue with files
greater than 2GB?  Has anyone done any 
serious testing with significant numbers of files 
of this size?  Are there configurations specific
to files/structures of these dimensions?

We noted that reversing the configuration, such
that SaturnI provides the replicate Brick amongst
a local distribute and a remote map to SaturnM
where SaturnM simply serves a local distribute;
that the data served to MMC is accurate (it 
continues to show 15GB files, even where there 
is a local 2GB copy).  Further, a client "cp" at 
MMC, of a file with a 2GB local replicate of a 
15GB file, will result in a 15GB file being 
created and replicated via Gluster (i.e. the 
correct specification at both server nodes).

So my other question is; Is it possible that we've
managed to corrupt something in this 
environment?  I.e. during the initial memory 
exhaustion events?  And is there a robust way
to have the replicate files revalidated by gluster
as a stat doesn't seem to be correcting files in
this state (i.e. replicate on SaturnM results in
daemon crashes, replicate on SaturnI results 
in files being left in the bad state).

Also, I'm not a member of the users list; if these
questions are better posed there then let me 
know and I'll re-post them there.

Thanks,

----- Original Message -----
>From: "Ian Latter" <ian.latter at midnightcode.org>
>To: <gluster-devel at nongnu.org>
>Subject:  [Gluster-devel] replicate background threads
>Date: Sun, 11 Mar 2012 20:17:15 +1000
>
> Hello,
> 
> 
>   My mate Michael and I have been steadily 
> advancing our Gluster testing and today we finally 
> reached some heavier conditions.  The outcome 
> was different from expectations built from our more 
> basic testing so I think we have a couple of 
> questions regarding the AFR/Replicate background 
> threads that may need a developer's contribution.  
> Any help appreciated.
> 
> 
>   The setup is a 3 box environment, but lets start
> with two;
> 
>     SaturnM (Server)
>        - 6core CPU, 16GB RAM, 1Gbps net
>        - 3.2.6 Kernel (custom distro)
>        - 3.2.5 Gluster (32bit)
>        - 3x2TB drives, CFQ, EXT3
>        - Bricked up into a single local 6TB 
>           "distribute" brick 
>        - "brick" served to the network
> 
>     MMC (Client)
>        - 4core CPU, 8GB RAM, 1Gbps net
>        - Ubuntu
>        - 3.2.5 Gluster (32bit)
>        - Don't recall the disk space locally
>        - "brick" from SaturnM mounted
> 
>     500 x 15Gbyte files were copied from MMC
> to a single sub-directory on the brick served from 
> SaturnM, all went well and dandy.  So then we 
> moved on to a 3 box environment;
> 
>     SaturnI (Server)
>        = 1core CPU, 1GB RAM, 1Gbps net
>        = 3.2.6 Kernel (custom distro)
>        = 3.2.5 Gluster (32bit)
>        = 4x2TB drives, CFQ, EXT3
>        = Bricked up into a single local 8TB 
>           "distribute" brick 
>        = "brick" served to the network
> 
>     SaturnM (Server/Client)
>        - 6core CPU, 16GB RAM, 1Gbps net
>        - 3.2.6 Kernel (custom distro)
>        - 3.2.5 Gluster (32bit)
>        - 3x2TB drives, CFQ, EXT3
>        - Bricked up into a single local 6TB 
>           "distribute" brick 
>        = Replicate brick added to sit over 
>           the local distribute brick and a
>           client "brick" mapped from SaturnI
>        - Replicate "brick" served to the network
> 
>     MMC (Client)
>        - 4core CPU, 8GB RAM, 1Gbps net
>        - Ubuntu
>        - 3.2.5 Gluster (32bit)
>        - Don't recall the disk space locally
>        - "brick" from SaturnM mounted
>        = "brick" from SaturnI mounted
> 
> 
>   Now, in lesser testing in this scenario all was
> well - any files on SaturnI would appear on SaturnM 
> (not a functional part of our test) and the content on
> SaturnM would appear on SaturnI (the real 
> objective).  
> 
>   Earlier testing used a handful of smaller files (10s 
> to 100s of Mbytes) and a single 15Gbyte file.  The
> 15Gbyte file would be "stat" via an "ls", which would
> kick off a background replication (ls appeared un-
> blocked) and the file would be transferred.  Also,
> interrupting the transfer (pulling the LAN cable)
> would result in a partial 15Gbyte file being corrected
> in a subsequent background process on another
> stat.
> 
>   *However* .. when confronted with 500 x 15Gbyte 
> files, in a single directory (but not the root directory) 
> things don't quite work out as nicely.  
>   - First, the "ls" (at MMC against the SaturnM brick) 
>     is blocking and hangs the terminal (ctrl-c doesn't 
>     kill it).  
>   - Then, looking from MMC at the SaturnI file 
>      system (ls -s) once per second, and then 
>      comparing the output (diff ls1.txt ls2.txt | 
>      grep -v '>') we can see that between 10 and 17 
>      files are being updated simultaneously by the
>      background process
>   - Further, a request at MMC for a single file that 
>     was originally in the 500 x 15Gbyte sub-dir on
>     SaturnM (which should return unblocked with 
>     correct results) will;
>       a) work as expected if there are less than 17
>           active background file tasks
>       b) block/hang if there are more
>   - Where-as a stat (ls) outside of the 500 x 15
>      sub-directory, such as the root of that brick,
>      would always work as expected (return
>      immediately, unblocked).
> 
> 
>   Thus, to us, it appears as though there is a 
> queue feeding a set of (around) 16 worker threads 
> in AFR.  If your request was to the loaded directory
> then you would be blocked until a worker was 
> available, and if your request was to any other
> location, it would return unblocked regardless of
> the worker pool state.
> 
>   The only thread metric that we could find to tweak
> was performance/io-threads (which was set to
> 16 per physical disk; well per locks + posix brick 
> stacks) but increasing this to 64 per stack didn't
> change the outcome (10 to 17 active background
> transfers).
> 
> 
> 
>   So, given the above, is our analysis sound, and
> if so, is there a way to increase the size of the 
> pool of active worker threads?  The objective 
> being to allow unblocked access to an existing
> repository of files (on SaturnM) while a 
> secondary/back-up is being filled, via GlusterFS?
> 
>   Note that I understand that performance 
> (through-put) will be an issue in the described 
> environment: this replication process is 
> estimated to run for between 10 and 40 hours, 
> which is acceptable so long as it isn't blocking
> (there's a production-capable file set in place).
> 
> 
> 
> 
> 
> Any help appreciated.
> 
> 
> 
> Thanks,
> 
> 
> 
> 
> 
> 
> --
> Ian Latter
> Late night coder ..
> http://midnightcode.org/
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
> 

--
Ian Latter
Late night coder ..
http://midnightcode.org/