[Gluster-devel] replicate background threads

Tue Mar 13 15:37:53 UTC 2012

On 03/13/2012 07:52 PM, Ian Latter wrote:
> Hello,
>
>
>    Well we've been privy to our first true error in
> Gluster now, and we're not sure of the cause.
>
>    The SaturnI machine with 1Gbyte of RAM was
> exhausting its memory and crashing and we saw
> core dumps on SaturnM and MMC.  Replacing
> the SaturnI hardware with identical hardware to
> SaturnM, but retaining SaturnI's original disks,
> (so fixing the memory capacity problem) we saw
> crashes randomly at all nodes.
>
>    Looking for irregularities at the file system
> we noticed that (we'd estimate) about 60% of
> the files at the OS/EXT3 layer of SaturnI
> (sourced via replicate from SaturnM) were of
> size 2147483648 (2^31) where they should
> have been substantially larger.  While we would
> happily accept "you shouldn't be using a 32bit
> gluster package" as the answer, we note two
> deltas;
>    1) All files used in testing were copied on from
>         32 bit clients to 32 bit servers, with no
>         observable errors
>    2) Of the file that were replicated, not all were
>         corrupted (capped at 2G -- note that we
>         confirmed that this was the first 2G of the
>         source file contents).
>
>
> So is there a known replicate issue with files
> greater than 2GB?  Has anyone done any
> serious testing with significant numbers of files
> of this size?  Are there configurations specific
> to files/structures of these dimensions?
>
> We noted that reversing the configuration, such
> that SaturnI provides the replicate Brick amongst
> a local distribute and a remote map to SaturnM
> where SaturnM simply serves a local distribute;
> that the data served to MMC is accurate (it
> continues to show 15GB files, even where there
> is a local 2GB copy).  Further, a client "cp" at
> MMC, of a file with a 2GB local replicate of a
> 15GB file, will result in a 15GB file being
> created and replicated via Gluster (i.e. the
> correct specification at both server nodes).
>
> So my other question is; Is it possible that we've
> managed to corrupt something in this
> environment?  I.e. during the initial memory
> exhaustion events?  And is there a robust way
> to have the replicate files revalidated by gluster
> as a stat doesn't seem to be correcting files in
> this state (i.e. replicate on SaturnM results in
> daemon crashes, replicate on SaturnI results
> in files being left in the bad state).
>
>
> Also, I'm not a member of the users list; if these
> questions are better posed there then let me
> know and I'll re-post them there.
>
>
>
> Thanks,
>
>
>
>
>
> ----- Original Message -----
>> From: "Ian Latter"<ian.latter at midnightcode.org>
>> To:<gluster-devel at nongnu.org>
>> Subject:  [Gluster-devel] replicate background threads
>> Date: Sun, 11 Mar 2012 20:17:15 +1000
>>
>> Hello,
>>
>>
>>    My mate Michael and I have been steadily
>> advancing our Gluster testing and today we finally
>> reached some heavier conditions.  The outcome
>> was different from expectations built from our more
>> basic testing so I think we have a couple of
>> questions regarding the AFR/Replicate background
>> threads that may need a developer's contribution.
>> Any help appreciated.
>>
>>
>>    The setup is a 3 box environment, but lets start
>> with two;
>>
>>      SaturnM (Server)
>>         - 6core CPU, 16GB RAM, 1Gbps net
>>         - 3.2.6 Kernel (custom distro)
>>         - 3.2.5 Gluster (32bit)
>>         - 3x2TB drives, CFQ, EXT3
>>         - Bricked up into a single local 6TB
>>            "distribute" brick
>>         - "brick" served to the network
>>
>>      MMC (Client)
>>         - 4core CPU, 8GB RAM, 1Gbps net
>>         - Ubuntu
>>         - 3.2.5 Gluster (32bit)
>>         - Don't recall the disk space locally
>>         - "brick" from SaturnM mounted
>>
>>      500 x 15Gbyte files were copied from MMC
>> to a single sub-directory on the brick served from
>> SaturnM, all went well and dandy.  So then we
>> moved on to a 3 box environment;
>>
>>      SaturnI (Server)
>>         = 1core CPU, 1GB RAM, 1Gbps net
>>         = 3.2.6 Kernel (custom distro)
>>         = 3.2.5 Gluster (32bit)
>>         = 4x2TB drives, CFQ, EXT3
>>         = Bricked up into a single local 8TB
>>            "distribute" brick
>>         = "brick" served to the network
>>
>>      SaturnM (Server/Client)
>>         - 6core CPU, 16GB RAM, 1Gbps net
>>         - 3.2.6 Kernel (custom distro)
>>         - 3.2.5 Gluster (32bit)
>>         - 3x2TB drives, CFQ, EXT3
>>         - Bricked up into a single local 6TB
>>            "distribute" brick
>>         = Replicate brick added to sit over
>>            the local distribute brick and a
>>            client "brick" mapped from SaturnI
>>         - Replicate "brick" served to the network
>>
>>      MMC (Client)
>>         - 4core CPU, 8GB RAM, 1Gbps net
>>         - Ubuntu
>>         - 3.2.5 Gluster (32bit)
>>         - Don't recall the disk space locally
>>         - "brick" from SaturnM mounted
>>         = "brick" from SaturnI mounted
>>
>>
>>    Now, in lesser testing in this scenario all was
>> well - any files on SaturnI would appear on SaturnM
>> (not a functional part of our test) and the content on
>> SaturnM would appear on SaturnI (the real
>> objective).
>>
>>    Earlier testing used a handful of smaller files (10s
>> to 100s of Mbytes) and a single 15Gbyte file.  The
>> 15Gbyte file would be "stat" via an "ls", which would
>> kick off a background replication (ls appeared un-
>> blocked) and the file would be transferred.  Also,
>> interrupting the transfer (pulling the LAN cable)
>> would result in a partial 15Gbyte file being corrected
>> in a subsequent background process on another
>> stat.
>>
>>    *However* .. when confronted with 500 x 15Gbyte
>> files, in a single directory (but not the root directory)
>> things don't quite work out as nicely.
>>    - First, the "ls" (at MMC against the SaturnM brick)
>>      is blocking and hangs the terminal (ctrl-c doesn't
>>      kill it).
<pranithk> At max 16 files can be self-healed in the back-ground in 
parallel. 17th file self-heal will happen in the foreground.
>>    - Then, looking from MMC at the SaturnI file
>>       system (ls -s) once per second, and then
>>       comparing the output (diff ls1.txt ls2.txt |
>>       grep -v '>') we can see that between 10 and 17
>>       files are being updated simultaneously by the
>>       background process
<pranithk> This is expected.
>>    - Further, a request at MMC for a single file that
>>      was originally in the 500 x 15Gbyte sub-dir on
>>      SaturnM (which should return unblocked with
>>      correct results) will;
>>        a) work as expected if there are less than 17
>>            active background file tasks
>>        b) block/hang if there are more
>>    - Where-as a stat (ls) outside of the 500 x 15
>>       sub-directory, such as the root of that brick,
>>       would always work as expected (return
>>       immediately, unblocked).
<pranithk> stat on the directory will only create the files with '0' 
file size. Then when you ls/stat the actual file the self-heal for the 
file gets triggered.
>>
>>
>>    Thus, to us, it appears as though there is a
>> queue feeding a set of (around) 16 worker threads
>> in AFR.  If your request was to the loaded directory
>> then you would be blocked until a worker was
>> available, and if your request was to any other
>> location, it would return unblocked regardless of
>> the worker pool state.
>>
>>    The only thread metric that we could find to tweak
>> was performance/io-threads (which was set to
>> 16 per physical disk; well per locks + posix brick
>> stacks) but increasing this to 64 per stack didn't
>> change the outcome (10 to 17 active background
>> transfers).
<pranithk> the option to increase the max num of background self-heals 
is cluster.background-self-heal-count. Default value of which is 16. I 
assume you know what you are doing to the performance of the system by 
increasing this number.
>>
>>
>>    So, given the above, is our analysis sound, and
>> if so, is there a way to increase the size of the
>> pool of active worker threads?  The objective
>> being to allow unblocked access to an existing
>> repository of files (on SaturnM) while a
>> secondary/back-up is being filled, via GlusterFS?
>>
>>    Note that I understand that performance
>> (through-put) will be an issue in the described
>> environment: this replication process is
>> estimated to run for between 10 and 40 hours,
>> which is acceptable so long as it isn't blocking
>> (there's a production-capable file set in place).
>>
>>
>>
>>
>>
>> Any help appreciated.
>>
Please let us know how it goes.
>>
>> Thanks,
>>
>>
>>
>>
>>
>>
>> --
>> Ian Latter
>> Late night coder ..
>> http://midnightcode.org/
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
>
> --
> Ian Latter
> Late night coder ..
> http://midnightcode.org/
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
hi Ian,
      inline replies with <pranithk>.

Pranith.