[Gluster-devel] replicate background threads

Wed Mar 14 02:03:32 UTC 2012

On 03/14/2012 01:47 AM, Ian Latter wrote:
> Thanks for the info Pranith;
>
> <pranithk>  the option to increase the max num of background
> self-heals
> is cluster.background-self-heal-count. Default value of
> which is 16. I
> assume you know what you are doing to the performance of the
> system by
> increasing this number.
>
>
> I didn't know this.  Is there a queue length for what
> is yet to be handled by the background self heal
> count?  If so, can it also be adjusted?
>
>
> ----- Original Message -----
>> From: "Pranith Kumar K"<pranithk at gluster.com>
>> To: "Ian Latter"<ian.latter at midnightcode.org>
>> Subject:  Re: [Gluster-devel] replicate background threads
>> Date: Tue, 13 Mar 2012 21:07:53 +0530
>>
>> On 03/13/2012 07:52 PM, Ian Latter wrote:
>>> Hello,
>>>
>>>
>>>     Well we've been privy to our first true error in
>>> Gluster now, and we're not sure of the cause.
>>>
>>>     The SaturnI machine with 1Gbyte of RAM was
>>> exhausting its memory and crashing and we saw
>>> core dumps on SaturnM and MMC.  Replacing
>>> the SaturnI hardware with identical hardware to
>>> SaturnM, but retaining SaturnI's original disks,
>>> (so fixing the memory capacity problem) we saw
>>> crashes randomly at all nodes.
>>>
>>>     Looking for irregularities at the file system
>>> we noticed that (we'd estimate) about 60% of
>>> the files at the OS/EXT3 layer of SaturnI
>>> (sourced via replicate from SaturnM) were of
>>> size 2147483648 (2^31) where they should
>>> have been substantially larger.  While we would
>>> happily accept "you shouldn't be using a 32bit
>>> gluster package" as the answer, we note two
>>> deltas;
>>>     1) All files used in testing were copied on from
>>>          32 bit clients to 32 bit servers, with no
>>>          observable errors
>>>     2) Of the file that were replicated, not all were
>>>          corrupted (capped at 2G -- note that we
>>>          confirmed that this was the first 2G of the
>>>          source file contents).
>>>
>>>
>>> So is there a known replicate issue with files
>>> greater than 2GB?  Has anyone done any
>>> serious testing with significant numbers of files
>>> of this size?  Are there configurations specific
>>> to files/structures of these dimensions?
>>>
>>> We noted that reversing the configuration, such
>>> that SaturnI provides the replicate Brick amongst
>>> a local distribute and a remote map to SaturnM
>>> where SaturnM simply serves a local distribute;
>>> that the data served to MMC is accurate (it
>>> continues to show 15GB files, even where there
>>> is a local 2GB copy).  Further, a client "cp" at
>>> MMC, of a file with a 2GB local replicate of a
>>> 15GB file, will result in a 15GB file being
>>> created and replicated via Gluster (i.e. the
>>> correct specification at both server nodes).
>>>
>>> So my other question is; Is it possible that we've
>>> managed to corrupt something in this
>>> environment?  I.e. during the initial memory
>>> exhaustion events?  And is there a robust way
>>> to have the replicate files revalidated by gluster
>>> as a stat doesn't seem to be correcting files in
>>> this state (i.e. replicate on SaturnM results in
>>> daemon crashes, replicate on SaturnI results
>>> in files being left in the bad state).
>>>
>>>
>>> Also, I'm not a member of the users list; if these
>>> questions are better posed there then let me
>>> know and I'll re-post them there.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Ian Latter"<ian.latter at midnightcode.org>
>>>> To:<gluster-devel at nongnu.org>
>>>> Subject:  [Gluster-devel] replicate background threads
>>>> Date: Sun, 11 Mar 2012 20:17:15 +1000
>>>>
>>>> Hello,
>>>>
>>>>
>>>>     My mate Michael and I have been steadily
>>>> advancing our Gluster testing and today we finally
>>>> reached some heavier conditions.  The outcome
>>>> was different from expectations built from our more
>>>> basic testing so I think we have a couple of
>>>> questions regarding the AFR/Replicate background
>>>> threads that may need a developer's contribution.
>>>> Any help appreciated.
>>>>
>>>>
>>>>     The setup is a 3 box environment, but lets start
>>>> with two;
>>>>
>>>>       SaturnM (Server)
>>>>          - 6core CPU, 16GB RAM, 1Gbps net
>>>>          - 3.2.6 Kernel (custom distro)
>>>>          - 3.2.5 Gluster (32bit)
>>>>          - 3x2TB drives, CFQ, EXT3
>>>>          - Bricked up into a single local 6TB
>>>>             "distribute" brick
>>>>          - "brick" served to the network
>>>>
>>>>       MMC (Client)
>>>>          - 4core CPU, 8GB RAM, 1Gbps net
>>>>          - Ubuntu
>>>>          - 3.2.5 Gluster (32bit)
>>>>          - Don't recall the disk space locally
>>>>          - "brick" from SaturnM mounted
>>>>
>>>>       500 x 15Gbyte files were copied from MMC
>>>> to a single sub-directory on the brick served from
>>>> SaturnM, all went well and dandy.  So then we
>>>> moved on to a 3 box environment;
>>>>
>>>>       SaturnI (Server)
>>>>          = 1core CPU, 1GB RAM, 1Gbps net
>>>>          = 3.2.6 Kernel (custom distro)
>>>>          = 3.2.5 Gluster (32bit)
>>>>          = 4x2TB drives, CFQ, EXT3
>>>>          = Bricked up into a single local 8TB
>>>>             "distribute" brick
>>>>          = "brick" served to the network
>>>>
>>>>       SaturnM (Server/Client)
>>>>          - 6core CPU, 16GB RAM, 1Gbps net
>>>>          - 3.2.6 Kernel (custom distro)
>>>>          - 3.2.5 Gluster (32bit)
>>>>          - 3x2TB drives, CFQ, EXT3
>>>>          - Bricked up into a single local 6TB
>>>>             "distribute" brick
>>>>          = Replicate brick added to sit over
>>>>             the local distribute brick and a
>>>>             client "brick" mapped from SaturnI
>>>>          - Replicate "brick" served to the network
>>>>
>>>>       MMC (Client)
>>>>          - 4core CPU, 8GB RAM, 1Gbps net
>>>>          - Ubuntu
>>>>          - 3.2.5 Gluster (32bit)
>>>>          - Don't recall the disk space locally
>>>>          - "brick" from SaturnM mounted
>>>>          = "brick" from SaturnI mounted
>>>>
>>>>
>>>>     Now, in lesser testing in this scenario all was
>>>> well - any files on SaturnI would appear on SaturnM
>>>> (not a functional part of our test) and the content on
>>>> SaturnM would appear on SaturnI (the real
>>>> objective).
>>>>
>>>>     Earlier testing used a handful of smaller files (10s
>>>> to 100s of Mbytes) and a single 15Gbyte file.  The
>>>> 15Gbyte file would be "stat" via an "ls", which would
>>>> kick off a background replication (ls appeared un-
>>>> blocked) and the file would be transferred.  Also,
>>>> interrupting the transfer (pulling the LAN cable)
>>>> would result in a partial 15Gbyte file being corrected
>>>> in a subsequent background process on another
>>>> stat.
>>>>
>>>>     *However* .. when confronted with 500 x 15Gbyte
>>>> files, in a single directory (but not the root directory)
>>>> things don't quite work out as nicely.
>>>>     - First, the "ls" (at MMC against the SaturnM brick)
>>>>       is blocking and hangs the terminal (ctrl-c doesn't
>>>>       kill it).
>> <pranithk>  At max 16 files can be self-healed in the
> back-ground in
>> parallel. 17th file self-heal will happen in the foreground.
>>>>     - Then, looking from MMC at the SaturnI file
>>>>        system (ls -s) once per second, and then
>>>>        comparing the output (diff ls1.txt ls2.txt |
>>>>        grep -v '>') we can see that between 10 and 17
>>>>        files are being updated simultaneously by the
>>>>        background process
>> <pranithk>  This is expected.
>>>>     - Further, a request at MMC for a single file that
>>>>       was originally in the 500 x 15Gbyte sub-dir on
>>>>       SaturnM (which should return unblocked with
>>>>       correct results) will;
>>>>         a) work as expected if there are less than 17
>>>>             active background file tasks
>>>>         b) block/hang if there are more
>>>>     - Where-as a stat (ls) outside of the 500 x 15
>>>>        sub-directory, such as the root of that brick,
>>>>        would always work as expected (return
>>>>        immediately, unblocked).
>> <pranithk>  stat on the directory will only create the
> files with '0'
>> file size. Then when you ls/stat the actual file the
> self-heal for the
>> file gets triggered.
>>>>
>>>>     Thus, to us, it appears as though there is a
>>>> queue feeding a set of (around) 16 worker threads
>>>> in AFR.  If your request was to the loaded directory
>>>> then you would be blocked until a worker was
>>>> available, and if your request was to any other
>>>> location, it would return unblocked regardless of
>>>> the worker pool state.
>>>>
>>>>     The only thread metric that we could find to tweak
>>>> was performance/io-threads (which was set to
>>>> 16 per physical disk; well per locks + posix brick
>>>> stacks) but increasing this to 64 per stack didn't
>>>> change the outcome (10 to 17 active background
>>>> transfers).
>> <pranithk>  the option to increase the max num of
> background self-heals
>> is cluster.background-self-heal-count. Default value of
> which is 16. I
>> assume you know what you are doing to the performance of
> the system by
>> increasing this number.
>>>>
>>>>     So, given the above, is our analysis sound, and
>>>> if so, is there a way to increase the size of the
>>>> pool of active worker threads?  The objective
>>>> being to allow unblocked access to an existing
>>>> repository of files (on SaturnM) while a
>>>> secondary/back-up is being filled, via GlusterFS?
>>>>
>>>>     Note that I understand that performance
>>>> (through-put) will be an issue in the described
>>>> environment: this replication process is
>>>> estimated to run for between 10 and 40 hours,
>>>> which is acceptable so long as it isn't blocking
>>>> (there's a production-capable file set in place).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Any help appreciated.
>>>>
>> Please let us know how it goes.
>>>> Thanks,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ian Latter
>>>> Late night coder ..
>>>> http://midnightcode.org/
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at nongnu.org
>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>>>
>>> --
>>> Ian Latter
>>> Late night coder ..
>>> http://midnightcode.org/
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at nongnu.org
>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>> hi Ian,
>>        inline replies with<pranithk>.
>>
>> Pranith.
>>
>
> --
> Ian Latter
> Late night coder ..
> http://midnightcode.org/
hi Ian,
      Maintaining a queue of files that need to be self-healed does not 
scale in practice, in cases where there are millions of files that need 
self-heal. So such a thing is not implemented. The idea is to make 
self-heal foreground after a certain-limit (background-self-heal-count) 
so there is no necessity for such a queue.

Pranith.