[Gluster-devel] replicate background threads
Pranith Kumar K
pranithk at gluster.com
Wed Mar 14 02:03:32 UTC 2012
On 03/14/2012 01:47 AM, Ian Latter wrote:
> Thanks for the info Pranith;
>
> <pranithk> the option to increase the max num of background
> self-heals
> is cluster.background-self-heal-count. Default value of
> which is 16. I
> assume you know what you are doing to the performance of the
> system by
> increasing this number.
>
>
> I didn't know this. Is there a queue length for what
> is yet to be handled by the background self heal
> count? If so, can it also be adjusted?
>
>
> ----- Original Message -----
>> From: "Pranith Kumar K"<pranithk at gluster.com>
>> To: "Ian Latter"<ian.latter at midnightcode.org>
>> Subject: Re: [Gluster-devel] replicate background threads
>> Date: Tue, 13 Mar 2012 21:07:53 +0530
>>
>> On 03/13/2012 07:52 PM, Ian Latter wrote:
>>> Hello,
>>>
>>>
>>> Well we've been privy to our first true error in
>>> Gluster now, and we're not sure of the cause.
>>>
>>> The SaturnI machine with 1Gbyte of RAM was
>>> exhausting its memory and crashing and we saw
>>> core dumps on SaturnM and MMC. Replacing
>>> the SaturnI hardware with identical hardware to
>>> SaturnM, but retaining SaturnI's original disks,
>>> (so fixing the memory capacity problem) we saw
>>> crashes randomly at all nodes.
>>>
>>> Looking for irregularities at the file system
>>> we noticed that (we'd estimate) about 60% of
>>> the files at the OS/EXT3 layer of SaturnI
>>> (sourced via replicate from SaturnM) were of
>>> size 2147483648 (2^31) where they should
>>> have been substantially larger. While we would
>>> happily accept "you shouldn't be using a 32bit
>>> gluster package" as the answer, we note two
>>> deltas;
>>> 1) All files used in testing were copied on from
>>> 32 bit clients to 32 bit servers, with no
>>> observable errors
>>> 2) Of the file that were replicated, not all were
>>> corrupted (capped at 2G -- note that we
>>> confirmed that this was the first 2G of the
>>> source file contents).
>>>
>>>
>>> So is there a known replicate issue with files
>>> greater than 2GB? Has anyone done any
>>> serious testing with significant numbers of files
>>> of this size? Are there configurations specific
>>> to files/structures of these dimensions?
>>>
>>> We noted that reversing the configuration, such
>>> that SaturnI provides the replicate Brick amongst
>>> a local distribute and a remote map to SaturnM
>>> where SaturnM simply serves a local distribute;
>>> that the data served to MMC is accurate (it
>>> continues to show 15GB files, even where there
>>> is a local 2GB copy). Further, a client "cp" at
>>> MMC, of a file with a 2GB local replicate of a
>>> 15GB file, will result in a 15GB file being
>>> created and replicated via Gluster (i.e. the
>>> correct specification at both server nodes).
>>>
>>> So my other question is; Is it possible that we've
>>> managed to corrupt something in this
>>> environment? I.e. during the initial memory
>>> exhaustion events? And is there a robust way
>>> to have the replicate files revalidated by gluster
>>> as a stat doesn't seem to be correcting files in
>>> this state (i.e. replicate on SaturnM results in
>>> daemon crashes, replicate on SaturnI results
>>> in files being left in the bad state).
>>>
>>>
>>> Also, I'm not a member of the users list; if these
>>> questions are better posed there then let me
>>> know and I'll re-post them there.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Ian Latter"<ian.latter at midnightcode.org>
>>>> To:<gluster-devel at nongnu.org>
>>>> Subject: [Gluster-devel] replicate background threads
>>>> Date: Sun, 11 Mar 2012 20:17:15 +1000
>>>>
>>>> Hello,
>>>>
>>>>
>>>> My mate Michael and I have been steadily
>>>> advancing our Gluster testing and today we finally
>>>> reached some heavier conditions. The outcome
>>>> was different from expectations built from our more
>>>> basic testing so I think we have a couple of
>>>> questions regarding the AFR/Replicate background
>>>> threads that may need a developer's contribution.
>>>> Any help appreciated.
>>>>
>>>>
>>>> The setup is a 3 box environment, but lets start
>>>> with two;
>>>>
>>>> SaturnM (Server)
>>>> - 6core CPU, 16GB RAM, 1Gbps net
>>>> - 3.2.6 Kernel (custom distro)
>>>> - 3.2.5 Gluster (32bit)
>>>> - 3x2TB drives, CFQ, EXT3
>>>> - Bricked up into a single local 6TB
>>>> "distribute" brick
>>>> - "brick" served to the network
>>>>
>>>> MMC (Client)
>>>> - 4core CPU, 8GB RAM, 1Gbps net
>>>> - Ubuntu
>>>> - 3.2.5 Gluster (32bit)
>>>> - Don't recall the disk space locally
>>>> - "brick" from SaturnM mounted
>>>>
>>>> 500 x 15Gbyte files were copied from MMC
>>>> to a single sub-directory on the brick served from
>>>> SaturnM, all went well and dandy. So then we
>>>> moved on to a 3 box environment;
>>>>
>>>> SaturnI (Server)
>>>> = 1core CPU, 1GB RAM, 1Gbps net
>>>> = 3.2.6 Kernel (custom distro)
>>>> = 3.2.5 Gluster (32bit)
>>>> = 4x2TB drives, CFQ, EXT3
>>>> = Bricked up into a single local 8TB
>>>> "distribute" brick
>>>> = "brick" served to the network
>>>>
>>>> SaturnM (Server/Client)
>>>> - 6core CPU, 16GB RAM, 1Gbps net
>>>> - 3.2.6 Kernel (custom distro)
>>>> - 3.2.5 Gluster (32bit)
>>>> - 3x2TB drives, CFQ, EXT3
>>>> - Bricked up into a single local 6TB
>>>> "distribute" brick
>>>> = Replicate brick added to sit over
>>>> the local distribute brick and a
>>>> client "brick" mapped from SaturnI
>>>> - Replicate "brick" served to the network
>>>>
>>>> MMC (Client)
>>>> - 4core CPU, 8GB RAM, 1Gbps net
>>>> - Ubuntu
>>>> - 3.2.5 Gluster (32bit)
>>>> - Don't recall the disk space locally
>>>> - "brick" from SaturnM mounted
>>>> = "brick" from SaturnI mounted
>>>>
>>>>
>>>> Now, in lesser testing in this scenario all was
>>>> well - any files on SaturnI would appear on SaturnM
>>>> (not a functional part of our test) and the content on
>>>> SaturnM would appear on SaturnI (the real
>>>> objective).
>>>>
>>>> Earlier testing used a handful of smaller files (10s
>>>> to 100s of Mbytes) and a single 15Gbyte file. The
>>>> 15Gbyte file would be "stat" via an "ls", which would
>>>> kick off a background replication (ls appeared un-
>>>> blocked) and the file would be transferred. Also,
>>>> interrupting the transfer (pulling the LAN cable)
>>>> would result in a partial 15Gbyte file being corrected
>>>> in a subsequent background process on another
>>>> stat.
>>>>
>>>> *However* .. when confronted with 500 x 15Gbyte
>>>> files, in a single directory (but not the root directory)
>>>> things don't quite work out as nicely.
>>>> - First, the "ls" (at MMC against the SaturnM brick)
>>>> is blocking and hangs the terminal (ctrl-c doesn't
>>>> kill it).
>> <pranithk> At max 16 files can be self-healed in the
> back-ground in
>> parallel. 17th file self-heal will happen in the foreground.
>>>> - Then, looking from MMC at the SaturnI file
>>>> system (ls -s) once per second, and then
>>>> comparing the output (diff ls1.txt ls2.txt |
>>>> grep -v '>') we can see that between 10 and 17
>>>> files are being updated simultaneously by the
>>>> background process
>> <pranithk> This is expected.
>>>> - Further, a request at MMC for a single file that
>>>> was originally in the 500 x 15Gbyte sub-dir on
>>>> SaturnM (which should return unblocked with
>>>> correct results) will;
>>>> a) work as expected if there are less than 17
>>>> active background file tasks
>>>> b) block/hang if there are more
>>>> - Where-as a stat (ls) outside of the 500 x 15
>>>> sub-directory, such as the root of that brick,
>>>> would always work as expected (return
>>>> immediately, unblocked).
>> <pranithk> stat on the directory will only create the
> files with '0'
>> file size. Then when you ls/stat the actual file the
> self-heal for the
>> file gets triggered.
>>>>
>>>> Thus, to us, it appears as though there is a
>>>> queue feeding a set of (around) 16 worker threads
>>>> in AFR. If your request was to the loaded directory
>>>> then you would be blocked until a worker was
>>>> available, and if your request was to any other
>>>> location, it would return unblocked regardless of
>>>> the worker pool state.
>>>>
>>>> The only thread metric that we could find to tweak
>>>> was performance/io-threads (which was set to
>>>> 16 per physical disk; well per locks + posix brick
>>>> stacks) but increasing this to 64 per stack didn't
>>>> change the outcome (10 to 17 active background
>>>> transfers).
>> <pranithk> the option to increase the max num of
> background self-heals
>> is cluster.background-self-heal-count. Default value of
> which is 16. I
>> assume you know what you are doing to the performance of
> the system by
>> increasing this number.
>>>>
>>>> So, given the above, is our analysis sound, and
>>>> if so, is there a way to increase the size of the
>>>> pool of active worker threads? The objective
>>>> being to allow unblocked access to an existing
>>>> repository of files (on SaturnM) while a
>>>> secondary/back-up is being filled, via GlusterFS?
>>>>
>>>> Note that I understand that performance
>>>> (through-put) will be an issue in the described
>>>> environment: this replication process is
>>>> estimated to run for between 10 and 40 hours,
>>>> which is acceptable so long as it isn't blocking
>>>> (there's a production-capable file set in place).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Any help appreciated.
>>>>
>> Please let us know how it goes.
>>>> Thanks,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ian Latter
>>>> Late night coder ..
>>>> http://midnightcode.org/
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at nongnu.org
>>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>>>
>>> --
>>> Ian Latter
>>> Late night coder ..
>>> http://midnightcode.org/
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at nongnu.org
>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>> hi Ian,
>> inline replies with<pranithk>.
>>
>> Pranith.
>>
>
> --
> Ian Latter
> Late night coder ..
> http://midnightcode.org/
hi Ian,
Maintaining a queue of files that need to be self-healed does not
scale in practice, in cases where there are millions of files that need
self-heal. So such a thing is not implemented. The idea is to make
self-heal foreground after a certain-limit (background-self-heal-count)
so there is no necessity for such a queue.
Pranith.
More information about the Gluster-devel
mailing list