[Gluster-devel] replicate background threads

Wed Mar 14 09:36:24 UTC 2012

Hello,

> hi Ian,
>      Maintaining a queue of files that need to be 
> self-healed does not scale in practice, in cases
> where there are millions of files that need self-
> heal. So such a thing is not implemented. The 
> idea is to make self-heal foreground after a 
> certain-limit (background-self-heal-count) so 
> there is no necessity for such a queue.
> 
> Pranith.

Ok, I understand - it will be interesting to observe
the system with the new knowledge from your
messages - thanks for your help, appreciate it.

Cheers,

----- Original Message -----
>From: "Pranith Kumar K" <pranithk at gluster.com>
>To: "Ian Latter" <ian.latter at midnightcode.org>
>Subject:  Re: [Gluster-devel] replicate background threads
>Date: Wed, 14 Mar 2012 07:33:32 +0530
>
> On 03/14/2012 01:47 AM, Ian Latter wrote:
> > Thanks for the info Pranith;
> >
> > <pranithk>  the option to increase the max num of background
> > self-heals
> > is cluster.background-self-heal-count. Default value of
> > which is 16. I
> > assume you know what you are doing to the performance of the
> > system by
> > increasing this number.
> >
> >
> > I didn't know this.  Is there a queue length for what
> > is yet to be handled by the background self heal
> > count?  If so, can it also be adjusted?
> >
> >
> > ----- Original Message -----
> >> From: "Pranith Kumar K"<pranithk at gluster.com>
> >> To: "Ian Latter"<ian.latter at midnightcode.org>
> >> Subject:  Re: [Gluster-devel] replicate background threads
> >> Date: Tue, 13 Mar 2012 21:07:53 +0530
> >>
> >> On 03/13/2012 07:52 PM, Ian Latter wrote:
> >>> Hello,
> >>>
> >>>
> >>>     Well we've been privy to our first true error in
> >>> Gluster now, and we're not sure of the cause.
> >>>
> >>>     The SaturnI machine with 1Gbyte of RAM was
> >>> exhausting its memory and crashing and we saw
> >>> core dumps on SaturnM and MMC.  Replacing
> >>> the SaturnI hardware with identical hardware to
> >>> SaturnM, but retaining SaturnI's original disks,
> >>> (so fixing the memory capacity problem) we saw
> >>> crashes randomly at all nodes.
> >>>
> >>>     Looking for irregularities at the file system
> >>> we noticed that (we'd estimate) about 60% of
> >>> the files at the OS/EXT3 layer of SaturnI
> >>> (sourced via replicate from SaturnM) were of
> >>> size 2147483648 (2^31) where they should
> >>> have been substantially larger.  While we would
> >>> happily accept "you shouldn't be using a 32bit
> >>> gluster package" as the answer, we note two
> >>> deltas;
> >>>     1) All files used in testing were copied on from
> >>>          32 bit clients to 32 bit servers, with no
> >>>          observable errors
> >>>     2) Of the file that were replicated, not all were
> >>>          corrupted (capped at 2G -- note that we
> >>>          confirmed that this was the first 2G of the
> >>>          source file contents).
> >>>
> >>>
> >>> So is there a known replicate issue with files
> >>> greater than 2GB?  Has anyone done any
> >>> serious testing with significant numbers of files
> >>> of this size?  Are there configurations specific
> >>> to files/structures of these dimensions?
> >>>
> >>> We noted that reversing the configuration, such
> >>> that SaturnI provides the replicate Brick amongst
> >>> a local distribute and a remote map to SaturnM
> >>> where SaturnM simply serves a local distribute;
> >>> that the data served to MMC is accurate (it
> >>> continues to show 15GB files, even where there
> >>> is a local 2GB copy).  Further, a client "cp" at
> >>> MMC, of a file with a 2GB local replicate of a
> >>> 15GB file, will result in a 15GB file being
> >>> created and replicated via Gluster (i.e. the
> >>> correct specification at both server nodes).
> >>>
> >>> So my other question is; Is it possible that we've
> >>> managed to corrupt something in this
> >>> environment?  I.e. during the initial memory
> >>> exhaustion events?  And is there a robust way
> >>> to have the replicate files revalidated by gluster
> >>> as a stat doesn't seem to be correcting files in
> >>> this state (i.e. replicate on SaturnM results in
> >>> daemon crashes, replicate on SaturnI results
> >>> in files being left in the bad state).
> >>>
> >>>
> >>> Also, I'm not a member of the users list; if these
> >>> questions are better posed there then let me
> >>> know and I'll re-post them there.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ----- Original Message -----
> >>>> From: "Ian Latter"<ian.latter at midnightcode.org>
> >>>> To:<gluster-devel at nongnu.org>
> >>>> Subject:  [Gluster-devel] replicate background threads
> >>>> Date: Sun, 11 Mar 2012 20:17:15 +1000
> >>>>
> >>>> Hello,
> >>>>
> >>>>
> >>>>     My mate Michael and I have been steadily
> >>>> advancing our Gluster testing and today we finally
> >>>> reached some heavier conditions.  The outcome
> >>>> was different from expectations built from our more
> >>>> basic testing so I think we have a couple of
> >>>> questions regarding the AFR/Replicate background
> >>>> threads that may need a developer's contribution.
> >>>> Any help appreciated.
> >>>>
> >>>>
> >>>>     The setup is a 3 box environment, but lets start
> >>>> with two;
> >>>>
> >>>>       SaturnM (Server)
> >>>>          - 6core CPU, 16GB RAM, 1Gbps net
> >>>>          - 3.2.6 Kernel (custom distro)
> >>>>          - 3.2.5 Gluster (32bit)
> >>>>          - 3x2TB drives, CFQ, EXT3
> >>>>          - Bricked up into a single local 6TB
> >>>>             "distribute" brick
> >>>>          - "brick" served to the network
> >>>>
> >>>>       MMC (Client)
> >>>>          - 4core CPU, 8GB RAM, 1Gbps net
> >>>>          - Ubuntu
> >>>>          - 3.2.5 Gluster (32bit)
> >>>>          - Don't recall the disk space locally
> >>>>          - "brick" from SaturnM mounted
> >>>>
> >>>>       500 x 15Gbyte files were copied from MMC
> >>>> to a single sub-directory on the brick served from
> >>>> SaturnM, all went well and dandy.  So then we
> >>>> moved on to a 3 box environment;
> >>>>
> >>>>       SaturnI (Server)
> >>>>          = 1core CPU, 1GB RAM, 1Gbps net
> >>>>          = 3.2.6 Kernel (custom distro)
> >>>>          = 3.2.5 Gluster (32bit)
> >>>>          = 4x2TB drives, CFQ, EXT3
> >>>>          = Bricked up into a single local 8TB
> >>>>             "distribute" brick
> >>>>          = "brick" served to the network
> >>>>
> >>>>       SaturnM (Server/Client)
> >>>>          - 6core CPU, 16GB RAM, 1Gbps net
> >>>>          - 3.2.6 Kernel (custom distro)
> >>>>          - 3.2.5 Gluster (32bit)
> >>>>          - 3x2TB drives, CFQ, EXT3
> >>>>          - Bricked up into a single local 6TB
> >>>>             "distribute" brick
> >>>>          = Replicate brick added to sit over
> >>>>             the local distribute brick and a
> >>>>             client "brick" mapped from SaturnI
> >>>>          - Replicate "brick" served to the network
> >>>>
> >>>>       MMC (Client)
> >>>>          - 4core CPU, 8GB RAM, 1Gbps net
> >>>>          - Ubuntu
> >>>>          - 3.2.5 Gluster (32bit)
> >>>>          - Don't recall the disk space locally
> >>>>          - "brick" from SaturnM mounted
> >>>>          = "brick" from SaturnI mounted
> >>>>
> >>>>
> >>>>     Now, in lesser testing in this scenario all was
> >>>> well - any files on SaturnI would appear on SaturnM
> >>>> (not a functional part of our test) and the content on
> >>>> SaturnM would appear on SaturnI (the real
> >>>> objective).
> >>>>
> >>>>     Earlier testing used a handful of smaller files (10s
> >>>> to 100s of Mbytes) and a single 15Gbyte file.  The
> >>>> 15Gbyte file would be "stat" via an "ls", which would
> >>>> kick off a background replication (ls appeared un-
> >>>> blocked) and the file would be transferred.  Also,
> >>>> interrupting the transfer (pulling the LAN cable)
> >>>> would result in a partial 15Gbyte file being corrected
> >>>> in a subsequent background process on another
> >>>> stat.
> >>>>
> >>>>     *However* .. when confronted with 500 x 15Gbyte
> >>>> files, in a single directory (but not the root directory)
> >>>> things don't quite work out as nicely.
> >>>>     - First, the "ls" (at MMC against the SaturnM brick)
> >>>>       is blocking and hangs the terminal (ctrl-c doesn't
> >>>>       kill it).
> >> <pranithk>  At max 16 files can be self-healed in the
> > back-ground in
> >> parallel. 17th file self-heal will happen in the
foreground.
> >>>>     - Then, looking from MMC at the SaturnI file
> >>>>        system (ls -s) once per second, and then
> >>>>        comparing the output (diff ls1.txt ls2.txt |
> >>>>        grep -v '>') we can see that between 10 and 17
> >>>>        files are being updated simultaneously by the
> >>>>        background process
> >> <pranithk>  This is expected.
> >>>>     - Further, a request at MMC for a single file that
> >>>>       was originally in the 500 x 15Gbyte sub-dir on
> >>>>       SaturnM (which should return unblocked with
> >>>>       correct results) will;
> >>>>         a) work as expected if there are less than 17
> >>>>             active background file tasks
> >>>>         b) block/hang if there are more
> >>>>     - Where-as a stat (ls) outside of the 500 x 15
> >>>>        sub-directory, such as the root of that brick,
> >>>>        would always work as expected (return
> >>>>        immediately, unblocked).
> >> <pranithk>  stat on the directory will only create the
> > files with '0'
> >> file size. Then when you ls/stat the actual file the
> > self-heal for the
> >> file gets triggered.
> >>>>
> >>>>     Thus, to us, it appears as though there is a
> >>>> queue feeding a set of (around) 16 worker threads
> >>>> in AFR.  If your request was to the loaded directory
> >>>> then you would be blocked until a worker was
> >>>> available, and if your request was to any other
> >>>> location, it would return unblocked regardless of
> >>>> the worker pool state.
> >>>>
> >>>>     The only thread metric that we could find to tweak
> >>>> was performance/io-threads (which was set to
> >>>> 16 per physical disk; well per locks + posix brick
> >>>> stacks) but increasing this to 64 per stack didn't
> >>>> change the outcome (10 to 17 active background
> >>>> transfers).
> >> <pranithk>  the option to increase the max num of
> > background self-heals
> >> is cluster.background-self-heal-count. Default value of
> > which is 16. I
> >> assume you know what you are doing to the performance of
> > the system by
> >> increasing this number.
> >>>>
> >>>>     So, given the above, is our analysis sound, and
> >>>> if so, is there a way to increase the size of the
> >>>> pool of active worker threads?  The objective
> >>>> being to allow unblocked access to an existing
> >>>> repository of files (on SaturnM) while a
> >>>> secondary/back-up is being filled, via GlusterFS?
> >>>>
> >>>>     Note that I understand that performance
> >>>> (through-put) will be an issue in the described
> >>>> environment: this replication process is
> >>>> estimated to run for between 10 and 40 hours,
> >>>> which is acceptable so long as it isn't blocking
> >>>> (there's a production-capable file set in place).
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Any help appreciated.
> >>>>
> >> Please let us know how it goes.
> >>>> Thanks,
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ian Latter
> >>>> Late night coder ..
> >>>> http://midnightcode.org/
> >>>>
> >>>> _______________________________________________
> >>>> Gluster-devel mailing list
> >>>> Gluster-devel at nongnu.org
> >>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
> >>>>
> >>> --
> >>> Ian Latter
> >>> Late night coder ..
> >>> http://midnightcode.org/
> >>>
> >>> _______________________________________________
> >>> Gluster-devel mailing list
> >>> Gluster-devel at nongnu.org
> >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
> >> hi Ian,
> >>        inline replies with<pranithk>.
> >>
> >> Pranith.
> >>
> >
> > --
> > Ian Latter
> > Late night coder ..
> > http://midnightcode.org/
> hi Ian,
>       Maintaining a queue of files that need to be
self-healed does not 
> scale in practice, in cases where there are millions of
files that need 
> self-heal. So such a thing is not implemented. The idea is
to make 
> self-heal foreground after a certain-limit
(background-self-heal-count) 
> so there is no necessity for such a queue.
> 
> Pranith.
> 

--
Ian Latter
Late night coder ..
http://midnightcode.org/