[Gluster-devel] Re: Rant... WAS: [List-hacking] [bug #25207] an rm of a file should not cause that file to be replicated with afr self-heal.

Mon Jan 5 13:23:03 UTC 2009

Swank iest wrote:
> It's a shame zresearch does not care to include the community in design. 
>  Am I mistakenly under the impression that gluster is an Open Source 
> project?

May be I miscommunicated. Only the intricate implementation details, we
discuss in person / phone. We do include the community and strongly value
their feedbacks. If you go through the mailing list and IRC archives you
will see a number of architectural discussions. Even the public roadmap
page has place holder for suggestions. Project itself is hosted under
Savannah. Source is under GPLv3 license. We are trying our best with the
limited resources we have.

> For instance, you may find there is a large portion of the community who 
> will feel that removing the file system's ability to heal itself is a 
> bad thing.  I, for example, would find having to manually monitor the 
> state of my clustered file system a rather expensive task.
> 
> I do, however, appreciate that it is a hard problem to solve.

We are not removing any ability, we are replacing it with a better one.
Self-healing code will re-implemented in a synchronous model through
an external tool. Currently it is the most complicated / unstable code
inside the file system. Stability is #1 priority for every one.

You can launch this tool through a cron-job. We are also planning to add
"daemon mode" support to receive notifications for real-time handling of
events (active healing).

> 
> I also believe that 
> 
> 1) Being told that FreeBSD is only supported with version 7.0 and only 
> with glusterfs 1.4 (which isn't released) is a bad thing.  Where is the 
> stable code base?  Has development stopped on 1.3?  I feel pressure to 
> be running 1.4, but it's not released yet.

Yes, only critical bug fixes happen on 1.3.

Release 2.0 (formerly 1.4) should happen this month. It is relatively
more stable than 1.3.

> 2) Being told that 1.4 release candidates are not a good "framework" to 
> be solving problems in is scary.  If 1.4 isn't the correct place, where 
> is?  Is there a 1.5 that hasn't been made public yet?  Is the AFR 
> self-heal code going to be ripped out of 1.4?  When will it be ripped 
> out?  I thought there was going to be a 1.4 release soon.  If 1.3 isn't 
> stable, and 1.4 isn't a good framework, what should someone use in 
> production?  Can only code that has been contracted from zresearch be 
> used in production?  How much does this cost?

Self-heal code will not be removed until it is replaced with a better
alternative. Next 2.0 release will still have self-heal turned on by default.
Once we feel that glusterfs-heal is ready, we will turn self-heal off by
default. We will not remove features without discussing with the community.

GlusterFS code is the same both for commercial and gratis users. We do not
hold any code as proprietary. Commercial users pay for the subscription
package which is support and service for GlusterFS. We deploy, hand-hold
and maintain. (Similar to RedHat, except we don't restrict redistribution
of binaries).

> 3) Talking about features in a public forum may lead to a better end 
> result.  For instance it may lead to feedback such as:

We always do that. Healing tool is already there on the roadmap. It was
not supposed to be introduced in this release. But we are planning to
make it available as part of 2.0.X minor release, instead of waiting
for 2.1.

This discussion came up, because you requested an optimization that
requires a hacky implementation. I won't complicate the current self-heal
design any further. It is achievable easily using the new design.

> AFR is broken in a number of ways right now
> 
> 1) AFR blocks on self-heal.  ls -lR will not return until the heal is 
> complete.  On large directories, this will make many applications break 
> in wonderfully weird ways.  I'm imagining users of web applications that 
> have files backed on gluster clicking refresh for 30 minutes.
> 
> 2) AFR self-heal is incredibly slow.  I have tracked this down to the 
> use of 4kb "chunks" being sent at a time.  The explanation for this is 
> to allow "spare file replication".  However, the additional TCP overhead 
> that using such small chunks causes means that self-heal will run at 
> speeds less than 1MBps in my environment (I'm attempting to run gluster 
> over a VPN between data centers.)  I believe that the tcp chunk should 
> be tied to the TCP window size.  I have set the 4kb size to 131072 in my 
> environment to get things to work a bit better (however, without 
> aggregation of small files, there is still an unnecessary amount of TCP 
> overhead which causes small files to be replicated really slowly.)

This was one of the reasons to implement a healing tool. It gives more
control to the user. Currently it is hard for the user to track when and
how the healing happens.

4KB chunk healing is fixable. I will look in to it.

I really appreciate your feedback and in-depth details. Also your bug-reports
are very useful. More you contribute, more attention you will gain :).

> 3) AFR only lists files that exist on the first brick listed in the AFR 
> configuration.  This can lead to really awkward situations where a file 
> doesn't exist on the first brick but does on subsequent bricks.  Now, 
> I've been explained that this occurs because AFR does not require a 
> metadata server.  In fact, this was one of the draws of gluster to me 
> (not having to find some way to make the metadata server highly 
> available.)  I did not understand (from any of the documentation 
> available) that it's not that gluster doesn't *require* a metadata 
> server, it's that it doesn't solve name space problem at all.

AFR uses two phase commit for atomic write operations. For read operations
it load balance across the volumes. What you are asking is to atomic
read/readdir from all the volumes and verify if they are same. We can
implement so, but it will impact the performance.

GlusterFS does not have meta-data server even for file level (distributed
hash) or block level (stripe) distribution.

> 4) AFR does not work reliably above unity or DHT.  It crashes a lot. 
>  Now, I can understand that gluster was not designed to operate in this 
> fashion, however, I cannot think of any other way to put live data into 
> a gluster file system.  (read this as, it would not be my final config, 
> but without having real-time replication of data into my "proper" 
> config... I would need to turn off live servers for days if not weeks to 
> move the data around by hand.  If I were to move data around by hand, 
> why would I need a replicated file system?)  If it were the case that 
> gluster is not designed to solve these problems, perhaps that should be 
> listed in the documentation somewhere rather than instructions on how to 
> do it (perhaps this is already the case with the 1.4 documentation?). 
>  Preferably, we could just fix the problems that cause it to not be possible

AFR is very much intended to work with DHT or Unify. We will look into
your bug reports.

> Now it's really naive of me to even attempt a design of a working 
> system, but if I were to try...
> 
> I would break AFR into three code paths
> 
> 1) WRITE
> 
> on write, files are written to all available bricks.  Bricks that are 
> not available are queued until they become available again.
> 
> 2) READ
> 
> on read, lookups happen on all bricks.  If a file doesn't exist on a 
> particular brick, it is added to the queue for replication.  The file is 
> returned from a valid brick.  (this is complicated by a server not being 
> available when a delete occurs.  If after it comes back up after a 
> deletion and that file is requested, that file would be replicated 
> again.)  This would, of course, not scale linearly with the addition of 
> bricks.
> 
> 3) REPLICATION
> 
> Process the queue.  I don't know where this queue should exist.  But 
> replication ought occur with it's own thread/process independent of 
> read/write.  Somewhere into this could be added code to "balance" files 
> across bricks (should a certain number of bricks only be required for a 
> file.  example: 5 bricks, but only two bricks require the file.)
> 
> </rant>

Queuing of writes from multiple clients has lot of coherency issues. It is
a complicated design. We have thought of implementing a spare volume concept
for this purpose. I will discuss with you when time is right.

> Is there an automated build process for arch somewhere?  If not, I would 
> be willing to build one for the project so that developers would be 
> warned of build errors as were introduced and fixed for FreeBSD 
> recently.  It would be a convenient place to add unit tests as well.
> 
>  Christopher Owen.

Automated build for FreeBSD? We don't even have an inhouse FreeBSD server.
It will be a big help for us.

Thanks a lot. Happy Hacking!
--
Anand Babu

> 
>  > Date: Mon, 5 Jan 2009 02:30:29 -0800
>  > From: ab at zresearch.com
>  > To: swankier at msn.com
>  > CC: list-hacking at zresearch.com; gluster-users at gluster.org; 
> Gluster-devel at nongnu.org
>  > Subject: Re: [List-hacking] [bug #25207] an rm of a file should not 
> cause that file to be replicated with afr self-heal.
>  >
>  > Christopher, main issue with self-heal is its complexity. Handling 
> self-healing
>  > logic in a non-blocking asynchronous code path is difficult. 
> Replicating a missing
>  > sounds simple, but holding off a lookup call and initiating a new 
> series of calls
>  > to heal the file and then resuming back normal operation is tricky. 
> Much of the
>  > bugs we faced in 1.3 is related to self-heal. We have handled most of 
> these cases
>  > over a period of time. Self-healing is decent now, but not good 
> enough. We feel that
>  > it has only complicated the code base. It is hard to test and 
> maintain this part of
>  > the code base.
>  >
>  > Plan is to drop self-heal code all together once the active healing 
> tool gets ready.
>  > Unlike self-healing, this active healing can be run by the user on a 
> mounted file system
>  > (online) any time. By moving the code out of the file system, into a 
> tool (that is
>  > synchronous and linear), we can implement sophisticated healing 
> techniques.
>  >
>  > Code is not in the repository yet. Hopefully in a month, it will be 
> ready for use.
>  > You can simply turn off self-heal and run this utility while the file 
> system is mounted.
>  >
>  > List-hacking is an internal list, mostly junk :). It is an internal 
> company list.
>  > We don't discuss technical / architectural stuff there. They are 
> mostly done over
>  > phone and in-person meetings. We do want to actively involve the 
> community right
>  > from the design phase. Mailing list is cumbersome and slow to 
> interactively
>  > brainstorm design discussions. We can once in a while organize IRC 
> sessions
>  > for this purpose.
>  >
>  > --
>  > Anand Babu
>  >
>  > Swank iest wrote:
>  >> Well,
>  >>
>  >> I guess this is getting outside of the bug. I suppose you are going to
>  >> mark it as not going to fix?
>  >>
>  >> I'm trying to put gluster into production right now, so may I ask:
>  >>
>  >> 1) What are the current issues with self-heal that require a full
>  >> re-write? Is there a place in the Wiki or elsewhere where it's being
>  >> documented?
>  >> 2) May I see the new code? I must not be looking in the correct place
>  >> in TLA?
>  >> 3) If it's not written yet, may I be included in the design discussion?
>  >> (As I haven't put gluster into production yet, now would be a good time
>  >> to know if it's not going to work in the near future.)
>  >> 4) May I be placed on the list-hacking at zresearch.com mailing list, 
> please?
>  >>
>  >> Christopher.
>  >>
>  >> > Date: Mon, 5 Jan 2009 01:36:14 -0800
>  >> > From: ab at zresearch.com
>  >> > To: krishna at zresearch.com
>  >> > CC: swankier at msn.com; list-hacking at zresearch.com
>  >> > Subject: Re: [List-hacking] [bug #25207] an rm of a file should not
>  >> cause that file to be replicated with afr self-heal.
>  >> >
>  >> > Krishna, leave it as is. Once self-heal ensures that the volumes are
>  >> intact, rm will
>  >> > remove both the copies anyways. It is inefficient, but optimizing it
>  >> the current framework
>  >> > will be hacky.
>  >> >
>  >> > Swaniker, We are ditching the current self-healing framework with an
>  >> active healing tool.
>  >> > We can take care of it then.
>  >> >
>  >> >
>  >> > Krishna Srinivas wrote:
>  >> >> The current selfheal logic is built in lookup of a file, lookup is
>  >> >> issued just before any file operation on a file. So if the lookup 
> call
>  >> >> does not know whether an open or rm is going to be done on the file.
>  >> >> Will get back to you if we can do anything about this, i.e to 
> save the
>  >> >> redundant copy of the file when it is going to be rm'ed
>  >> >>
>  >> >> Krishna
>  >> >>
>  >> >> On Mon, Jan 5, 2009 at 12:19 PM, swankier <INVALID.NOREPLY at gnu.org>
>  >> wrote:
>  >> >>> Follow-up Comment #2, bug #25207 (project gluster):
>  >> >>>
>  >> >>> I am:
>  >> >>>
>  >> >>> 1) delete file from posix system beneath afr on one side
>  >> >>> 2) run rm on gluster file system
>  >> >>>
>  >> >>> file is then replicated followed by deletion
>  >> >>>
>  >> >>> _______________________________________________________
>  >> >>>
>  >> >>> Reply to this item at:
>  >> >>>
>  >> >>> <http://savannah.nongnu.org/bugs/?25207>
>  >> >
>  >> > --
>  >> > Anand Babu Periasamy
>  >> > GPG Key ID: 0x62E15A31
>  >> > Blog [http://ab.freeshell.org]
>  >> > GlusterFS [http://www.gluster.org]
>  >> > The GNU Operating System [http://www.gnu.org]
>  >> >
>  >>
>  >> ------------------------------------------------------------------------
>  >> Visit messengerbuddies.ca to find out how you could win. Enter today.
>  >> <http://www.messengerbuddies.ca/?ocid=BUDDYOMATICENCA20>
>  >
>  > --
>  > Anand Babu Periasamy
>  > GPG Key ID: 0x62E15A31
>  > Blog [http://ab.freeshell.org]
>  > GlusterFS [http://www.gluster.org]
>  > The GNU Operating System [http://www.gnu.org]
>  >
> 
> ------------------------------------------------------------------------
> Visit messengerbuddies.ca to find out how you could win. Enter today. 
> <http://www.messengerbuddies.ca/?ocid=BUDDYOMATICENCA20>

-- 
Anand Babu Periasamy
GPG Key ID: 0x62E15A31
Blog [http://ab.freeshell.org]
GlusterFS [http://www.gluster.org]
The GNU Operating System [http://www.gnu.org]