[Gluster-users] Rant... WAS: [List-hacking] [bug #25207] an rm of a file should not cause that file to be replicated with afr self-heal.

Mon Jan 5 11:37:33 UTC 2009

It's a shame zresearch does not care to include the community in design.  Am I mistakenly under the impression that gluster is an Open Source project?For instance, you may find there is a large portion of the community who will feel that removing the file system's ability to heal itself is a bad thing.  I, for example, would find having to manually monitor the state of my clustered file system a rather expensive task.I do, however, appreciate that it is a hard problem to solve.I also believe that 1) Being told that FreeBSD is only supported with version 7.0 and only with glusterfs 1.4 (which isn't released) is a bad thing.  Where is the stable code base?  Has development stopped on 1.3?  I feel pressure to be running 1.4, but it's not released yet.2) Being told that 1.4 release candidates are not a good "framework" to be solving problems in is scary.  If 1.4 isn't the correct place, where is?  Is there a 1.5 that hasn't been made public yet?  Is the AFR self-heal code going to be ripped out of 1.4?  When will it be ripped out?  I thought there was going to be a 1.4 release soon.  If 1.3 isn't stable, and 1.4 isn't a good framework, what should someone use in production?  Can only code that has been contracted from zresearch be used in production?  How much does this cost?3) Talking about features in a public forum may lead to a better end result.  For instance it may lead to feedback such as:AFR is broken in a number of ways right now1) AFR blocks on self-heal.  ls -lR will not return until the heal is complete.  On large directories, this will make many applications break in wonderfully weird ways.  I'm imagining users of web applications that have files backed on gluster clicking refresh for 30 minutes.2) AFR self-heal is incredibly slow.  I have tracked this down to the use of 4kb "chunks" being sent at a time.  The explanation for this is to allow "spare file replication".  However, the additional TCP overhead that using such small chunks causes means that self-heal will run at speeds less than 1MBps in my environment (I'm attempting to run gluster over a VPN between data centers.)  I believe that the tcp chunk should be tied to the TCP window size.  I have set the 4kb size to 131072 in my environment to get things to work a bit better (however, without aggregation of small files, there is still an unnecessary amount of TCP overhead which causes small files to be replicated really slowly.)3) AFR only lists files that exist on the first brick listed in the AFR configuration.  This can lead to really awkward situations where a file doesn't exist on the first brick but does on subsequent bricks.  Now, I've been explained that this occurs because AFR does not require a metadata server.  In fact, this was one of the draws of gluster to me (not having to find some way to make the metadata server highly available.)  I did not understand (from any of the documentation available) that it's not that gluster doesn't *require* a metadata server, it's that it doesn't solve name space problem at all.4) AFR does not work reliably above unity or DHT.  It crashes a lot.  Now, I can understand that gluster was not designed to operate in this fashion, however, I cannot think of any other way to put live data into a gluster file system.  (read this as, it would not be my final config, but without having real-time replication of data into my "proper" config... I would need to turn off live servers for days if not weeks to move the data around by hand.  If I were to move data around by hand, why would I need a replicated file system?)  If it were the case that gluster is not designed to solve these problems, perhaps that should be listed in the documentation somewhere rather than instructions on how to do it (perhaps this is already the case with the 1.4 documentation?).  Preferably, we could just fix the problems that cause it to not be possibleNow it's really naive of me to even attempt a design of a working system, but if I were to try...I would break AFR into three code paths1) WRITEon write, files are written to all available bricks.  Bricks that are not available are queued until they become available again.2) READon read, lookups happen on all bricks.  If a file doesn't exist on a particular brick, it is added to the queue for replication.  The file is returned from a valid brick.  (this is complicated by a server not being available when a delete occurs.  If after it comes back up after a deletion and that file is requested, that file would be replicated again.)  This would, of course, not scale linearly with the addition of bricks.3) REPLICATIONProcess the queue.  I don't know where this queue should exist.  But replication ought occur with it's own thread/process independent of read/write.  Somewhere into this could be added code to "balance" files across bricks (should a certain number of bricks only be required for a file.  example: 5 bricks, but only two bricks require the file.)</rant>Is there an automated build process for arch somewhere?  If not, I would be willing to build one for the project so that developers would be warned of build errors as were introduced and fixed for FreeBSD recently.  It would be a convenient place to add unit tests as well. Christopher Owen.> Date: Mon, 5 Jan 2009 02:30:29 -0800> From: ab at zresearch.com> To: swankier at msn.com> CC: list-hacking at zresearch.com; gluster-users at gluster.org; Gluster-devel at nongnu.org> Subject: Re: [List-hacking] [bug #25207] an rm of a file should not cause that file to be replicated with afr self-heal.> > Christopher, main issue with self-heal is its complexity. Handling self-healing> logic in a non-blocking asynchronous code path is difficult. Replicating a missing> sounds simple, but holding off a lookup call and initiating a new series of calls> to heal the file and then resuming back normal operation is tricky. Much of the> bugs we faced in 1.3 is related to self-heal. We have handled most of these cases> over a period of time. Self-healing is decent now, but not good enough. We feel that> it has only complicated the code base. It is hard to test and maintain this part of> the code base.> > Plan is to drop self-heal code all together once the active healing tool gets ready.> Unlike self-healing, this active healing can be run by the user on a mounted file system> (online) any time. By moving the code out of the file system, into a tool (that is> synchronous and linear), we can implement sophisticated healing techniques.> > Code is not in the repository yet. Hopefully in a month, it will be ready for use.> You can simply turn off self-heal and run this utility while the file system is mounted.> > List-hacking is an internal list, mostly junk :). It is an internal company list.> We don't discuss technical / architectural stuff there. They are mostly done over> phone and in-person meetings. We do want to actively involve the community right> from the design phase. Mailing list is cumbersome and slow to interactively> brainstorm design discussions. We can once in a while organize IRC sessions> for this purpose.> > --> Anand Babu> > Swank iest wrote:>> Well,>> >> I guess this is getting outside of the bug.  I suppose you are going to >> mark it as not going to fix?>> >> I'm trying to put gluster into production right now, so may I ask:>> >> 1) What are the current issues with self-heal that require a full >> re-write?  Is there a place in the Wiki or elsewhere where it's being >> documented?>> 2) May I see the new code?  I must not be looking in the correct place >> in TLA?>> 3) If it's not written yet, may I be included in the design discussion? >>  (As I haven't put gluster into production yet, now would be a good time >> to know if it's not going to work in the near future.)>> 4) May I be placed on the list-hacking at zresearch.com mailing list, please?>> >>  Christopher.>> >> > Date: Mon, 5 Jan 2009 01:36:14 -0800>> > From: ab at zresearch.com>> > To: krishna at zresearch.com>> > CC: swankier at msn.com; list-hacking at zresearch.com>> > Subject: Re: [List-hacking] [bug #25207] an rm of a file should not >> cause that file to be replicated with afr self-heal.>> >>> > Krishna, leave it as is. Once self-heal ensures that the volumes are >> intact, rm will>> > remove both the copies anyways. It is inefficient, but optimizing it >> the current framework>> > will be hacky.>> >>> > Swaniker, We are ditching the current self-healing framework with an >> active healing tool.>> > We can take care of it then.>> >>> >>> > Krishna Srinivas wrote:>> >> The current selfheal logic is built in lookup of a file, lookup is>> >> issued just before any file operation on a file. So if the lookup call>> >> does not know whether an open or rm is going to be done on the file.>> >> Will get back to you if we can do anything about this, i.e to save the>> >> redundant copy of the file when it is going to be rm'ed>> >>>> >> Krishna>> >>>> >> On Mon, Jan 5, 2009 at 12:19 PM, swankier <INVALID.NOREPLY at gnu.org> >> wrote:>> >>> Follow-up Comment #2, bug #25207 (project gluster):>> >>>>> >>> I am:>> >>>>> >>> 1) delete file from posix system beneath afr on one side>> >>> 2) run rm on gluster file system>> >>>>> >>> file is then replicated followed by deletion>> >>>>> >>> _______________________________________________________>> >>>>> >>> Reply to this item at:>> >>>>> >>> <http://savannah.nongnu.org/bugs/?25207>>> >>> > -->> > Anand Babu Periasamy>> > GPG Key ID: 0x62E15A31>> > Blog [http://ab.freeshell.org]>> > GlusterFS [http://www.gluster.org]>> > The GNU Operating System [http://www.gnu.org]>> >>> >> ------------------------------------------------------------------------>> Visit messengerbuddies.ca to find out how you could win. Enter today. >> <http://www.messengerbuddies.ca/?ocid=BUDDYOMATICENCA20>> > -- > Anand Babu Periasamy> GPG Key ID: 0x62E15A31> Blog [http://ab.freeshell.org]> GlusterFS [http://www.gluster.org]> The GNU Operating System [http://www.gnu.org]> 
_________________________________________________________________
Show them the way! Add maps and directions to your party invites.
http://www.microsoft.com/windows/windowslive/events.aspx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20090105/72c0c645/attachment.html>