Fwd: [Gluster-devel] proposals to afr

Thu Oct 25 18:06:49 UTC 2007

> > My thought above was a simple flag as to whether or not the file was
> > bing written just to denote whether it should be considered in a
> > consistent state if a crash happens.

This will not work because, setting the flag before write() and resetting
after write() will be expensive interms of cpu cycles (debatable)

Also suppose we have set the flag before write(), another client sees
that this flag is set on open() and takes wrong action (returning EIO) it is not
correct behavior.

On 10/25/07, Krishna Srinivas <krishna at zresearch.com> wrote:
> Kevan, Alexey,
>
> Good points. I did not go through each line, so please let me know if
> still something is left unclear after this mail.
>
> Whenever we increment versions or when we compare versions
> we do it by holding locks. So that the race condition that you
> mentioned regarding versions is taken care off.
>
> There is another good point mentioned by you:
> > Imagine the following order of operations:
> > On subvolume A
> > First op: request(1).write("this is a request from program a") &&
> > opCounterIncrement
> > Second op: request(2).write("this is a request from program b") &&
> > opCounterIncrement
> > crash
> >
> > On subvolume B
> > First op: request(2).write("this is a request from program b") &&
> > opCounterIncrement
> > Second op: request(1).write("this is a request from program a") &&
> > opCounterIncrement
> > crash
>
> This can happen when when AFR is loaded on the client side. During
> implementation we had to choose whether we do lock on every write
> or leave it to application's responsibility in case it is really worried about
> data consistency between its children. As locking for every write will
> be *very* inefficient we decided not to lock for write. If AFR is loaded
> on the server, this problem will not arise as all the clients will have to
> go through the same point of contact (i.e server afr).
>
> > This whole conversation's gotten into somewhat esoteric territory that
> > > requires more input from the GlusterFS team on whether it's even worth
> > > considering doing stuff this way.  Maybe they have a better solution in
> > > the works?  Any team members care to comment on their thoughts on this?
> >
> >
> > they wait us to forget this problem ;)
>
> No. If we do that, it will work against glusterfs and us and users.
> Please follow it up if the developers have missed answering any
> of your questions.
>
> To handle the case where data is written to the 1st child and not
> the 2nd child but still the versions remain same we can do further
> checks during open like:
> do they have same file size?
> do their mtime stamps differ by a certain number of seconds?
> <anything more can be added here?>
> ...in which case the open can return EIO and leave it to the discretion
> of the administrator to fix it up.
>
> Thanks for inputs,
> Krishna
>
> On 10/25/07, Alexey Filin <alexey.filin at gmail.com> wrote:
> > On 10/24/07, Kevan Benson <kbenson at a-1networks.com> wrote:
> > >
> > > Alexey Filin wrote:
> > > > On 10/23/07, Kevan Benson <kbenson at a-1networks.com> wrote:
> > > >
> > > >> Actually, I just thought of a major problem with this.  I think the
> > > >> extended attributes need to be set as atomic operations.  Imagine the
> > > >> case where two processes are writing the file at the same time, the op
> > > >> counters could get very messed up.
> > > >>
> > > >
> > > >
> > > > atomic operations is an ideal which is not possible on practice
> > > sometimes,
> > > > ideal hardware exists in mind only, developers choose a compromise
> > > between
> > > > complexity, performance, reliability, flexibility etc on existing
> > > hardware
> > > > always.
> > > >
> > > > to provide operation-counter(or version if it is updated after each
> > > > operation) consistency the concurrent access to the same file is to be
> > > done:
> > > >
> > > > * with one thread (to allow concurrent operations with _one_ file to be
> > > > serviced by _one_ thread only) which can provide atomicity with explicit
> > > > queuing
> > > > * or with sync primitive(s) for many threads.
> > > >
> > > > io threads help to decrease latencies when many clients use the same
> > > brick
> > > > (as e.g. a glfs doc says) or to overlap network/disk io to increase
> > > > performance per a client (is it implemented in glfs?)
> > > >
> > >
> > > It depends on the context.  Really, an atomic operation just means that
> > > nothing else can interrupt the action while it's doing this one thing.
> > > In the case of glusterfs, it could easily be achieved with a short
> > > thread lock if it isn't already (I suspect it probably is).  I'm
> > > referring to atomic in the sense that there's only one extended
> > > attribute, not multiple across threads.  If two separate threads
> > > (serving two separate requests) are acting on the same file, the op
> > > counter as you defined it (an extended attribute) could itself become
> > > inconsistent between different AFR subvolumes, depending on the order
> > > the write requests are processed.
> > >
> > > Imagine the following order of operations:
> > > On subvolume A
> > > First op: request(1).write("this is a request from program a") &&
> > > opCounterIncrement
> > > Second op: request(2).write("this is a request from program b") &&
> > > opCounterIncrement
> > > crash
> > >
> > > On subvolume B
> > > First op: request(2).write("this is a request from program b") &&
> > > opCounterIncrement
> > > Second op: request(1).write("this is a request from program a") &&
> > > opCounterIncrement
> > > crash
> > >
> > > At this point the opcounters would be the same, the trusted_afr_version
> > > the same, the data different, and no self-heal would be triggered.
> > >
> > > Now, I'm not familiar enough with the internals of GlusterFS to tell you
> > > whether what I outlines above is even possible, but it is a race
> > > condition I can see causing problems unless files are implicitly locked
> > > by AFR writes.  I'm not sure.
> >
> >
> > order of operations can be guaranteed on ideal hardware and software only,
> > for example:
> >
> > * operations over a file and its extended attributes can be considered by
> > backend FS independent and reordered (I don't know is it true or not on all
> > FS's),
> > * disk driver can reorder operations over different files,
> > * hard disk controller caches and reorders operations to optimize disk heads
> > movement
> >
> > io synchronization has too many unknown variables (as of crash) to be
> > discussed here without glfs devs as you said below, moreover 100% absence of
> > sync problems for crash can be guaranteed by glfs devs only if they provide
> > us disks with firmware and drivers tuned for glfs and that problem can't be
> > fixed absolutely even by monsters like Oracle while they don't sell hardware
> > with transactional io.
> >
> > The problem being discussed in the thread is enough simple and common, so
> > could be considered and fixed by glfs team (I don't insist to fix it
> > immediately).
> >
> > > Another solution comes to mind.  Just set another extended attribute
> > > >
> > > >> denoting that the file is being written to currently (and unset it
> > > >> afterwards).  If the AFR subvolume notices that the file islisted as
> > > >> being written to but no clients have it open (I hope this is easily
> > > >> determinable) a flag is returned for the file.  If all subvolumes
> > > return
> > > >> this flag for the file in the AFR (and all the trusted_afr_versions are
> > > >> the same), choose one version of the file (for example from the first
> > > >> AFR subvolume) as the legit copy and copy it to the other AFR
> > > nodes.  It
> > > >> doesn't matter which version is the most up to date, they will all be
> > > >> fairly close, and since this is from a failed write operation there was
> > > >> no guarantee the file was in a valid state after the write.  it's
> > > >> doesn't matter which copy you get, as long as it's consistent across
> > > AFR
> > > >> members.
> > > >>
> > > >
> > > >
> > > > I like it more op counter, advantage to op counter is that the flag is
> > > set
> > > > only two times (open()/close()) so an overhead is minimal (concurrent
> > > access
> > > > to the flag is to be synchronized), the disadvantage is if not closed
> > > file
> > > > is enough big it has to be copied sometimes when it is not required, it
> > > is
> > > > acceptable if afr crashes rare
> > > >
> > >
> > > Wait, I assumed by operation you meant every specific write to the file,
> > > so this opcounter could be incremented quite a bit, but you just stated
> > > it would only be set once as a flag, so maybe I'm misunderstanding you.
> > > If it's incremented per actual file operation, quite a lot of increments
> > > might happen.  For example, using wget to save a remote file to disk
> > > doesn't write everything at once, it does many writes as it's buffer
> > > fills with enough information to be worth writing to disk.
> > >
> > > My thought above was a simple flag as to whether or not the file was
> > > bing written just to denote whether it should be considered in a
> > > consistent state if a crash happens.
> >
> >
> > I understood you correctly. I proposed to increment op counter for each
> > operation, your flag is set only twice per io "session" (being
> > started/finished with open()/close()), op counter increments fix the same
> > problem the flag does, but disk io overhead with op counter is much more
> > than with the flag. The overhead is disk only, network is not required,
> > because set/unset is done by each slave independantly during open()/close()
> > (I believe you think so)
> >
> > This whole conversation's gotten into somewhat esoteric territory that
> > > requires more input from the GlusterFS team on whether it's even worth
> > > considering doing stuff this way.  Maybe they have a better solution in
> > > the works?  Any team members care to comment on their thoughts on this?
> >
> >
> > they wait us to forget this problem ;)
> >
> > --
> > >
> > > -Kevan Benson
> > > -A-1 Networks
> > >
> >
> > Regards, Alexey.
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at nongnu.org
> > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >
>