[Gluster-devel] Wrong behavior on fsync of md-cache ?

Tue Nov 25 11:59:23 UTC 2014

----- Original Message -----
> From: "Xavier Hernandez" <xhernandez at datalab.es>
> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Emmanuel Dreyfus" <manu at netbsd.org>
> Sent: Tuesday, November 25, 2014 2:05:25 PM
> Subject: Re: Wrong behavior on fsync of md-cache ?
> 
> On 11/25/2014 07:38 AM, Raghavendra Gowdappa wrote:
> > ----- Original Message -----
> >> From: "Xavier Hernandez" <xhernandez at datalab.es>
> >> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> >> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Emmanuel Dreyfus"
> >> <manu at netbsd.org>
> >> Sent: Tuesday, November 25, 2014 12:49:03 AM
> >> Subject: Re: Wrong behavior on fsync of md-cache ?
> >>
> >> I think the problem is here: the first thing wb_fsync()
> >> checks is if there's an error in the fd (wd_fd_err()). If that's the
> >> case, the call is immediately unwinded with that error. The error seems
> >> to be set in wb_fulfill_cbk(). I don't know the internals of write-back
> >> xlator, but this seems to be the problem.
> >
> > Yes, your analysis is correct. Once the error is hit, fsync is not
> > queued  behind unfulfilled writes. Whether it can be considered as a bug
> > is debatable.  Since there is already an error in one of the writes which
> > was written-behind  fsync should return the error. I am not sure whether
> > it should wait till we try to flush _all_ the writes that were written
> > behind. Any suggestions on what is the expected behaviour here?
> >
> 
> I think that it should wait for all pending writes. In the test case I
> used, all pending writes will fail the same way that the first one, but
> in other situations it's possible to have a write failing (for example
> due to a damaged block in disk) and following writes succeeding.
> 
>  From the man page of fsync:
> 
>      fsync() transfers ("flushes") all modified in-core data of (i.e.,
>      modified buffer cache pages for) the file referred to by the file
>      descriptor fd to the disk device (or other permanent storage
>      device) so that all changed information can be retrieved even after
>      the system crashed or was rebooted. This includes writing through
>      or flushing a disk cache if present. The call blocks until the
>      device reports that the transfer has completed. It also flushes
>      metadata information associated with the file (see stat(2)).
> 
> As I understand it, when fsync is received all queued writes must be
> sent to the device (regardless if a previous write has failed or not).
> It also says that the call blocks until the device has finished all the
> operations.
> 
> However it's not clear to me how to control file consistency because
> this allows some writes to succeed after a failed one. 

Though fsync doesn't wait on queued writes after a failure, the queued writes are flushed to disk even in the existing codebase. Can you file a bug to make fsync to wait for completion of queued writes irrespective of whether flushing any of them failed or not? I'll send a patch to fix the issue. Just to prioritise this, how important is the fix?

> I assume that
> controlling this is the responsibility of the calling application that
> should issue fsyncs on critical points to guarantee consistency.
> 
> Anyway it seems that there's a difference between linux and NetBSD
> because this test only fails on NetBSD. Is it possible that linux's fuse
> implementation delays the fsync request until all pending writes have
> been answered ? this would explain why this problem has not manifested
> till now. NetBSD seems to send fsync (probably as the first step of a
> close() call) when the first write fails.
> 
> Xavi
>