[Gluster-devel] Wrong behavior on fsync of md-cache ?

Tue Nov 25 08:35:25 UTC 2014

On 11/25/2014 07:38 AM, Raghavendra Gowdappa wrote:
> ----- Original Message -----
>> From: "Xavier Hernandez" <xhernandez at datalab.es>
>> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Emmanuel Dreyfus" <manu at netbsd.org>
>> Sent: Tuesday, November 25, 2014 12:49:03 AM
>> Subject: Re: Wrong behavior on fsync of md-cache ?
>>
>> I think the problem is here: the first thing wb_fsync()
>> checks is if there's an error in the fd (wd_fd_err()). If that's the
>> case, the call is immediately unwinded with that error. The error seems
>> to be set in wb_fulfill_cbk(). I don't know the internals of write-back
>> xlator, but this seems to be the problem.
>
> Yes, your analysis is correct. Once the error is hit, fsync is not
> queued  behind unfulfilled writes. Whether it can be considered as a bug
> is debatable.  Since there is already an error in one of the writes which
> was written-behind  fsync should return the error. I am not sure whether
> it should wait till we try to flush _all_ the writes that were written
> behind. Any suggestions on what is the expected behaviour here?
>

I think that it should wait for all pending writes. In the test case I 
used, all pending writes will fail the same way that the first one, but 
in other situations it's possible to have a write failing (for example 
due to a damaged block in disk) and following writes succeeding.

 From the man page of fsync:

     fsync() transfers ("flushes") all modified in-core data of (i.e.,
     modified buffer cache pages for) the file referred to by the file
     descriptor fd to the disk device (or other permanent storage
     device) so that all changed information can be retrieved even after
     the system crashed or was rebooted. This includes writing through
     or flushing a disk cache if present. The call blocks until the
     device reports that the transfer has completed. It also flushes
     metadata information associated with the file (see stat(2)).

As I understand it, when fsync is received all queued writes must be 
sent to the device (regardless if a previous write has failed or not). 
It also says that the call blocks until the device has finished all the 
operations.

However it's not clear to me how to control file consistency because 
this allows some writes to succeed after a failed one. I assume that 
controlling this is the responsibility of the calling application that 
should issue fsyncs on critical points to guarantee consistency.

Anyway it seems that there's a difference between linux and NetBSD 
because this test only fails on NetBSD. Is it possible that linux's fuse 
implementation delays the fsync request until all pending writes have 
been answered ? this would explain why this problem has not manifested 
till now. NetBSD seems to send fsync (probably as the first step of a 
close() call) when the first write fails.

Xavi