[Gluster-devel] Wrong behavior on fsync of md-cache ?

Tue Nov 25 12:42:21 UTC 2014

On 11/25/2014 12:59 PM, Raghavendra Gowdappa wrote:
>
>
> ----- Original Message -----
>> From: "Xavier Hernandez" <xhernandez at datalab.es>
>> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Emmanuel Dreyfus" <manu at netbsd.org>
>> Sent: Tuesday, November 25, 2014 2:05:25 PM
>> Subject: Re: Wrong behavior on fsync of md-cache ?
>>
>> On 11/25/2014 07:38 AM, Raghavendra Gowdappa wrote:
>>> ----- Original Message -----
>>>> From: "Xavier Hernandez" <xhernandez at datalab.es>
>>>> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Emmanuel Dreyfus"
>>>> <manu at netbsd.org>
>>>> Sent: Tuesday, November 25, 2014 12:49:03 AM
>>>> Subject: Re: Wrong behavior on fsync of md-cache ?
>>>>
>>>> I think the problem is here: the first thing wb_fsync()
>>>> checks is if there's an error in the fd (wd_fd_err()). If that's the
>>>> case, the call is immediately unwinded with that error. The error seems
>>>> to be set in wb_fulfill_cbk(). I don't know the internals of write-back
>>>> xlator, but this seems to be the problem.
>>>
>>> Yes, your analysis is correct. Once the error is hit, fsync is not
>>> queued  behind unfulfilled writes. Whether it can be considered as a bug
>>> is debatable.  Since there is already an error in one of the writes which
>>> was written-behind  fsync should return the error. I am not sure whether
>>> it should wait till we try to flush _all_ the writes that were written
>>> behind. Any suggestions on what is the expected behaviour here?
>>>
>>
>> I think that it should wait for all pending writes. In the test case I
>> used, all pending writes will fail the same way that the first one, but
>> in other situations it's possible to have a write failing (for example
>> due to a damaged block in disk) and following writes succeeding.
>>
>>   From the man page of fsync:
>>
>>       fsync() transfers ("flushes") all modified in-core data of (i.e.,
>>       modified buffer cache pages for) the file referred to by the file
>>       descriptor fd to the disk device (or other permanent storage
>>       device) so that all changed information can be retrieved even after
>>       the system crashed or was rebooted. This includes writing through
>>       or flushing a disk cache if present. The call blocks until the
>>       device reports that the transfer has completed. It also flushes
>>       metadata information associated with the file (see stat(2)).
>>
>> As I understand it, when fsync is received all queued writes must be
>> sent to the device (regardless if a previous write has failed or not).
>> It also says that the call blocks until the device has finished all the
>> operations.
>>
>> However it's not clear to me how to control file consistency because
>> this allows some writes to succeed after a failed one.
>
> Though fsync doesn't wait on queued writes after a failure, the queued writes are flushed to disk even in the existing codebase. Can you file a bug to make fsync to wait for completion of queued writes irrespective of whether flushing any of them failed or not? I'll send a patch to fix the issue.

I filed bug #1167793

> Just to prioritise this, how important is the fix?

It seems to fail only in NetBSD. I'm not sure what priority it has. 
Emmanuel is trying to create a regression test for new patches that 
checks all tests in tests/basic, and tests/basic/ec/quota.t hits this issue.

An alternative would be to temporarily remove or change this test to 
avoid the problem.

Xavi