[Gluster-devel] Handling Failed flushes in write-behind

Mon Oct 5 06:07:00 UTC 2015

----- Original Message -----
> From: "Prashanth Pai" <ppai at redhat.com>
> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Thiago da Silva" <thiago at redhat.com>
> Sent: Wednesday, September 30, 2015 11:38:38 AM
> Subject: Re: [Gluster-devel] Handling Failed flushes in write-behind
> 
> > > > As for as 2, goes, application can checkpoint by doing fsync and on
> > > > write
> > > > failures, roll-back to last checkpoint and replay writes from that
> > > > checkpoint. Or, glusterfs can retry the writes on behalf of the
> > > > application. However, glusterfs retrying writes cannot be a complete
> > > > solution as the error-condition we've run into might never get resolved
> > > > (For eg., running out of space). So, glusterfs has to give up after
> > > > some
> > > > time.
> 
> The application should not be expected to replay writes. glusterfs must be
> retrying the failed write.

Well, failed writes can fail due to two categories of errors:

1. The error condition can be transient or file-system can do something to alleviate the error.
2. The error condition can be permanent or file-system has no control over how to recover from the failure condition. For eg., Network failure.

The best a file-system can do in scenario 1 is:
1. try to do things to alleviate the error.
2. retry the writes

For eg., ext4 on seeing a writeback failure with ENOSPC, tries to free some space by freeing some extents (again extents are managed by filesystem) and retries. Again this retry is only once after failure. After that page is marked with error.

As far as failure scenarios 2, there is no point in retrying and it is difficult to have a well defined policy on how long we can keep retrying. The purpose of this mail is to identify errors that fall into scenario 1 above and have a recovery policy. I am afraid, glusterfs cannot do much in scenario 2. If you've ideas that can help for scenario 2, I am open to incorporate them.

I did a quick look at how various filesystems handle writeback failures (this is not extensive research and hence there might be some incorrectness):

1. FUSE:
   ======
 FUSE implemented write-back from kernel version 3.15. In its current version, it doesn't replay the writes at all on writeback failure.

2. xfs:
   ====
 xfs seem to have an intelligent failure handling mechanism on writeback failure. It marks the pages as dirty again after writeback failure for some errors. For other errors, it doesn't retry. I couldn't look into details of what errors are retried and what errors are not

3. ext4:
   =====
  Only ENOSPC errors are retried. That too, only once.

Also, please note that to the best of my knowledge, POSIX only guarantees writes that are checkpointed by fsync to have been persisted. Given the above constraints I am curious to know how the applications handle similar issues on other filesystems.

> In gluster-swift, we had hit into a case where the application would get EIO
> but the write had actually failed because of ENOSPC.

>From linux kernel source tree,

static inline void mapping_set_error(struct address_space *mapping, int error)
{
        if (unlikely(error)) {
                if (error == -ENOSPC)
			set_bit(AS_ENOSPC, &mapping->flags);
                else
                        set_bit(AS_EIO, &mapping->flags);
        }
}

Seems like only ENOSPC is stored. Rest of the errors are transformed into EIO. Again, we are ready to comply to whatever is the standard practise.

> https://bugzilla.redhat.com/show_bug.cgi?id=986812
> 
> Regards,
>  -Prashanth Pai
> 
> ----- Original Message -----
> > From: "Raghavendra Gowdappa" <rgowdapp at redhat.com>
> > To: "Vijay Bellur" <vbellur at redhat.com>
> > Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Ben Turner"
> > <bturner at redhat.com>, "Ira Cooper" <icooper at redhat.com>
> > Sent: Tuesday, September 29, 2015 4:56:33 PM
> > Subject: Re: [Gluster-devel] Handling Failed flushes in write-behind
> > 
> > + gluster-devel
> > 
> > > 
> > > On Tuesday 29 September 2015 04:45 PM, Raghavendra Gowdappa wrote:
> > > > Hi All,
> > > >
> > > > Currently on failure of flushing of writeback cache, we mark the fd
> > > > bad.
> > > > The rationale behind this is that since the application doesn't know
> > > > which
> > > > of the writes that are cached failed, fd is in a bad state and cannot
> > > > possibly do a meaningful/correct read. However, this approach (though
> > > > posix-complaint) is not acceptable for long standing applications like
> > > > QEMU [1]. So, a two part solution was decided:
> > > >
> > > > 1. No longer mark the fd bad during failures while flushing data to
> > > > backend
> > > > from write-behind cache.
> > > > 2. retry the writes
> > > >
> > > > As for as 2, goes, application can checkpoint by doing fsync and on
> > > > write
> > > > failures, roll-back to last checkpoint and replay writes from that
> > > > checkpoint. Or, glusterfs can retry the writes on behalf of the
> > > > application. However, glusterfs retrying writes cannot be a complete
> > > > solution as the error-condition we've run into might never get resolved
> > > > (For eg., running out of space). So, glusterfs has to give up after
> > > > some
> > > > time.
> > > >
> > > > It would be helpful if you give your inputs on how other writeback
> > > > systems
> > > > (Eg., kernel page-cache, nfs, samba, ceph, lustre etc) behave in this
> > > > scenario and what would be a sane policy for glusterfs.
> > > >
> > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1200862
> > > >
> > > > regards,
> > > > Raghavendra
> > > >
> > > 
> > > 
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> > 
>