[Gluster-devel] Barrier design issues wrt volume snapshot

Thu Mar 6 19:44:49 UTC 2014

On Thu, Mar 6, 2014 at 11:19 AM, Krishnan Parthasarathi <kparthas at redhat.com
> wrote:

>
>
> ----- Original Message -----
> > On Thu, Mar 6, 2014 at 12:21 AM, Vijay Bellur <vbellur at redhat.com>
> wrote:
> >
> > > Adding gluster-devel.
> > >
> > >
> > > On 03/06/2014 01:15 PM, Krishnan Parthasarathi wrote:
> > >
> > >> All,
> > >>
> > >> In recent discussions around design (and implementation) of the
> barrier
> > >> feature, couple of things came to light.
> > >>
> > >> 1) changelog xlator needs barrier xlator to block unlink and rename
> FOPs
> > >>     in the call path. This is apart from the current list of FOPs that
> > >> are blocked
> > >>     in their call back path.
> > >>     This is to make sure that the changelog has a bounded queue of
> unlink
> > >> and rename FOPs,
> > >>     from the time barriering is enabled, to be drained, committed to
> > >> changelog file and published.
> > >>
> > >
> > Why is this necessary?
>
> The only consumer of changelog today, georeplication, can't tolerate
> missing unlink/rename
> entries from changelog, even with the initial xsync based crawl, until
> changelog entries
> are available for the master volume.
> So, changelog xlator needs to ensure that the last rotated
> (publishable) changelog should have entries for all the
> unlink(s)/rename(s) that made
> it to the snapshot. For this, changelog needs barrier xlator to block
> unlink/rename
> FOPs in the call path too. Hope that helps.
>

This sounds like a very changelog specific requirement. This is best
addressed in the changelog translator itself. If unlink/rmdir/renames
should not be "in progress" during a snapshot, then we need to hold off new
ops in the call path, trigger a log rotation and the rotation should wait
for completion of ongoing fops anyways.

>
> >
> >
> > 2) It is possible in a pure distribute volume that the following sequence
> > >> of FOPs could result
> > >>     in snapshots of bricks disagreeing on inode type for a file or
> > >> directory.
> > >>
> > >>     t1: snap b1
> > >>     t2: unlink /a
> > >>     t3: mkdir /a
> > >>     t4: snap b2
> > >>
> > >> where, b1 and b2 are bricks of a pure distribute volume V.
> > >>
> > >> The above sequence can happen with the current barrier xlator design,
> > >> since we allow unlink FOPs
> > >> to go through to the disk and only block their acknowledgement to the
> > >> application. This implies
> > >> a concurrent mkdir on the same name could succeed, since DHT doesn't
> > >> serialize unlink and mkdir FOPs,
> > >> unlike AFR.
> > >>
> > >> Avati,
> > >>
> > >> I hear that you have a solution for problem 2). Could you please start
> > >> the discussion on this thread?
> > >> It would help us to decide how to go about with the barrier xlator
> > >> implementation.
> > >>
> > >
> >
> > The solution is really a long pending implementation of dentry
> > serialization in the resolver of protocol server. Today we allow multiple
> > FOPs to happen in parallel which modify the same dentry. This results in
> > hairy races (including non atomicity of rename) and has been kept open
> for
> > a while now. Implementing the dentry serialization in the resolver will
> > "solve" 2 as a side effect. Hence that is a better approach than making
> > changes in the barrier translator.
> >
>
> I am not sure I understood how this works from the brief introduction
> above.
> Could you explain a bit?
>

By dentry serialization, I mean we should have only one operation modifying
a <pargfid>/bname at a given time. This needs changes in the resolver of
protocol server and possibly some changes in the inode table. This is
really for solving rare races, and I think is something we need to work on
independent of the snapshot requirements.

Avati
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140306/284a71cc/attachment-0001.html>