[Gluster-devel] Barrier design issues wrt volume snapshot

Fri Mar 7 04:07:39 UTC 2014

----- Original Message -----

> From: "Anand Avati" <avati at gluster.org>
> To: "Vijay Bellur" <vbellur at redhat.com>
> Cc: "Krishnan Parthasarathi" <kparthas at redhat.com>, "Anand Avati"
> <aavati at redhat.com>, "Raghavendra Gowdappa" <rgowdapp at redhat.com>, "Varun
> Shastry" <vshastry at redhat.com>, "Pranith Kumar Karampuri"
> <pkarampu at redhat.com>, "Venky Shankar" <vshankar at redhat.com>, "Kaushal M"
> <kaushal at redhat.com>, "Rajesh Joseph" <rjoseph at redhat.com>, "Kotresh
> Hiremath Ravishankar" <khiremat at redhat.com>, gluster-devel at nongnu.org
> Sent: Friday, March 7, 2014 12:21:54 AM
> Subject: Re: [Gluster-devel] Barrier design issues wrt volume snapshot

> On Thu, Mar 6, 2014 at 12:21 AM, Vijay Bellur < vbellur at redhat.com > wrote:

> > Adding gluster-devel.
> 

> > On 03/06/2014 01:15 PM, Krishnan Parthasarathi wrote:
> 

> > > All,
> > 
> 

> > > In recent discussions around design (and implementation) of the barrier
> > 
> 
> > > feature, couple of things came to light.
> > 
> 

> > > 1) changelog xlator needs barrier xlator to block unlink and rename FOPs
> > 
> 
> > > in the call path. This is apart from the current list of FOPs that are
> > > blocked
> > 
> 
> > > in their call back path.
> > 
> 
> > > This is to make sure that the changelog has a bounded queue of unlink and
> > > rename FOPs,
> > 
> 
> > > from the time barriering is enabled, to be drained, committed to
> > > changelog
> > > file and published.
> > 
> 

> Why is this necessary?

FOPs that are still coming through after enabling barrier (assuming that barrier is done in the call path) would end up in a non-consumable changelog. For these operations, geo-rep would resort to FS crawl based on xtime which does not handle unlinks and renames. 

> > > 2) It is possible in a pure distribute volume that the following sequence
> > > of
> > > FOPs could result
> > 
> 
> > > in snapshots of bricks disagreeing on inode type for a file or directory.
> > 
> 

> > > t1: snap b1
> > 
> 
> > > t2: unlink /a
> > 
> 
> > > t3: mkdir /a
> > 
> 
> > > t4: snap b2
> > 
> 

> > > where, b1 and b2 are bricks of a pure distribute volume V.
> > 
> 

> > > The above sequence can happen with the current barrier xlator design,
> > > since
> > > we allow unlink FOPs
> > 
> 
> > > to go through to the disk and only block their acknowledgement to the
> > > application. This implies
> > 
> 
> > > a concurrent mkdir on the same name could succeed, since DHT doesn't
> > > serialize unlink and mkdir FOPs,
> > 
> 
> > > unlike AFR.
> > 
> 

> > > Avati,
> > 
> 

> > > I hear that you have a solution for problem 2). Could you please start
> > > the
> > > discussion on this thread?
> > 
> 
> > > It would help us to decide how to go about with the barrier xlator
> > > implementation.
> > 
> 

> The solution is really a long pending implementation of dentry serialization
> in the resolver of protocol server. Today we allow multiple FOPs to happen
> in parallel which modify the same dentry. This results in hairy races
> (including non atomicity of rename) and has been kept open for a while now.
> Implementing the dentry serialization in the resolver will "solve" 2 as a
> side effect. Hence that is a better approach than making changes in the
> barrier translator.

> Avati
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140306/a8b5f3af/attachment-0001.html>