[Gluster-devel] Sharding - Inode write fops - recoverability from failures - design

Tue Feb 24 06:49:01 UTC 2015

----- Original Message -----

> From: "Vijay Bellur" <vbellur at redhat.com>
> To: "Krutika Dhananjay" <kdhananj at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Tuesday, February 24, 2015 11:35:28 AM
> Subject: Re: [Gluster-devel] Sharding - Inode write fops - recoverability
> from failures - design

> On 02/24/2015 10:36 AM, Krutika Dhananjay wrote:
> >
> >
> > ------------------------------------------------------------------------
> >
> > *From: *"Vijay Bellur" <vbellur at redhat.com>
> > *To: *"Krutika Dhananjay" <kdhananj at redhat.com>, "Gluster Devel"
> > <gluster-devel at gluster.org>
> > *Sent: *Monday, February 23, 2015 5:25:57 PM
> > *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -
> > recoverability from failures - design
> >
> > On 02/22/2015 06:08 PM, Krutika Dhananjay wrote:
> > > Hi,
> > >
> > > Please find the design doc for one of the problems in sharding which
> > > Pranith and I are trying to solve and its solution @
> > > http://review.gluster.org/#/c/9723/1.
> > > Reviews and feedback are much appreciated.
> > >
> >
> > Can this feature be made optional? I think there are use cases like
> > virtual machine image storage, hdfs etc. where the number of metadata
> > queries might not be very high. It would be an acceptable tradeoff in
> > such cases to not be very efficient for answering metadata queries but
> > be very efficient for data operations.
> >
> > IOW, can we have two possible modes of operation for the sharding
> > translator to answer metadata queries?
> >
> > 1. One that behaves like a regular filesystem where we expect a mix of
> > data and metadata operations. Your document seems to cover that part
> > well. We can look at optimizing behavior for multi-threaded single
> > writer use cases after an initial implementation is in place.
> > Techniques
> > like eager locking can be applied here.
> >
> > 2. Another mode where we do not expect a lot of metadata queries. In
> > this mode, we can visit all nodes where we have shards to answer these
> > queries.
> >
> > But for sharding translator to be able to visit all shards, it is
> > required to know the last shard number.
> > Without this, it will never know when to stop looking up the different
> > shards. For this to happen, we
> > still need to maintain the size attribute for each file.
> >

> Wouldn't maintaining the total number of shards in the metadata shard be
> sufficient?
Maintaining the correctness of "total number of shards" would again incur the same cost as maintaining size or any other metadata attribute if a client/brick crashes in the middle of a write fop before the attribute is committed to disk. 
In other words, we will again need to maintain a "dirty" and "committed" copy of the shard_count to ensure its correctness. 

-Krutika 

> -Vijay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150224/125e8cd2/attachment.html>