[Gluster-devel] Sharding - Inode write fops - recoverability from failures - design

Tue Feb 24 08:23:16 UTC 2015

----- Original Message -----

> From: "Vijay Bellur" <vbellur at redhat.com>
> To: "Krutika Dhananjay" <kdhananj at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Tuesday, February 24, 2015 12:26:58 PM
> Subject: Re: [Gluster-devel] Sharding - Inode write fops - recoverability
> from failures - design

> On 02/24/2015 12:19 PM, Krutika Dhananjay wrote:
> >
> >
> > ------------------------------------------------------------------------
> >
> > *From: *"Vijay Bellur" <vbellur at redhat.com>
> > *To: *"Krutika Dhananjay" <kdhananj at redhat.com>
> > *Cc: *"Gluster Devel" <gluster-devel at gluster.org>
> > *Sent: *Tuesday, February 24, 2015 11:35:28 AM
> > *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -
> > recoverability from failures - design
> >
> > On 02/24/2015 10:36 AM, Krutika Dhananjay wrote:
> > >
> > >
> > >
> > ------------------------------------------------------------------------
> > >
> > > *From: *"Vijay Bellur" <vbellur at redhat.com>
> > > *To: *"Krutika Dhananjay" <kdhananj at redhat.com>, "Gluster Devel"
> > > <gluster-devel at gluster.org>
> > > *Sent: *Monday, February 23, 2015 5:25:57 PM
> > > *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -
> > > recoverability from failures - design
> > >
> > > On 02/22/2015 06:08 PM, Krutika Dhananjay wrote:
> > > > Hi,
> > > >
> > > > Please find the design doc for one of the problems in
> > sharding which
> > > > Pranith and I are trying to solve and its solution @
> > > > http://review.gluster.org/#/c/9723/1.
> > > > Reviews and feedback are much appreciated.
> > > >
> > >
> > > Can this feature be made optional? I think there are use
> > cases like
> > > virtual machine image storage, hdfs etc. where the number of
> > metadata
> > > queries might not be very high. It would be an acceptable
> > tradeoff in
> > > such cases to not be very efficient for answering metadata
> > queries but
> > > be very efficient for data operations.
> > >
> > > IOW, can we have two possible modes of operation for the sharding
> > > translator to answer metadata queries?
> > >
> > > 1. One that behaves like a regular filesystem where we expect
> > a mix of
> > > data and metadata operations. Your document seems to cover
> > that part
> > > well. We can look at optimizing behavior for multi-threaded
> > single
> > > writer use cases after an initial implementation is in place.
> > > Techniques
> > > like eager locking can be applied here.
> > >
> > > 2. Another mode where we do not expect a lot of metadata
> > queries. In
> > > this mode, we can visit all nodes where we have shards to
> > answer these
> > > queries.
> > >
> > > But for sharding translator to be able to visit all shards, it is
> > > required to know the last shard number.
> > > Without this, it will never know when to stop looking up the
> > different
> > > shards. For this to happen, we
> > > still need to maintain the size attribute for each file.
> > >
> >
> > Wouldn't maintaining the total number of shards in the metadata
> > shard be
> > sufficient?
> >
> > Maintaining the correctness of "total number of shards" would again
> > incur the same cost as maintaining size or any other metadata attribute
> > if a client/brick crashes in the middle of a write fop before the
> > attribute is committed to disk.
> > In other words, we will again need to maintain a "dirty" and "committed"
> > copy of the shard_count to ensure its correctness.
> >

> I think the cost of maintaining "total number of shards" is not as
> expensive as maintaining size or any other metadata attribute. The shard
> count needs to be updated only when an extending operation results in
> the creation of a new shard or when a truncate operation results in the
> removal of a shard. Maintaining other metadata attributes would need a 5
> phase transaction for every write operation. Isn't that the case?
Even size attribute changes only in case of extending writes and truncates. In fact, Pranith and I had 
initially chosen to persist shard count as opposed to size in the first design for inode write fops. 
But the reason we decided to go with size in the end is to prevent extra lookup on the last shard to 
find the total size of the file (i.e., if N is the total number of shards, file size = (N-1)*shard_block_size + sizeof(last shard)). 

-Krutika 

> -Vijay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150224/8530c144/attachment.html>