[Gluster-devel] Sharding - Inode write fops - recoverability from failures - design

Tue Feb 24 10:49:17 UTC 2015

----- Original Message -----

> From: "Vijay Bellur" <vbellur at redhat.com>
> To: "Krutika Dhananjay" <kdhananj at redhat.com>
> Cc: "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Tuesday, February 24, 2015 4:13:13 PM
> Subject: Re: [Gluster-devel] Sharding - Inode write fops - recoverability
> from failures - design

> On 02/24/2015 01:53 PM, Krutika Dhananjay wrote:
> >
> >
> > ------------------------------------------------------------------------
> >
> > *From: *"Vijay Bellur" <vbellur at redhat.com>
> > *To: *"Krutika Dhananjay" <kdhananj at redhat.com>
> > *Cc: *"Gluster Devel" <gluster-devel at gluster.org>
> > *Sent: *Tuesday, February 24, 2015 12:26:58 PM
> > *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -
> > recoverability from failures - design
> >
> > On 02/24/2015 12:19 PM, Krutika Dhananjay wrote:
> > >
> > >
> > >
> > ------------------------------------------------------------------------
> > >
> > > *From: *"Vijay Bellur" <vbellur at redhat.com>
> > > *To: *"Krutika Dhananjay" <kdhananj at redhat.com>
> > > *Cc: *"Gluster Devel" <gluster-devel at gluster.org>
> > > *Sent: *Tuesday, February 24, 2015 11:35:28 AM
> > > *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -
> > > recoverability from failures - design
> > >
> > > On 02/24/2015 10:36 AM, Krutika Dhananjay wrote:
> > > >
> > > >
> > > >
> > >
> > ------------------------------------------------------------------------
> > > >
> > > > *From: *"Vijay Bellur" <vbellur at redhat.com>
> > > > *To: *"Krutika Dhananjay" <kdhananj at redhat.com>,
> > "Gluster Devel"
> > > > <gluster-devel at gluster.org>
> > > > *Sent: *Monday, February 23, 2015 5:25:57 PM
> > > > *Subject: *Re: [Gluster-devel] Sharding - Inode write
> > fops -
> > > > recoverability from failures - design
> > > >
> > > > On 02/22/2015 06:08 PM, Krutika Dhananjay wrote:
> > > > > Hi,
> > > > >
> > > > > Please find the design doc for one of the problems in
> > > sharding which
> > > > > Pranith and I are trying to solve and its solution @
> > > > > http://review.gluster.org/#/c/9723/1.
> > > > > Reviews and feedback are much appreciated.
> > > > >
> > > >
> > > > Can this feature be made optional? I think there are use
> > > cases like
> > > > virtual machine image storage, hdfs etc. where the
> > number of
> > > metadata
> > > > queries might not be very high. It would be an acceptable
> > > tradeoff in
> > > > such cases to not be very efficient for answering metadata
> > > queries but
> > > > be very efficient for data operations.
> > > >
> > > > IOW, can we have two possible modes of operation for
> > the sharding
> > > > translator to answer metadata queries?
> > > >
> > > > 1. One that behaves like a regular filesystem where we
> > expect
> > > a mix of
> > > > data and metadata operations. Your document seems to cover
> > > that part
> > > > well. We can look at optimizing behavior for
> > multi-threaded
> > > single
> > > > writer use cases after an initial implementation is in
> > place.
> > > > Techniques
> > > > like eager locking can be applied here.
> > > >
> > > > 2. Another mode where we do not expect a lot of metadata
> > > queries. In
> > > > this mode, we can visit all nodes where we have shards to
> > > answer these
> > > > queries.
> > > >
> > > > But for sharding translator to be able to visit all
> > shards, it is
> > > > required to know the last shard number.
> > > > Without this, it will never know when to stop looking up the
> > > different
> > > > shards. For this to happen, we
> > > > still need to maintain the size attribute for each file.
> > > >
> > >
> > > Wouldn't maintaining the total number of shards in the metadata
> > > shard be
> > > sufficient?
> > >
> > > Maintaining the correctness of "total number of shards" would again
> > > incur the same cost as maintaining size or any other metadata
> > attribute
> > > if a client/brick crashes in the middle of a write fop before the
> > > attribute is committed to disk.
> > > In other words, we will again need to maintain a "dirty" and
> > "committed"
> > > copy of the shard_count to ensure its correctness.
> > >
> >
> > I think the cost of maintaining "total number of shards" is not as
> > expensive as maintaining size or any other metadata attribute. The
> > shard
> > count needs to be updated only when an extending operation results in
> > the creation of a new shard or when a truncate operation results in the
> > removal of a shard. Maintaining other metadata attributes would need
> > a 5
> > phase transaction for every write operation. Isn't that the case?
> >
> > Even size attribute changes only in case of extending writes and
> > truncates. In fact, Pranith and I had
> > initially chosen to persist shard count as opposed to size in the first
> > design for inode write fops.
> > But the reason we decided to go with size in the end is to prevent extra
> > lookup on the last shard to
> > find the total size of the file (i.e., if N is the total number of
> > shards, file size = (N-1)*shard_block_size + sizeof(last shard)).
> >

> I am probably confused about the definition of size.
By size, I mean the total size of the file in bytes. 

> For maintaining
> accurate size, wouldn't we need to account for truncates and writes that
> happen within the scope of one shard?
Correct. This particular increase/decrease in size can be deduced from the change in ia_size between postbuf and prebuf in the respective callback. 
-Krutika 

> -Vijay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150224/343a45a7/attachment-0001.html>