[Gluster-devel] Sharding - Inode write fops - recoverability from failures - design

Tue Feb 24 06:56:58 UTC 2015

On 02/24/2015 12:19 PM, Krutika Dhananjay wrote:
>
>
> ------------------------------------------------------------------------
>
>     *From: *"Vijay Bellur" <vbellur at redhat.com>
>     *To: *"Krutika Dhananjay" <kdhananj at redhat.com>
>     *Cc: *"Gluster Devel" <gluster-devel at gluster.org>
>     *Sent: *Tuesday, February 24, 2015 11:35:28 AM
>     *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -
>     recoverability from failures - design
>
>     On 02/24/2015 10:36 AM, Krutika Dhananjay wrote:
>      >
>      >
>      >
>     ------------------------------------------------------------------------
>      >
>      >     *From: *"Vijay Bellur" <vbellur at redhat.com>
>      >     *To: *"Krutika Dhananjay" <kdhananj at redhat.com>, "Gluster Devel"
>      >     <gluster-devel at gluster.org>
>      >     *Sent: *Monday, February 23, 2015 5:25:57 PM
>      >     *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -
>      >     recoverability from failures - design
>      >
>      >     On 02/22/2015 06:08 PM, Krutika Dhananjay wrote:
>      >      > Hi,
>      >      >
>      >      > Please find the design doc for one of the problems in
>     sharding which
>      >      > Pranith and I are trying to solve and its solution @
>      >      > http://review.gluster.org/#/c/9723/1.
>      >      > Reviews and feedback are much appreciated.
>      >      >
>      >
>      >     Can this feature be made optional? I think there are use
>     cases like
>      >     virtual machine image storage, hdfs etc. where the number of
>     metadata
>      >     queries might not be very high. It would be an acceptable
>     tradeoff in
>      >     such cases to not be very efficient for answering metadata
>     queries but
>      >     be very efficient for data operations.
>      >
>      >     IOW, can we have two possible modes of operation for the sharding
>      >     translator to answer metadata queries?
>      >
>      >     1. One that behaves like a regular filesystem where we expect
>     a mix of
>      >     data and metadata operations. Your document seems to cover
>     that part
>      >     well. We can look at optimizing behavior for multi-threaded
>     single
>      >     writer use cases after an initial implementation is in place.
>      >     Techniques
>      >     like eager locking can be applied here.
>      >
>      >     2. Another mode where we do not expect a lot of metadata
>     queries. In
>      >     this mode, we can visit all nodes where we have shards to
>     answer these
>      >     queries.
>      >
>      > But for sharding translator to be able to visit all shards, it is
>      > required to know the last shard number.
>      > Without this, it will never know when to stop looking up the
>     different
>      > shards. For this to happen, we
>      > still need to maintain the size attribute for each file.
>      >
>
>     Wouldn't maintaining the total number of shards in the metadata
>     shard be
>     sufficient?
>
> Maintaining the correctness of "total number of shards" would again
> incur the same cost as maintaining size or any other metadata attribute
> if a client/brick crashes in the middle of a write fop before the
> attribute is committed to disk.
> In other words, we will again need to maintain a "dirty" and "committed"
> copy of the shard_count to ensure its correctness.
>

I think the cost of maintaining "total number of shards" is not as 
expensive as maintaining size or any other metadata attribute. The shard 
count needs to be updated only when an extending operation results in 
the creation of a new shard or when a truncate operation results in the 
removal of a shard. Maintaining other metadata attributes would need a 5 
phase transaction for every write operation. Isn't that the case?

-Vijay