[Gluster-devel] Sharding - Inode write fops - recoverability from failures - design

Tue Feb 24 10:43:13 UTC 2015

On 02/24/2015 01:53 PM, Krutika Dhananjay wrote:
>
>
> ------------------------------------------------------------------------
>
>     *From: *"Vijay Bellur" <vbellur at redhat.com>
>     *To: *"Krutika Dhananjay" <kdhananj at redhat.com>
>     *Cc: *"Gluster Devel" <gluster-devel at gluster.org>
>     *Sent: *Tuesday, February 24, 2015 12:26:58 PM
>     *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -
>     recoverability from failures - design
>
>     On 02/24/2015 12:19 PM, Krutika Dhananjay wrote:
>      >
>      >
>      >
>     ------------------------------------------------------------------------
>      >
>      >     *From: *"Vijay Bellur" <vbellur at redhat.com>
>      >     *To: *"Krutika Dhananjay" <kdhananj at redhat.com>
>      >     *Cc: *"Gluster Devel" <gluster-devel at gluster.org>
>      >     *Sent: *Tuesday, February 24, 2015 11:35:28 AM
>      >     *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -
>      >     recoverability from failures - design
>      >
>      >     On 02/24/2015 10:36 AM, Krutika Dhananjay wrote:
>      >      >
>      >      >
>      >      >
>      >
>     ------------------------------------------------------------------------
>      >      >
>      >      >     *From: *"Vijay Bellur" <vbellur at redhat.com>
>      >      >     *To: *"Krutika Dhananjay" <kdhananj at redhat.com>,
>     "Gluster Devel"
>      >      >     <gluster-devel at gluster.org>
>      >      >     *Sent: *Monday, February 23, 2015 5:25:57 PM
>      >      >     *Subject: *Re: [Gluster-devel] Sharding - Inode write
>     fops -
>      >      >     recoverability from failures - design
>      >      >
>      >      >     On 02/22/2015 06:08 PM, Krutika Dhananjay wrote:
>      >      >      > Hi,
>      >      >      >
>      >      >      > Please find the design doc for one of the problems in
>      >     sharding which
>      >      >      > Pranith and I are trying to solve and its solution @
>      >      >      > http://review.gluster.org/#/c/9723/1.
>      >      >      > Reviews and feedback are much appreciated.
>      >      >      >
>      >      >
>      >      >     Can this feature be made optional? I think there are use
>      >     cases like
>      >      >     virtual machine image storage, hdfs etc. where the
>     number of
>      >     metadata
>      >      >     queries might not be very high. It would be an acceptable
>      >     tradeoff in
>      >      >     such cases to not be very efficient for answering metadata
>      >     queries but
>      >      >     be very efficient for data operations.
>      >      >
>      >      >     IOW, can we have two possible modes of operation for
>     the sharding
>      >      >     translator to answer metadata queries?
>      >      >
>      >      >     1. One that behaves like a regular filesystem where we
>     expect
>      >     a mix of
>      >      >     data and metadata operations. Your document seems to cover
>      >     that part
>      >      >     well. We can look at optimizing behavior for
>     multi-threaded
>      >     single
>      >      >     writer use cases after an initial implementation is in
>     place.
>      >      >     Techniques
>      >      >     like eager locking can be applied here.
>      >      >
>      >      >     2. Another mode where we do not expect a lot of metadata
>      >     queries. In
>      >      >     this mode, we can visit all nodes where we have shards to
>      >     answer these
>      >      >     queries.
>      >      >
>      >      > But for sharding translator to be able to visit all
>     shards, it is
>      >      > required to know the last shard number.
>      >      > Without this, it will never know when to stop looking up the
>      >     different
>      >      > shards. For this to happen, we
>      >      > still need to maintain the size attribute for each file.
>      >      >
>      >
>      >     Wouldn't maintaining the total number of shards in the metadata
>      >     shard be
>      >     sufficient?
>      >
>      > Maintaining the correctness of "total number of shards" would again
>      > incur the same cost as maintaining size or any other metadata
>     attribute
>      > if a client/brick crashes in the middle of a write fop before the
>      > attribute is committed to disk.
>      > In other words, we will again need to maintain a "dirty" and
>     "committed"
>      > copy of the shard_count to ensure its correctness.
>      >
>
>     I think the cost of maintaining "total number of shards" is not as
>     expensive as maintaining size or any other metadata attribute. The
>     shard
>     count needs to be updated only when an extending operation results in
>     the creation of a new shard or when a truncate operation results in the
>     removal of a shard. Maintaining other metadata attributes would need
>     a 5
>     phase transaction for every write operation. Isn't that the case?
>
> Even size attribute changes only in case of extending writes and
> truncates. In fact, Pranith and I had
> initially chosen to persist shard count as opposed to size in the first
> design for inode write fops.
> But the reason we decided to go with size in the end is to prevent extra
> lookup on the last shard to
> find the total size of the file (i.e., if N is the total number of
> shards, file size = (N-1)*shard_block_size + sizeof(last shard)).
>

I am probably confused about the definition of size. For maintaining 
accurate size, wouldn't we need to account for truncates and writes that 
happen within the scope of one shard?

-Vijay