[Gluster-devel] Sharding - Inode write fops - recoverability from failures - design
Vijay Bellur
vbellur at redhat.com
Tue Feb 24 10:43:13 UTC 2015
On 02/24/2015 01:53 PM, Krutika Dhananjay wrote:
>
>
> ------------------------------------------------------------------------
>
> *From: *"Vijay Bellur" <vbellur at redhat.com>
> *To: *"Krutika Dhananjay" <kdhananj at redhat.com>
> *Cc: *"Gluster Devel" <gluster-devel at gluster.org>
> *Sent: *Tuesday, February 24, 2015 12:26:58 PM
> *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -
> recoverability from failures - design
>
> On 02/24/2015 12:19 PM, Krutika Dhananjay wrote:
> >
> >
> >
> ------------------------------------------------------------------------
> >
> > *From: *"Vijay Bellur" <vbellur at redhat.com>
> > *To: *"Krutika Dhananjay" <kdhananj at redhat.com>
> > *Cc: *"Gluster Devel" <gluster-devel at gluster.org>
> > *Sent: *Tuesday, February 24, 2015 11:35:28 AM
> > *Subject: *Re: [Gluster-devel] Sharding - Inode write fops -
> > recoverability from failures - design
> >
> > On 02/24/2015 10:36 AM, Krutika Dhananjay wrote:
> > >
> > >
> > >
> >
> ------------------------------------------------------------------------
> > >
> > > *From: *"Vijay Bellur" <vbellur at redhat.com>
> > > *To: *"Krutika Dhananjay" <kdhananj at redhat.com>,
> "Gluster Devel"
> > > <gluster-devel at gluster.org>
> > > *Sent: *Monday, February 23, 2015 5:25:57 PM
> > > *Subject: *Re: [Gluster-devel] Sharding - Inode write
> fops -
> > > recoverability from failures - design
> > >
> > > On 02/22/2015 06:08 PM, Krutika Dhananjay wrote:
> > > > Hi,
> > > >
> > > > Please find the design doc for one of the problems in
> > sharding which
> > > > Pranith and I are trying to solve and its solution @
> > > > http://review.gluster.org/#/c/9723/1.
> > > > Reviews and feedback are much appreciated.
> > > >
> > >
> > > Can this feature be made optional? I think there are use
> > cases like
> > > virtual machine image storage, hdfs etc. where the
> number of
> > metadata
> > > queries might not be very high. It would be an acceptable
> > tradeoff in
> > > such cases to not be very efficient for answering metadata
> > queries but
> > > be very efficient for data operations.
> > >
> > > IOW, can we have two possible modes of operation for
> the sharding
> > > translator to answer metadata queries?
> > >
> > > 1. One that behaves like a regular filesystem where we
> expect
> > a mix of
> > > data and metadata operations. Your document seems to cover
> > that part
> > > well. We can look at optimizing behavior for
> multi-threaded
> > single
> > > writer use cases after an initial implementation is in
> place.
> > > Techniques
> > > like eager locking can be applied here.
> > >
> > > 2. Another mode where we do not expect a lot of metadata
> > queries. In
> > > this mode, we can visit all nodes where we have shards to
> > answer these
> > > queries.
> > >
> > > But for sharding translator to be able to visit all
> shards, it is
> > > required to know the last shard number.
> > > Without this, it will never know when to stop looking up the
> > different
> > > shards. For this to happen, we
> > > still need to maintain the size attribute for each file.
> > >
> >
> > Wouldn't maintaining the total number of shards in the metadata
> > shard be
> > sufficient?
> >
> > Maintaining the correctness of "total number of shards" would again
> > incur the same cost as maintaining size or any other metadata
> attribute
> > if a client/brick crashes in the middle of a write fop before the
> > attribute is committed to disk.
> > In other words, we will again need to maintain a "dirty" and
> "committed"
> > copy of the shard_count to ensure its correctness.
> >
>
> I think the cost of maintaining "total number of shards" is not as
> expensive as maintaining size or any other metadata attribute. The
> shard
> count needs to be updated only when an extending operation results in
> the creation of a new shard or when a truncate operation results in the
> removal of a shard. Maintaining other metadata attributes would need
> a 5
> phase transaction for every write operation. Isn't that the case?
>
> Even size attribute changes only in case of extending writes and
> truncates. In fact, Pranith and I had
> initially chosen to persist shard count as opposed to size in the first
> design for inode write fops.
> But the reason we decided to go with size in the end is to prevent extra
> lookup on the last shard to
> find the total size of the file (i.e., if N is the total number of
> shards, file size = (N-1)*shard_block_size + sizeof(last shard)).
>
I am probably confused about the definition of size. For maintaining
accurate size, wouldn't we need to account for truncates and writes that
happen within the scope of one shard?
-Vijay
More information about the Gluster-devel
mailing list