[Bugs] [Bug 1341429] successful mkdir from "bad" subvolume should be ignored while propagating result to higher layer

Mon Jun 6 10:43:31 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1341429

--- Comment #1 from Raghavendra G <rgowdapp at redhat.com> ---
Following is the transcript of discussion I had with Xavi discussing whether
this problem is present in EC. The conclusion was, EC doesn't suffer from this
problem.

15:07 < raghug> xavih: my concern with patch #13885 is that, pre-op checks
(xattr comparision) is done only at storage/posix
15:07 < raghug> but in clustered scenario like afr/ec, the brick (by itself)
doesn't know the "correctness" of xattrs present on it
15:08 < raghug> So, whatever the decision the brick makes, can't be trusted
15:08 < xavih> raghug: yes, it only knows its own state, and it should act
depending on it
15:09 < raghug> xavih: yes. 
15:09 < raghug> xavih: consider this scenario
15:09 < xavih> raghug: in this case afr/ec will combine the answers as
appropriate and give a "good" answer to the upper layer. Bricks not matching
the "good" answer will be marked as bad
15:12 < raghug> xavih: just for clarification, here is an example I gave out in
the bug report of bz 1341429
15:12 < raghug> [#gluster] xavih: what about rollback? If the posix has already
created the directory, but ec/afr considers the answer as bad, we need to
delete the directory
15:12 < raghug> raghug: sorry ignore last statement
15:12 < raghug> For Preop checks like [1], dentry operations like mkdir etc
would rely on xattrs on parent (like dht layout). However, a "bad" subvolume of
afr cannot make correct decision during preop check as the xattrs are not
guaranteed to be correct. Imagine the following scenario:
15:12 < raghug> a. mkdir succeeds on non-readable subvolume
15:12 < raghug> b. fails on readable subvolume (may be because of layout xattrs
didn't match).
15:12 < raghug> afr here would report mkdir as success to parent xlators (here
dht). However, since non-readable subvolume is not guaranteed to have correct
xattrs, mkdir shouldn't have succeeded. Worse still, if mkdir on readable
subvolume had failed because preop check failed (client in memory layout xattrs
and layout xattrs persisted on disk didn't match), we would be ignoring a
genuine failure and instead transforming it as a success.
15:13 < raghug> you can use readable/good interchanged
15:13 < raghug> same with non-readable/bad subvols
15:13 < xavih> raghug: as I understand the changes in storage/posix, mkdir
*cannot* succeed on a bad brick
15:15 < xavih> raghug: if the xattr check fails, it returns an error without
creating the directory
15:15 < raghug> xavih: what if a "bad" brick has the "stale" xattrs
coincedentally cached by client also?
15:16 < raghug> it can happen with frequent changes of xattrs I suppose.
thinking for a concrete example..
15:17 -!- atinm [atinm at nat/redhat/x-iofeywgqrejhtcuh] has joined #gluster
15:17 < xavih> raghug: do you mean that directories with the same xattr can be
good and bad ? shouldn't that xattr value be unique in some way ?
15:17 < xavih> raghug: otherwise this preop check seems weak...
15:19 < raghug> For simplicity, Lets consider the case of 3 way replica b1, b2
and b3
15:19 < xavih> raghug: ok
15:19 < raghug> Lets say a client c1 has cached xattr value x1
15:20 < xavih> raghug: when you say "cached" do you mean that it has stored
that value on the three bricks ?
15:20 < raghug> xavih: no. The client has stored the value in its memory
15:20 < raghug> in the inode-ctx of directory/file lets say
15:21 < xavih> raghug: without sending the request to the bricks and check the
answer ?
15:21 < xavih> raghug: I think this is not possible
15:22 < raghug> xavih: it would have sent the request (say a getxattr/lookup).
But lets say there is a future fop (say mkdir) that relies on this xattr. In
this time window, there is a possibility that this value can change on brick
15:22 < raghug> that is the "staleness" 13855 is addressing at
15:23 < xavih> raghug: yes, this is possible, but I still don't see the
problem...
15:23 < raghug> yes.. here is the scenario
15:24 < raghug> 15:19 < raghug> For simplicity, Lets consider the case of 3 way
replica b1, b2 and b3
15:24 < raghug> 15:19 < xavih> raghug: ok
15:24 < raghug> 15:19 < raghug> Lets say a client c1 has cached xattr value x1
15:24 < raghug> are you ok till now?
15:25 < raghug> so, we've three bricks b1, b2 and b3
15:25 < raghug> all three part of a replica
15:25 < raghug> Now imagine a client c1 has read an xattr with value x1 and it
has cached that value in its memory
15:25 < xavih> raghug: I don't understand the xattr being cached. AFAIK
netither afr nor ec cache regular xattrs beyong those needed by its own
internal use...
15:26 < xavih> raghug: anyway, let's assume it's cached...
15:26 < raghug> xavih: dht is caching them
15:26 < xavih> raghug: because dht needs it...
15:26 < raghug> the xattr being cached is layout of the directory
15:26 < raghug> xavih: yes
15:26 < raghug> xavih: any doubts till now?
15:26 < xavih> raghug: afr and ec don't need it, so they don't cache it
15:27 < xavih> raghug: no, it's ok
15:27 < raghug> xavih: ok. So, now another client c2 changes the same xattr
value to x2. But it succeeds only on b2 and b3
15:27 < xavih> raghug: ok
15:27 < raghug> so, b1 has x1. b2 and b3 have x2 
15:28 < raghug> also dht on client c1 has cached value x1 
15:28 < xavih> raghug: ok
15:29 < raghug> so now when dht (from client c1) tries to do a mkdir using the
same xattr, mkdir succeeds on "bad" brick b1 but fails on b2 and b3
15:29 < xavih> raghug: in that case (I will speak for ec, not so sure for afr),
when dht on c1 tries to create a directory it will receive an EIO error with
the xattr indicating that preop failed
15:30 < xavih> raghug: from the ec side, it will see b2 and b3 failing, and b1
succeeding
15:30 < xavih> raghug: b1 will be marked as bad, and self-heal will take care
of it
15:31 < raghug> xavih: yes. With Quorum this problem is mitigated
15:31 < raghug> But with two way replica, this would be a problem
15:31 < raghug> I was also worried about "rollback"
15:31 < raghug> (here deleting directory from b1)
15:32 < xavih> raghug: I can't talk for afr. Pranith should know better
15:32 < xavih> raghug: the removal of directory will happen, but it could be
delayed
15:32 < raghug> xavih: thanks. Let me think over what we discussed. Will get
back to you if I find any issues. Thanks :)
15:33 < xavih> raghug: anyway, since the parent directory has been marked as
bad, and future readdir won't read contents from that brick
15:34 < xavih> raghug: so the "bad" newly created directory won't be visible to
the user (nor dht)
15:34 < raghug> xavih: ok. got it
15:35 < raghug> xavih: what if there is a quorum of "bad" bricks?
15:35 < xavih> raghug: this self-heal will also heal the "outdated" xattr that
caused the incorrect creation of the directory in the first place
15:35 < xavih> raghug: there cannot be a quorum of "bad" bricks
15:35 < raghug> in this case lets say for some reason b2 also had the same
"stale" xattr and mkdir succeeded on it
15:36 < raghug> xavih: hmm..
15:36 < xavih> raghug: if there were a quorum of bad bricks, it means that
another client has failed on more than a half of the bricks
15:36 < raghug> then, I suppose that op itself is a failure
15:36 < raghug> (from another client)
15:36 < xavih> raghug: bad in that case, the "bad" bricks will be the ones that
have succeeded the mkdir on the other client
15:37 < xavih> raghug: following your example, to try to have a majority of
"bad" bricks, c2 should have failed on b1 and b2, and succeded only on b3,
right ?
15:37 < raghug> xavih: yes
15:38 < xavih> raghug: in that case, c2 will receive an error, indicating that
the mkdir has failed
15:38 < raghug> ok
15:38 < xavih> raghug: this means that the good bricks are really b1 and b2
15:38 < raghug> xavih: yes. Got it
15:39 < xavih> raghug: so c1 won't see the as bad, and it will succeed on them
without problems
15:39 < xavih> raghug: everything is consistent on both clients
15:39 < raghug> xavih: on a slightly unrelated issue, I am planning to move the
locking to bricks as future work. So, EC wouldn't be witnessing lock that
synchronizes modification of the xattr (this lock is again taken by dht). Do
you think there would be any problem with that?
15:39 < raghug> (ref: patch 13885)
15:41 < raghug> with this lock, no dht will be able to modify xattr for the
duration of lock
15:41 < raghug> (no dht from other clients)
15:41 < xavih> raghug: that might be problematic
15:41 < raghug> xavih: ...
15:44 < xavih> raghug: ec takes the necessary inodelks to guarantee integrity,
but if you move the locking from dht to the bricks, it means that two clients
could try to change the same xattr. EC guarantees that all bricks will be
updated in the same order, but it cannot guarantee that a lock taken after it
(in the brick) will be honored
15:44 < xavih> raghug: how do you plan to move locking to the bricks ? maybe
this could be used be ec as well.
15:44 < xavih> raghug: I doubt this could be transparent to ec
15:45 < raghug> xavih: dht's directory self-heal takes lock on _all_ its
subvols
15:46 < raghug> so, a lock on one subvol would prevent self-heal
15:46 < raghug> but the catch is whether an EC subvol would choose the same
brick as lock server
15:47 < raghug> In the context of #13885..
15:47 < raghug> mkdir would've wound to b1, b2 and b3
15:48 < xavih> raghug: from the point of view of DHT, an EC subvolume should be
seen as a single brick, however if a feature crosses EC "hidden" in an xdata,
special care must be taken...
15:48 < raghug> and all three bricks would 1. acquire inodelk in the same brick
2. do pre-op check (xattr comparision). 3. release inodelk
15:49 < raghug> note that each brick is aquiring an inodelk only on itself
15:49 < xavih> raghug: in this special case for mkdir, it shouldn't be a
problem, but the locking issue should be analyzed again
15:50 < xavih> raghug: oh, I see
15:50 < raghug> xavih: its not just mkdir. We've plans to expand this logic to
all dentry operations that depend on layout (create, unlink, rename, rmdir etc)
15:50 < xavih> raghug: I should analyze it deeper. At first glance, I think
there could be some deadlocks
15:50 < raghug> xavih: no problem. We can meet later to discuss
15:51 < raghug> Its sufficient for now that you understand the scope of the
problem/solution we are trying to implement in dht
15:51 < xavih> raghug: when you have something more detailed, I can look at it.
15:52 < raghug> xavih: roughly you expect patches similar to #13885 for other
dentry operations
15:52 < raghug> create, link, unlink, rmdir, symlink, mknod etc
15:53 < xavih> raghug: good. I don't foresee any problem with them, but you can
add me as a reviewer so that I'll be able to check each of them
15:53 < raghug> there is also special case of lk, which I think we can take up
once you are clear about dentry ops
15:54 < raghug>  when lk is issued on a directory, dht has to choose a
"lock-server", which all the clients agree upon
15:54 < raghug> (as directory is present on all subvols)
15:54 < raghug> we are planning to have an "hashed-subvol" as the lock-server
15:54 < raghug> and this hashed-subvol is dependent on layout xattr
15:55 < xavih> raghug: wouldn't this cause problems if that subvol dies ?
15:55 < raghug> xavih: yes. its a problem
15:55 < raghug> xavih: ideally we should acquire lock on _all_ subvols
15:55 < xavih> raghug: yes
15:55 < raghug> but its complicated and poses scalability issues
15:55 < raghug> so, I assume that's a tradeoff we've to live with
15:56 < xavih> raghug: and using a subset of the subvolumes ? (2 or more)
15:56 < raghug> xavih: yes, that's also a variant we are thinking of
15:56 < raghug> but bare minimum would be to agree on at-least one subvol as
"lock-server"
15:57 < xavih> ok
15:57 < raghug> the current code is broken there too
15:57 < raghug> and can sometimes lead to different "lock-server(s)" even when
all subvols are up
15:58 < raghug> note that these are corner cases

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.