[Gluster-devel] 1402538 : Assertion failure during rebalance of symbolic links
Pranith Kumar Karampuri
pkarampu at redhat.com
Wed Dec 14 09:17:47 UTC 2016
On Wed, Dec 14, 2016 at 1:48 PM, Xavier Hernandez <xhernandez at datalab.es>
wrote:
> There's another issue with the patch that Ashish sent.
>
> The original problem is that a setattr on a symbolic link gets transformed
> to a regular file while the fop is being executed. Even if we apply the
> Ashish' patch to avoid the assert, the setattr fop will still succeed and
> incorrectly change the attributes of a gluster special file that shouldn't
> change.
>
> I think that's a bigger problem that needs to be addressed globally.
>
> I'm sure this is not an easy solution, but probably the best way would be
> to have distinct inodes for the gluster link files and the original file.
> This way most of these problems should be solved.
>
Is there any reason why there is a difference in type of the file on
hashed/cached subvols? We can have the same type of file on both dht
subvolumes? That will prevent unlink of regular file and recreate with the
actual type of the file?
>
> Xavi
>
>
> On 12/14/2016 09:02 AM, Xavier Hernandez wrote:
>
>> On 12/14/2016 06:10 AM, Raghavendra Gowdappa wrote:
>>
>>>
>>>
>>> ----- Original Message -----
>>>
>>>> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
>>>> To: "Ashish Pandey" <aspandey at redhat.com>
>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Shyam Ranganathan"
>>>> <srangana at redhat.com>, "Nithya Balachandran"
>>>> <nbalacha at redhat.com>, "Xavier Hernandez" <xhernandez at datalab.es>,
>>>> "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>>>> Sent: Tuesday, December 13, 2016 9:29:46 PM
>>>> Subject: Re: 1402538 : Assertion failure during rebalance of symbolic
>>>> links
>>>>
>>>> On Tue, Dec 13, 2016 at 2:45 PM, Ashish Pandey <aspandey at redhat.com>
>>>> wrote:
>>>>
>>>> Hi All,
>>>>>
>>>>> We have been seeing an issue where re balancing symbolic links leads
>>>>> to an
>>>>> assertion failure in EC volume.
>>>>>
>>>>> The root cause of this is that while migrating symbolic links to
>>>>> other sub
>>>>> volume, it creates a link file (with attributes .........T) .
>>>>> This file is a regular file.
>>>>> Now, during migration a setattr comes to this link and because of
>>>>> possible
>>>>> race, posix_stat return stats of this "T" file.
>>>>> In ec_manager_seattr, we receive callbacks and check the type of
>>>>> entry. If
>>>>> it is a regular file we try to get size and if it is not there, we
>>>>> raise an
>>>>> assert.
>>>>> So, basically we are checking a size of the link (which will not have
>>>>> size) which has been returned as regular file and we are ending up when
>>>>> this condition
>>>>> becomes TRUE.
>>>>>
>>>>> Now, this looks like a problem with re balance and difficult to fix at
>>>>> this point (as per the discussion).
>>>>> We have an alternative to fix it in EC but that will be more like a
>>>>> hack
>>>>> than an actual fix. We should not modify EC
>>>>> to deal with an individual issue which is in other translator.
>>>>>
>>>>
>>> I am afraid, dht doesn't have a better way of handling this. While DHT
>>> maintains abstraction (of a symbolic link) to layers above, the layers
>>> below it cannot be shielded from seeing the details like a linkto file
>>> etc.
>>>
>>
>> That's ok, and I think it's the right thing to do. From the point of
>> view of EC, it's irrelevant how the file is seen by upper layers. It
>> only cares about the files below it.
>>
>> If the concern really is that the file is changing its type in a span
>>> of single fop, we can probably explore the option of locking (or other
>>> synchronization mechanisms) to prevent migration taking place, while a
>>> fop is in progress.
>>>
>>
>> That's the real problem. Some operations receive an inode referencing a
>> symbolic link on input but the iatt structures from the callback
>> reference a regular file. It's even worse because it's an asynchronous
>> race so some of the bricks may return a regular file and some may return
>> a symbolic link. If there are more than redundancy bricks returning a
>> different type, the most probably result will be an I/O error caused by
>> inconsistent answers.
>>
>> Ashish wrote a patch to check the type of the inode at the input instead
>> of relying on the answers. While this could avoid the assertion issued
>> by ec, it doesn't solve the race, leaving room for the I/O errors I
>> mentioned earlier.
>>
>> But, I assume there will be performance penalties for that too.
>>>
>>
>> Yes. I don't see any other way to really solve this problem. A lock is
>> needed.
>>
>> In ec we already have a problem that will need an additional lock on
>> rmdir, unlink and rename to avoid some races. This change will also need
>> support from locks xlator to avoid granting locks on deleted inodes. If
>> dht is using one of these operations to replace the symbolic link by the
>> gluster link file, I think this change could solve the I/O errors, but
>> I'm not sure we could completely solve the problem.
>>
>> I'm not sure how dht does the transform from a symbolic link to a
>> gluster link file, but if it involves more than one fop from the point
>> of view of ec, there's nothing that ec can do to solve the problem. If
>> another client accesses the file, ec can return any intermediate state.
>> DHT should take some lock to do all operations atomically and avoid
>> problems on other clients.
>>
>> I think that the mid-term approach to completely solve the problem
>> without a performance impact should be to implement some kind of
>> transaction mechanism that will reuse lock requests. This would allow,
>> among other things, that multiple atomic operations could be performed
>> by different xlators but sharing the locks instead of requiring each
>> xlator to take an inodelk on its own.
>>
>> Xavi
>>
>>
>>>
>>>>> Now the question is how to proceed with this? Any suggestions?
>>>>>
>>>>>
>>>> Raghavendra/Nithya,
>>>> Could one of you explain the difficulties in fixing this
>>>> issue in
>>>> DHT so that Xavi will also be caught up with why we should add this
>>>> change
>>>> in EC in the short term.
>>>>
>>>>
>>>>
>>>>> Details on this bug can be found here -
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1402538
>>>>>
>>>>> ----
>>>>> Ashish
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Pranith
>>>>
>>>>
>>
>
--
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20161214/be5f5400/attachment.html>
More information about the Gluster-devel
mailing list