[Gluster-devel] 1402538 : Assertion failure during rebalance of symbolic links

Wed Dec 14 08:02:04 UTC 2016

On 12/14/2016 06:10 AM, Raghavendra Gowdappa wrote:
>
>
> ----- Original Message -----
>> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
>> To: "Ashish Pandey" <aspandey at redhat.com>
>> Cc: "Gluster Devel" <gluster-devel at gluster.org>, "Shyam Ranganathan" <srangana at redhat.com>, "Nithya Balachandran"
>> <nbalacha at redhat.com>, "Xavier Hernandez" <xhernandez at datalab.es>, "Raghavendra Gowdappa" <rgowdapp at redhat.com>
>> Sent: Tuesday, December 13, 2016 9:29:46 PM
>> Subject: Re: 1402538 : Assertion failure during rebalance of symbolic links
>>
>> On Tue, Dec 13, 2016 at 2:45 PM, Ashish Pandey <aspandey at redhat.com> wrote:
>>
>>> Hi All,
>>>
>>> We have been seeing an issue where re balancing symbolic links leads to an
>>> assertion failure in EC volume.
>>>
>>> The root cause of this is that while migrating symbolic links to other sub
>>> volume, it creates a link file (with attributes .........T) .
>>> This file is a regular file.
>>> Now, during migration a setattr comes to this link and because of possible
>>> race, posix_stat return stats of this "T" file.
>>> In ec_manager_seattr, we receive callbacks and check the type of entry. If
>>> it is a regular file we try to get size and if it is not there, we raise an
>>> assert.
>>> So, basically we are checking a size of the link (which will not have
>>> size) which has been returned as regular file and we are ending up when
>>> this condition
>>> becomes TRUE.
>>>
>>> Now, this looks like a problem with re balance and difficult to fix at
>>> this point (as per the discussion).
>>> We have an alternative to fix it in EC but that will be more like a hack
>>> than an actual fix. We should not modify EC
>>> to deal with an individual issue which is in other translator.
>
> I am afraid, dht doesn't have a better way of handling this. While DHT maintains abstraction (of a symbolic link) to layers above, the layers below it cannot be shielded from seeing the details like a linkto file etc.

That's ok, and I think it's the right thing to do. From the point of 
view of EC, it's irrelevant how the file is seen by upper layers. It 
only cares about the files below it.

> If the concern really is that the file is changing its type in a span of single fop, we can probably explore the option of locking (or other synchronization mechanisms) to prevent migration taking place, while a fop is in progress.

That's the real problem. Some operations receive an inode referencing a 
symbolic link on input but the iatt structures from the callback 
reference a regular file. It's even worse because it's an asynchronous 
race so some of the bricks may return a regular file and some may return 
a symbolic link. If there are more than redundancy bricks returning a 
different type, the most probably result will be an I/O error caused by 
inconsistent answers.

Ashish wrote a patch to check the type of the inode at the input instead 
of relying on the answers. While this could avoid the assertion issued 
by ec, it doesn't solve the race, leaving room for the I/O errors I 
mentioned earlier.

> But, I assume there will be performance penalties for that too.

Yes. I don't see any other way to really solve this problem. A lock is 
needed.

In ec we already have a problem that will need an additional lock on 
rmdir, unlink and rename to avoid some races. This change will also need 
support from locks xlator to avoid granting locks on deleted inodes. If 
dht is using one of these operations to replace the symbolic link by the 
gluster link file, I think this change could solve the I/O errors, but 
I'm not sure we could completely solve the problem.

I'm not sure how dht does the transform from a symbolic link to a 
gluster link file, but if it involves more than one fop from the point 
of view of ec, there's nothing that ec can do to solve the problem. If 
another client accesses the file, ec can return any intermediate state. 
DHT should take some lock to do all operations atomically and avoid 
problems on other clients.

I think that the mid-term approach to completely solve the problem 
without a performance impact should be to implement some kind of 
transaction mechanism that will reuse lock requests. This would allow, 
among other things, that multiple atomic operations could be performed 
by different xlators but sharing the locks instead of requiring each 
xlator to take an inodelk on its own.

Xavi

>
>>>
>>> Now the question is how to proceed with this? Any suggestions?
>>>
>>
>> Raghavendra/Nithya,
>>          Could one of you explain the difficulties in fixing this issue in
>> DHT so that Xavi will also be caught up with why we should add this change
>> in EC in the short term.
>>
>>
>>>
>>> Details on this bug can be found here -
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1402538
>>>
>>> ----
>>> Ashish
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Pranith
>>