[Gluster-devel] dht: selfheal of missing directories on nameless (by GFID) LOOKUP
Anand Avati
avati at gluster.org
Mon May 5 21:15:40 UTC 2014
On Sun, May 4, 2014 at 10:10 PM, Raghavendra G <raghavendra at gluster.com>wrote:
> On Mon, May 5, 2014 at 12:32 AM, Anand Avati <avati at gluster.org> wrote:
>
>>
>>
>>
>> On Sun, May 4, 2014 at 9:22 AM, Niels de Vos <ndevos at redhat.com> wrote:
>>
>>> Hi,
>>>
>>> bug 1093324 has been opened and we have identified the following cause:
>>>
>>> 1. an NFS-client does a LOOKUP of a directory on a volume
>>> 2. the NFS-client receives a filehandle (contains volume-id + GFID)
>>> 3. add-brick is executed, but the new brick does not have any
>>> directories yet
>>> 4. the NFS-client creates a new file in the directory, this request is
>>> in the format or <filehandle>/<filename>, <filehandle> was received
>>> in step 2
>>> 5. the NFS-server does a LOOKUP on the parent directory identified by
>>> the filehandle - nameless LOOKUP, only GFID is known
>>> 6. the old brick(s) return successfully
>>> 7. the new brick returns ESTALE
>>> 8. the NFS-server returns ESTALE to the NFS-client
>>>
>>> In this case, the NFS-client should not receive an ESTALE. There is also
>>> no ESTALE error passed to the client when this procedure is done over
>>> FUSE or samba/libgfapi.
>>>
>>> Selfhealing a directory entry based only on a GFID is not always
>>> possible. Files do not have a unique filename (hardlinks), so it is not
>>> trivial to find a filename for a GFID (expensive operation, and the
>>> result could be a list). However, for a directory this is simpler.
>>> A directory is not hardlink'd in the .glusterfs directory, directories
>>> are maintained as symbolic-links. This makes it possible to find the
>>> name of a directory, when only the GFID is known.
>>>
>>> Currently DHT is not able to selfheal directories on a nameless LOOKUP.
>>> I think that it should be possible to change this, and to fix the ESTALE
>>> returned by the NFS-server.
>>>
>>> At least two changes would be needed, and this is where I would like to
>>> hear opinions from others about it:
>>>
>>> - The posix-xlator should be able to return the directory name when
>>> a GFID is given. This can be part of the LOOKUP-reply (dict), and that
>>> would add a readlink() syscall for each nameless LOOKUP that finds
>>> a directory. Or (suggested by Pranith) add a virtual xattr and handle
>>> this specific request with an additional FGETXATTR call.
>>>
>>
>> I think the LOOKUP-reply with readlink() is better, instead of a new
>> over-the-wire FOP.
>>
>>
>>>
>>> - DHT should selfheal the directory when at least one ESTALE is returned
>>> by the bricks.
>>
>>
>>
>> This also makes sense, except - if even the parent directory is missing
>> on that server (yet to be healed). Another important point to note is that,
>> the directories (with the same GFID) themselves may be present at various
>> locations as various dentries on the many servers. A lookup of
>> <dir-gfid>/"name" should succeed transparently independent of the differing
>> <dir-gfid>'s dentries across servers.
>>
>
> Just want to be sure, among the following two scenarios
> 1. Different <pargfid>/name combinations, having same gfid
> 2. Same <pargfid>/name combination, having different gfids
>
> are you saying 1 is legal (though only as a transient state during ops
> like rename etc)? How about 2, isn't it illegal even as a transient state
> (one should never ever see 2 at any point in time)?
>
Rename of dirs is what causes all the interesting things. 1 is inevitable,
as we just cannot rename the dir on all servers at the same time
(fundamental behavior of a distributed system). 2 is something we can avoid
if we make sure in our algorithms that the destination of a dir rename is
empty before we start. The more important thing is the observer's point of
view - if observation on server 1 is made before a rename-dir() transaction
and observer on server 2 is made after, you will still see mismatching
gfids etc - so it is important to be careful and not be alarmed by such
false positives.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140505/c1d33025/attachment-0003.html>
More information about the Gluster-devel
mailing list