[Gluster-devel] dht: selfheal of missing directories on nameless (by GFID) LOOKUP

Raghavendra G raghavendra at gluster.com
Mon May 5 05:10:43 UTC 2014


On Mon, May 5, 2014 at 12:32 AM, Anand Avati <avati at gluster.org> wrote:

>
>
>
> On Sun, May 4, 2014 at 9:22 AM, Niels de Vos <ndevos at redhat.com> wrote:
>
>> Hi,
>>
>> bug 1093324 has been opened and we have identified the following cause:
>>
>> 1. an NFS-client does a LOOKUP of a directory on a volume
>> 2. the NFS-client receives a filehandle (contains volume-id + GFID)
>> 3. add-brick is executed, but the new brick does not have any
>>    directories yet
>> 4. the NFS-client creates a new file in the directory, this request is
>>    in the format or <filehandle>/<filename>, <filehandle> was received
>>    in step 2
>> 5. the NFS-server does a LOOKUP on the parent directory identified by
>>    the filehandle - nameless LOOKUP, only GFID is known
>> 6. the old brick(s) return successfully
>> 7. the new brick returns ESTALE
>> 8. the NFS-server returns ESTALE to the NFS-client
>>
>> In this case, the NFS-client should not receive an ESTALE. There is also
>> no ESTALE error passed to the client when this procedure is done over
>> FUSE or samba/libgfapi.
>>
>> Selfhealing a directory entry based only on a GFID is not always
>> possible. Files do not have a unique filename (hardlinks), so it is not
>> trivial to find a filename for a GFID (expensive operation, and the
>> result could be a list). However, for a directory this is simpler.
>> A directory is not hardlink'd in the .glusterfs directory, directories
>> are maintained as symbolic-links. This makes it possible to find the
>> name of a directory, when only the GFID is known.
>>
>> Currently DHT is not able to selfheal directories on a nameless LOOKUP.
>> I think that it should be possible to change this, and to fix the ESTALE
>> returned by the NFS-server.
>>
>> At least two changes would be needed, and this is where I would like to
>> hear opinions from others about it:
>>
>> - The posix-xlator should be able to return the directory name when
>>   a GFID is given. This can be part of the LOOKUP-reply (dict), and that
>>   would add a readlink() syscall for each nameless LOOKUP that finds
>>   a directory. Or (suggested by Pranith) add a virtual xattr and handle
>>   this specific request with an additional FGETXATTR call.
>>
>
> I think the LOOKUP-reply with readlink() is better, instead of a new
> over-the-wire FOP.
>
>
>>
>> - DHT should selfheal the directory when at least one ESTALE is returned
>>   by the bricks.
>
>
>
> This also makes sense, except - if even the parent directory is missing on
> that server (yet to be healed). Another important point to note is that,
> the directories (with the same GFID) themselves may be present at various
> locations as various dentries on the many servers. A lookup of
> <dir-gfid>/"name" should succeed transparently independent of the differing
> <dir-gfid>'s dentries across servers.
>

Just want to be sure, among the following two scenarios
1. Different <pargfid>/name combinations, having same gfid
2. Same <pargfid>/name combination, having different gfids

are you saying 1 is legal (though only as a transient state during ops like
rename etc)? How about 2, isn't it illegal even as a transient state (one
should never ever see 2 at any point in time)?



>
> However if you want to heal, now the choice of server from where you
> select the dir's parent and name become important as the self-heal will
> impose that on the other servers. For e.g one of the AFR subvolumes may
> have not yet healed the parent directories etc. Or, the N-1 servers may
> each return a different par-gfid/dir-name in the LOOKUP reply. So it can
> quickly get hairy.
>
> As a general approach, using the LOOKUP-reply to send parent info from the
> posix level makes sense. But we also need a more detailed proposal on how
> that info is used at the cluster xlator levels to achieve a higher level
> goal, like self-heal.
>
>
>> When all bricks return ESTALE, the ESTALE is valid and
>>   should be passed on to the upper layers (NFS-server -> NFS-client).
>>
>
> Yes.
>
> Thanks
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>
>


-- 
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20140505/8d5c80a0/attachment-0003.html>


More information about the Gluster-devel mailing list