[Gluster-devel] Feature support: development of metadata service xlator plugin in glusterfs.

Thu Jun 25 20:30:23 UTC 2015

I would like to present this problem in a different context, than 
solving this with a meta data server approach, and also state some of 
the ongoing efforts, and wish list items, to address problems of this 
nature (hence the top post).

Comments welcome! (there is a lot of "hand waving" below BTW :) )

The problem you discuss, is with regards to fan out calls rather than 
actual calls themselves. I.e if we still do, lock (N+M)->record file 
length(N+M)->write->unlock(N+M) using a meta data server in place, this 
would only reduce to just removing the fan out portions of the same, i.e 
lock->record file length->write->unlock.

(I am not an expert at EC, so I assume the sequence is right, at least 
lock->unlock is present in such forms in EC and AFR, so the discussion 
could proceed with these in mind).

A) Fan out, slows down responses to the slowest response, but if we 
could remove some steps in the process (say the entire lock->unlock) 
then we would be better placed to speed up the stack. One of the 
possibilities to do this is using delegation support in the Gluster 
stack that has been added for NFS Ganesha.

With piggy backed auto delegation support for a file open/creat/lookup 
to a gluster client stack, the locks are local to the client and hence 
do not involve network round trips for the same. Some parts of this are 
in [1] and some discussed in [2].

B) For the fan out itself, NSR (see [3]) proposes server side leader 
election, which could be the meta data server equivalent, but 
nevertheless distributed for each EC/AFR set. Thereby removing any 
single MDS limitations, and distributing MDS loads as well.

In this scheme of things, the leader needs to do local locks, rather 
than the client having to send in lock requests, thereby reducing 
possible FOPs again. Also, the leader can record file length etc. and 
failed transactions can be handled better, further possibly reducing 
other network FOP/call.

If at all possible, with A+B we should be able to come to a point of 1:1 
call count between, FOP by the client, to a network FOP to a brick (and 
in some occasion a fan out of 1:k). Which would mean equivalence for the 
most part to any existing network file system that is not distributed 
(e.g NFS).

C) For DHT related issues in the fan out of directory operations, work 
around the same is being discussed as DHT2 here [4].

The central theme for DHT2 being, directory in one subvolume, hence 
eliminate fan out and also bring in better consistency to various FOPs 
that DHT performs.

Overall, with these approaches, we (as in gluster) would/should aim for 
better consistency (first), with improved network utilization and 
reduced round trips to improve performance (next).

Foot note: None of this is ready yet, and would take time, this is just 
a *possible* direction that gluster core is going ahead with to address 
various problems at scale.

Shyam

[1] 
http://www.gluster.org/community/documentation/index.php/Features/caching
[2] http://www.gluster.org/pipermail/gluster-devel/2014-February/039900.html
[3] 
http://www.gluster.org/community/documentation/index.php/Features/new-style-replication
[4] 
http://www.gluster.org/community/documentation/index.php/Features/dht-scalability

On 06/21/2015 08:33 AM, 张兵 wrote:
> Thank you for your reply.
> In glusterfs,Some metadata information is recorded in the file's
> extended attr in all
> bricks,
> For example EC volume, N+M mode, stat file requires N+M command,
> file write, requires M+M lock, record file length, and also N+M setattr
> command,
> Finally n+m unlock command;
> if have metadataserver,All metadata related operations
> only one command to metadata server;
> As the old topic, MKDIR requires that all the DHT children should be
> executed Mkdir;
> Another difficult problem, lack of centralized metadata; disk recovery
> performance is not able to get a massive upgrade;Such as EC N+M volume,
> disk reconstruction, and only bricks n+m to participate in the
> reconstruction;Rebuilding 1TB takes several hours;
> The use of metadata, the data can be dispersed to all the disk,Disk
> failure, a lot of disk can be involved in the reconstruction;
> How to solve these difficulties.
>
> At 2015-06-21 05:31:58, "Vijay Bellur" <vbellur at redhat.com> wrote:
>>On Friday 19 June 2015 10:43 PM, 张兵 wrote:
>>> Hi all
>>>      In the use of the glusterfs ,found file system commands a lot, such
>>> as stat, lookup,setfattr, the very influence system performance,
>>> especially with EC volume. The use of glusterfs code architecture and
>>> add metadata server xlater and achieve similar GFS architecture; so, the
>>> same set of software, users can choose their own metadata server or not
>>> to choose the metadata server;
>>
>>How do you expect the metadata server to aide performance here? There
>>would be network trips to the metadata servers to set/fetch necessary
>>information. If the intention is to avoid the penalty of having to fetch
>>information from disk, we have been investigating the possibility of
>>loading md-cache as part of the brick process graph to avoid hitting the
>>disk for repetitive fetch of attributes & extended attributes. I expect
>>that to be mainlined soon.
>>
>>If you have other ideas on how a metadata server can improve
>>performance, that would be interesting to know.
>>
>>Regards,
>>Vijay
>>
>>
>
>
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>