[Gluster-devel] RIO-Distribution: Status update

Fri Sep 22 16:44:18 UTC 2017

Hi,

Here is a detailed (and lengthy, considering it is been a while) status 
update on RIO. As the mail is long, you maybe interested in subsections 
of the same, so here are a list of sections and their numbers,

1) General information
2) What does the graph look like
3) What works
4) What does not work
5) Major changes to common code
6) Problems needing attention
7) Next steps of interest

1) General information:

Main github issue tracking the feature: 
https://github.com/gluster/glusterfs/issues/243 (this has been updated 
just before this mail, and has quite a few updates to the issue 
description itself)

Where are we developing the code:
   - We are using the experimental branch to develop code for RIO
   - List of current commits would be as in [1]

2) What does the graph look like:
   - It is intended to look like this in the (near) future 
https://docs.google.com/document/d/1-1ibDCzHh0_U5KXz977MFtGxgNE8CADK5zmJm_f0_Dw/edit?usp=sharing

3) What works:
   - Using the python volume file generator, we can create a RIO based 
volume on a single node 
(https://github.com/gluster/glusterfs/blob/experimental/tests/experimental/riocreate.t)
     - This volume is a bare bones 
FUSE->RIO-Client->protocol->RIO-Server->POSIX2 graph, so other xlators 
are not integrated or active at present

   - This volume can be FUSE mounted, and operations such as, create, 
mkdir, stat, xattr(get/set/remove) can be performed on the volume
     - Creating files and directories in deeper than one directory 
depth, requires a couple of unmerged patches,
       - Add mkdir FOP: https://review.gluster.org/#/c/18270/
       - Add ability to handle remote inodes in lookup: 
https://review.gluster.org/#/c/18295/

4) What does not work:
   - Data operations are still under development, so reading or writing 
files will not work (as will things like fallocate, discard, etc. not work)
   - Directory listing does not work
   - unlink, rename, link, among a few other FOPs do not work

5) Major changes to common code:
   - POSIX xlator has been *reorganized*, so that we can reuse all but 
entry ops from existing posix xlator code. This is still in 
experimental, but we intend to bring this into master in about 2 weeks, 
once we have a few data FOPs working, to ensure that this works and 
hence the reorganization is worth the effort.
     - Commits of interest:
       - Reorganize posix xlator to prepare for reuse with rio: 
https://review.gluster.org/#/c/17990/
       - Further reorganize posix xlator code for rio : 
https://review.gluster.org/#/c/17998/
       - Some further reorganization of posix xlator: 
https://review.gluster.org/#/c/18013/

   - Added 2 new FOPs, icreate and namelink
     - Sketchy details of these FOPs would be, icreate creates an inode 
and namelink links an inode to a basename. So in essence, icreate is a 
create without a name, and namelink completes the linking of the inode 
to its basename under the required parent GFID.
     - Some details can be found at [5]
     - More details regarding the FOP will appear around the time we 
would attempt to push this to master.
     - Commits of interest:
       - add icreate/namelink fop: https://review.gluster.org/#/c/18085/
       - io-threads: add icreate/namelink fop: 
https://review.gluster.org/#/c/18086/
       - protocol: add icreate/namelink: 
https://review.gluster.org/#/c/18094/

6) Problems needing attention:
   - Keeping time/size updated in the MDS (from the DS)
Once we enable data operations, the time and size information on the DS 
needs to be synced/fetched from the MDS for any iatt related data 
returned. This problem is well written out by Venky here [2] and as 
noted earlier has similar solution requirements as the utime xlator work 
that Rafi is currently engaged on [3]. We intend to leverage the work 
with RIO as well.

   - Handling cases where basename and inode are on different MDS 
subvolumes (remote inodes)
There is an interesting case in RIO, where name and inode of a 
filesystem object can be in 2 different MDS subvolumes. In such cases, 
we will get the GFID when looking up the name on the first MDS, and 
using the GFID we would lookup the inode in the relevant MDS. This needs 
some thought, as currently this is plugged in as an op_ret = -1 and 
op_errno = EREMOTE, with changes to client/server protocol layers, to 
return iatt information on this class of errors. This changes the 
abstraction/assumption that a FOP should return parameters (instead of 
NULLs) even on errors, and hence needs a better fix for the same. 
Suggestions welcome, code snippet that achieves this is in [4]

   - Handling notify for the client and the server (given the way the 
graph is now)
When is the RIO-client or RIO-server ready to serve requests? IOW, how 
to handle notify? Currently this is hacked into the code, and will not 
survive any mishaps, but we need a better understanding of the problem, 
and related events and finally the solution to make this happen 
correctly. Code that does this:
     - Server is ready only when its POSIX xlator is ready (in RIO 
bricks connect to all other bricks, so an UP event from other bricks, 
does not mean we are ready): 
https://github.com/gluster/glusterfs/blob/experimental/xlators/experimental/rio/rio-server/src/rio-server-main.c#L41
     - Client is ready when all children are ready (do not judge me by 
this hacky code! ;-p): 
https://github.com/gluster/glusterfs/blob/experimental/xlators/experimental/rio/rio-client/src/rio-client-main.c#L85

7) Next steps of interest:
   - Handling dirty inodes
     inodes that have had data operations, hence have stale time/size 
information on the MDS
   - Adding the dentry backpointers to the inode
     Just like what is added today for POSIX using the xxhash named xattrs
   - Handling inheritence of parent bits
     How and when SUID/SGID, ACLs are handled, when creating 
subdirectories, as we are not leveraging the hiearchy of the local FS

Shyam, Kotresh, Susant

[1] RIO experimental commits: 
https://github.com/gluster/glusterfs/issues/243#issuecomment-331476032
[2] Times and size maintenence in RIO: 
https://review.gluster.org/#/c/13395/3/design/DHT2/DHT2_Size_On_MDS
[3] POSIX changes for utime xlator: https://review.gluster.org/#/c/17224/4
[4] Returning iatt even on errors:
     - 
https://review.gluster.org/#/c/18295/3/xlators/protocol/server/src/server-rpc-fops.c
     - 
https://review.gluster.org/#/c/18295/3/xlators/protocol/client/src/client-rpc-fops.c
[5] Notes on icreate/namelink: 
https://review.gluster.org/#/c/13395/3/design/DHT2/DHT2_Icreate_Namelink_Notes.md