[Gluster-devel] RIO-Distribution: Status update
Shyam Ranganathan
srangana at redhat.com
Fri Sep 22 16:44:18 UTC 2017
Hi,
Here is a detailed (and lengthy, considering it is been a while) status
update on RIO. As the mail is long, you maybe interested in subsections
of the same, so here are a list of sections and their numbers,
1) General information
2) What does the graph look like
3) What works
4) What does not work
5) Major changes to common code
6) Problems needing attention
7) Next steps of interest
1) General information:
Main github issue tracking the feature:
https://github.com/gluster/glusterfs/issues/243 (this has been updated
just before this mail, and has quite a few updates to the issue
description itself)
Where are we developing the code:
- We are using the experimental branch to develop code for RIO
- List of current commits would be as in [1]
2) What does the graph look like:
- It is intended to look like this in the (near) future
https://docs.google.com/document/d/1-1ibDCzHh0_U5KXz977MFtGxgNE8CADK5zmJm_f0_Dw/edit?usp=sharing
3) What works:
- Using the python volume file generator, we can create a RIO based
volume on a single node
(https://github.com/gluster/glusterfs/blob/experimental/tests/experimental/riocreate.t)
- This volume is a bare bones
FUSE->RIO-Client->protocol->RIO-Server->POSIX2 graph, so other xlators
are not integrated or active at present
- This volume can be FUSE mounted, and operations such as, create,
mkdir, stat, xattr(get/set/remove) can be performed on the volume
- Creating files and directories in deeper than one directory
depth, requires a couple of unmerged patches,
- Add mkdir FOP: https://review.gluster.org/#/c/18270/
- Add ability to handle remote inodes in lookup:
https://review.gluster.org/#/c/18295/
4) What does not work:
- Data operations are still under development, so reading or writing
files will not work (as will things like fallocate, discard, etc. not work)
- Directory listing does not work
- unlink, rename, link, among a few other FOPs do not work
5) Major changes to common code:
- POSIX xlator has been *reorganized*, so that we can reuse all but
entry ops from existing posix xlator code. This is still in
experimental, but we intend to bring this into master in about 2 weeks,
once we have a few data FOPs working, to ensure that this works and
hence the reorganization is worth the effort.
- Commits of interest:
- Reorganize posix xlator to prepare for reuse with rio:
https://review.gluster.org/#/c/17990/
- Further reorganize posix xlator code for rio :
https://review.gluster.org/#/c/17998/
- Some further reorganization of posix xlator:
https://review.gluster.org/#/c/18013/
- Added 2 new FOPs, icreate and namelink
- Sketchy details of these FOPs would be, icreate creates an inode
and namelink links an inode to a basename. So in essence, icreate is a
create without a name, and namelink completes the linking of the inode
to its basename under the required parent GFID.
- Some details can be found at [5]
- More details regarding the FOP will appear around the time we
would attempt to push this to master.
- Commits of interest:
- add icreate/namelink fop: https://review.gluster.org/#/c/18085/
- io-threads: add icreate/namelink fop:
https://review.gluster.org/#/c/18086/
- protocol: add icreate/namelink:
https://review.gluster.org/#/c/18094/
6) Problems needing attention:
- Keeping time/size updated in the MDS (from the DS)
Once we enable data operations, the time and size information on the DS
needs to be synced/fetched from the MDS for any iatt related data
returned. This problem is well written out by Venky here [2] and as
noted earlier has similar solution requirements as the utime xlator work
that Rafi is currently engaged on [3]. We intend to leverage the work
with RIO as well.
- Handling cases where basename and inode are on different MDS
subvolumes (remote inodes)
There is an interesting case in RIO, where name and inode of a
filesystem object can be in 2 different MDS subvolumes. In such cases,
we will get the GFID when looking up the name on the first MDS, and
using the GFID we would lookup the inode in the relevant MDS. This needs
some thought, as currently this is plugged in as an op_ret = -1 and
op_errno = EREMOTE, with changes to client/server protocol layers, to
return iatt information on this class of errors. This changes the
abstraction/assumption that a FOP should return parameters (instead of
NULLs) even on errors, and hence needs a better fix for the same.
Suggestions welcome, code snippet that achieves this is in [4]
- Handling notify for the client and the server (given the way the
graph is now)
When is the RIO-client or RIO-server ready to serve requests? IOW, how
to handle notify? Currently this is hacked into the code, and will not
survive any mishaps, but we need a better understanding of the problem,
and related events and finally the solution to make this happen
correctly. Code that does this:
- Server is ready only when its POSIX xlator is ready (in RIO
bricks connect to all other bricks, so an UP event from other bricks,
does not mean we are ready):
https://github.com/gluster/glusterfs/blob/experimental/xlators/experimental/rio/rio-server/src/rio-server-main.c#L41
- Client is ready when all children are ready (do not judge me by
this hacky code! ;-p):
https://github.com/gluster/glusterfs/blob/experimental/xlators/experimental/rio/rio-client/src/rio-client-main.c#L85
7) Next steps of interest:
- Handling dirty inodes
inodes that have had data operations, hence have stale time/size
information on the MDS
- Adding the dentry backpointers to the inode
Just like what is added today for POSIX using the xxhash named xattrs
- Handling inheritence of parent bits
How and when SUID/SGID, ACLs are handled, when creating
subdirectories, as we are not leveraging the hiearchy of the local FS
Shyam, Kotresh, Susant
[1] RIO experimental commits:
https://github.com/gluster/glusterfs/issues/243#issuecomment-331476032
[2] Times and size maintenence in RIO:
https://review.gluster.org/#/c/13395/3/design/DHT2/DHT2_Size_On_MDS
[3] POSIX changes for utime xlator: https://review.gluster.org/#/c/17224/4
[4] Returning iatt even on errors:
-
https://review.gluster.org/#/c/18295/3/xlators/protocol/server/src/server-rpc-fops.c
-
https://review.gluster.org/#/c/18295/3/xlators/protocol/client/src/client-rpc-fops.c
[5] Notes on icreate/namelink:
https://review.gluster.org/#/c/13395/3/design/DHT2/DHT2_Icreate_Namelink_Notes.md
More information about the Gluster-devel
mailing list