[Gluster-users] User-serviceable snapshots design

Anand Subramanian ansubram at redhat.com
Fri May 2 14:35:26 UTC 2014


Attached is a basic write-up of the user-serviceable snapshot feature 
design (Avati's). Please take a look and let us know if you have 
questions of any sort...

We have a basic implementation up now; reviews and upstream commit 
should follow very soon over the next week.

Cheers,
Anand
-------------- next part --------------
User-serviceable Snapshots
==========================

Credits
=======
This brilliant design is Anand Avati's brainchild. The meta xlator is also to blame to some extent.


Terminology
==========

* gluster volume - a GlusterFS volume created by the gluster vol create cmd
* snapshot volume - a snapshot volume created by the gluster snapshot create cmd ; this is based on the LVM2 thin-lv backend and is itself a thin-lv; a snapshot thin-lv is accessible as yet another GlusterFS volume in itself

1. Introduction

User serviceable snapshots are a quick and easy way to access data stored in earlier snapshotted volumes. This feature is based on the core snapshot feature introduced in GlusterFS earlier. The key point here is that USS allows the end user to access his/her older data without any admin intervention. To that extent this feature is about ease of use, ease of access to one's past data in snapshot volumes (which, today in the gluster world are based on LVM2 Thin-LVs as the backend.

This is not a replacement for bulk data access from an earlier snapshot volume, in which case the recommendation is to use the mounted snapshot volume and access it as a GlusterFS volume, mounted and accessed via the Native FUSE client.
Rather this is targetted for use in typical home directory scenarios where individual users can at random points of time, access their files/dirs in their individual home directories without admin intervention of any sort. The home directory usecase is only an example and there are several other use-cases including other kinds of applications that could benefit from this feature.

2. Use-case

Consider a user John with his Unix id john and $HOME as /home/john. Let us consider an example when John wants to access a file /home/john/Important/file_john.txt which existed in his home directory in November 2013 but was deleted in December 2013. In order to access the file (prior to the introduction of the user-serviceable snapshot feature), John's only option was to send a note to the admin to ensure the gluster-snapshotted volume from Nov 2013 was made available (activated
and mounted). The admin would then notify John of the availability of the snapshot volume when John could potentially traverse his older home directory to copy over the file.

With USS, the need for admin intervention goes away. John is now free to execute the following steps and access the desired file whenever he needs:

$pwd
/home/john

$ls
dir1/	dir2/	dir3/	file1	file2	Important/

$cd Important/

$ls

(No files present - this being his current view)

$cd .snaps

$pwd
/home/john/Important/.snaps

$ls
snapshot_jan2014/	snapshot_dec2013/	snapshot_nov2013/	snapshot_oct2013/	snapshot_sep2013/

$cd snapshot_nov2013/

$ls
file_john.txt	file_john_1.txt

$cp -p file_john.txt $HOME

As the above steps indicate, it is fairly easy to recover lost files or even older vesions of files or directories using USS.


3. Design
==========

A new server-side xlator (snapview-server) and a client-side xlator (snapview-client) are introduced. On the client side, the xlator would be above DHT xlator in the graph and would redirect fops to either the dht xlator or to the protocol-client xlator (both of which are children of the snapview-client xlator). On the server side, the protocol-server xlator and the snapview-server xlator would form the graph which is hosted inside a separate daemon snapd (glusterfsd process). One such daemon process is spawned for each gluster volume.

We rely on the fact that gfids are unique and are the same across all snapshotted volumes. Given a volume, we will access a file using its gfid without knowing the filename. We accomplish this by taking the existing data filesystem namespace and overlaying a virtual gfid namespace on top.

All files, directories will remain accessible as they are (in the current state of the gluster volumes). But in every directory we will create a "virtual directory" called ".snaps" on the fly. This ".snaps" directory will provide a list of all the available snapshots for the given volume and kind of act as a wormhole into all the available snapshots of that volume ie. to the past.

When the .snaps dir is looked up, the client xlator with its instrumented lookup() detects that its a reference to the virtual directory. It redirects the request to the snapd daemon and to the snapview-server xlator in turn, which generates a random gfid, fills up a pseudo stat sturcture with necessary info and returns via STACK_UNWIND. Information about the directory is maintained in the server xlator inode context, where inodes are classified as VIRTUAL, REAL or the
special "DOT_SNAPS_INODE" so that this info can be used in subsequent lookups. On the client xlator side too, such virtual type info is maintained in the inode_ctx.

The user would typically do a "ls" which results in an opendir and a readdirp() on the inode returned. Server xlator will query the list of snaps that are present in the system and present each one as an entry in the directory, in the form of dirent entries. We also need to encode enough info in each of the respective inodes so that next time a call happens to that inode, we can figure where that inode is in the big picture. Whether it is in the snap vol, where we are etc. And once a user tries to do ls inside one of the specific snap dirs, we will have to figure out what the gfid was of the original directory and perform access on the appropriate graph based on the specific snap directory (hourly.0 etc) and perform access on that graph based on that gfid. The inode information in the server xlator side is mapped to the gfapi world via the handle-based libgfapi which were introduced for the nfs-ganesha integration. These handle based APIs allow a gfapi operation to be performed on a "gfid" handle and a glfs-object that encodes the gfid and inode returned from the gfapi world. 
In this case, once the server xlator allocates an inode, we need to track and map it to the corresponding glfs-object in the handle-based gfapi world, so that any glfs_h_XXX operation can be performed on it. 

For example, on the server xlator side, the _stat call would typically need to check the type of inode stored in the inode_ctx. If its a ".snaps" inode then the iatt structure is filled in. If its a subsequent lookup on a Virtual inode, then we obtain the glfs_t and glfs_object info from the inode_ctx (where this is already stored). The desired stat is then easily obtained using the glfs_h_stat (fs, object, &stat) call.

Considerations
==============
- A global option will be available to turn off USS globally. A volume level option will also be made available to enable USS per volume. There could be volumes for which uss access is not desirable at all.

- Disabling this feature would remove the client side graph generation while snapds can continue to exist on the server side; they will never be accessed without the client side enablement. And since every access to client gfapi graphs etc. is dynamic and done on the fly and cleaned up, the expectation is that such a snapd left behind would not hog resources at all.

- Today we are allowing the listing of all available snapshots in each ".snaps" directory. We plan to introduce a configurable option to limit the number of snapshots visible under the USS feature.

- There is no impact to existing fops by this feature. If enabled, it is just an extra check in the client side xlator when the fop is redirected to the server side xlator

- With a large number of snapshot volumes being made available or visible, one glfs_t * hangs off the snapd for each gfapi client call-graph. Along with that, if a large number of users start simultaneously accessing files on each of
the snapshot volumes (max number of supported snapshots is 256 today) then the RSS for snapd could go high. We are trying to get numbers for this before we can say for sure if this is an issue at all (say, with OOM killer).

- The list of snapshots is refreshed each time a new snap is taken or added to the system. The snapd would query glusterd for the new list of snaps and refresh its in-memory list of snapshots, appropriately cleaning up the glfs_t graphs for each of the deleted snapshots and clearing up any glfs_objects.

- Again, this is not a performance oriented feature. Rather, the goal is to allow a seamless user-experience by allowing easy and useful access to snapshotted volumes and individual data stored in those volumes.







More information about the Gluster-users mailing list