[Gluster-devel] Introducing file based snapshots in gluster

Tue Mar 1 09:17:44 UTC 2016

On Tuesday, March 1, 2016 1:15:06 PM, Kaushal M wrote:
> On Tue, Mar 1, 2016 at 12:37 PM, Prasanna Kumar Kalever
> <pkalever at redhat.com> wrote:
> > Hello Gluster,
> >
> >
> > Introducing a new file based snapshot feature in gluster which is based  on
> > reflinks feature which will be out from xfs in a couple of months
> > (downstream)
> >
> >
> > what is a reflink ?
> >
> > You might have surely used softlinks and hardlinks everyday!
> >
> > Reflink  supports transparent copy on write, unlike soft/hardlinks which if
> > useful for  snapshotting, basically reflink points to same data blocks
> > that are used  by actual file (blocks are common to real file and a
> > reflink file hence  space efficient), they use different inode numbers
> > hence they can have  different permissions to access same data blocks,
> > although they may look  similar to hardlinks but are more space efficient
> > and can handle all  operations that can be performed on a regular file,
> > unlike hardlinks  that are limited to unlink().
> >
> > which filesystem support reflink ?
> > I  think its Btrfs who put it for the first time, now xfs trying hard to
> > make them available, in the future we can see them in ext4 as well
> >
> > You can get a feel of reflinks by following tutorial
> > https://pkalever.wordpress.com/2016/01/22/xfs-reflinks-tutorial/
> >
> >
> > POC in gluster: https://asciinema.org/a/be50ukifcwk8tqhvo0ndtdqdd?speed=2
> >
> >
> > How we are doing it ?
> > Currently  we don't have a specific system-call that gives handle to
> > reflinks, so I  decided to go with ioctl call with XFS_IOC_CLONE command.
> >
> > In POC I have used setxattr/getxattr to create/delete/list the snapshot.
> > Restore feature will use setxattr as well.
> >
> > We  can have a fop although Fuse does't understand it, we will manage with
> > a  setxattr at Fuse mount point and again from client side it will be a
> > fop till  the posix xlator then as a ioctl to the underlying filesystem.
> > Planing  to expose APIs for create, delete, list and restore.
> >
> > Are these snapshots Internal or external?
> > We  will have a separate file each time we create a snapshot, obviously the
> > snapshot file will have a different inode number and will be a  readonly,
> > all these files are maintained in the ".fsnap/ " directory  which is
> > maintained by the parent directory where the  snapshot-ted/actual file
> > resides, therefore they will not be visible to user (even with ls -a
> > option, just like USS).
> >
> > *** We can always restore to any snapshot available  in the list and the
> > best part is we can delete any snapshot between  snapshot1 and  snapshotN
> > because all of them are independent ***
> >
> > It  is applications duty to ensure the consistency of the file before it
> > tries to create a snapshot, say in case of VM file snapshot it is the
> > hyper-visor that should freeze the IO and then request for the snapshot
> >
> >
> >
> > Integration with gluster: (Initial state, need more investigation)
> >
> > Quota:
> > Since  the snapshot files resides in ".fsnap/" directory which is
> > maintained  by the same directory where the actual file exist, it falls in
> > the same  users quota :)
> >
> > DHT:
> > As said the snapshot files will resides in the same directory where the
> > actual file resides may be in a ".fsnap/" directory
> >
> > Re-balancing:
> > Simplest  solution could be, copy the actual file as whole copy then for
> > snapshotfiles rsync only delta's and recreate snapshots history by
> > repeating snapshot sequence after each snapshotfile rsync.
> >
> > AFR:
> > Mostly  will be same as write fop (inodelk's and quorum's). There could be
> > no  way to recover or recreate a snapshot on node (brick to be precise)
> > which was down while  taking snapshot and comes back later in time.
> >
> > Disperse:
> > Mostly take the inodelk and snapshot the file, on each of the bricks should
> > work.
> >
> > Sharding:
> > Assume we have a file split into 4 shards. If the fop for take snapshot is
> > sent to all the subvols having the shards, it would be sufficient. All
> > shards will have the snapshot for the state of the shard.
> > List of snap fop should be sent only to the main subvol where shard 0
> > resides.
> > Delete of a snap should be similar to create.
> > Restore would be a little difficult because metadata of the file needs to
> > be updated in shard xlator.
> > <Needs more investigation>
> > Also in case of sharding, the bricks have gfid based flat filesystem. Hence
> > the snaps created will also be in the shard directory, hence quota is not
> > straight forward and needs additional work in this case.
> >
> >
> > How can we make it better ?
> > Discussion page: http://pad.engineering.redhat.com/kclYd9TPjr
> 
> This link is not accessible externally. Could you move the contents to
> a public location?

Thanks Kaushal, I have copied it to 
https://public.pad.fsfe.org/p/Snapshots_in_glusterfs
lets use this from now.

-Prasanna
> 
> >
> >
> > Thanks to "Pranith Kumar Karampuri", "Raghavendra Talur", "Rajesh Joseph",
> > "Poornima Gurusiddaiah" and "Kotresh Hiremath Ravishankar"
> > for all initial discussions.
> >
> >
> > -Prasanna
> >
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>