[Gluster-devel] File snapshot design propsals

Thu Sep 8 11:22:26 UTC 2016

> 1) Doing file snapshot using shards: (This is suggested by shyam, tried to
> keep the text as is)
> If a block for such a file is written to with a higher version then the brick
> xlators can perform a block copy and then change the new block to the new
> version, and let the older version be as is.

How large are shards, and how often will partial-shard writes occur?  Doing
copy-on-write (what this really is) a lot could be costly.

> This leaves blocks with various versions on disk, and when a older snap
> (version) is deleted, then the corresponding blocks are freed.

If a shard is shared between multiple snapshots, this garbage collection
problem becomes significantly more complicated.

> 2) Doing a file snapshot using sparse files:
> This is sort of inspired from granular data self-heal idea we wanted to
> implement in afr, where we logically represent each block/shard used in the
> file by a bitmap stored either as an xattr or written to a metafile. So
> there is no physical division of the file into different shards. When a
> snapshot is taken, a new sparsefile is created of same size as before, new
> writes on the file are redirected to this file instead of the original file,
> thus preserving the old file. When a write is performed on this file, we
> mark which block is going to be written, copy out this block from older
> shard, overwrite the buffer and then write to the new version and mark the
> block as used either in xattr/metafile.

Again, this will become a lot more complicated with N snapshots, because
we'd need to record which of N files contains the relevant version of a
block.

> 3) Doing filesnapshots by using reflink functionality given by the underlying
> FS:
> When a snapshot request comes, we just do a reflink of the earlier file to
> the latest version and new writes are redirected to this new version of the
> file.

For the sake of completeness:

4) Log-structured merge tree approach.  This is actually similar to 2, but
using append-only logs instead of bitmaps.  New writes are simply added to
the most recent log.  Reads are satisfied by checking one or more logs
before going to the base file (if any).  A snapshot is simply base plus
logs up to the snapshot time, minus any logs after.

This is similar to some elements of JBR, to Cassandra, and to qcow2.  With
specific reference to the latter (which is cross-platform) is there a way
we can use that instead of implementing our own version of 2 or 4?

Lastly, how do *writable* snapshots (clones) affect the tradeoffs between
these approaches?