[Gluster-devel] Glusterfs Snapshot feature

Mon Dec 23 03:48:24 UTC 2013

Thanks Paul for going through the Snapshot design wiki and providing your valuable inputs. The wiki needs lot of update. I will update it soon and circulate it here again.

The cli status and list commands will provide space consumption by the snapshot. Snapshot scheduler is planned for phase 2. I will update the document with details of phase I and phase II.

As of now auto-delete feature is based on maximum allowed snapshot count. Admins can set a limit on number of snapshots allowed per volume. More details will be provided in the wiki.

We still have to iron out details of Replica-3 and quorum support, but the intention of providing the feature is to allow admins to take snapshots even some bricks are down without compromising the integrity of the volume.

When snapshot count reaches the maximum limit then an auto-delete is triggered. Before deletion the snapshot is stopped and then it is deleted.

User serviceable snapshots is planning in phase II and yes .snap directory will list in accordance with snap time-stamp. 

Snapshot can be restored only when the volume is stopped. It is a completely offline activity. More details will be provided in the wiki.

Thanks & Regards,
Rajesh

----- Original Message -----
From: "Paul Cuzner" <pcuzner at redhat.com>
To: "Rajesh Joseph" <rjoseph at redhat.com>
Cc: "gluster-devel" <gluster-devel at nongnu.org>
Sent: Monday, December 23, 2013 3:50:45 AM
Subject: Re: [Gluster-devel] Glusterfs Snapshot feature

Hi Rajesh,

I've just read through the snapshot spec on the wiki page in the forge, and have been looking at it through a storage admin's eyes.

There are a couple of items that are perhaps already addressed but not listed on the wiki, but just in case here's my thoughts;

The CLI definition doesn't define a snap usage command - ie. a way for the admin to understand which snap is consuming the most space. With snaps comes misuse, and capacity issues, so our implementation should provide the admin with the information to make the right 'call'.

I think I've mentioned this before, but how are snaps orchestrated across the cluster. I see a CLI to start it - but what's needed is a cron like ability associated per volume. I think in a previous post I called this a "snap schedule" command.

A common thing I've seen on other platforms is unexpected space usage, due to changes in data access patterns generating more delta's - I've seen this a lot in virtual environments, when service packs/maintenance gets rolled out for example. In these situations, capacity can soon run out, so an auto-delete feature to drop snaps to ensure the real volume stays on line would seem like a sensible approach.

The comments around quorum and replica 3 enable an exception to a basic rule - fail a snap request if the cluster is not in a healthy state. I would argue against making exceptions and keep things simple - if a node/brick is unavailable, or there is a cluster reconfig in progress, raise an alert and fail the snap request. I think this is a more straight forward approach for admins to get their heads around than thinking about specific volume types and potential cleanup activity.

It's not clear from the wiki how snaps are deleted - for example, when the snap-max-limit is reached, does the create of a new snapshot automatically trigger the delete of the oldest snap? If so presumably the delete will only be actioned, once the barrier is in place across all affected bricks. 

The snap create is allowing a user-defined name, which is a great idea. However, what is the order that the snaps would be resolved in when the user opens the .snap directory? Will create time for the snapshot be the order the user sees regardless of name, or is there a potential of a name for a more current time appearing lower in the list causing a potential recovery point to be missed by the end user?

snap restore presumably requires the volume to be offline, and the bricks unmounted? There's no detail in the scoping document about how the restore capability is intended to be implemented. Restore will be a drastic action, so understanding the implications to the various protocols smb,nfs,native, swift and gfapi would be key. Perhaps the simple answer is the volume must be in a stopped state first?

There is some mention of a phased implementation for snapshots - phase-1 is mentioned for example towards the end of the doc. Perhaps it would be beneficial to define the phases at the start of the article, and list the features likely to be in each phase. This may help focus feedback specific to the initial implementation for example.

Like I said - I'm looking at this through the eyes of an "old admin" ;)

Cheers,

Paul C

----- Original Message -----
> From: "Rajesh Joseph" <rjoseph at redhat.com>
> To: "gluster-devel" <gluster-devel at nongnu.org>
> Sent: Friday, 6 December, 2013 1:05:29 AM
> Subject: [Gluster-devel] Glusterfs Snapshot feature
> 
> Hi all,
> 
> We are implementing snapshot support for glusterfs volumes in release-3.6.
> The design document can be found at
> https://forge.gluster.org/snapshot/pages/Home.
> The document needs some update and I will doing that in the coming weeks.
> 
> All the work done till now can be seen at review.gluster.com. The project
> name for the snapshot
> work is "glusterfs-snapshot"
> http://review.gluster.org/#/q/status:open+project:glusterfs-snapshot,n,z
> 
> All suggestions and comments are welcome.
> 
> Thanks & Regards,
> Rajesh
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>