[Gluster-devel] Stateless Nodes - HowTo - was Re: glusterfs-3.3.0qa34 released
Ian Latter
ian.latter at midnightcode.org
Wed May 29 11:25:39 UTC 2013
Hello,
Following up on this thread I upgraded to GlusterFS 3.3.1 where the glusterd behavior was slightly different.
In 3.3.0 I observed;
> When you do this in the startup process you can skip the "gluster
> peer probe" and simply call the "gluster volume create" as we did in
> a non clustered environment, on every node as it boots (on every
> boot, including the first). The nodes that are late to the party will be
> told that the configuration already exists, and the clustered volume
> *should* come up.
In 3.3.1 this was reversed. A clustered "glusterd" would refuse to accept a volume create (for a distributed volume) if one of the nodes was down. When all nodes were up, the last booted node could run the "gluster volume create" to kick off the cluster (for any node to then issue the "gluster volume start")
Further to this, so long as one node "exists" the in-memory configuration of the clustered volumes persists within glusterd. I.e; for a two node cluster;
- boot node 1 - clean drives, establish cluster relations and sit
- boot node 2 - clean drives, establish cluster relations and configure a
cluster volume
- reboot node 1 - clean drives, establish cluster relations and ..
.. at this point node 1 and node 2 have NFS shares successfully running but only node 2 has its bricks successfully serving the DHT volume (observable via "gluster volume status"). Performing a "gluster volume stop" and "gluster volume start" on the clustered volume will reassert both nodes' bricks in the volume.
I have now codified this into a 28Mbyte firmware image - here;
http://midnightcode.org/projects/saturn/code/midnightcode-saturn-0.1-20130526230106-usb.burn.gz
I have documented the installation, configuration and operations of that image in a sizable manual - here;
http://midnightcode.org/papers/Saturn%20Manual%20-%20Midnight%20Code%20-%20v0.1.pdf
What I'm not happy about is my understanding of the recovery strategy, from a native GlusterFS perspective, for a DHT volume.
I.e. when a node that is providing a brick to a gluster cluster volume reboots, what is the intended recovery strategy for that distributed volume with respect to the lost/found brick?
Is it recommended that the volume be stopped and started to rejoin the lost/found brick? Or is there a transparent method for re-introducing the lost/found brick (from a client perspective) such as a "brick delete" then "brick add" for that node in the volume? As it wasn't clear to me what the impact would be of removing and adding a brick to a DHT (either regarding the on disk data/attr state or the future performance of that DHT volume), I didn't pursue this path. If you can "brick delete" then is this the preferred method for shutting down a node in order to cleanly umount the disk under the brick? - rather than shutting down the entire volume whenever a single node drops out?
I recognise that DHT is not a HA (highly available) solution, I'm simply looking for the recommended operational and recovery strategies for DHT volumes in multi-node configurations.
My timing isn't good - I understand that everyone here is (rightfully) focused on 3.4.0. When time is avail further down the track, could someone in the know please steer me through the current gluster architecture with respect to the the above query?
Thanks,
----- Original Message -----
>From: "Ian Latter" <ian.latter at midnightcode.org>
>To: <gluster-devel at nongnu.org>
>Subject: [Gluster-devel] Stateless Nodes - HowTo - was Re: glusterfs-3.3.0qa34 released
>Date: Fri, 17 May 2013 00:37:25 +1000
>
> Hello,
>
>
> Well I can't believe that it's been more than a year since I started looking into a stateless cluster implementation for GlusterFS .. time flies eh.
>
> First - what do I mean by "stateless"? I mean that;
> - the user configuration of the operating environment is maintained outside of the OS
> - the operating system is destroyed on reboot or power off, and all OS and application configuration is irrecoverably lost
> - on each boot we want to get back to the user preferred/configured operating environment through the most normal methods possible (preferably by running the same commands and JIT building of config files that were used to configure the system the first time; should be used every time).
>
> In this way, you could well argue that the OE state is maintained in a type of provisioning or orchestration tool, outside of the OS and application instances (or in my case in the Saturn configuration file that is the only persistent data maintained between running OE instances).
>
> Per the thread below, to get a stateless node (no clustering involved) we would remove the xattr values from each shared brick, on boot;
> removexattr(mount_point, "trusted.glusterfs.volume-id")
> removexattr(mount_point, "trusted.gfid")
>
> And then we would populate glusterd/glusterd.info with an externally stored UUID (to make it consistent across boots). These three actions would allow the CLI "gluster volume create" commands to run unimpeded - thanks to Amar for that detail.
>
> Note1: that we've only been experimenting with DHT/Distribute, so I don't know if other Gluster xlator modules have pedantic needs in addition to the above.
> Note2: that my glusterd directory is in /etc (/etc/glusterd/glusterd.info), where-as the current location in the popular distro's is, I believe, /var/lib (/var/lib/glusterd/glusterd.info), so I will refer to the relative path in this message.
>
>
> But we have finally scaled out beyond the limits of our largest chassis (down to 1TB free) and need to cluster to add on more capacity via the next chassis. Over the past three nights I've had a chance to experiment with GlusterFS 3.3.0 (I will be looking at 3.4.0 shortly) and create a "distribute" volume between two clustered nodes. To get a stateless outcome we then need to be able to boot one node from scratch and have it re-join the cluster and volume from only the "gluster" CLI command/s.
>
> For what its worth, I couldn't find a way to do this. The peer probing model doesn't seem to allow an old node to rejoin the cluster.
>
> So many thanks to Mike of FunWithLinux for this post and steering me in the right direction;
> http://funwithlinux.net/2013/02/glusterfs-tips-and-tricks-centos/
>
> The trick seems to be (in addition to the non-cluster configs, above) to manage the cluster membership outside of GlusterFS. On boot, we automatically populate the relevant peer file
> (glusterd/peers/{uuid}) with the UUID, state=3, and hostname/IP address; one file for each other node in the cluster (excluding the local node). I.e.
>
> # cat /etc/glusterd/peers/ab2d5444-5a01-427a-a322-c16592676d29
> uuid=ab2d5444-5a01-427a-a322-c16592676d29
> state=3
> hostname1=192.168.179.102
>
> Note that if you're using IP addresses as your node handle (as opposed to host names) then you must retain the same IP address across boots for this to work, lest you make modifications to the existing/running cluster nodes that will require glusterd to be restarted.
>
> When you do this in the startup process you can skip the "gluster peer probe" and simply call the "gluster volume create" as we did in a non clustered environment, on every node as it boots (on every boot, including the first). The nodes that are late to the party will be told that the configuration already exists, and the clustered volume *should* come up.
>
> I am still experimenting, but I say "should" because you can sometimes see a delay in the re-establishment of the clustered volume, and you can sometimes see the clustered volume fail to re-establish. When it fails to re-establish the solution seems to be a "gluster volume start" for that volume, on any node. FWIW I believe I'm seeing this locally because Saturn tries to nicely stop all Gluster volumes on reboot, which is affecting the cluster (of course) - lol - a little more integration work to do.
>
>
> The external state needed then looks like this on the first node (101);
>
> set gluster server uuid 6b481ebb-859a-4c2b-8b5f-8f0bba7c3b9a
> set gluster peer0 uuid ab2d5444-5a01-427a-a322-c16592676d29
> set gluster peer0 ipv4_address 192.168.179.102
> set gluster volume0 name myvolume
> set gluster volume0 is_enabled 1
> set gluster volume0 uuid 00000000-0000-0000-0000-000000000000
> set gluster volume0 interface eth0
> set gluster volume0 type distribute
> set gluster volume0 brick0 /dev/hda
> set gluster volume0 brick1 192.168.179.102:/glusterfs/exports/hda
>
> And the external state needed looks like this on the second node (102);
>
> set gluster server uuid ab2d5444-5a01-427a-a322-c16592676d29
> set gluster peer0 uuid 6b481ebb-859a-4c2b-8b5f-8f0bba7c3b9a
> set gluster peer0 ipv4_address 192.168.179.101
> set gluster volume0 name myvolume
> set gluster volume0 is_enabled 1
> set gluster volume0 uuid 00000000-0000-0000-0000-000000000000
> set gluster volume0 interface eth0
> set gluster volume0 type distribute
> set gluster volume0 brick0 192.168.179.101:/glusterfs/exports/hda
> set gluster volume0 brick1 /dev/hda
>
> Note that I assumed that there was a per volume UUID (currently all zeros) that I would need to re-instate but haven't seen yet (presumably it's one value that's currently being removed from the mount point xattr's on each boot).
>
>
> I hope that this information helps others who are trying to dynamically provision and re-provision virtual/infrastructure environments. I note that this information covers a topic that has not been written up on the Gluster site;
>
> HowTo - GlusterDocumentation
> http://www.gluster.org/community/documentation/index.php/HowTo
> [...]
> Articles that need to be written
> Troubleshooting
> - UUID's and cloning Gluster instances
> - Verifying cluster integrity
> [...]
>
>
> Please feel free to use this content to help contribute to that FAQ/HowTo document.
>
>
> Cheers,
>
>
> ----- Original Message -----
> >From: "Ian Latter" <ian.latter at midnightcode.org>
> >To: "Amar Tumballi" <amarts at redhat.com>
> >Subject: Re: [Gluster-devel] glusterfs-3.3.0qa34 released
> >Date: Wed, 18 Apr 2012 18:55:46 +1000
> >
> >
> > ----- Original Message -----
> > >From: "Amar Tumballi" <amarts at redhat.com>
> > >To: "Ian Latter" <ian.latter at midnightcode.org>
> > >Subject: Re: [Gluster-devel] glusterfs-3.3.0qa34 released
> > >Date: Wed, 18 Apr 2012 13:42:45 +0530
> > >
> > > On 04/18/2012 12:26 PM, Ian Latter wrote:
> > > > Hello,
> > > >
> > > >
> > > > I've written a work around for this issue (in 3.3.0qa35)
> > > > by adding a new configuration option to glusterd
> > > > (ignore-strict-checks) but there are additional checks
> > > > within the posix brick/xlator. I can see that volume starts
> > > > but the bricks inside it fail shortly there-after, and
> > that of
> > > > the 5 disks in my volume three of them have one
> > > > volume_id and two them have another - so this isn't going
> > > > to be resolved without some human intervention.
> > > >
> > > > However, while going through the posix brick/xlator I
> > > > found the "volume-id" parameter. I've tracked it back
> > > > to the volinfo structure in the glusterd xlator.
> > > >
> > > > So before I try to code up a posix inheritance for my
> > > > glusterd work around (ignoring additional checks so
> > > > that a new volume_id is created on-the-fly / as-needed),
> > > > does anyone know of a CLI method for passing the
> > > > volume-id into glusterd (either via "volume create" or
> > > > "volume set")? I don't see one from the code ...
> > > > glusterd_handle_create_volume does a uuid_generate
> > > > and its not a feature of glusterd_volopt_map ...
> > > >
> > > > Is a user defined UUID init method planned for the CLI
> > > > before 3.3.0 is released? Is there a reason that this
> > > > shouldn't be permitted from the CLI "volume create" ?
> > > >
> > > >
> > > We don't want to bring in this option to CLI. That is
> > because we don't
> > > think it is right to confuse USER with more
> > options/values. 'volume-id'
> > > is a internal thing for the user, and we don't want him to
> > know about in
> > > normal use cases.
> > >
> > > In case of 'power-users' like you, If you know what you
> > are doing, the
> > > better solution is to do 'setxattr -x trusted.volume-id
> > $brick' before
> > > starting the brick, so posix translator anyway doesn't get
> > bothered.
> > >
> > > Regards,
> > > Amar
> > >
> >
> >
> > Hello Amar,
> >
> > I wouldn't go so far as to say that I know what I'm
> > doing, but I'll take the compliment ;-)
> >
> > Thanks for the advice. I'm going to assume that I'll
> > be revisiting this issue when we can get back into
> > clustering (replicating distributed volumes). I.e. I'm
> > assuming that on this path we'll end up driving out
> > issues like split brain;
> >
> > https://github.com/jdarcy/glusterfs/commit/8a45a0e480f7e8c6ea1195f77ce3810d4817dc37
> >
> >
> > Cheers,
> >
> >
> >
> > --
> > Ian Latter
> > Late night coder ..
> > http://midnightcode.org/
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at nongnu.org
> > https://lists.nongnu.org/mailman/listinfo/gluster-devel
> >
>
>
> --
> Ian Latter
> Late night coder ..
> http://midnightcode.org/
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
--
Ian Latter
Late night coder ..
http://midnightcode.org/
More information about the Gluster-devel
mailing list