[Gluster-users] A bunch of comments about brick management

Tue Nov 12 00:05:03 UTC 2013

On Mon, 2013-11-11 at 12:50 +0530, Kaushal M wrote:
> Hey James,
Hey,

> I'm replying inline about your observations. Some of your observations
> are valid bugs. But most of them are such because they haven't been
> implemented and are feature requests.
That's what I expected in some cases. My replies are also inline, below.

> 
> On Wed, Oct 30, 2013 at 4:10 PM, James <purpleidea at gmail.com> wrote:
> > Hey there,
> >
> > I've been madly hacking on cool new puppet-gluster features... In my
> > lack of sleep, I've put together some comments about gluster add/remove
> > brick features. Hopefully they are useful, and make sense. These are
> > sort of "bugs". Have a look, and let me know if I should formally report
> > any of these...
> >
> > Cheers...
> > James
> >
> > PS: this is also mirrored here:
> > http://paste.fedoraproject.org/50402/12956713
> > because email has destroyed formatting :P
> >
> >
> > All tests are done on gluster 3.4.1, using CentOS 6.4 on vm's.
> > Firewall has been disabled for testing purposes.
> > gluster --version
> > glusterfs 3.4.1 built on Sep 27 2013 13:13:58
> >
> >
> > ### 1) simple operations shouldn't fail
> > # running the following commands in succession without files:
> > # gluster volume add-brick examplevol vmx1.example.com:/tmp/foo9
> > vmx2.example.com:/tmp/foo9
> > # gluster volume remove-brick examplevol vmx1.example.com:/tmp/foo9
> > vmx2.example.com:/tmp/foo9 start ... status
> >
> > shows a failure:
> >
> > [root at vmx1 ~]# gluster volume add-brick examplevol
> > vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9
> > volume add-brick: success
> > [root at vmx1 ~]# gluster volume remove-brick examplevol
> > vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 status
> >                                     Node Rebalanced-files          size
> > scanned      failures       skipped         status run-time in secs
> >                                ---------      -----------   -----------
> > -----------   -----------   -----------   ------------   --------------
> >                                localhost                0        0Bytes
> > 0             0    not started             0.00
> >                         vmx2.example.com                0        0Bytes
> > 0             0    not started             0.00
> > [root at vmx1 ~]# gluster volume remove-brick examplevol
> > vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 start
> > volume remove-brick start: success
> > ID: ecbcc2b6-4351-468a-8f53-3a09159e4059
> > [root at vmx1 ~]# gluster volume remove-brick examplevol
> > vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 status
> >                                     Node Rebalanced-files          size
> > scanned      failures       skipped         status run-time in secs
> >                                ---------      -----------   -----------
> > -----------   -----------   -----------   ------------   --------------
> >                                localhost                0        0Bytes
> > 8             0      completed             0.00
> >                         vmx2.example.com                0        0Bytes
> > 0             1         failed             0.00
> 
> I don't know why this the process failed on one node. This is a
> rebalance issue, not a cli one, and is a valid bug. If you can
> reproduce it consistently, then please file a bug for this.

Done: https://bugzilla.redhat.com/show_bug.cgi?id=1029235

> 
> > [root at vmx1 ~]# gluster volume remove-brick examplevol
> > vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 commit
> > Removing brick(s) can result in data loss. Do you want to Continue?
> > (y/n) y
> > volume remove-brick commit: success
> > [root at vmx1 ~]#
> >
> > ### 1b) on the other node, the output shows an extra row (also including
> > the failure)
> >
> > [root at vmx2 ~]# gluster volume remove-brick examplevol
> > vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 status
> >                                     Node Rebalanced-files          size
> > scanned      failures       skipped         status run-time in secs
> >                                ---------      -----------   -----------
> > -----------   -----------   -----------   ------------   --------------
> >                                localhost                0        0Bytes
> > 0             0      completed             0.00
> >                                localhost                0        0Bytes
> > 0             0      completed             0.00
> >                         vmx1.example.com                0        0Bytes
> > 0             1         failed             0.00
> >
> 
> This is a bug, which I've seen some other times as well. But haven't
> gone around to file it or fix it. So file a bug for this as well.

Filed: https://bugzilla.redhat.com/show_bug.cgi?id=1029237
Feel free to CC yourself and add an "ack, I've seen this too" if you
like.

> 
> >
> > ### 2) formatting:
> >
> > # the "skipped" column doesn't seem to have any data, as a result
> > formatting is broken...
> > # this problem is obviously not seen in the more useful --xml output
> > below. neither is the 'skipped' column.
> >
> > [root at vmx1 examplevol]# gluster volume remove-brick examplevol
> > vmx1.example.com:/tmp/foo3 vmx2.example.com:/tmp/foo3 status
> >                                     Node Rebalanced-files          size
> > scanned      failures       skipped         status run-time in secs
> >                                ---------      -----------   -----------
> > -----------   -----------   -----------   ------------   --------------
> >                                localhost                0        0Bytes
> > 8             0      completed             0.00
> >                         vmx2.example.com                0        0Bytes
> > 8             0      completed             0.00
> >
> >
> > <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> > <cliOutput>
> >   <opRet>0</opRet>
> >   <opErrno>115</opErrno>
> >   <opErrstr/>
> >   <volRemoveBrick>
> >     <task-id>d99cab76-cd7d-4579-80ae-c1e6faff3d1d</task-id>
> >     <nodeCount>2</nodeCount>
> >     <node>
> >       <nodeName>localhost</nodeName>
> >       <files>0</files>
> >       <size>0</size>
> >       <lookups>8</lookups>
> >       <failures>0</failures>
> >       <status>3</status>
> >       <statusStr>completed</statusStr>
> >     </node>
> >     <node>
> >       <nodeName>vmx2.example.com</nodeName>
> >       <files>0</files>
> >       <size>0</size>
> >       <lookups>8</lookups>
> >       <failures>0</failures>
> >       <status>3</status>
> >       <statusStr>completed</statusStr>
> >     </node>
> >     <aggregate>
> >       <files>0</files>
> >       <size>0</size>
> >       <lookups>16</lookups>
> >       <failures>0</failures>
> >       <status>3</status>
> >       <statusStr>completed</statusStr>
> >     </aggregate>
> >   </volRemoveBrick>
> > </cliOutput>
> >
> 
> Skipped count wasn't available in rebalance status' xml output in
> 3.4.0, but has been added recently and should be available in 3.4.2.
> (http://review.gluster.org/6000)
> Regarding the missing skipped count in remove-brick status output, it
> might be because for remove-brick a skipped file is treated as a
> failure. But this might have been fixed in 3.4.2, but needs to checked
> to confirm.

Now sure how you'd like me to proceed. So for now, I'll just forget
about this entirely, and hopefully I won't have any issues in 3.4.2, if
I do, and I notice them, then I'll repost this if I remember :P

> 
> >
> > ### 3)
> > [root at vmx1 examplevol]# gluster volume remove-brick examplevol
> > vmx1.example.com:/tmp/foo3 vmx2.example.com:/tmp/foo3 status
> >                                     Node Rebalanced-files          size
> > scanned      failures       skipped         status run-time in secs
> >                                ---------      -----------   -----------
> > -----------   -----------   -----------   ------------   --------------
> >                                localhost                0        0Bytes
> > 8             0      completed             0.00
> >                      vmx2.example.com                0        0Bytes
> > 8             0      completed             0.00
> > [root at vmx1 examplevol]# gluster volume remove-brick examplevol
> > vmx1.example.com:/tmp/foo3 vmx2.example.com:/tmp/foo3 commit
> > Removing brick(s) can result in data loss. Do you want to Continue?
> > (y/n) y
> > volume remove-brick commit: success
> >
> > This shouldn't warn you that you might experience data loss. If the
> > rebalance has successfully worked, and bricks shouldn't be accepting new
> > files, then gluster should know this, and just let you commit safely.
> > I guess you can consider this a UI bug, as long as it checks before hand
> > that it's safe.
> >
> 
> The warning is given by the CLI, before even contacting glusterd. The
> CLI doesn't have any information regarding the status of a
> remove-brick commands' rebalance process. It gives the warning for all
> 'remove-brick commit' commands, be it a forceful removal or a clean
> removal with rebalancing. This would be a nice feature to have though,
> but it requires changes to way CLI and glusterd communicate.

Thanks for the explanation. I suppose this makes me feel better about
the remove brick process, although an (atomic?) safety mechanism built
in to the cli command would be great. That way I'll only answer "okay"
to danger commands that will actually cause danger.

If you think this needs an RFE bug, I can open it, but your higher
understanding of the situation is probably better suited to do this. If
you open something, point me to the URL, and I can add the above
comments.

> 
> >
> > ### 4)
> > Aggregate "totals", aka the <aggregate> </aggregate> data isn't shown in
> > the normal command line output.
> >
> 
> The aggregate totals were added to the xml output because the ovirt
> team, which was the driver behind having XML outputs, requested for
> it. It should be simple enough to add it to the normal output. If it
> is desired please raise a RFE bug.
puppet-gluster is a big consumer of --xml output too ;) If there's
somewhere I need to subscribe for api breakage/changes, I'd love to be
cc-ed.

In any case, I don't really need the aggregate information in the non
xml output, just figured I'd mention it. Thanks for the explanation.

> 
> >
> > ### 5) the volume shouldn't have to be "started" for a rebalance to
> > work... we might want to do a rebalance, but keep it "stopped" so that
> > clients can't mount.
> > This is probably due to gluster needing it "online" to rebalance, but
> > nonetheless, it doesn't work with what users/sysadmins expect.
> >
> 
> Gluster requires the bricks to be running for rebalance to happen, so
> we cannot start a rebalance with the volume stopped. But we could
> bring a mechanism to barrier client access to the volume during
> rebalance. This type of barriering is being considered for the volume
> level snapshot feature planned for 3.6 . But, since rebalance is a
> long running process when compared to a snapshot, there might be
> certain difficulties in barriering during rebalance.
Maybe the semantic that volumes are "started" or "stopped" should be
changed to be:

If glusterd is running, then volumes are "started" (in the plumbing sort
of sense), but when you gluster volume "start" you make it available for
changes by clients, etc...

If there's an RFE to comment on, let me know. This would be a very
useful feature for users doing automatic rebalances... Maybe that's a
dangerous operation ATM, but ultimately it should be made to be safe
enough that automatic scripts shouldn't worry about borking the volume.

> 
> >
> > ### 6) in the command: gluster volume rebalance myvolume status ;
> > gluster volume rebalance myvolume status --xml && echo t
> > No where does it mention the volume, or the specific bricks which are
> > being [re-]balanced. In particular, a volume name would be especially
> > useful in the --xml output.
> > This would be useful if multiple rebalances are going on... I realize
> > this is because the rebalance command only allows you to specify one
> > volume at a time, but to be consistent with other commands, a volume
> > rebalance status command should let you get info on many volumes.
> > Also, still missing per brick information.
> >
> >
> >
> >                                     Node Rebalanced-files          size
> > scanned      failures       skipped         status run time in secs
> >                                ---------      -----------   -----------
> > -----------   -----------   -----------   ------------   --------------
> >                                localhost                0        0Bytes
> > 2             0             0    in progress             1.00
> >                         vmx2.example.com                0        0Bytes
> > 7             0             0    in progress             1.00
> > volume rebalance: examplevol: success:
> > <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> > <cliOutput>
> >   <opRet>0</opRet>
> >   <opErrno>115</opErrno>
> >   <opErrstr/>
> >   <volRebalance>
> >     <task-id>c5e9970b-f96a-4a28-af14-5477cf90d638</task-id>
> >     <op>3</op>
> >     <nodeCount>2</nodeCount>
> >     <node>
> >       <nodeName>localhost</nodeName>
> >       <files>0</files>
> >       <size>0</size>
> >       <lookups>2</lookups>
> >       <failures>0</failures>
> >       <status>1</status>
> >       <statusStr>in progress</statusStr>
> >     </node>
> >     <node>
> >       <nodeName>vmx2.example.com</nodeName>
> >       <files>0</files>
> >       <size>0</size>
> >       <lookups>7</lookups>
> >       <failures>0</failures>
> >       <status>1</status>
> >       <statusStr>in progress</statusStr>
> >     </node>
> >     <aggregate>
> >       <files>0</files>
> >       <size>0</size>
> >       <lookups>9</lookups>
> >       <failures>0</failures>
> >       <status>1</status>
> >       <statusStr>in progress</statusStr>
> >     </aggregate>
> >   </volRebalance>
> > </cliOutput>
> >
> 
> Having the volume name in xml output is a valid enhancement. Go ahead
> and open a RFE bug for it.
> The rebalance process on each node crawls the whole volume to find
> files which need to be migrated and which are present on bricks of the
> volume belonging to that node. So the rebalance status of a node can
> be considered the status of the brick. But if a node contains more
> than one brick of the volume being rebalanced, we don't have a way to
> differentiate, and I'm not sure if we could do that.

Fair enough. Opened:
https://bugzilla.redhat.com/show_bug.cgi?id=1029239

Thanks for your replies.
Cheers,
James

> 
> >
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://supercolony.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131111/3d2a36ef/attachment.sig>