[Gluster-users] A bunch of comments about brick management

Mon Nov 11 07:20:52 UTC 2013

Hey James,
I'm replying inline about your observations. Some of your observations
are valid bugs. But most of them are such because they haven't been
implemented and are feature requests.

On Wed, Oct 30, 2013 at 4:10 PM, James <purpleidea at gmail.com> wrote:
> Hey there,
>
> I've been madly hacking on cool new puppet-gluster features... In my
> lack of sleep, I've put together some comments about gluster add/remove
> brick features. Hopefully they are useful, and make sense. These are
> sort of "bugs". Have a look, and let me know if I should formally report
> any of these...
>
> Cheers...
> James
>
> PS: this is also mirrored here:
> http://paste.fedoraproject.org/50402/12956713
> because email has destroyed formatting :P
>
>
> All tests are done on gluster 3.4.1, using CentOS 6.4 on vm's.
> Firewall has been disabled for testing purposes.
> gluster --version
> glusterfs 3.4.1 built on Sep 27 2013 13:13:58
>
>
> ### 1) simple operations shouldn't fail
> # running the following commands in succession without files:
> # gluster volume add-brick examplevol vmx1.example.com:/tmp/foo9
> vmx2.example.com:/tmp/foo9
> # gluster volume remove-brick examplevol vmx1.example.com:/tmp/foo9
> vmx2.example.com:/tmp/foo9 start ... status
>
> shows a failure:
>
> [root at vmx1 ~]# gluster volume add-brick examplevol
> vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9
> volume add-brick: success
> [root at vmx1 ~]# gluster volume remove-brick examplevol
> vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 status
>                                     Node Rebalanced-files          size
> scanned      failures       skipped         status run-time in secs
>                                ---------      -----------   -----------
> -----------   -----------   -----------   ------------   --------------
>                                localhost                0        0Bytes
> 0             0    not started             0.00
>                         vmx2.example.com                0        0Bytes
> 0             0    not started             0.00
> [root at vmx1 ~]# gluster volume remove-brick examplevol
> vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 start
> volume remove-brick start: success
> ID: ecbcc2b6-4351-468a-8f53-3a09159e4059
> [root at vmx1 ~]# gluster volume remove-brick examplevol
> vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 status
>                                     Node Rebalanced-files          size
> scanned      failures       skipped         status run-time in secs
>                                ---------      -----------   -----------
> -----------   -----------   -----------   ------------   --------------
>                                localhost                0        0Bytes
> 8             0      completed             0.00
>                         vmx2.example.com                0        0Bytes
> 0             1         failed             0.00

I don't know why this the process failed on one node. This is a
rebalance issue, not a cli one, and is a valid bug. If you can
reproduce it consistently, then please file a bug for this.

> [root at vmx1 ~]# gluster volume remove-brick examplevol
> vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 commit
> Removing brick(s) can result in data loss. Do you want to Continue?
> (y/n) y
> volume remove-brick commit: success
> [root at vmx1 ~]#
>
> ### 1b) on the other node, the output shows an extra row (also including
> the failure)
>
> [root at vmx2 ~]# gluster volume remove-brick examplevol
> vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 status
>                                     Node Rebalanced-files          size
> scanned      failures       skipped         status run-time in secs
>                                ---------      -----------   -----------
> -----------   -----------   -----------   ------------   --------------
>                                localhost                0        0Bytes
> 0             0      completed             0.00
>                                localhost                0        0Bytes
> 0             0      completed             0.00
>                         vmx1.example.com                0        0Bytes
> 0             1         failed             0.00
>

This is a bug, which I've seen some other times as well. But haven't
gone around to file it or fix it. So file a bug for this as well.

>
> ### 2) formatting:
>
> # the "skipped" column doesn't seem to have any data, as a result
> formatting is broken...
> # this problem is obviously not seen in the more useful --xml output
> below. neither is the 'skipped' column.
>
> [root at vmx1 examplevol]# gluster volume remove-brick examplevol
> vmx1.example.com:/tmp/foo3 vmx2.example.com:/tmp/foo3 status
>                                     Node Rebalanced-files          size
> scanned      failures       skipped         status run-time in secs
>                                ---------      -----------   -----------
> -----------   -----------   -----------   ------------   --------------
>                                localhost                0        0Bytes
> 8             0      completed             0.00
>                         vmx2.example.com                0        0Bytes
> 8             0      completed             0.00
>
>
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <cliOutput>
>   <opRet>0</opRet>
>   <opErrno>115</opErrno>
>   <opErrstr/>
>   <volRemoveBrick>
>     <task-id>d99cab76-cd7d-4579-80ae-c1e6faff3d1d</task-id>
>     <nodeCount>2</nodeCount>
>     <node>
>       <nodeName>localhost</nodeName>
>       <files>0</files>
>       <size>0</size>
>       <lookups>8</lookups>
>       <failures>0</failures>
>       <status>3</status>
>       <statusStr>completed</statusStr>
>     </node>
>     <node>
>       <nodeName>vmx2.example.com</nodeName>
>       <files>0</files>
>       <size>0</size>
>       <lookups>8</lookups>
>       <failures>0</failures>
>       <status>3</status>
>       <statusStr>completed</statusStr>
>     </node>
>     <aggregate>
>       <files>0</files>
>       <size>0</size>
>       <lookups>16</lookups>
>       <failures>0</failures>
>       <status>3</status>
>       <statusStr>completed</statusStr>
>     </aggregate>
>   </volRemoveBrick>
> </cliOutput>
>

Skipped count wasn't available in rebalance status' xml output in
3.4.0, but has been added recently and should be available in 3.4.2.
(http://review.gluster.org/6000)
Regarding the missing skipped count in remove-brick status output, it
might be because for remove-brick a skipped file is treated as a
failure. But this might have been fixed in 3.4.2, but needs to checked
to confirm.

>
> ### 3)
> [root at vmx1 examplevol]# gluster volume remove-brick examplevol
> vmx1.example.com:/tmp/foo3 vmx2.example.com:/tmp/foo3 status
>                                     Node Rebalanced-files          size
> scanned      failures       skipped         status run-time in secs
>                                ---------      -----------   -----------
> -----------   -----------   -----------   ------------   --------------
>                                localhost                0        0Bytes
> 8             0      completed             0.00
>                      vmx2.example.com                0        0Bytes
> 8             0      completed             0.00
> [root at vmx1 examplevol]# gluster volume remove-brick examplevol
> vmx1.example.com:/tmp/foo3 vmx2.example.com:/tmp/foo3 commit
> Removing brick(s) can result in data loss. Do you want to Continue?
> (y/n) y
> volume remove-brick commit: success
>
> This shouldn't warn you that you might experience data loss. If the
> rebalance has successfully worked, and bricks shouldn't be accepting new
> files, then gluster should know this, and just let you commit safely.
> I guess you can consider this a UI bug, as long as it checks before hand
> that it's safe.
>

The warning is given by the CLI, before even contacting glusterd. The
CLI doesn't have any information regarding the status of a
remove-brick commands' rebalance process. It gives the warning for all
'remove-brick commit' commands, be it a forceful removal or a clean
removal with rebalancing. This would be a nice feature to have though,
but it requires changes to way CLI and glusterd communicate.

>
> ### 4)
> Aggregate "totals", aka the <aggregate> </aggregate> data isn't shown in
> the normal command line output.
>

The aggregate totals were added to the xml output because the ovirt
team, which was the driver behind having XML outputs, requested for
it. It should be simple enough to add it to the normal output. If it
is desired please raise a RFE bug.

>
> ### 5) the volume shouldn't have to be "started" for a rebalance to
> work... we might want to do a rebalance, but keep it "stopped" so that
> clients can't mount.
> This is probably due to gluster needing it "online" to rebalance, but
> nonetheless, it doesn't work with what users/sysadmins expect.
>

Gluster requires the bricks to be running for rebalance to happen, so
we cannot start a rebalance with the volume stopped. But we could
bring a mechanism to barrier client access to the volume during
rebalance. This type of barriering is being considered for the volume
level snapshot feature planned for 3.6 . But, since rebalance is a
long running process when compared to a snapshot, there might be
certain difficulties in barriering during rebalance.

>
> ### 6) in the command: gluster volume rebalance myvolume status ;
> gluster volume rebalance myvolume status --xml && echo t
> No where does it mention the volume, or the specific bricks which are
> being [re-]balanced. In particular, a volume name would be especially
> useful in the --xml output.
> This would be useful if multiple rebalances are going on... I realize
> this is because the rebalance command only allows you to specify one
> volume at a time, but to be consistent with other commands, a volume
> rebalance status command should let you get info on many volumes.
> Also, still missing per brick information.
>
>
>
>                                     Node Rebalanced-files          size
> scanned      failures       skipped         status run time in secs
>                                ---------      -----------   -----------
> -----------   -----------   -----------   ------------   --------------
>                                localhost                0        0Bytes
> 2             0             0    in progress             1.00
>                         vmx2.example.com                0        0Bytes
> 7             0             0    in progress             1.00
> volume rebalance: examplevol: success:
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <cliOutput>
>   <opRet>0</opRet>
>   <opErrno>115</opErrno>
>   <opErrstr/>
>   <volRebalance>
>     <task-id>c5e9970b-f96a-4a28-af14-5477cf90d638</task-id>
>     <op>3</op>
>     <nodeCount>2</nodeCount>
>     <node>
>       <nodeName>localhost</nodeName>
>       <files>0</files>
>       <size>0</size>
>       <lookups>2</lookups>
>       <failures>0</failures>
>       <status>1</status>
>       <statusStr>in progress</statusStr>
>     </node>
>     <node>
>       <nodeName>vmx2.example.com</nodeName>
>       <files>0</files>
>       <size>0</size>
>       <lookups>7</lookups>
>       <failures>0</failures>
>       <status>1</status>
>       <statusStr>in progress</statusStr>
>     </node>
>     <aggregate>
>       <files>0</files>
>       <size>0</size>
>       <lookups>9</lookups>
>       <failures>0</failures>
>       <status>1</status>
>       <statusStr>in progress</statusStr>
>     </aggregate>
>   </volRebalance>
> </cliOutput>
>

Having the volume name in xml output is a valid enhancement. Go ahead
and open a RFE bug for it.
The rebalance process on each node crawls the whole volume to find
files which need to be migrated and which are present on bricks of the
volume belonging to that node. So the rebalance status of a node can
be considered the status of the brick. But if a node contains more
than one brick of the volume being rebalanced, we don't have a way to
differentiate, and I'm not sure if we could do that.

>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users