[Bugs] [Bug 1416327] Failure on remove-brick when each server has a lot of bricks

Mon Jan 30 07:33:06 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1416327


--- Comment #1 from Xavier Hernandez <xhernandez at datalab.es> ---
The discussion I've had with Pranith and Ashish about this problem:

On 30/01/17 08:23, Xavier Hernandez wrote:
> Hi Ashish,
> 
> On 30/01/17 08:04, Ashish Pandey wrote:
>>
>> Hi Xavi,
>>
>> Our QA team has also filed a bug similar to this. The only diffrenec is
>> that the setup they used is only 3 * (4+2)
>> and they removed 6 bricks.
>> https://bugzilla.redhat.com/show_bug.cgi?id=1417535
> 
> The messages look quite similar.
> 
>>
>> Do you think it is only possible with lots of bricks or it could also
>> happened with less number  of bricks
> 
> The issue happens when the system is unable to establish the connections
> in less time than the timeout (currently 10 seconds). Probably this can
> happen if there are a lot of connections to make or the system is very
> busy.
> 
>> Also, could you please update the upstream bug with your solution and
>> this mail discussion?
> 
> I'll do.
> 
> Xavi
> 
>>
>> Ashish
>>
>> ------------------------------------------------------------------------
>> *From: *"Pranith Kumar Karampuri" <pkarampu at redhat.com>
>> *To: *"Xavier Hernandez" <xhernandez at datalab.es>
>> *Cc: *"Ashish Pandey" <aspandey at redhat.com>
>> *Sent: *Wednesday, January 25, 2017 6:52:57 PM
>> *Subject: *Re: Start timeout for ec/afr
>>
>>
>>
>> On Wed, Jan 25, 2017 at 5:17 PM, Xavier Hernandez <xhernandez at datalab.es
>> <mailto:xhernandez at datalab.es>> wrote:
>>
>>     On 25/01/17 12:28, Pranith Kumar Karampuri wrote:
>>
>>
>>
>>         On Wed, Jan 25, 2017 at 4:49 PM, Xavier Hernandez
>>         <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>>         <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>>
>>         wrote:
>>
>>             On 25/01/17 12:08, Pranith Kumar Karampuri wrote:
>>
>>                 Wow, scale problem :-).
>>
>>                 It can happen this way with mounts also right? Why are we only
>>                 considering rebalance process only?
>>
>>
>>             The problem can happen with mounts also, but it's less visible.
>>             Waiting a little solves the problem. However an automated task that
>>             mount the volume and does something immediately after that can
>>             suffer the same problem.
>>
>>                 The reason we did this timeout business is to
>>                 prevent users from getting frustrated at the time of mount
>>                 waiting for it to happen. Rebalance can take that extra minute or
>>                 two till it gets a ping timeout before being operational. So if
>>                 this issue is only with rebalance we can do something different.
>>
>>
>>             Is there a way to detect if we are running as a mount, self-heal,
>>             rebalance, ... ?
>>
>>
>>         Glusterd launches all these processes. So we can launch them with custom
>>         options. Check glusterd_handle_defrag_start() for example.
>>
>>
>>     That's an interesting option.
>>
>>
>>
>>             We could have different setting for each environment. However I
>>             still think that succeeding a mount when the volume is not fully
>>             available is not a good solution.
>>
>>             I think the mount should wait until the volume is as ready as
>>             possible. We can define a timeout for this to avoid an indefinite
>>             wait, but this timeout should be way longer than the current 10 seconds.
>>
>>             On the other hand, when enough bricks are online, we don't need to
>>             force the user to wait for a full timeout if a brick is really down.
>>             In this case a smaller timeout of 5-10 seconds would be enough to
>>             see if there are more bricks available before declaring the volume up.
>>
>>
>>         Will a bigger number of bricks break through the barriers and we will
>>         have to adjust the numbers again?
>>
>>     It can happen. That's why I would make the second timeout configurable.
>>
>>     For example, in a distributed-disperse volume if there aren't enough
>>     bricks online to bring up at least one ec subvolume, mount will have
>>     to wait for the first timeout. It could be 1 minute (or fixed to 10
>>     times the second timeout, for example). This is a big delay, but we
>>     are dealing with a very rare scenario. Probably the mount hang would
>>     be the least of the problems.
>>
>>     If there are few bricks online, enough to bring up one ec subvolume,
>>     then that subvolume will answer reasonably fast. At most the
>>     connection delay + the second timeout value (this could be 5-10
>>     seconds by default, but configurable). DHT brings up itself as soon
>>     as at least one of the subvolumes comes up. So we are ok here, we
>>     don't need to do that each single ec subvolume report its state in a
>>     fast way.
>>
>>     If all bricks are online, the mount will have to wait only the time
>>     needed to connect to all bricks. No timeouts will be applied here.
>>
>>
>> Looks good to me.
>>
>>
>>     Xavi
>>
>>             Xavi
>>
>>
>>
>>                 On Wed, Jan 25, 2017 at 3:55 PM, Xavier Hernandez
>>                 <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>>         <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>
>>                 <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
>>         <mailto:xhernandez at datalab.es>>>>
>>
>>                 wrote:
>>
>>                     Hi,
>>
>>                     currently we have a start timeout for ec and afr that work very
>>                     similarly, if not equal. Basically, when PARENT_UP event is
>>                     received, the timer is started. If we receive CHILD_UP/CHILD_DOWN
>>                     events from all children, the timer is cancelled and the appropriate
>>                     event is propagated. If not all bricks have answered when the
>>                     timeout expires, we propagate CHILD_UP/CHILD_DOWN depending on how
>>                     many up children we have.
>>
>>                     There's an issue when one server has a lot of bricks. In this case
>>                     the connection to enough bricks to bring the volume up could take
>>                     more time than the 10 hardcoded seconds (I've filed a bug for this:
>>                     https://bugzilla.redhat.com/show_bug.cgi?id=1416327)
>>
>>
>>                     For mounts this is not a problem. Even if not enough bricks have
>>                     answered in time, the volume will be mounted and eventually the
>>                     remaining bricks will be connected and accessible.
>>
>>                     However when a rebalance process is started. It immediately tries to
>>                     do operations on the volume once all DHT's subvolumes have answered.
>>                     If they answer as CHILD_DOWN, the rebalance fails without waiting
>>                     for the subvolumes to come online.
>>
>>                     To solve this I've been thinking on the following change for the
>>                     first start of a volume:
>>
>>                     1. Start a timer when PARENT_UP is received
>>
>>                     This will be a worst case timer. It would set it at least to a
>>                     minute. However I'm not sure if this is really necessary. Maybe
>>                     protocol/client answers relatively fast even if no connection can be
>>                     established.
>>
>>                     2. Start a second timer when the minimum amount of bricks are up
>>
>>                     Once we know that the volume can really be started, we'll still wait
>>                     a little more to allow remaining bricks to connect. I would set this
>>                     timeout configurable and with a default value of 5 seconds.
>>
>>                     I think there's no good reason to propagate a CHILD_DOWN event
>>                     before we really know if the volume will be down.
>>
>>                     This solves the rebalance problem and allows for some margin of time
>>                     for bricks to become online before operating on the volume (this
>>                     avoids that operations send just after the CHILD_UP even could cause
>>                     inconsistencies that self-heal will have to solve once the remaining
>>                     bricks come online).
>>
>>                     In this case a mount won't succeed until the main timeout (or
>>                     protocol/client timeout) when not enough bricks are available, but I
>>                     think this is acceptable.
>>
>>                     What do you think ?
>>
>>                 --
>>                 Pranith
>>
>>         --
>>         Pranith
>>
>> -- 
>> Pranith
>>
>

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=oEXa4PxTYI&a=cc_unsubscribe