[Bugs] [Bug 1416327] Failure on remove-brick when each server has a lot of bricks
bugzilla at redhat.com
bugzilla at redhat.com
Mon Jan 30 07:33:06 UTC 2017
https://bugzilla.redhat.com/show_bug.cgi?id=1416327
--- Comment #1 from Xavier Hernandez <xhernandez at datalab.es> ---
The discussion I've had with Pranith and Ashish about this problem:
On 30/01/17 08:23, Xavier Hernandez wrote:
> Hi Ashish,
>
> On 30/01/17 08:04, Ashish Pandey wrote:
>>
>> Hi Xavi,
>>
>> Our QA team has also filed a bug similar to this. The only diffrenec is
>> that the setup they used is only 3 * (4+2)
>> and they removed 6 bricks.
>> https://bugzilla.redhat.com/show_bug.cgi?id=1417535
>
> The messages look quite similar.
>
>>
>> Do you think it is only possible with lots of bricks or it could also
>> happened with less number of bricks
>
> The issue happens when the system is unable to establish the connections
> in less time than the timeout (currently 10 seconds). Probably this can
> happen if there are a lot of connections to make or the system is very
> busy.
>
>> Also, could you please update the upstream bug with your solution and
>> this mail discussion?
>
> I'll do.
>
> Xavi
>
>>
>> Ashish
>>
>> ------------------------------------------------------------------------
>> *From: *"Pranith Kumar Karampuri" <pkarampu at redhat.com>
>> *To: *"Xavier Hernandez" <xhernandez at datalab.es>
>> *Cc: *"Ashish Pandey" <aspandey at redhat.com>
>> *Sent: *Wednesday, January 25, 2017 6:52:57 PM
>> *Subject: *Re: Start timeout for ec/afr
>>
>>
>>
>> On Wed, Jan 25, 2017 at 5:17 PM, Xavier Hernandez <xhernandez at datalab.es
>> <mailto:xhernandez at datalab.es>> wrote:
>>
>> On 25/01/17 12:28, Pranith Kumar Karampuri wrote:
>>
>>
>>
>> On Wed, Jan 25, 2017 at 4:49 PM, Xavier Hernandez
>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>> <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>>
>> wrote:
>>
>> On 25/01/17 12:08, Pranith Kumar Karampuri wrote:
>>
>> Wow, scale problem :-).
>>
>> It can happen this way with mounts also right? Why are we only
>> considering rebalance process only?
>>
>>
>> The problem can happen with mounts also, but it's less visible.
>> Waiting a little solves the problem. However an automated task that
>> mount the volume and does something immediately after that can
>> suffer the same problem.
>>
>> The reason we did this timeout business is to
>> prevent users from getting frustrated at the time of mount
>> waiting for it to happen. Rebalance can take that extra minute or
>> two till it gets a ping timeout before being operational. So if
>> this issue is only with rebalance we can do something different.
>>
>>
>> Is there a way to detect if we are running as a mount, self-heal,
>> rebalance, ... ?
>>
>>
>> Glusterd launches all these processes. So we can launch them with custom
>> options. Check glusterd_handle_defrag_start() for example.
>>
>>
>> That's an interesting option.
>>
>>
>>
>> We could have different setting for each environment. However I
>> still think that succeeding a mount when the volume is not fully
>> available is not a good solution.
>>
>> I think the mount should wait until the volume is as ready as
>> possible. We can define a timeout for this to avoid an indefinite
>> wait, but this timeout should be way longer than the current 10 seconds.
>>
>> On the other hand, when enough bricks are online, we don't need to
>> force the user to wait for a full timeout if a brick is really down.
>> In this case a smaller timeout of 5-10 seconds would be enough to
>> see if there are more bricks available before declaring the volume up.
>>
>>
>> Will a bigger number of bricks break through the barriers and we will
>> have to adjust the numbers again?
>>
>> It can happen. That's why I would make the second timeout configurable.
>>
>> For example, in a distributed-disperse volume if there aren't enough
>> bricks online to bring up at least one ec subvolume, mount will have
>> to wait for the first timeout. It could be 1 minute (or fixed to 10
>> times the second timeout, for example). This is a big delay, but we
>> are dealing with a very rare scenario. Probably the mount hang would
>> be the least of the problems.
>>
>> If there are few bricks online, enough to bring up one ec subvolume,
>> then that subvolume will answer reasonably fast. At most the
>> connection delay + the second timeout value (this could be 5-10
>> seconds by default, but configurable). DHT brings up itself as soon
>> as at least one of the subvolumes comes up. So we are ok here, we
>> don't need to do that each single ec subvolume report its state in a
>> fast way.
>>
>> If all bricks are online, the mount will have to wait only the time
>> needed to connect to all bricks. No timeouts will be applied here.
>>
>>
>> Looks good to me.
>>
>>
>> Xavi
>>
>> Xavi
>>
>>
>>
>> On Wed, Jan 25, 2017 at 3:55 PM, Xavier Hernandez
>> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>
>> <mailto:xhernandez at datalab.es <mailto:xhernandez at datalab.es>>
>> <mailto:xhernandez at datalab.es
>> <mailto:xhernandez at datalab.es> <mailto:xhernandez at datalab.es
>> <mailto:xhernandez at datalab.es>>>>
>>
>> wrote:
>>
>> Hi,
>>
>> currently we have a start timeout for ec and afr that work very
>> similarly, if not equal. Basically, when PARENT_UP event is
>> received, the timer is started. If we receive CHILD_UP/CHILD_DOWN
>> events from all children, the timer is cancelled and the appropriate
>> event is propagated. If not all bricks have answered when the
>> timeout expires, we propagate CHILD_UP/CHILD_DOWN depending on how
>> many up children we have.
>>
>> There's an issue when one server has a lot of bricks. In this case
>> the connection to enough bricks to bring the volume up could take
>> more time than the 10 hardcoded seconds (I've filed a bug for this:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1416327)
>>
>>
>> For mounts this is not a problem. Even if not enough bricks have
>> answered in time, the volume will be mounted and eventually the
>> remaining bricks will be connected and accessible.
>>
>> However when a rebalance process is started. It immediately tries to
>> do operations on the volume once all DHT's subvolumes have answered.
>> If they answer as CHILD_DOWN, the rebalance fails without waiting
>> for the subvolumes to come online.
>>
>> To solve this I've been thinking on the following change for the
>> first start of a volume:
>>
>> 1. Start a timer when PARENT_UP is received
>>
>> This will be a worst case timer. It would set it at least to a
>> minute. However I'm not sure if this is really necessary. Maybe
>> protocol/client answers relatively fast even if no connection can be
>> established.
>>
>> 2. Start a second timer when the minimum amount of bricks are up
>>
>> Once we know that the volume can really be started, we'll still wait
>> a little more to allow remaining bricks to connect. I would set this
>> timeout configurable and with a default value of 5 seconds.
>>
>> I think there's no good reason to propagate a CHILD_DOWN event
>> before we really know if the volume will be down.
>>
>> This solves the rebalance problem and allows for some margin of time
>> for bricks to become online before operating on the volume (this
>> avoids that operations send just after the CHILD_UP even could cause
>> inconsistencies that self-heal will have to solve once the remaining
>> bricks come online).
>>
>> In this case a mount won't succeed until the main timeout (or
>> protocol/client timeout) when not enough bricks are available, but I
>> think this is acceptable.
>>
>> What do you think ?
>>
>> --
>> Pranith
>>
>> --
>> Pranith
>>
>> --
>> Pranith
>>
>
--
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=oEXa4PxTYI&a=cc_unsubscribe
More information about the Bugs
mailing list