[Gluster-devel] Error coalesce for erasure code xlator
Xavier Hernandez
xhernandez at datalab.es
Tue Jul 1 13:20:42 UTC 2014
On Tuesday 01 July 2014 08:18:46 Jeff Darcy wrote:
> > Not having enough quorum means that more than R (redundancy) bricks have
> > failed simultaneously (or have failed while another brick was alive but
> > not
> > recovered yet), which means that it's outside of the defined work
> > conditions. However in some circumstances this could be improved.
>
> I think it could be worse than that. Consider the following series of
> operations:
>
> (1) R bricks are down - let's say A and B with R=2.
>
> (2) A modifying operation is done on file/directory X.
>
> (3) A and B come back up but on-demand recovery is not yet done for X.
>
> (4) A *different* R bricks go down - let's say C and D.
>
> (5) Somebody tries to read X.
>
> At this point the read fails. Even though we have quorum, we still don't
> have enough bricks to satisfy the request. One could argue that the
> failures on C and D are simultaneous with those on A and B, in the sense
> that the failure of A and B persists until recovery is complete, and thus
> it violates our operating assumption. I've tried to explain that to users
> many times on many projects, and they seem to have very little patience for
> that answer. I've come to believe that they're right, and that minimizing
> that recovery time is critical.
Yes, this case fails. I assumed this case belongs to the "outside of the
defined work conditions" category. Not much can be done here but trying to
minimize the recovery time as you say (this is pending for ec), but in this
case users will complain that the system is slow... this is a hard to solve
problem... maybe increasing redundancy would be the only good solution, at
least from the technical point of view.
More information about the Gluster-devel
mailing list