[Gluster-devel] Weird full heal on Distributed-Disperse volume with sharding

Wed Sep 30 07:56:05 UTC 2020

Hi Dmitry,

On Wed, Sep 30, 2020 at 9:21 AM Dmitry Antipov <dmantipov at yandex.ru> wrote:

> On 9/30/20 8:58 AM, Xavi Hernandez wrote:
>
> > This is normal. A dispersed volume writes encoded fragments of each
> block in each brick. In this case it's a 2+1 configuration, so each block
> is divided into 2 fragments. A third fragment is generated
> > for redundancy and stored on the third brick.
>
> OK. But for Distributed-Replicate 2 x 3 setup and 64K shards, 4M file
> should be split into (4096 / 64) * 3 = 192 shards, not 189. So why 189?
>

In fact, there aren't 189 shards. There are 63 shards replicated 3 times
each. The shard 0 is not inside the .shard directory. It's placed in the
directory where the file was created. So there are a total of 64 chunks of
64 KiB = 4 MiB.

> And if all bricks are considered equal and has enough amount of free
> space, shards distribution {24, 24, 24, 39, 39, 39} looks suboptimal.
>

Shards are distributed exactly equal as regular files. This means that they
are balanced based on a random distribution (with some correction when free
space is not equal, but this is irrelevant now). Random distributions tend
to balance very well the number of files, but only with a big number of
files. Statistics on a small number of files may be biased.

If you keep adding new files to the volume, the balance will improve.

> Why not {31, 32, 31, 32, 31, 32}? Isn't it a bug?
>

This can't happen. When you create a 2 x 3 replicated volume, you are
creating 2 independent replica 3 subvolumes. The first replica set is
composed of the first 3 bricks, and the second of the last 3. The
distribution layer chooses on which replica set to put each file.

It's not a bug. It's by design. Gluster can work with multiple clients
creating files simultaneously. To force a perfect distribution, all of them
would have to synchronize to decide where to create each file. This would
have a significant performance impact. Instead of that, distribution is
done randomly, which allows each client to work independently and it will
balance files pretty well in the long term.

> > This is not right. A disperse 2+1 configuration only supports a single
> failure. Wiping 2 fragments from the same file makes the file
> unrecoverable. Disperse works using the Reed-Solomon erasure code,
> > which requires at least 2 healthy fragments to recover the data (in a
> 2+1 configuration).
>
> It seems that I missed the point that all bricks are considered equal,
> regardless of the physical host they're attached to.
>

All bricks are considered equal inside a single replica/disperse set. A 2 x
(2 + 1) configuration has 2 independent disperse sets, so only one brick
from each of them may fail without data loss. If you want to support any 2
brick failures, you need to use a 1 x (4 + 2) configuration. In this case
there's a single disperse set which tolerates up to 2 brick failures.

>
> So, for the Distributed-Disperse 2 x (2 + 1) setup with 3 hosts, 2 bricks
> per each, and two files, A and B, it's possible to have
> the following layout:
>
> Host0:                  Host1:                  Host2:
> |- Brick0: A0 B0        |- Brick0: A1           |- Brick0: A2
> |- Brick1: B1           |- Brick1: B2           |- Brick1:
>

No, this won't happen. A single file will go either to brick0 of all hosts
or brick1 of all hosts. They won't be mixed.

> This setup can tolerate single brick failure but not single host failure
> because if Host0 is down, two fragments of B will be lost
> and so B becomes unrecoverable (but A is not).
>
> If this is so, is it possible/hard to enforce 'one fragment per *host*'
> behavior? If we can guarantee the following:
>
> Host0:                  Host1:                  Host2:
> |- Brick0: A0           |- Brick0: A1           |- Brick0: A2
> |- Brick1: B1           |- Brick1: B2           |- Brick1: B0
>

This is how it currently works. You only need to take care of creating the
volume with the bricks in the right order. In this case the order should be
H0/B0, H1/B0, H2/B0, H0/B1, H1/B1, H1/B1. Anyway, if you create the volume
using an incorrect order and two bricks of the same disperse set are placed
in the same host, the operation will complain about it. This will only be
accepted by gluster if you create the volume with the 'force' option.

Regards,

Xavi

> this setup can tolerate both single brick and single host failures.
>
> Dmitry
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20200930/b10b1c80/attachment.html>