[Gluster-users] EC planning

Thu Oct 15 06:41:55 UTC 2015

Hi Serkan,

On 14/10/15 15:13, Serkan Çoban wrote:
> Hi Xavier,
>
>> I'm not sure if I understand you. Are you saying you will create two separate gluster volumes or you will add both bricks to the same distributed-dispersed volume ?
>
> Is adding more than one brick from same host to a disperse gluster
> volume recommended? I meant two different gluster volume.
> If I add two bricks from same server to same dispersed volume and lets
> say it is 8+1 configuration, then loosing one host will bring down the
> volume right?

If you add two bricks from the same server to the *same* disperse set, 
then yes, a failure of the node will mean the failure of two bricks. 
However this is not what I'm saying. You can have more than one brick on 
the same server but assign each one of them to a different disperse set 
of the same gluster volume. This way, if a server fails, only one brick 
of each replica set is lost, causing no troubles.

For example, suppose you have 6 servers and you create 4 bricks in each 
server. You could create a volume like this:

   gluster volume create test disperse 6 redundancy 2 \
                          server{1..6}:/bricks/test_1 \
                          server{1..6}:/bricks/test_2 \
                          server{1..6}:/bricks/test_3 \
                          server{1..6}:/bricks/test_4

This way you will create distributed-dispersed volume with 4 independent 
disperse sets, each formed with one brick of each server.

   Bricks of disperse set 1:
     server1:/bricks/test_1
     server2:/bricks/test_1
     server3:/bricks/test_1
     server4:/bricks/test_1
     server5:/bricks/test_1
     server6:/bricks/test_1

In this case, if server1 fails for example, you will lose 
server1:/bricks/test_{1..4}, but all disperse sets will continue to work 
without trouble.

>
>> One possibility is to get rid of the server RAID and use each disk as a single brick. This way you can create 26 bricks per server and assign each one to a different disperse set. A big distributed-dispersed volume balances I/O load >between bricks better. Note that RAID configurations have a reduction in the available number of IOPS. For sequential writes, this is not so bad, but if you have many clients accessing the same bricks, you will see many random ?>accesses even if clients are doing sequential writes. Caching can alleviate this, but if you want to sustain a throughput of 2-3 GB/s, caching effects are not so evident.
>
> I can create 26 JBOD disks and use them as bricks but is this
> recommended? By using 50 servers, brick count will be 1300, is this
> not a problem?

I cannot tell that for sure as I haven't tested gluster installations 
with so many bricks. I have tested configurations with ~200 bricks and 
it works well. Basically this is an scalability issue related with the 
distribution part of gluster. Maybe someone from the DHT team could help 
you more on this.

> Can you explain the configuration a bit more? For example by using
> 16+2, 26 brick per server and 54 servers total. In the end I only want
> one gluster volume and protection for 2 host failure.

You can use a command like this:

   gluster volume create test disperse 18 redundancy 2 \
                          server{1..54}:/bricks/test_1 \
                          server{1..54}:/bricks/test_2 \
                          ...
                          server{1..54}:/bricks/test_26

This way you get a single gluster volume that can tolerate up to two 
full node failures without losing data.

> Also in this case disk failures will be handled by gluster I hope this
> don't bring more problems. But I will also test this configuration
> when I get the servers..

Yes, in this case the recovery of a failed disk will be handled by 
gluster. It should work well.

With RAID, the recovery of a disk is local to the server (no network 
communication) and thus reads/writes are faster. However, to do this on 
a 208TB RAID using 26 8TB disks, it will need to read 192TB and write 
8TB. Quite a lot. With gluster the recovery will use the network, but 
only the used data will be recovered (i.e. if the failed disk was only 
half filled, it will only recover 4TB of data). On a 16+2 configuration, 
this means that it will read less than 16*8=128 TB and write less than 8 TB.

A 10Gbit network is a must for these configurations.

Xavi

>
> Serkan
>
>
>
> On Wed, Oct 14, 2015 at 2:03 PM, Xavier Hernandez <xhernandez at datalab.es> wrote:
>> Hi Serkan,
>>
>> On 13/10/15 15:53, Serkan Çoban wrote:
>>>
>>> Hi Xavier and thanks for your answers.
>>>
>>> Servers will have 26*8TB disks.I don't want to loose more than 2 disk
>>> for raid,
>>> so my options are HW RAID6 24+2 or 2 * HW RAID5 12+1,
>>
>>
>> A RAID5 of more than 8-10 disks is normally considered unsafe because the
>> probability of a second drive failure while reconstructing another failed
>> drive is considerably high. The same happens with a RAID6 of more than 16-20
>> disks.
>>
>>> in both cases I can create 2 bricks per server using LVM and use one brick
>>> per server to create two distributed-disperse volumes. I will test those
>>> configurations when servers arrive.
>>
>>
>> I'm not sure if I understand you. Are you saying you will create two
>> separate gluster volumes or you will add both bricks to the same
>> distributed-dispersed volume ?
>>
>>>
>>> I can go with 8+1 or 16+2, will make tests when servers arrive. But 8+2
>>> will
>>> be too much, I lost nearly %25 space in this case.
>>>
>>> For the client count, this cluster will get backups from hadoop nodes
>>> so there will be 750-1000 clients at least which sends data at the same
>>> time.
>>> Can 16+2 * 3 = 54 gluster nodes handle this or should I increase node
>>> count?
>>
>>
>> In this case I think it would be better to increase the number of bricks,
>> otherwise you may have some performance hit to serve all these clients.
>>
>> One possibility is to get rid of the server RAID and use each disk as a
>> single brick. This way you can create 26 bricks per server and assign each
>> one to a different disperse set. A big distributed-dispersed volume balances
>> I/O load between bricks better. Note that RAID configurations have a
>> reduction in the available number of IOPS. For sequential writes, this is
>> not so bad, but if you have many clients accessing the same bricks, you will
>> see many random accesses even if clients are doing sequential writes.
>> Caching can alleviate this, but if you want to sustain a throughput of 2-3
>> GB/s, caching effects are not so evident.
>>
>> Without RAID you could use a 16+2 or even a 16+3 dispersed volume. This
>> gives you a good protection and increased storage.
>>
>> Xavi
>>
>>>
>>> I will check the parameters you mentioned.
>>>
>>> Serkan
>>>
>>> On Tue, Oct 13, 2015 at 1:43 PM, Xavier Hernandez <xhernandez at datalab.es
>>> <mailto:xhernandez at datalab.es>> wrote:
>>>
>>>      +gluster-users
>>>
>>>
>>>      On 13/10/15 12:34, Xavier Hernandez wrote:
>>>
>>>          Hi Serkan,
>>>
>>>          On 12/10/15 16:52, Serkan Çoban wrote:
>>>
>>>              Hi,
>>>
>>>              I am planning to use GlusterFS for backup purposes. I write
>>>              big files
>>>              (>100MB) with a throughput of 2-3GB/sn. In order to gain
>>>              from space we
>>>              plan to use erasure coding. I have some questions for EC and
>>>              brick
>>>              planning:
>>>              - I am planning to use 200TB XFS/ZFS RAID6 volume to hold
>>>              one brick per
>>>              server. Should I increase brick count? is increasing brick
>>>              count also
>>>              increases performance?
>>>
>>>
>>>          Using a distributed-dispersed volume increases performance. You
>>> can
>>>          split each RAID6 volume into multiple bricks to create such a
>>>          volume.
>>>          This is because a single brick process cannot achieve the maximum
>>>          throughput of the disk, so creating multiple bricks improves this.
>>>          However having too many bricks could be worse because all
>>>          request will
>>>          go to the same filesystem and will compete between them in your
>>>          case.
>>>
>>>          Another thing to consider is the size of the RAID volume. A
>>>          200TB RAID
>>>          will require *a lot* of time to reconstruct in case of failure
>>>          of any
>>>          disk. Also, a 200 TB RAID means you need almost 30 8TB disks. A
>>>          RAID6 of
>>>          30 disks is quite fragile. Maybe it would be better to create
>>>          multiple
>>>          RAID6 volumes, each with 18 disks at most (16+2 is a good and
>>>          efficient
>>>          configuration, specially for XFS on non-hardware raids). Even in
>>>          this
>>>          configuration, you can create multiple bricks in each RAID6
>>> volume.
>>>
>>>              - I plan to use 16+2 for EC. Is this a problem? Should I
>>>              decrease this
>>>              to 12+2 or 10+2? Or is it completely safe to use whatever we
>>>              want?
>>>
>>>
>>>          16+2 is a very big configuration. It requires much computation
>>>          power and
>>>          forces you to grow (if you need to grow the gluster volume at some
>>>          point) in multiples of 18 bricks.
>>>
>>>          Considering that you are already using a RAID6 in your servers,
>>>          what you
>>>          are really protecting with the disperse redundancy is the
>>>          failure of the
>>>          servers themselves. Maybe a 8+1 configuration could be enough
>>>          for your
>>>          needs and requires less computation. If you really need
>>>          redundancy 2,
>>>          8+2 should be ok.
>>>
>>>          Using values that are not a power of 2 has a theoretical impact
>>>          on the
>>>          performance of the disperse volume when applications write
>>>          blocks whose
>>>          size is a multiple of a power of 2 (which is the most normal
>>>          case). This
>>>          means that it's possible that a 10+2 performs worse than a 8+2.
>>>          However
>>>          this depends on many other factors, some even internal to
>>>          gluster, like
>>>          caching, meaning that the real impact could be almost negligible
>>>          in some
>>>          cases. You should test it with your workload.
>>>
>>>              - I understand that EC calculation is performed on client
>>>              side, I want
>>>              to know if there are any benchmarks how EC affects CPU
>>>              usage? For
>>>              example each 100MB/sn traffic may use 1CPU core?
>>>
>>>
>>>          I don't have a detailed measurement of CPU usage related to
>>>          bandwidth,
>>>          however we have made some tests that seem to indicate that the CPU
>>>          overhead caused by disperse is quite small for a 4+2
>>>          configuration. I
>>>          don't have access to this data right now. When I have it, I'll
>>>          send it
>>>          to you.
>>>
>>>          I will also try to do some tests with a 8+2 and 16+2
>>>          configuration to
>>>          see the difference.
>>>
>>>              - Is client number affect cluster performance? Is there any
>>>              difference
>>>              if I connect 100 clients each writing with 20-30MB/s to
>>>              cluster vs 1000
>>>              clients each writing 2-3MB/s?
>>>
>>>
>>>          Increasing the number of clients improves performance however I
>>>          wont' go
>>>          over 100 clients as this could have a negative impact on
>>> performance
>>>          caused by the overhead of managing all of them. In our tests, the
>>>          maximum performance if obtained with ~8 parallel clients (if my
>>>          memory
>>>          doesn't fail).
>>>
>>>          You will also probably want to tweak some volume parameters, like
>>>          server.event-threads, client.event-threads,
>>>          performance.client-io-threads and server.outstanding-rpc-limit to
>>>          increase performance.
>>>
>>>          Xavi
>>>
>>>
>>>              Thank you for your time,
>>>              Serkan
>>>
>>>
>>>              _______________________________________________
>>>              Gluster-users mailing list
>>>              Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>>              http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>