[Gluster-devel] a union to two stripes to fourteen mirrors...

Wed Nov 21 07:20:42 UTC 2007

Yes, you are correct.... don't know where I was with my head...

I was thinking that, if you for example setup 3 equal bricks with a 1TB 
"/export" partition, that this setup could lead to a 2 TB usable space 
and any single brick could fail without data loss, like this:

unify(afr(server1:/export/1 + server2:/export/2) + afr(server2:/export/1 + server3:/export/2) + afr(server3:/export/1 + server1:/export/2))

Ofcourse, no matter how complicated you make it, an afr of 2 volumes 
writes every file twice, so it uses double of its space, so it halves 
the capacity of the total volume... In the above example, any brick can 
indeed fail, but total usable capacity is only 1.5TB.

There goes my reputation.... Next time I'll try to write something 
really smart :-)

Kevan Benson wrote:
>
> I think it's more of a RAID 10 or 0+1.  Since AFR is being used, all 
> data is being stored twice (at a minimum, for all redundant files). 
> Only AFRing 10 of 50 bricks would mean that 25% (10 of 40) bricks were 
> redundant, and the data they contained would be available on another 
> brick, whether it was a stripe of a file or a full file.  The other 
> 75% of the bricks would be single points of failure.
>
> RAID 5 actually uses parity, and GlusterFS doesn't have any method for 
> doing parity currently that I know of (although that would be a nice 
> translator, or nice addition to the stripe translator).
>
> Onyx wrote:
>> Wow, very interesting concept!
>> Never thought of it...
>> Kind of like a raid5 over a network, right?
>>
>> Just thinking out loud now, not sure if this is correct, but...
>> - In your setup, any single brick can fail like with raid5
>> - If you afr 2 times (3 copies), any 2 bricks can fail like with raid6
>> - If you afr n times, any n bricks can fail.
>>
>> So you can setup a cluster with 50 bricks, afr 10 times, have a 
>> redundancy of 10 bricks, and usable storage space of 40 bricks....
>> A complex but very interesting concept!
>> ....
>> ....AND... We could setup some detection system and other small 
>> intelligence in the cluster to start a spare brick with the 
>> configuration of the failed brick. BAM, hotspare brick alive, and 
>> starting to auto-heal!
>>
>> Man Glusterfs is flexible!
>>
>> Can someone confirm if my thinking is not way-off here?
>>
>>
>> This makes me think of an other young cluster filesystem....
>>
>>
>> Jerker Nyberg wrote:
>>>
>>> Hi,
>>>
>>> I'm trying out different configurations of GlusterFS. I have 7 nodes 
>>> each with two 320 GB disks where 300 GB om each disk is for the 
>>> distributed file system.
>>>
>>> Each node is called N. Every file system is on the server side 
>>> mirrored to the other disk on the next node, wrapped around so that 
>>> the last node mirrors its disk to the first. invented. The real 
>>> config is included in the end of this mail.
>>>
>>> Pseudodefinitions:
>>>
>>> fs(1) = a file system on the first disk
>>> fs(2) = a file system on the second disk
>>> n(I, fs(J)) = the fs J on node I
>>> afr(N .. M) = mirror the volumes
>>> stripe(N .. M) = stripe the volumes
>>>
>>> Server:
>>>
>>> Forw(N) = afr(fs(1), node(N+1, fs(2))
>>> Back(N) = afr(fs(2), node(N-1, fs(1))
>>>
>>> Client:
>>>
>>> FStr(N .. M) = stripe(n(N, Forw(N)) .. n(N+i, Forw(N+1)) .. n(M, 
>>> Forw(M)))
>>> BStr(N .. M) = stripe(n(N, Back(N)) .. n(N+i, Back(N+1)) .. n(M, 
>>> Back(M)))
>>> mount /glusterfs = union(FStr(1 .. 7), BStr(1..7))
>>>
>>>
>>>
>>> The goal was to get good performance but also redundancy. But this 
>>> setup will not will it? The stripes will not work when a part of is 
>>> gone and the union will not not magically find the other part of a 
>>> file on the other stripe? And where to put the union namespace for 
>>> good performance?
>>>
>>> But my major question is this: I tried to stripe a single stripe 
>>> (not using union on the client, just striping on the servers which 
>>> in turn mirrored) When rsync'ing in data on it on a single server 
>>> things worked fine, but when I put some load on it from the other 
>>> nodes (dd'ing in and out some large files) the glusterfsd's on the 
>>> first server died... Do you want me to check this up more and try to 
>>> reproduce and narrow down the problem, or is this kind of setup in 
>>> general not a good idea?
>>>
>>> Regards
>>> Jerker Nyberg.
>>>
>>> ### client config
>>>
>>> # remote slices
>>> volume brick2
>>>   type protocol/client
>>>   option transport-type tcp/client
>>>   option remote-host 10.0.0.2
>>>   option remote-subvolume brick
>>> end-volume
>>> volume brick3
>>>   type protocol/client
>>>   option transport-type tcp/client
>>>   option remote-host 10.0.0.3
>>>   option remote-subvolume brick
>>> end-volume
>>> volume brick4
>>>   type protocol/client
>>>   option transport-type tcp/client
>>>   option remote-host 10.0.0.4
>>>   option remote-subvolume brick
>>> end-volume
>>> volume brick5
>>>   type protocol/client
>>>   option transport-type tcp/client     # for TCP/IP transport
>>>   option remote-host 10.0.0.5
>>>   option remote-subvolume brick
>>> end-volume
>>> volume brick6
>>>   type protocol/client
>>>   option transport-type tcp/client
>>>   option remote-host 10.0.0.6
>>>   option remote-subvolume brick
>>> end-volume
>>> volume brick7
>>>   type protocol/client
>>>   option transport-type tcp/client
>>>   option remote-host 10.0.0.7
>>>   option remote-subvolume brick
>>> end-volume
>>> volume brick8
>>>   type protocol/client
>>>   option transport-type tcp/client
>>>   option remote-host 10.0.0.8
>>>   option remote-subvolume brick
>>> end-volume
>>> volume stripe
>>>   type cluster/stripe
>>>   subvolumes brick2 brick3 brick4 brick5 brick6 brick7 brick8
>>>   option block-size *:32KB
>>> end-volume
>>> ### Add iothreads
>>> volume iothreads
>>>    type performance/io-threads
>>>    option thread-count 32  # deault is 1
>>>    option cache-size 64MB #64MB
>>>    subvolumes stripe
>>> end-volume
>>> ### Add readahead feature
>>> volume readahead
>>>   type performance/read-ahead
>>>   option page-size 256kB     # unit in bytes
>>> #  option page-count 20       # cache per file  = (page-count x 
>>> page-size)
>>>   option page-count 10       # cache per file  = (page-count x 
>>> page-size)
>>>   subvolumes iothreads
>>> end-volume
>>> ### Add IO-Cache feature
>>> volume iocache
>>>   type performance/io-cache
>>>   option page-size 256KB
>>> #  option page-size 100MB
>>>   option page-count 10
>>>   subvolumes readahead
>>> end-volume
>>> ### Add writeback feature
>>> volume writeback
>>>   type performance/write-behind
>>>   option aggregate-size 1MB
>>>   option flush-behind off
>>>   subvolumes iocache
>>> end-volume
>>>
>>> ### server config for the 10.0.0.2
>>>
>>> # posix
>>> volume ba
>>>   type storage/posix
>>>   option directory /hda/glusterfs-a
>>> end-volume
>>> volume bc
>>>   type storage/posix
>>>   option directory /hdc/glusterfs-c
>>> end-volume
>>> # remote mirror
>>> volume mc
>>>   type protocol/client
>>>   option transport-type tcp/client
>>>   option remote-host 10.0.0.3 # the next node
>>>   option remote-subvolume bc
>>> end-volume
>>> # join
>>> volume afr
>>>         type cluster/afr
>>>         subvolumes ba mc
>>> end-volume
>>> # lock
>>> volume pl
>>>   type features/posix-locks
>>>   subvolumes afr
>>> end-volume
>>> # threads
>>> volume brick
>>>    type performance/io-threads
>>>    option thread-count 16  # deault is 1
>>>    option cache-size 128MB #64MB
>>>    subvolumes pl
>>> end-volume
>>> # export
>>> volume server
>>>   type protocol/server
>>>   option transport-type tcp/server
>>>   subvolumes brick, bc
>>>   option auth.ip.brick.allow *
>>>   option auth.ip.bc.allow *
>>> end-volume
>>>
>>>
>>>
>>> # glusterfs --version
>>> glusterfs 1.3.8 built on Nov 16 2007
>>> Copyright (c) 2006, 2007 Z RESEARCH Inc. <http://www.zresearch.com>
>>> GlusterFS comes with ABSOLUTELY NO WARRANTY.
>>> You may redistribute copies of GlusterFS under the terms of the GNU 
>>> General Public License.
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at nongnu.org
>>> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>> .
>>
>
>