[Gluster-users] disperse volume file to subvolume mapping

Wed Apr 20 12:13:20 UTC 2016

Here is the steps that I do in detail and relevant output from bricks:

I am using below command for volume creation:
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force

then I mount volume on 50 clients:
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster

then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1

then I start distcp on clients, there are 1059X8.8GB files in one folder and
they will be copied to /mnt/gluster/s1 with 100 parallel which means 2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb file:///mnt/gluster/s1

After job finished here is the status of s1 directory from bricks:
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.

full listing of files in bricks:
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0

You can ignore the .crc files in the brick output above, they are
checksum files...

As you can see part-m-xxxx files written only some bricks in nodes 0205..0224
All bricks have some files but they have zero size.

I increase file descriptors to 65k so it is not the issue...

On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez <xhernandez at datalab.es> wrote:
> Hi Serkan,
>
> On 19/04/16 15:16, Serkan Çoban wrote:
>>>>>
>>>>> I assume that gluster is used to store the intermediate files before
>>>>> the reduce phase
>>
>> Nope, gluster is the destination for distcp command. hadoop distcp -m
>> 50 http://nn1:8020/path/to/folder file:///mnt/gluster
>> This run maps on datanodes which have /mnt/gluster mounted on all of them.
>
>
> I don't know hadoop, so I'm of little help here. However it seems that -m 50
> means to execute 50 copies in parallel. This means that even if the
> distribution worked fine, at most 50 (much probably less) of the 78 ec sets
> would be used in parallel.
>
>>
>>>>> This means that this is caused by some peculiarity of the mapreduce.
>>
>> Yes but how a client write 500 files to gluster mount and those file
>> just written only to subset of subvolumes? I cannot use gluster as a
>> backup cluster if I cannot write with distcp.
>>
>
> All 500 files were created only on one of the 78 ec sets and the remaining
> 77 got empty ?
>
>>>>> You should look which files are created in each brick and how many
>>>>> while the process is running.
>>
>> Files only created on nodes 185..204 or 205..224 or 225..244. Only on
>> 20 nodes in each test.
>
>
> How many files there were in each brick ?
>
> Not sure if this can be related, but standard linux distributions have a
> default limit of 1024 open file descriptors. Having a so big volume and
> doing a massive copy, maybe this limit is affecting something ?
>
> Are there any error or warning messages in the mount or bricks logs ?
>
>
> Xavi
>
>>
>> On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez <xhernandez at datalab.es>
>> wrote:
>>>
>>> Hi Serkan,
>>>
>>> moved to gluster-users since this doesn't belong to devel list.
>>>
>>> On 19/04/16 11:24, Serkan Çoban wrote:
>>>>
>>>>
>>>> I am copying 10.000 files to gluster volume using mapreduce on
>>>> clients. Each map process took one file at a time and copy it to
>>>> gluster volume.
>>>
>>>
>>>
>>> I assume that gluster is used to store the intermediate files before the
>>> reduce phase.
>>>
>>>> My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
>>>> copy >78 files parallel I expect each file goes to different subvolume
>>>> right?
>>>
>>>
>>>
>>> If you only copy 78 files, most probably you will get some subvolume
>>> empty
>>> and some other with more than one or two files. It's not an exact
>>> distribution, it's a statistially balanced distribution: over time and
>>> with
>>> enough files, each brick will contain an amount of files in the same
>>> order
>>> of magnitude, but they won't have the *same* number of files.
>>>
>>>> In my tests during tests with fio I can see every file goes to
>>>> different subvolume, but when I start mapreduce process from clients
>>>> only 78/3=26 subvolumes used for writing files.
>>>
>>>
>>>
>>> This means that this is caused by some peculiarity of the mapreduce.
>>>
>>>> I see that clearly from network traffic. Mapreduce on client side can
>>>> be run multi thread. I tested with 1-5-10 threads on each client but
>>>> every time only 26 subvolumes used.
>>>> How can I debug the issue further?
>>>
>>>
>>>
>>> You should look which files are created in each brick and how many while
>>> the
>>> process is running.
>>>
>>> Xavi
>>>
>>>
>>>>
>>>> On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
>>>> <xhernandez at datalab.es> wrote:
>>>>>
>>>>>
>>>>> Hi Serkan,
>>>>>
>>>>> On 19/04/16 09:18, Serkan Çoban wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
>>>>>> 50 clients copying part-0-xxxx named files using mapreduce to gluster
>>>>>> using one thread per server and they are using only 20 servers out of
>>>>>> 60. On the other hand fio tests use all the servers. Anything I can do
>>>>>> to solve the issue?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Distribution of files to ec sets is done by dht. In theory if you
>>>>> create
>>>>> many files each ec set will receive the same amount of files. However
>>>>> when
>>>>> the number of files is small enough, statistics can fail.
>>>>>
>>>>> Not sure what you are doing exactly, but a mapreduce procedure
>>>>> generally
>>>>> only creates a single output. In that case it makes sense that only one
>>>>> ec
>>>>> set is used. If you want to use all ec sets for a single file, you
>>>>> should
>>>>> enable sharding (I haven't tested that) or split the result in multiple
>>>>> files.
>>>>>
>>>>> Xavi
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Serkan
>>>>>>
>>>>>>
>>>>>> ---------- Forwarded message ----------
>>>>>> From: Serkan Çoban <cobanserkan at gmail.com>
>>>>>> Date: Mon, Apr 18, 2016 at 2:39 PM
>>>>>> Subject: disperse volume file to subvolume mapping
>>>>>> To: Gluster Users <gluster-users at gluster.org>
>>>>>>
>>>>>>
>>>>>> Hi, I have a problem where clients are using only 1/3 of nodes in
>>>>>> disperse volume for writing.
>>>>>> I am testing from 50 clients using 1 to 10 threads with file names
>>>>>> part-0-xxxx.
>>>>>> What I see is clients only use 20 nodes for writing. How is the file
>>>>>> name to sub volume hashing is done? Is this related to file names are
>>>>>> similar?
>>>>>>
>>>>>> My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
>>>>>> is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
>>>>>>
>>>>>
>>>
>