[Gluster-users] disperse volume file to subvolume mapping

Wed Apr 20 06:34:35 UTC 2016

Hi Serkan,

On 19/04/16 15:16, Serkan Çoban wrote:
>>>> I assume that gluster is used to store the intermediate files before the reduce phase
> Nope, gluster is the destination for distcp command. hadoop distcp -m
> 50 http://nn1:8020/path/to/folder file:///mnt/gluster
> This run maps on datanodes which have /mnt/gluster mounted on all of them.

I don't know hadoop, so I'm of little help here. However it seems that 
-m 50 means to execute 50 copies in parallel. This means that even if 
the distribution worked fine, at most 50 (much probably less) of the 78 
ec sets would be used in parallel.

>
>>>> This means that this is caused by some peculiarity of the mapreduce.
> Yes but how a client write 500 files to gluster mount and those file
> just written only to subset of subvolumes? I cannot use gluster as a
> backup cluster if I cannot write with distcp.
>

All 500 files were created only on one of the 78 ec sets and the 
remaining 77 got empty ?

>>>> You should look which files are created in each brick and how many while the process is running.
> Files only created on nodes 185..204 or 205..224 or 225..244. Only on
> 20 nodes in each test.

How many files there were in each brick ?

Not sure if this can be related, but standard linux distributions have a 
default limit of 1024 open file descriptors. Having a so big volume and 
doing a massive copy, maybe this limit is affecting something ?

Are there any error or warning messages in the mount or bricks logs ?

Xavi

>
> On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez <xhernandez at datalab.es> wrote:
>> Hi Serkan,
>>
>> moved to gluster-users since this doesn't belong to devel list.
>>
>> On 19/04/16 11:24, Serkan Çoban wrote:
>>>
>>> I am copying 10.000 files to gluster volume using mapreduce on
>>> clients. Each map process took one file at a time and copy it to
>>> gluster volume.
>>
>>
>> I assume that gluster is used to store the intermediate files before the
>> reduce phase.
>>
>>> My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
>>> copy >78 files parallel I expect each file goes to different subvolume
>>> right?
>>
>>
>> If you only copy 78 files, most probably you will get some subvolume empty
>> and some other with more than one or two files. It's not an exact
>> distribution, it's a statistially balanced distribution: over time and with
>> enough files, each brick will contain an amount of files in the same order
>> of magnitude, but they won't have the *same* number of files.
>>
>>> In my tests during tests with fio I can see every file goes to
>>> different subvolume, but when I start mapreduce process from clients
>>> only 78/3=26 subvolumes used for writing files.
>>
>>
>> This means that this is caused by some peculiarity of the mapreduce.
>>
>>> I see that clearly from network traffic. Mapreduce on client side can
>>> be run multi thread. I tested with 1-5-10 threads on each client but
>>> every time only 26 subvolumes used.
>>> How can I debug the issue further?
>>
>>
>> You should look which files are created in each brick and how many while the
>> process is running.
>>
>> Xavi
>>
>>
>>>
>>> On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
>>> <xhernandez at datalab.es> wrote:
>>>>
>>>> Hi Serkan,
>>>>
>>>> On 19/04/16 09:18, Serkan Çoban wrote:
>>>>>
>>>>>
>>>>> Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
>>>>> 50 clients copying part-0-xxxx named files using mapreduce to gluster
>>>>> using one thread per server and they are using only 20 servers out of
>>>>> 60. On the other hand fio tests use all the servers. Anything I can do
>>>>> to solve the issue?
>>>>
>>>>
>>>>
>>>> Distribution of files to ec sets is done by dht. In theory if you create
>>>> many files each ec set will receive the same amount of files. However
>>>> when
>>>> the number of files is small enough, statistics can fail.
>>>>
>>>> Not sure what you are doing exactly, but a mapreduce procedure generally
>>>> only creates a single output. In that case it makes sense that only one
>>>> ec
>>>> set is used. If you want to use all ec sets for a single file, you should
>>>> enable sharding (I haven't tested that) or split the result in multiple
>>>> files.
>>>>
>>>> Xavi
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Serkan
>>>>>
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: Serkan Çoban <cobanserkan at gmail.com>
>>>>> Date: Mon, Apr 18, 2016 at 2:39 PM
>>>>> Subject: disperse volume file to subvolume mapping
>>>>> To: Gluster Users <gluster-users at gluster.org>
>>>>>
>>>>>
>>>>> Hi, I have a problem where clients are using only 1/3 of nodes in
>>>>> disperse volume for writing.
>>>>> I am testing from 50 clients using 1 to 10 threads with file names
>>>>> part-0-xxxx.
>>>>> What I see is clients only use 20 nodes for writing. How is the file
>>>>> name to sub volume hashing is done? Is this related to file names are
>>>>> similar?
>>>>>
>>>>> My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
>>>>> is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
>>>>>
>>>>
>>