[Gluster-users] disperse volume file to subvolume mapping

Tue Apr 19 10:05:16 UTC 2016

Hi Serkan,

moved to gluster-users since this doesn't belong to devel list.

On 19/04/16 11:24, Serkan Çoban wrote:
> I am copying 10.000 files to gluster volume using mapreduce on
> clients. Each map process took one file at a time and copy it to
> gluster volume.

I assume that gluster is used to store the intermediate files before the 
reduce phase.

> My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
> copy >78 files parallel I expect each file goes to different subvolume
> right?

If you only copy 78 files, most probably you will get some subvolume 
empty and some other with more than one or two files. It's not an exact 
distribution, it's a statistially balanced distribution: over time and 
with enough files, each brick will contain an amount of files in the 
same order of magnitude, but they won't have the *same* number of files.

> In my tests during tests with fio I can see every file goes to
> different subvolume, but when I start mapreduce process from clients
> only 78/3=26 subvolumes used for writing files.

This means that this is caused by some peculiarity of the mapreduce.

> I see that clearly from network traffic. Mapreduce on client side can
> be run multi thread. I tested with 1-5-10 threads on each client but
> every time only 26 subvolumes used.
> How can I debug the issue further?

You should look which files are created in each brick and how many while 
the process is running.

Xavi

>
> On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
> <xhernandez at datalab.es> wrote:
>> Hi Serkan,
>>
>> On 19/04/16 09:18, Serkan Çoban wrote:
>>>
>>> Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
>>> 50 clients copying part-0-xxxx named files using mapreduce to gluster
>>> using one thread per server and they are using only 20 servers out of
>>> 60. On the other hand fio tests use all the servers. Anything I can do
>>> to solve the issue?
>>
>>
>> Distribution of files to ec sets is done by dht. In theory if you create
>> many files each ec set will receive the same amount of files. However when
>> the number of files is small enough, statistics can fail.
>>
>> Not sure what you are doing exactly, but a mapreduce procedure generally
>> only creates a single output. In that case it makes sense that only one ec
>> set is used. If you want to use all ec sets for a single file, you should
>> enable sharding (I haven't tested that) or split the result in multiple
>> files.
>>
>> Xavi
>>
>>
>>>
>>> Thanks,
>>> Serkan
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Serkan Çoban <cobanserkan at gmail.com>
>>> Date: Mon, Apr 18, 2016 at 2:39 PM
>>> Subject: disperse volume file to subvolume mapping
>>> To: Gluster Users <gluster-users at gluster.org>
>>>
>>>
>>> Hi, I have a problem where clients are using only 1/3 of nodes in
>>> disperse volume for writing.
>>> I am testing from 50 clients using 1 to 10 threads with file names
>>> part-0-xxxx.
>>> What I see is clients only use 20 nodes for writing. How is the file
>>> name to sub volume hashing is done? Is this related to file names are
>>> similar?
>>>
>>> My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
>>> is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
>>>
>>