[Gluster-users] disperse volume file to subvolume mapping

Xavier Hernandez xhernandez at datalab.es
Thu Apr 21 10:55:40 UTC 2016


Hi Serkan,

On 21/04/16 12:39, Serkan Çoban wrote:
> I started a gluster v rebalance v0 start command hoping that it will
> equally redistribute files across 60 nodes but it did not do that...
> why it did not redistribute files? any thoughts?

Has the rebalance operation finished successfully ? has it skipped any 
files ?

After a successful rebalance all files with attributes '---------T' 
should have disappeared.

>
> On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
> <xhernandez at datalab.es> wrote:
>> Hi Serkan,
>>
>> On 21/04/16 10:07, Serkan Çoban wrote:
>>>>
>>>> I think the problem is in the temporary name that distcp gives to the
>>>> file while it's being copied before renaming it to the real name. Do you
>>>> know what is the structure of this name ?
>>>
>>> Distcp temporary file name format is:
>>> ".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
>>> temporary file name used by one map process. For example I see in the
>>> logs that one map copies files part-m-00031,part-m-00047,part-m-00063
>>> sequentially and they all use same temporary file name above. So no
>>> original file name appears in temporary file name.
>>
>>
>> This explains the problem. With the default options, DHT sends all files to
>> the subvolume that should store a file named 'distcp.tmp'.
>>
>> With this temporary name format, little can be done.
>>
>>>
>>> I will check if we can modify distcp behaviour, or we have to write
>>> our mapreduce procedures instead of using distcp.
>>>
>>>> 2. define the option 'extra-hash-regex' to an expression that matches
>>>> your temporary file names and returns the same name that will finally have.
>>>> Depending on the differences between original and temporary file names, this
>>>> option could be useless.
>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
>>>> name conversion, so the files will be evenly distributed. However this will
>>>> cause a lot of files placed in incorrect subvolumes, creating a lot of link
>>>> files until a rebalance is executed.
>>>
>>>
>>> How can I set these options?
>>
>>
>> You can set gluster options using:
>>
>> gluster volume set <volname> <option> <value>
>>
>> for example:
>>
>> gluster volume set v0 rsync-hash-regex none
>>
>> Xavi
>>
>>
>>>
>>>
>>>
>>> On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
>>> <xhernandez at datalab.es> wrote:
>>>>
>>>> Hi Serkan,
>>>>
>>>> I think the problem is in the temporary name that distcp gives to the
>>>> file
>>>> while it's being copied before renaming it to the real name. Do you know
>>>> what is the structure of this name ?
>>>>
>>>> DHT selects the subvolume (in this case the ec set) on which the file
>>>> will
>>>> be stored based on the name of the file. This has a problem when a file
>>>> is
>>>> being renamed, because this could change the subvolume where the file
>>>> should
>>>> be found.
>>>>
>>>> DHT has a feature to avoid incorrect file placements when executing
>>>> renames
>>>> for the rsync case. What it does is to check if the file matches the
>>>> following regular expression:
>>>>
>>>>       ^\.(.+)\.[^.]+$
>>>>
>>>> If a match is found, it only considers the part between parenthesis to
>>>> calculate the destination subvolume.
>>>>
>>>> This is useful for rsync because temporary file names are constructed in
>>>> the
>>>> following way: suppose the original filename is 'test'. The temporary
>>>> filename while rsync is being executed is made by prepending a dot and
>>>> appending '.<random chars>': .test.712hd
>>>>
>>>> As you can see, the original name and the part of the name between
>>>> parenthesis that matches the regular expression are the same. This causes
>>>> that, after renaming the temporary file to its original filename, both
>>>> files
>>>> will be considered to belong to the same subvolume by DHT.
>>>>
>>>> In your case it's very probable that distcp uses a temporary name like
>>>> '.part.<number>'. In this case the portion of the name used to select the
>>>> subvolume is always 'part'. This would explain why all files go to the
>>>> same
>>>> subvolume. Once the file is renamed to another name, DHT realizes that it
>>>> should go to another subvolume. At this point it creates a link file
>>>> (those
>>>> files with access rights = '---------T') in the correct subvolume but it
>>>> doesn't move it. As you can see, this kind of files are better balanced.
>>>>
>>>> To solve this problem you have three options:
>>>>
>>>> 1. change the temporary filename used by distcp to correctly match the
>>>> regular expression. I'm not sure if this can be configured, but if this
>>>> is
>>>> possible, this is the best option.
>>>>
>>>> 2. define the option 'extra-hash-regex' to an expression that matches
>>>> your
>>>> temporary file names and returns the same name that will finally have.
>>>> Depending on the differences between original and temporary file names,
>>>> this
>>>> option could be useless.
>>>>
>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
>>>> name
>>>> conversion, so the files will be evenly distributed. However this will
>>>> cause
>>>> a lot of files placed in incorrect subvolumes, creating a lot of link
>>>> files
>>>> until a rebalance is executed.
>>>>
>>>> Xavi
>>>>
>>>>
>>>> On 20/04/16 14:13, Serkan Çoban wrote:
>>>>>
>>>>>
>>>>> Here is the steps that I do in detail and relevant output from bricks:
>>>>>
>>>>> I am using below command for volume creation:
>>>>> gluster volume create v0 disperse 20 redundancy 4 \
>>>>> 1.1.1.{185..204}:/bricks/02 \
>>>>> 1.1.1.{205..224}:/bricks/02 \
>>>>> 1.1.1.{225..244}:/bricks/02 \
>>>>> 1.1.1.{185..204}:/bricks/03 \
>>>>> 1.1.1.{205..224}:/bricks/03 \
>>>>> 1.1.1.{225..244}:/bricks/03 \
>>>>> 1.1.1.{185..204}:/bricks/04 \
>>>>> 1.1.1.{205..224}:/bricks/04 \
>>>>> 1.1.1.{225..244}:/bricks/04 \
>>>>> 1.1.1.{185..204}:/bricks/05 \
>>>>> 1.1.1.{205..224}:/bricks/05 \
>>>>> 1.1.1.{225..244}:/bricks/05 \
>>>>> 1.1.1.{185..204}:/bricks/06 \
>>>>> 1.1.1.{205..224}:/bricks/06 \
>>>>> 1.1.1.{225..244}:/bricks/06 \
>>>>> 1.1.1.{185..204}:/bricks/07 \
>>>>> 1.1.1.{205..224}:/bricks/07 \
>>>>> 1.1.1.{225..244}:/bricks/07 \
>>>>> 1.1.1.{185..204}:/bricks/08 \
>>>>> 1.1.1.{205..224}:/bricks/08 \
>>>>> 1.1.1.{225..244}:/bricks/08 \
>>>>> 1.1.1.{185..204}:/bricks/09 \
>>>>> 1.1.1.{205..224}:/bricks/09 \
>>>>> 1.1.1.{225..244}:/bricks/09 \
>>>>> 1.1.1.{185..204}:/bricks/10 \
>>>>> 1.1.1.{205..224}:/bricks/10 \
>>>>> 1.1.1.{225..244}:/bricks/10 \
>>>>> 1.1.1.{185..204}:/bricks/11 \
>>>>> 1.1.1.{205..224}:/bricks/11 \
>>>>> 1.1.1.{225..244}:/bricks/11 \
>>>>> 1.1.1.{185..204}:/bricks/12 \
>>>>> 1.1.1.{205..224}:/bricks/12 \
>>>>> 1.1.1.{225..244}:/bricks/12 \
>>>>> 1.1.1.{185..204}:/bricks/13 \
>>>>> 1.1.1.{205..224}:/bricks/13 \
>>>>> 1.1.1.{225..244}:/bricks/13 \
>>>>> 1.1.1.{185..204}:/bricks/14 \
>>>>> 1.1.1.{205..224}:/bricks/14 \
>>>>> 1.1.1.{225..244}:/bricks/14 \
>>>>> 1.1.1.{185..204}:/bricks/15 \
>>>>> 1.1.1.{205..224}:/bricks/15 \
>>>>> 1.1.1.{225..244}:/bricks/15 \
>>>>> 1.1.1.{185..204}:/bricks/16 \
>>>>> 1.1.1.{205..224}:/bricks/16 \
>>>>> 1.1.1.{225..244}:/bricks/16 \
>>>>> 1.1.1.{185..204}:/bricks/17 \
>>>>> 1.1.1.{205..224}:/bricks/17 \
>>>>> 1.1.1.{225..244}:/bricks/17 \
>>>>> 1.1.1.{185..204}:/bricks/18 \
>>>>> 1.1.1.{205..224}:/bricks/18 \
>>>>> 1.1.1.{225..244}:/bricks/18 \
>>>>> 1.1.1.{185..204}:/bricks/19 \
>>>>> 1.1.1.{205..224}:/bricks/19 \
>>>>> 1.1.1.{225..244}:/bricks/19 \
>>>>> 1.1.1.{185..204}:/bricks/20 \
>>>>> 1.1.1.{205..224}:/bricks/20 \
>>>>> 1.1.1.{225..244}:/bricks/20 \
>>>>> 1.1.1.{185..204}:/bricks/21 \
>>>>> 1.1.1.{205..224}:/bricks/21 \
>>>>> 1.1.1.{225..244}:/bricks/21 \
>>>>> 1.1.1.{185..204}:/bricks/22 \
>>>>> 1.1.1.{205..224}:/bricks/22 \
>>>>> 1.1.1.{225..244}:/bricks/22 \
>>>>> 1.1.1.{185..204}:/bricks/23 \
>>>>> 1.1.1.{205..224}:/bricks/23 \
>>>>> 1.1.1.{225..244}:/bricks/23 \
>>>>> 1.1.1.{185..204}:/bricks/24 \
>>>>> 1.1.1.{205..224}:/bricks/24 \
>>>>> 1.1.1.{225..244}:/bricks/24 \
>>>>> 1.1.1.{185..204}:/bricks/25 \
>>>>> 1.1.1.{205..224}:/bricks/25 \
>>>>> 1.1.1.{225..244}:/bricks/25 \
>>>>> 1.1.1.{185..204}:/bricks/26 \
>>>>> 1.1.1.{205..224}:/bricks/26 \
>>>>> 1.1.1.{225..244}:/bricks/26 \
>>>>> 1.1.1.{185..204}:/bricks/27 \
>>>>> 1.1.1.{205..224}:/bricks/27 \
>>>>> 1.1.1.{225..244}:/bricks/27 force
>>>>>
>>>>> then I mount volume on 50 clients:
>>>>> mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
>>>>>
>>>>> then I make a directory from one of the clients and chmod it.
>>>>> mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
>>>>>
>>>>> then I start distcp on clients, there are 1059X8.8GB files in one folder
>>>>> and
>>>>> they will be copied to /mnt/gluster/s1 with 100 parallel which means 2
>>>>> copy jobs per client at same time.
>>>>> hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
>>>>> file:///mnt/gluster/s1
>>>>>
>>>>> After job finished here is the status of s1 directory from bricks:
>>>>> s1 directory is present in all 1560 brick.
>>>>> s1/teragen-10tb folder is present in all 1560 brick.
>>>>>
>>>>> full listing of files in bricks:
>>>>> https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
>>>>>
>>>>> You can ignore the .crc files in the brick output above, they are
>>>>> checksum files...
>>>>>
>>>>> As you can see part-m-xxxx files written only some bricks in nodes
>>>>> 0205..0224
>>>>> All bricks have some files but they have zero size.
>>>>>
>>>>> I increase file descriptors to 65k so it is not the issue...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
>>>>> <xhernandez at datalab.es>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Serkan,
>>>>>>
>>>>>> On 19/04/16 15:16, Serkan Çoban wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I assume that gluster is used to store the intermediate files
>>>>>>>>>> before
>>>>>>>>>> the reduce phase
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Nope, gluster is the destination for distcp command. hadoop distcp -m
>>>>>>> 50 http://nn1:8020/path/to/folder file:///mnt/gluster
>>>>>>> This run maps on datanodes which have /mnt/gluster mounted on all of
>>>>>>> them.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I don't know hadoop, so I'm of little help here. However it seems that
>>>>>> -m
>>>>>> 50
>>>>>> means to execute 50 copies in parallel. This means that even if the
>>>>>> distribution worked fine, at most 50 (much probably less) of the 78 ec
>>>>>> sets
>>>>>> would be used in parallel.
>>>>>>
>>>>>>>
>>>>>>>>>> This means that this is caused by some peculiarity of the
>>>>>>>>>> mapreduce.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Yes but how a client write 500 files to gluster mount and those file
>>>>>>> just written only to subset of subvolumes? I cannot use gluster as a
>>>>>>> backup cluster if I cannot write with distcp.
>>>>>>>
>>>>>>
>>>>>> All 500 files were created only on one of the 78 ec sets and the
>>>>>> remaining
>>>>>> 77 got empty ?
>>>>>>
>>>>>>>>>> You should look which files are created in each brick and how many
>>>>>>>>>> while the process is running.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Files only created on nodes 185..204 or 205..224 or 225..244. Only on
>>>>>>> 20 nodes in each test.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> How many files there were in each brick ?
>>>>>>
>>>>>> Not sure if this can be related, but standard linux distributions have
>>>>>> a
>>>>>> default limit of 1024 open file descriptors. Having a so big volume and
>>>>>> doing a massive copy, maybe this limit is affecting something ?
>>>>>>
>>>>>> Are there any error or warning messages in the mount or bricks logs ?
>>>>>>
>>>>>>
>>>>>> Xavi
>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
>>>>>>> <xhernandez at datalab.es>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Serkan,
>>>>>>>>
>>>>>>>> moved to gluster-users since this doesn't belong to devel list.
>>>>>>>>
>>>>>>>> On 19/04/16 11:24, Serkan Çoban wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am copying 10.000 files to gluster volume using mapreduce on
>>>>>>>>> clients. Each map process took one file at a time and copy it to
>>>>>>>>> gluster volume.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I assume that gluster is used to store the intermediate files before
>>>>>>>> the
>>>>>>>> reduce phase.
>>>>>>>>
>>>>>>>>> My disperse volume consist of 78 subvolumes of 16+4 disk each. So If
>>>>>>>>> I
>>>>>>>>> copy >78 files parallel I expect each file goes to different
>>>>>>>>> subvolume
>>>>>>>>> right?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> If you only copy 78 files, most probably you will get some subvolume
>>>>>>>> empty
>>>>>>>> and some other with more than one or two files. It's not an exact
>>>>>>>> distribution, it's a statistially balanced distribution: over time
>>>>>>>> and
>>>>>>>> with
>>>>>>>> enough files, each brick will contain an amount of files in the same
>>>>>>>> order
>>>>>>>> of magnitude, but they won't have the *same* number of files.
>>>>>>>>
>>>>>>>>> In my tests during tests with fio I can see every file goes to
>>>>>>>>> different subvolume, but when I start mapreduce process from clients
>>>>>>>>> only 78/3=26 subvolumes used for writing files.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> This means that this is caused by some peculiarity of the mapreduce.
>>>>>>>>
>>>>>>>>> I see that clearly from network traffic. Mapreduce on client side
>>>>>>>>> can
>>>>>>>>> be run multi thread. I tested with 1-5-10 threads on each client but
>>>>>>>>> every time only 26 subvolumes used.
>>>>>>>>> How can I debug the issue further?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> You should look which files are created in each brick and how many
>>>>>>>> while
>>>>>>>> the
>>>>>>>> process is running.
>>>>>>>>
>>>>>>>> Xavi
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
>>>>>>>>> <xhernandez at datalab.es> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Serkan,
>>>>>>>>>>
>>>>>>>>>> On 19/04/16 09:18, Serkan Çoban wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi, I just reinstalled fresh 3.7.11 and I am seeing the same
>>>>>>>>>>> behavior.
>>>>>>>>>>> 50 clients copying part-0-xxxx named files using mapreduce to
>>>>>>>>>>> gluster
>>>>>>>>>>> using one thread per server and they are using only 20 servers out
>>>>>>>>>>> of
>>>>>>>>>>> 60. On the other hand fio tests use all the servers. Anything I
>>>>>>>>>>> can
>>>>>>>>>>> do
>>>>>>>>>>> to solve the issue?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Distribution of files to ec sets is done by dht. In theory if you
>>>>>>>>>> create
>>>>>>>>>> many files each ec set will receive the same amount of files.
>>>>>>>>>> However
>>>>>>>>>> when
>>>>>>>>>> the number of files is small enough, statistics can fail.
>>>>>>>>>>
>>>>>>>>>> Not sure what you are doing exactly, but a mapreduce procedure
>>>>>>>>>> generally
>>>>>>>>>> only creates a single output. In that case it makes sense that only
>>>>>>>>>> one
>>>>>>>>>> ec
>>>>>>>>>> set is used. If you want to use all ec sets for a single file, you
>>>>>>>>>> should
>>>>>>>>>> enable sharding (I haven't tested that) or split the result in
>>>>>>>>>> multiple
>>>>>>>>>> files.
>>>>>>>>>>
>>>>>>>>>> Xavi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Serkan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>>>> From: Serkan Çoban <cobanserkan at gmail.com>
>>>>>>>>>>> Date: Mon, Apr 18, 2016 at 2:39 PM
>>>>>>>>>>> Subject: disperse volume file to subvolume mapping
>>>>>>>>>>> To: Gluster Users <gluster-users at gluster.org>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi, I have a problem where clients are using only 1/3 of nodes in
>>>>>>>>>>> disperse volume for writing.
>>>>>>>>>>> I am testing from 50 clients using 1 to 10 threads with file names
>>>>>>>>>>> part-0-xxxx.
>>>>>>>>>>> What I see is clients only use 20 nodes for writing. How is the
>>>>>>>>>>> file
>>>>>>>>>>> name to sub volume hashing is done? Is this related to file names
>>>>>>>>>>> are
>>>>>>>>>>> similar?
>>>>>>>>>>>
>>>>>>>>>>> My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse
>>>>>>>>>>> volume
>>>>>>>>>>> is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>


More information about the Gluster-users mailing list