[Gluster-users] disperse volume file to subvolume mapping

Xavier Hernandez xhernandez at datalab.es
Fri Apr 22 06:43:37 UTC 2016


Even the number of scanned files is 0 ?

This seems an issue with DHT. I'm not an expert on this area. Not sure 
if the regular expression pattern that some files still match could 
cause an interference with rebalance.

Anyway, if you have found a solution for your use case, it's ok for me.

Best regards,

Xavi

On 22/04/16 08:24, Serkan Çoban wrote:
> Not only skipped column but all columns are 0 in rebalance status
> command. It seems rebalance does not to anything. All '---------T'
> files are there. Anyway we wrote our custom mapreduce tool and it is
> copying files right now to gluster and it is utilizing all 60 nodes as
> expected. I will delete distcp folder and continue if you don't need
> any further log/debug files to examine the issue.
>
> Thanks for help,
> Serkan
>
> On Fri, Apr 22, 2016 at 9:15 AM, Xavier Hernandez <xhernandez at datalab.es> wrote:
>> When you execute a rebalance 'force' the skipped column should be 0 for all
>> nodes and all '---------T' files must have disappeared. Otherwise something
>> failed. Is this true in your case ?
>>
>>
>> On 21/04/16 15:19, Serkan Çoban wrote:
>>>
>>> Same result. Also checked the rebalance.log file, it has also no
>>> reference to part files...
>>>
>>> On Thu, Apr 21, 2016 at 3:34 PM, Xavier Hernandez <xhernandez at datalab.es>
>>> wrote:
>>>>
>>>> Can you try a 'gluster volume rebalance v0 start force' ?
>>>>
>>>>
>>>> On 21/04/16 14:23, Serkan Çoban wrote:
>>>>>>
>>>>>>
>>>>>> Has the rebalance operation finished successfully ? has it skipped any
>>>>>> files ?
>>>>>
>>>>>
>>>>> Yes according to gluster v rebalance status it is completed without any
>>>>> errors.
>>>>> rebalance status report is like:
>>>>> Node         Rebalanced files   size               Scanned
>>>>> failures  skipped
>>>>> 1.1.1.185   158                      29GB             1720
>>>>> 0           314
>>>>> 1.1.1.205    93                       46.5GB           761
>>>>> 0           95
>>>>> 1.1.1.225    74                       37GB              779
>>>>>     0           94
>>>>>
>>>>>
>>>>> All other hosts has 0 values.
>>>>>
>>>>> I double check that files with '---------T' attributes are there,
>>>>> maybe some of them deleted but I still see them in bricks...
>>>>> I am also concerned why part files not distributed to all 60 nodes?
>>>>> Rebalance should do that?
>>>>>
>>>>> On Thu, Apr 21, 2016 at 1:55 PM, Xavier Hernandez
>>>>> <xhernandez at datalab.es>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Serkan,
>>>>>>
>>>>>> On 21/04/16 12:39, Serkan Çoban wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I started a gluster v rebalance v0 start command hoping that it will
>>>>>>> equally redistribute files across 60 nodes but it did not do that...
>>>>>>> why it did not redistribute files? any thoughts?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Has the rebalance operation finished successfully ? has it skipped any
>>>>>> files
>>>>>> ?
>>>>>>
>>>>>> After a successful rebalance all files with attributes '---------T'
>>>>>> should
>>>>>> have disappeared.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
>>>>>>> <xhernandez at datalab.es> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Serkan,
>>>>>>>>
>>>>>>>> On 21/04/16 10:07, Serkan Çoban wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think the problem is in the temporary name that distcp gives to
>>>>>>>>>> the
>>>>>>>>>> file while it's being copied before renaming it to the real name.
>>>>>>>>>> Do
>>>>>>>>>> you
>>>>>>>>>> know what is the structure of this name ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Distcp temporary file name format is:
>>>>>>>>> ".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
>>>>>>>>> temporary file name used by one map process. For example I see in
>>>>>>>>> the
>>>>>>>>> logs that one map copies files
>>>>>>>>> part-m-00031,part-m-00047,part-m-00063
>>>>>>>>> sequentially and they all use same temporary file name above. So no
>>>>>>>>> original file name appears in temporary file name.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> This explains the problem. With the default options, DHT sends all
>>>>>>>> files
>>>>>>>> to
>>>>>>>> the subvolume that should store a file named 'distcp.tmp'.
>>>>>>>>
>>>>>>>> With this temporary name format, little can be done.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I will check if we can modify distcp behaviour, or we have to write
>>>>>>>>> our mapreduce procedures instead of using distcp.
>>>>>>>>>
>>>>>>>>>> 2. define the option 'extra-hash-regex' to an expression that
>>>>>>>>>> matches
>>>>>>>>>> your temporary file names and returns the same name that will
>>>>>>>>>> finally
>>>>>>>>>> have.
>>>>>>>>>> Depending on the differences between original and temporary file
>>>>>>>>>> names,
>>>>>>>>>> this
>>>>>>>>>> option could be useless.
>>>>>>>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent
>>>>>>>>>> the
>>>>>>>>>> name conversion, so the files will be evenly distributed. However
>>>>>>>>>> this
>>>>>>>>>> will
>>>>>>>>>> cause a lot of files placed in incorrect subvolumes, creating a lot
>>>>>>>>>> of
>>>>>>>>>> link
>>>>>>>>>> files until a rebalance is executed.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> How can I set these options?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> You can set gluster options using:
>>>>>>>>
>>>>>>>> gluster volume set <volname> <option> <value>
>>>>>>>>
>>>>>>>> for example:
>>>>>>>>
>>>>>>>> gluster volume set v0 rsync-hash-regex none
>>>>>>>>
>>>>>>>> Xavi
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
>>>>>>>>> <xhernandez at datalab.es> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Serkan,
>>>>>>>>>>
>>>>>>>>>> I think the problem is in the temporary name that distcp gives to
>>>>>>>>>> the
>>>>>>>>>> file
>>>>>>>>>> while it's being copied before renaming it to the real name. Do you
>>>>>>>>>> know
>>>>>>>>>> what is the structure of this name ?
>>>>>>>>>>
>>>>>>>>>> DHT selects the subvolume (in this case the ec set) on which the
>>>>>>>>>> file
>>>>>>>>>> will
>>>>>>>>>> be stored based on the name of the file. This has a problem when a
>>>>>>>>>> file
>>>>>>>>>> is
>>>>>>>>>> being renamed, because this could change the subvolume where the
>>>>>>>>>> file
>>>>>>>>>> should
>>>>>>>>>> be found.
>>>>>>>>>>
>>>>>>>>>> DHT has a feature to avoid incorrect file placements when executing
>>>>>>>>>> renames
>>>>>>>>>> for the rsync case. What it does is to check if the file matches
>>>>>>>>>> the
>>>>>>>>>> following regular expression:
>>>>>>>>>>
>>>>>>>>>>          ^\.(.+)\.[^.]+$
>>>>>>>>>>
>>>>>>>>>> If a match is found, it only considers the part between parenthesis
>>>>>>>>>> to
>>>>>>>>>> calculate the destination subvolume.
>>>>>>>>>>
>>>>>>>>>> This is useful for rsync because temporary file names are
>>>>>>>>>> constructed
>>>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>> following way: suppose the original filename is 'test'. The
>>>>>>>>>> temporary
>>>>>>>>>> filename while rsync is being executed is made by prepending a dot
>>>>>>>>>> and
>>>>>>>>>> appending '.<random chars>': .test.712hd
>>>>>>>>>>
>>>>>>>>>> As you can see, the original name and the part of the name between
>>>>>>>>>> parenthesis that matches the regular expression are the same. This
>>>>>>>>>> causes
>>>>>>>>>> that, after renaming the temporary file to its original filename,
>>>>>>>>>> both
>>>>>>>>>> files
>>>>>>>>>> will be considered to belong to the same subvolume by DHT.
>>>>>>>>>>
>>>>>>>>>> In your case it's very probable that distcp uses a temporary name
>>>>>>>>>> like
>>>>>>>>>> '.part.<number>'. In this case the portion of the name used to
>>>>>>>>>> select
>>>>>>>>>> the
>>>>>>>>>> subvolume is always 'part'. This would explain why all files go to
>>>>>>>>>> the
>>>>>>>>>> same
>>>>>>>>>> subvolume. Once the file is renamed to another name, DHT realizes
>>>>>>>>>> that
>>>>>>>>>> it
>>>>>>>>>> should go to another subvolume. At this point it creates a link
>>>>>>>>>> file
>>>>>>>>>> (those
>>>>>>>>>> files with access rights = '---------T') in the correct subvolume
>>>>>>>>>> but
>>>>>>>>>> it
>>>>>>>>>> doesn't move it. As you can see, this kind of files are better
>>>>>>>>>> balanced.
>>>>>>>>>>
>>>>>>>>>> To solve this problem you have three options:
>>>>>>>>>>
>>>>>>>>>> 1. change the temporary filename used by distcp to correctly match
>>>>>>>>>> the
>>>>>>>>>> regular expression. I'm not sure if this can be configured, but if
>>>>>>>>>> this
>>>>>>>>>> is
>>>>>>>>>> possible, this is the best option.
>>>>>>>>>>
>>>>>>>>>> 2. define the option 'extra-hash-regex' to an expression that
>>>>>>>>>> matches
>>>>>>>>>> your
>>>>>>>>>> temporary file names and returns the same name that will finally
>>>>>>>>>> have.
>>>>>>>>>> Depending on the differences between original and temporary file
>>>>>>>>>> names,
>>>>>>>>>> this
>>>>>>>>>> option could be useless.
>>>>>>>>>>
>>>>>>>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent
>>>>>>>>>> the
>>>>>>>>>> name
>>>>>>>>>> conversion, so the files will be evenly distributed. However this
>>>>>>>>>> will
>>>>>>>>>> cause
>>>>>>>>>> a lot of files placed in incorrect subvolumes, creating a lot of
>>>>>>>>>> link
>>>>>>>>>> files
>>>>>>>>>> until a rebalance is executed.
>>>>>>>>>>
>>>>>>>>>> Xavi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 20/04/16 14:13, Serkan Çoban wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Here is the steps that I do in detail and relevant output from
>>>>>>>>>>> bricks:
>>>>>>>>>>>
>>>>>>>>>>> I am using below command for volume creation:
>>>>>>>>>>> gluster volume create v0 disperse 20 redundancy 4 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/02 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/02 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/02 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/03 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/03 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/03 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/04 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/04 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/04 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/05 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/05 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/05 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/06 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/06 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/06 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/07 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/07 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/07 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/08 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/08 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/08 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/09 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/09 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/09 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/10 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/10 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/10 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/11 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/11 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/11 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/12 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/12 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/12 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/13 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/13 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/13 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/14 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/14 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/14 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/15 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/15 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/15 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/16 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/16 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/16 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/17 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/17 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/17 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/18 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/18 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/18 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/19 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/19 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/19 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/20 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/20 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/20 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/21 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/21 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/21 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/22 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/22 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/22 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/23 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/23 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/23 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/24 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/24 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/24 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/25 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/25 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/25 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/26 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/26 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/26 \
>>>>>>>>>>> 1.1.1.{185..204}:/bricks/27 \
>>>>>>>>>>> 1.1.1.{205..224}:/bricks/27 \
>>>>>>>>>>> 1.1.1.{225..244}:/bricks/27 force
>>>>>>>>>>>
>>>>>>>>>>> then I mount volume on 50 clients:
>>>>>>>>>>> mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
>>>>>>>>>>>
>>>>>>>>>>> then I make a directory from one of the clients and chmod it.
>>>>>>>>>>> mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
>>>>>>>>>>>
>>>>>>>>>>> then I start distcp on clients, there are 1059X8.8GB files in one
>>>>>>>>>>> folder
>>>>>>>>>>> and
>>>>>>>>>>> they will be copied to /mnt/gluster/s1 with 100 parallel which
>>>>>>>>>>> means
>>>>>>>>>>> 2
>>>>>>>>>>> copy jobs per client at same time.
>>>>>>>>>>> hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
>>>>>>>>>>> file:///mnt/gluster/s1
>>>>>>>>>>>
>>>>>>>>>>> After job finished here is the status of s1 directory from bricks:
>>>>>>>>>>> s1 directory is present in all 1560 brick.
>>>>>>>>>>> s1/teragen-10tb folder is present in all 1560 brick.
>>>>>>>>>>>
>>>>>>>>>>> full listing of files in bricks:
>>>>>>>>>>> https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
>>>>>>>>>>>
>>>>>>>>>>> You can ignore the .crc files in the brick output above, they are
>>>>>>>>>>> checksum files...
>>>>>>>>>>>
>>>>>>>>>>> As you can see part-m-xxxx files written only some bricks in nodes
>>>>>>>>>>> 0205..0224
>>>>>>>>>>> All bricks have some files but they have zero size.
>>>>>>>>>>>
>>>>>>>>>>> I increase file descriptors to 65k so it is not the issue...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
>>>>>>>>>>> <xhernandez at datalab.es>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Serkan,
>>>>>>>>>>>>
>>>>>>>>>>>> On 19/04/16 15:16, Serkan Çoban wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I assume that gluster is used to store the intermediate files
>>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>>> the reduce phase
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nope, gluster is the destination for distcp command. hadoop
>>>>>>>>>>>>> distcp
>>>>>>>>>>>>> -m
>>>>>>>>>>>>> 50 http://nn1:8020/path/to/folder file:///mnt/gluster
>>>>>>>>>>>>> This run maps on datanodes which have /mnt/gluster mounted on
>>>>>>>>>>>>> all
>>>>>>>>>>>>> of
>>>>>>>>>>>>> them.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I don't know hadoop, so I'm of little help here. However it seems
>>>>>>>>>>>> that
>>>>>>>>>>>> -m
>>>>>>>>>>>> 50
>>>>>>>>>>>> means to execute 50 copies in parallel. This means that even if
>>>>>>>>>>>> the
>>>>>>>>>>>> distribution worked fine, at most 50 (much probably less) of the
>>>>>>>>>>>> 78
>>>>>>>>>>>> ec
>>>>>>>>>>>> sets
>>>>>>>>>>>> would be used in parallel.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This means that this is caused by some peculiarity of the
>>>>>>>>>>>>>>>> mapreduce.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes but how a client write 500 files to gluster mount and those
>>>>>>>>>>>>> file
>>>>>>>>>>>>> just written only to subset of subvolumes? I cannot use gluster
>>>>>>>>>>>>> as
>>>>>>>>>>>>> a
>>>>>>>>>>>>> backup cluster if I cannot write with distcp.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> All 500 files were created only on one of the 78 ec sets and the
>>>>>>>>>>>> remaining
>>>>>>>>>>>> 77 got empty ?
>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You should look which files are created in each brick and how
>>>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>>> while the process is running.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Files only created on nodes 185..204 or 205..224 or 225..244.
>>>>>>>>>>>>> Only
>>>>>>>>>>>>> on
>>>>>>>>>>>>> 20 nodes in each test.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> How many files there were in each brick ?
>>>>>>>>>>>>
>>>>>>>>>>>> Not sure if this can be related, but standard linux distributions
>>>>>>>>>>>> have
>>>>>>>>>>>> a
>>>>>>>>>>>> default limit of 1024 open file descriptors. Having a so big
>>>>>>>>>>>> volume
>>>>>>>>>>>> and
>>>>>>>>>>>> doing a massive copy, maybe this limit is affecting something ?
>>>>>>>>>>>>
>>>>>>>>>>>> Are there any error or warning messages in the mount or bricks
>>>>>>>>>>>> logs
>>>>>>>>>>>> ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Xavi
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
>>>>>>>>>>>>> <xhernandez at datalab.es>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Serkan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> moved to gluster-users since this doesn't belong to devel list.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 19/04/16 11:24, Serkan Çoban wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am copying 10.000 files to gluster volume using mapreduce on
>>>>>>>>>>>>>>> clients. Each map process took one file at a time and copy it
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> gluster volume.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I assume that gluster is used to store the intermediate files
>>>>>>>>>>>>>> before
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> reduce phase.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My disperse volume consist of 78 subvolumes of 16+4 disk each.
>>>>>>>>>>>>>>> So
>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> copy >78 files parallel I expect each file goes to different
>>>>>>>>>>>>>>> subvolume
>>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you only copy 78 files, most probably you will get some
>>>>>>>>>>>>>> subvolume
>>>>>>>>>>>>>> empty
>>>>>>>>>>>>>> and some other with more than one or two files. It's not an
>>>>>>>>>>>>>> exact
>>>>>>>>>>>>>> distribution, it's a statistially balanced distribution: over
>>>>>>>>>>>>>> time
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> enough files, each brick will contain an amount of files in the
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>> order
>>>>>>>>>>>>>> of magnitude, but they won't have the *same* number of files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In my tests during tests with fio I can see every file goes to
>>>>>>>>>>>>>>> different subvolume, but when I start mapreduce process from
>>>>>>>>>>>>>>> clients
>>>>>>>>>>>>>>> only 78/3=26 subvolumes used for writing files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This means that this is caused by some peculiarity of the
>>>>>>>>>>>>>> mapreduce.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see that clearly from network traffic. Mapreduce on client
>>>>>>>>>>>>>>> side
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>> be run multi thread. I tested with 1-5-10 threads on each
>>>>>>>>>>>>>>> client
>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>> every time only 26 subvolumes used.
>>>>>>>>>>>>>>> How can I debug the issue further?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You should look which files are created in each brick and how
>>>>>>>>>>>>>> many
>>>>>>>>>>>>>> while
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> process is running.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Xavi
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
>>>>>>>>>>>>>>> <xhernandez at datalab.es> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Serkan,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 19/04/16 09:18, Serkan Çoban wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi, I just reinstalled fresh 3.7.11 and I am seeing the same
>>>>>>>>>>>>>>>>> behavior.
>>>>>>>>>>>>>>>>> 50 clients copying part-0-xxxx named files using mapreduce
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> gluster
>>>>>>>>>>>>>>>>> using one thread per server and they are using only 20
>>>>>>>>>>>>>>>>> servers
>>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> 60. On the other hand fio tests use all the servers.
>>>>>>>>>>>>>>>>> Anything
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>> to solve the issue?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Distribution of files to ec sets is done by dht. In theory if
>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>> create
>>>>>>>>>>>>>>>> many files each ec set will receive the same amount of files.
>>>>>>>>>>>>>>>> However
>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> the number of files is small enough, statistics can fail.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Not sure what you are doing exactly, but a mapreduce
>>>>>>>>>>>>>>>> procedure
>>>>>>>>>>>>>>>> generally
>>>>>>>>>>>>>>>> only creates a single output. In that case it makes sense
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>> ec
>>>>>>>>>>>>>>>> set is used. If you want to use all ec sets for a single
>>>>>>>>>>>>>>>> file,
>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> enable sharding (I haven't tested that) or split the result
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Xavi
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Serkan
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>>>>>>>>>> From: Serkan Çoban <cobanserkan at gmail.com>
>>>>>>>>>>>>>>>>> Date: Mon, Apr 18, 2016 at 2:39 PM
>>>>>>>>>>>>>>>>> Subject: disperse volume file to subvolume mapping
>>>>>>>>>>>>>>>>> To: Gluster Users <gluster-users at gluster.org>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi, I have a problem where clients are using only 1/3 of
>>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> disperse volume for writing.
>>>>>>>>>>>>>>>>> I am testing from 50 clients using 1 to 10 threads with file
>>>>>>>>>>>>>>>>> names
>>>>>>>>>>>>>>>>> part-0-xxxx.
>>>>>>>>>>>>>>>>> What I see is clients only use 20 nodes for writing. How is
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>>> name to sub volume hashing is done? Is this related to file
>>>>>>>>>>>>>>>>> names
>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> similar?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> My cluster is 3.7.10 with 60 nodes each has 26 disks.
>>>>>>>>>>>>>>>>> Disperse
>>>>>>>>>>>>>>>>> volume
>>>>>>>>>>>>>>>>> is 78 x (16+4). Only 26 out of 78 sub volumes used during
>>>>>>>>>>>>>>>>> writes..
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>


More information about the Gluster-users mailing list