[Gluster-users] Failures during rebalance on gluster distributed disperse volume

Sunil Kumar Heggodu Gopala Acharya sheggodu at redhat.com
Sat Sep 15 09:57:41 UTC 2018


Hi Mauro,

As Nithya highlighted FALLOCATE support for EC volumes went in 3.11 as part
of https://bugzilla.redhat.com/show_bug.cgi?id=1454686. Hence, upgrading to
3.12 as suggested before would be a right move.

Here is the documentation for upgrading to 3.12:
https://docs.gluster.org/en/latest/Upgrade-Guide/upgrade_to_3.12/

Regards,

Sunil kumar Acharya

Senior Software Engineer

Red Hat

<https://www.redhat.com>

T: +91-8067935170 <http://redhatemailsignature-marketing.itos.redhat.com/>

<https://red.ht/sig>
TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>


On Sat, Sep 15, 2018 at 3:42 AM, Mauro Tridici <mauro.tridici at cmcc.it>
wrote:

>
> Hi Nithya,
>
> thank you very much for your answer.
> I will wait for @Sunil opinion too before starting the upgrade procedure.
>
> Since it will be the first upgrade of our Gluster cluster, I would like to
> know if it could be a “virtually dangerous" procedure and if it will be the
> risk of losing data :-)
> Unfortunately, I can’t do a preventive copy of the volume data in another
> location.
> If it is possible, could you please illustrate the right steps needed to
> complete the upgrade procedure from the 3.10.5 to the 3.12 version?
>
> Thank you again, Nithya.
> Thank you to all of you for the help!
>
> Regards,
> Mauro
>
> Il giorno 14 set 2018, alle ore 16:59, Nithya Balachandran <
> nbalacha at redhat.com> ha scritto:
>
> Hi Mauro,
>
>
> The rebalance code started using fallocate in 3.10.5 (
> https://bugzilla.redhat.com/show_bug.cgi?id=1473132) which works fine on
> replicated volumes. However, we neglected to test this with EC volumes on
> 3.10. Once we discovered the issue, the EC fallocate implementation was
> made available in 3.11.
>
> At this point, I'm afraid the only option I see is to upgrade to at least
> 3.12.
>
> @Sunil, do you have anything to add?
>
> Regards,
> Nithya
>
> On 13 September 2018 at 18:34, Mauro Tridici <mauro.tridici at cmcc.it>
> wrote:
>
>>
>> Hi Nithya,
>>
>> thank you for involving EC group.
>> I will wait for your suggestions.
>>
>> Regards,
>> Mauro
>>
>> Il giorno 13 set 2018, alle ore 13:38, Nithya Balachandran <
>> nbalacha at redhat.com> ha scritto:
>>
>> This looks like an issue because rebalance switched to using fallocate
>> which EC did not have implemented at that point.
>>
>> @Pranith, @Ashish, which version of gluster had support for fallocate in
>> EC?
>>
>>
>> Regards,
>> Nithya
>>
>> On 12 September 2018 at 19:24, Mauro Tridici <mauro.tridici at cmcc.it>
>> wrote:
>>
>>> Dear All,
>>>
>>> I recently added 3 servers (each one with 12 bricks) to an existing
>>> Gluster Distributed Disperse Volume.
>>> Volume extension has been completed without error and I already executed
>>> the rebalance procedure with fix-layout option with no problem.
>>> I just launched the rebalance procedure without fix-layout option, but,
>>> as you can see in the output below, I noticed that some failures have been
>>> detected.
>>>
>>> [root at s01 glusterfs]# gluster v rebalance tier2 status
>>>                                     Node Rebalanced-files          size
>>>       scanned      failures       skipped               status  run time in
>>> h:m:s
>>>                                ---------      -----------   -----------
>>>   -----------   -----------   -----------         ------------
>>> --------------
>>>                                localhost            71176         3.2MB
>>>       2137557       1530391          8128          in progress
>>> 13:59:05
>>>                                  s02-stg                0        0Bytes
>>>             0             0             0            completed
>>> 11:53:28
>>>                                  s03-stg                0        0Bytes
>>>             0             0             0            completed
>>> 11:53:32
>>>                                  s04-stg                0        0Bytes
>>>             0             0             0            completed
>>>  0:00:06
>>>                                  s05-stg               15        0Bytes
>>>         17055             0            18            completed
>>> 10:48:01
>>>                                  s06-stg                0        0Bytes
>>>             0             0             0            completed
>>>  0:00:06
>>> Estimated time left for rebalance to complete :        0:46:53
>>> volume rebalance: tier2: success
>>>
>>> In the volume rebalance log file, I detected a lot of error messages
>>> similar to the following ones:
>>>
>>> [2018-09-12 13:15:50.756703] E [MSGID: 0] [dht-rebalance.c:1696:dht_migrate_file]
>>> 0-tier2-dht: Create dst failed on - tier2-disperse-6 for file -
>>> /CSP/sp1/CESM/archive/sps_200508_003/atm/hist/postproc/sps_2
>>> 00508_003.cam.h0.2005-12_grid.nc
>>> [2018-09-12 13:15:50.757025] E [MSGID: 109023]
>>> [dht-rebalance.c:2733:gf_defrag_migrate_single_file] 0-tier2-dht:
>>> migrate-data failed for /CSP/sp1/CESM/archive/sps_2005
>>> 08_003/atm/hist/postproc/sps_200508_003.cam.h0.2005-12_grid.nc
>>> [2018-09-12 13:15:50.759183] E [MSGID: 109023]
>>> [dht-rebalance.c:844:__dht_rebalance_create_dst_file] 0-tier2-dht:
>>> fallocate failed for /CSP/sp1/CESM/archive/sps_2005
>>> 08_003/atm/hist/postproc/sps_200508_003.cam.h0.2005-09_grid.nc on
>>> tier2-disperse-9 (Operation not supported)
>>> [2018-09-12 13:15:50.759206] E [MSGID: 0] [dht-rebalance.c:1696:dht_migrate_file]
>>> 0-tier2-dht: Create dst failed on - tier2-disperse-9 for file -
>>> /CSP/sp1/CESM/archive/sps_200508_003/atm/hist/postproc/sps_2
>>> 00508_003.cam.h0.2005-09_grid.nc
>>> [2018-09-12 13:15:50.759536] E [MSGID: 109023]
>>> [dht-rebalance.c:2733:gf_defrag_migrate_single_file] 0-tier2-dht:
>>> migrate-data failed for /CSP/sp1/CESM/archive/sps_2005
>>> 08_003/atm/hist/postproc/sps_200508_003.cam.h0.2005-09_grid.nc
>>> [2018-09-12 13:15:50.777219] E [MSGID: 109023]
>>> [dht-rebalance.c:844:__dht_rebalance_create_dst_file] 0-tier2-dht:
>>> fallocate failed for /CSP/sp1/CESM/archive/sps_2005
>>> 08_003/atm/hist/postproc/sps_200508_003.cam.h0.2006-01_grid.nc on
>>> tier2-disperse-10 (Operation not supported)
>>> [2018-09-12 13:15:50.777241] E [MSGID: 0] [dht-rebalance.c:1696:dht_migrate_file]
>>> 0-tier2-dht: Create dst failed on - tier2-disperse-10 for file -
>>> /CSP/sp1/CESM/archive/sps_200508_003/atm/hist/postproc/sps_2
>>> 00508_003.cam.h0.2006-01_grid.nc
>>> [2018-09-12 13:15:50.777676] E [MSGID: 109023]
>>> [dht-rebalance.c:2733:gf_defrag_migrate_single_file] 0-tier2-dht:
>>> migrate-data failed for /CSP/sp1/CESM/archive/sps_2005
>>> 08_003/atm/hist/postproc/sps_200508_003.cam.h0.2006-01_grid.nc
>>>
>>> Could you please help me to understand what is happening and how to
>>> solve it?
>>>
>>> Our Gluster implementation is based on Gluster v.3.10.5
>>>
>>> Thank you in advance,
>>> Mauro
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>>
>>
>> -------------------------
>> Mauro Tridici
>>
>> Fondazione CMCC
>> CMCC Supercomputing Center
>> presso Complesso Ecotekne - Università del Salento -
>> Strada Prov.le Lecce - Monteroni sn
>> 73100 Lecce  IT
>> http://www.cmcc.it
>>
>> mobile: (+39) 327 5630841
>> email: mauro.tridici at cmcc.it
>>
>>
>
>
> -------------------------
> Mauro Tridici
>
> Fondazione CMCC
> CMCC Supercomputing Center
> presso Complesso Ecotekne - Università del Salento -
> Strada Prov.le Lecce - Monteroni sn
> 73100 Lecce  IT
> http://www.cmcc.it
>
> mobile: (+39) 327 5630841
> email: mauro.tridici at cmcc.it
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180915/d35dcd01/attachment.html>


More information about the Gluster-users mailing list