[Gluster-devel] Skipped files during rebalance

Wed Aug 19 10:12:46 UTC 2015

Dear Susant,

Apparently the glistered process was stuck in a strange state. So we restarted the glusterd process on stor106. This allowed us to stop the volume, and reboot.

I will start a new rebalance now, and will get the information you asked during the rebalance operation.

I think it makes more sense to post the logs of this new rebalance operation.

Kind regards,

—
Christophe

> On 19 Aug 2015, at 08:49, Susant Palai <spalai at redhat.com> wrote:
> 
> Hi Christophe,
>   Forgot to ask you to post the rebalance and glusterd logs.
> 
> Regards,
> Susant
> 
> 
> ----- Original Message -----
>> From: "Susant Palai" <spalai at redhat.com>
>> To: "Christophe TREFOIS" <christophe.trefois at uni.lu>
>> Cc: "Gluster Devel" <gluster-devel at gluster.org>
>> Sent: Wednesday, August 19, 2015 11:44:35 AM
>> Subject: Re: [Gluster-devel] Skipped files during rebalance
>> 
>> Comments inline.
>> 
>> ----- Original Message -----
>>> From: "Christophe TREFOIS" <christophe.trefois at uni.lu>
>>> To: "Susant Palai" <spalai at redhat.com>
>>> Cc: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, "Nithya Balachandran"
>>> <nbalacha at redhat.com>, "Shyamsundar
>>> Ranganathan" <srangana at redhat.com>, "Mohammed Rafi K C"
>>> <rkavunga at redhat.com>, "Gluster Devel"
>>> <gluster-devel at gluster.org>
>>> Sent: Tuesday, August 18, 2015 8:08:41 PM
>>> Subject: Re: [Gluster-devel] Skipped files during rebalance
>>> 
>>> Hi Susan,
>>> 
>>> Thank you for the response.
>>> 
>>>> On 18 Aug 2015, at 10:45, Susant Palai <spalai at redhat.com> wrote:
>>>> 
>>>> Hi Christophe,
>>>> 
>>>>  Need some info regarding the high mem-usage.
>>>> 
>>>> 1. Top output: To see whether any other process eating up memory.
>> 
>> I will be interested to know the memory usage of all the gluster process
>> referring to the high mem-usage. These process includes glusterfsd,
>> glusterd, gluster, any mount process (glusterfs), and rebalance(glusterfs).
>> 
>> 
>>>> 2. Gluster volume info
>>> 
>>> root at highlander ~]# gluster volume info
>>> 
>>> Volume Name: live
>>> Type: Distribute
>>> Volume ID: 1328637d-7730-4627-8945-bbe43626d527
>>> Status: Started
>>> Number of Bricks: 9
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: stor104:/zfs/brick0/brick
>>> Brick2: stor104:/zfs/brick1/brick
>>> Brick3: stor104:/zfs/brick2/brick
>>> Brick4: stor106:/zfs/brick0/brick
>>> Brick5: stor106:/zfs/brick1/brick
>>> Brick6: stor106:/zfs/brick2/brick
>>> Brick7: stor105:/zfs/brick0/brick
>>> Brick8: stor105:/zfs/brick1/brick
>>> Brick9: stor105:/zfs/brick2/brick
>>> Options Reconfigured:
>>> diagnostics.count-fop-hits: on
>>> diagnostics.latency-measurement: on
>>> server.allow-insecure: on
>>> cluster.min-free-disk: 1%
>>> diagnostics.brick-log-level: ERROR
>>> diagnostics.client-log-level: ERROR
>>> cluster.data-self-heal-algorithm: full
>>> performance.cache-max-file-size: 4MB
>>> performance.cache-refresh-timeout: 60
>>> performance.cache-size: 1GB
>>> performance.client-io-threads: on
>>> performance.io-thread-count: 32
>>> performance.write-behind-window-size: 4MB
>>> 
>>>> 3. Is rebalance process still running? If yes can you point to specific
>>>> mem
>>>> usage by rebalance process? The high mem-usage was seen during rebalance
>>>> or even post rebalance?
>>> 
>>> I would like to restart the rebalance process since it failed… But I can’t
>>> as
>>> the volume cannot be stopped (I wanted to reboot the servers to have a
>>> clean
>>> testing grounds).
>>> 
>>> Here are the logs from the three nodes:
>>> http://paste.fedoraproject.org/256183/43989079
>>> 
>>> Maybe you could help me figure out how to stop the volume?
>>> 
>>> This is what happens
>>> 
>>> [root at highlander ~]# gluster volume rebalance live stop
>>> volume rebalance: live: failed: Rebalance not started.
>> 
>> Requesting glusterd team to give input.
>>> 
>>> [root at highlander ~]# ssh stor105 "gluster volume rebalance live stop"
>>> volume rebalance: live: failed: Rebalance not started.
>>> 
>>> [root at highlander ~]# ssh stor104 "gluster volume rebalance live stop"
>>> volume rebalance: live: failed: Rebalance not started.
>>> 
>>> [root at highlander ~]# ssh stor106 "gluster volume rebalance live stop"
>>> volume rebalance: live: failed: Rebalance not started.
>>> 
>>> [root at highlander ~]# gluster volume rebalance live stop
>>> volume rebalance: live: failed: Rebalance not started.
>>> 
>>> [root at highlander ~]# gluster volume stop live
>>> Stopping volume will make its data inaccessible. Do you want to continue?
>>> (y/n) y
>>> volume stop: live: failed: Staging failed on stor106. Error: rebalance
>>> session is in progress for the volume 'live'
>>> Staging failed on stor104. Error: rebalance session is in progress for the
>>> volume ‘live'
>> Can you run [ps aux |  grep "rebalance"] on all the servers and post here?
>> Just want to check whether rebalance is really running or not. Again
>> requesting glusterd team to give inputs.
>> 
>>> 
>>> 
>>>> 4. Gluster version
>>> 
>>> [root at highlander ~]# pdsh -g live 'rpm -qa | grep gluster'
>>> stor104: glusterfs-api-3.7.3-1.el7.x86_64
>>> stor104: glusterfs-server-3.7.3-1.el7.x86_64
>>> stor104: glusterfs-libs-3.7.3-1.el7.x86_64
>>> stor104: glusterfs-3.7.3-1.el7.x86_64
>>> stor104: glusterfs-fuse-3.7.3-1.el7.x86_64
>>> stor104: glusterfs-client-xlators-3.7.3-1.el7.x86_64
>>> stor104: glusterfs-cli-3.7.3-1.el7.x86_64
>>> 
>>> stor105: glusterfs-3.7.3-1.el7.x86_64
>>> stor105: glusterfs-client-xlators-3.7.3-1.el7.x86_64
>>> stor105: glusterfs-api-3.7.3-1.el7.x86_64
>>> stor105: glusterfs-cli-3.7.3-1.el7.x86_64
>>> stor105: glusterfs-server-3.7.3-1.el7.x86_64
>>> stor105: glusterfs-libs-3.7.3-1.el7.x86_64
>>> stor105: glusterfs-fuse-3.7.3-1.el7.x86_64
>>> 
>>> stor106: glusterfs-libs-3.7.3-1.el7.x86_64
>>> stor106: glusterfs-fuse-3.7.3-1.el7.x86_64
>>> stor106: glusterfs-client-xlators-3.7.3-1.el7.x86_64
>>> stor106: glusterfs-api-3.7.3-1.el7.x86_64
>>> stor106: glusterfs-cli-3.7.3-1.el7.x86_64
>>> stor106: glusterfs-server-3.7.3-1.el7.x86_64
>>> stor106: glusterfs-3.7.3-1.el7.x86_64
>>> 
>>>> 
>>>> Will ask for more information in case needed.
>>>> 
>>>> Regards,
>>>> Susant
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>>> From: "Christophe TREFOIS" <christophe.trefois at uni.lu>
>>>>> To: "Raghavendra Gowdappa" <rgowdapp at redhat.com>, "Nithya Balachandran"
>>>>> <nbalacha at redhat.com>, "Susant Palai"
>>>>> <spalai at redhat.com>, "Shyamsundar Ranganathan" <srangana at redhat.com>
>>>>> Cc: "Mohammed Rafi K C" <rkavunga at redhat.com>
>>>>> Sent: Monday, 17 August, 2015 7:03:20 PM
>>>>> Subject: Fwd: [Gluster-devel] Skipped files during rebalance
>>>>> 
>>>>> Hi DHT team,
>>>>> 
>>>>> This email somehow didn’t get forwarded to you.
>>>>> 
>>>>> In addition to my problem described below, here is one example of free
>>>>> memory
>>>>> after everything failed
>>>>> 
>>>>> [root at highlander ~]# pdsh -g live 'free -m'
>>>>> stor106:               total        used        free      shared
>>>>> buff/cache
>>>>> available
>>>>> stor106: Mem:         193249      124784        1347           9
>>>>> 67118
>>>>> 12769
>>>>> stor106: Swap:             0           0           0
>>>>> stor104:               total        used        free      shared
>>>>> buff/cache
>>>>> available
>>>>> stor104: Mem:         193249      107617       31323           9
>>>>> 54308
>>>>> 42752
>>>>> stor104: Swap:             0           0           0
>>>>> stor105:               total        used        free      shared
>>>>> buff/cache
>>>>> available
>>>>> stor105: Mem:         193248      141804        6736           9
>>>>> 44707
>>>>> 9713
>>>>> stor105: Swap:             0           0           0
>>>>> 
>>>>> So after the failed operation, there’s almost no memory free, and it is
>>>>> also
>>>>> not freed up.
>>>>> 
>>>>> Thank you for pointing me to any directions,
>>>>> 
>>>>> Kind regards,
>>>>> 
>>>>> —
>>>>> Christophe
>>>>> 
>>>>> 
>>>>> Begin forwarded message:
>>>>> 
>>>>> From: Christophe TREFOIS
>>>>> <christophe.trefois at uni.lu<mailto:christophe.trefois at uni.lu>>
>>>>> Subject: Re: [Gluster-devel] Skipped files during rebalance
>>>>> Date: 17 Aug 2015 11:54:32 CEST
>>>>> To: Mohammed Rafi K C <rkavunga at redhat.com<mailto:rkavunga at redhat.com>>
>>>>> Cc: "gluster-devel at gluster.org<mailto:gluster-devel at gluster.org>"
>>>>> <gluster-devel at gluster.org<mailto:gluster-devel at gluster.org>>
>>>>> 
>>>>> Dear Rafi,
>>>>> 
>>>>> Thanks for submitting a patch.
>>>>> 
>>>>> @DHT, I have two additional questions / problems.
>>>>> 
>>>>> 1. When doing a rebalance (with data) RAM consumption on the nodes goes
>>>>> dramatically high, eg out of 196 GB available per node, RAM usage would
>>>>> fill
>>>>> up to 195.6 GB. This seems quite excessive and strange to me.
>>>>> 
>>>>> 2. As you can see, the rebalance (with data) failed as one endpoint
>>>>> becomes
>>>>> unconnected (even though it still is connected). I’m thinking this could
>>>>> be
>>>>> due to the high RAM usage?
>>>>> 
>>>>> Thank you for your help,
>>>>> 
>>>>> —
>>>>> Christophe
>>>>> 
>>>>> Dr Christophe Trefois, Dipl.-Ing.
>>>>> Technical Specialist / Post-Doc
>>>>> 
>>>>> UNIVERSITÉ DU LUXEMBOURG
>>>>> 
>>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>>>>> Campus Belval | House of Biomedicine
>>>>> 6, avenue du Swing
>>>>> L-4367 Belvaux
>>>>> T: +352 46 66 44 6124
>>>>> F: +352 46 66 44 6949
>>>>> http://www.uni.lu/lcsb
>>>>> 
>>>>> [Facebook]<https://www.facebook.com/trefex>  [Twitter]
>>>>> <https://twitter.com/Trefex>   [Google Plus]
>>>>> <https://plus.google.com/+ChristopheTrefois/>   [Linkedin]
>>>>> <https://www.linkedin.com/in/trefoischristophe>   [skype]
>>>>> <http://skype:Trefex?call>
>>>>> 
>>>>> 
>>>>> ----
>>>>> This message is confidential and may contain privileged information.
>>>>> It is intended for the named recipient only.
>>>>> If you receive it in error please notify me and permanently delete the
>>>>> original message and any copies.
>>>>> ----
>>>>> 
>>>>> 
>>>>> 
>>>>> On 17 Aug 2015, at 11:27, Mohammed Rafi K C
>>>>> <rkavunga at redhat.com<mailto:rkavunga at redhat.com>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:
>>>>> Dear all,
>>>>> 
>>>>> I have successfully added a new node to our setup, and finally managed
>>>>> to
>>>>> get
>>>>> a successful fix-layout run as well with no errors.
>>>>> 
>>>>> Now, as per the documentation, I started a gluster volume rebalance live
>>>>> start task and I see many skipped files.
>>>>> The error log contains then entires as follows for each skipped file.
>>>>> 
>>>>> [2015-08-16 20:23:30.591161] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>> s_05(2013-10-11_17-12-02)/004010008.flex lookup failed
>>>>> [2015-08-16 20:23:30.768391] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>> s_05(2013-10-11_17-12-02)/007005003.flex lookup failed
>>>>> [2015-08-16 20:23:30.804811] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>> s_05(2013-10-11_17-12-02)/006005009.flex lookup failed
>>>>> [2015-08-16 20:23:30.805201] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>> s_05(2013-10-11_17-12-02)/005006011.flex lookup failed
>>>>> [2015-08-16 20:23:30.880037] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>> s_05(2013-10-11_17-12-02)/005009012.flex lookup failed
>>>>> [2015-08-16 20:23:31.038236] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>> s_05(2013-10-11_17-12-02)/003008007.flex lookup failed
>>>>> [2015-08-16 20:23:31.259762] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>> s_05(2013-10-11_17-12-02)/004008006.flex lookup failed
>>>>> [2015-08-16 20:23:31.333764] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>> s_05(2013-10-11_17-12-02)/007008001.flex lookup failed
>>>>> [2015-08-16 20:23:31.340190] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>>>>> s_05(2013-10-11_17-12-02)/006007004.flex lookup failed
>>>>> 
>>>>> Update: one of the rebalance tasks now failed.
>>>>> 
>>>>> @Rafi, I got the same error as Friday except this time with data.
>>>>> 
>>>>> Packets that carrying the ping request could be waiting in the queue
>>>>> during
>>>>> the whole time-out period, because of the heavy traffic in the network.
>>>>> I
>>>>> have sent a patch for this. You can track the status here :
>>>>> http://review.gluster.org/11935
>>>>> 
>>>>> 
>>>>> 
>>>>> [2015-08-16 20:24:34.533167] C
>>>>> [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0:
>>>>> server
>>>>> 192.168.123.104:49164 has not responded in the last 42 seconds,
>>>>> disconnecting.
>>>>> [2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_unwin
>>>>> d+0x1de)[0x7fa454bb09be] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>> (-->
>>>>> /lib64/li
>>>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0:
>>>>> forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at
>>>>> 2015-08-16 20:23:51.305640 (xid=0x5dd4da)
>>>>> [2015-08-16 20:24:34.533672] E [MSGID: 114031]
>>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote
>>>>> operation failed [Transport endpoint is not connected]
>>>>> [2015-08-16 20:24:34.534201] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_unwin
>>>>> d+0x1de)[0x7fa454bb09be] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>> (-->
>>>>> /lib64/li
>>>>> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0:
>>>>> forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at
>>>>> 2015-08-16
>>>>> 20:23:51.303938 (xid=0x5dd4d7)
>>>>> [2015-08-16 20:24:34.534347] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>>> /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_
>>>>> 12(2013-10-12_00-12-55)/007008007.flex: failed to migrate data
>>>>> [2015-08-16 20:24:34.534413] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_unwin
>>>>> d+0x1de)[0x7fa454bb09be] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>> (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>>> called at 2015-08-16 20:23:51.303969 (xid=0x5dd4d8)
>>>>> [2015-08-16 20:24:34.534579] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>>> /hcs/hcs/OperaArchiveCol/SK
>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007009012.flex:
>>>>> failed to migrate data
>>>>> [2015-08-16 20:24:34.534676] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>> (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>>> called at 2015-08-16 20:23:51.313548 (xid=0x5dd4db)
>>>>> [2015-08-16 20:24:34.534745] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>>> /hcs/hcs/OperaArchiveCol/SK
>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/006008011.flex:
>>>>> failed to migrate data
>>>>> [2015-08-16 20:24:34.535199] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>> (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>>> called at 2015-08-16 20:23:51.326369 (xid=0x5dd4dc)
>>>>> [2015-08-16 20:24:34.535232] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>>> /hcs/hcs/OperaArchiveCol/SK
>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005003001.flex:
>>>>> failed to migrate data
>>>>> [2015-08-16 20:24:34.535984] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>> (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12))
>>>>> called at 2015-08-16 20:23:51.326437 (xid=0x5dd4dd)
>>>>> [2015-08-16 20:24:34.536069] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed:
>>>>> /hcs/hcs/OperaArchiveCol/SK
>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007010012.flex:
>>>>> failed to migrate data
>>>>> [2015-08-16 20:24:34.536267] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>> (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
>>>>> op(LOOKUP(27))
>>>>> called at 2015-08-16 20:23:51.337240 (xid=0x5dd4de)
>>>>> [2015-08-16 20:24:34.536339] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>> failed:/hcs/hcs/OperaArchiveCol/SK
>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/002005012.flex
>>>>> lookup failed
>>>>> [2015-08-16 20:24:34.536487] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>> (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
>>>>> op(LOOKUP(27))
>>>>> called at 2015-08-16 20:23:51.425254 (xid=0x5dd4df)
>>>>> [2015-08-16 20:24:34.536685] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>> (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
>>>>> op(LOOKUP(27))
>>>>> called at 2015-08-16 20:23:51.738907 (xid=0x5dd4e0)
>>>>> [2015-08-16 20:24:34.536891] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>> (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
>>>>> op(LOOKUP(27))
>>>>> called at 2015-08-16 20:23:51.805096 (xid=0x5dd4e1)
>>>>> [2015-08-16 20:24:34.537316] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>> (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
>>>>> op(LOOKUP(27))
>>>>> called at 2015-08-16 20:23:51.805977 (xid=0x5dd4e2)
>>>>> [2015-08-16 20:24:34.537735] E [rpc-clnt.c:362:saved_frames_unwind] (-->
>>>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (-->
>>>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
>>>>> (-->
>>>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
>>>>> 0-live-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called
>>>>> at
>>>>> 2015-08-16 20:23:52.530107 (xid=0x5dd4e3)
>>>>> [2015-08-16 20:24:34.538475] E [MSGID: 114031]
>>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote
>>>>> operation failed [Transport endpoint is not connected]
>>>>> The message "E [MSGID: 114031]
>>>>> [client-rpc-fops.c:1621:client3_3_inodelk_cbk]
>>>>> 0-live-client-0: remote operation failed [Transport endpoint is not
>>>>> connected]" repeated 4 times between [2015-08-16 20:24:34.538475] and
>>>>> [2015-08-16 20:24:34.538535]
>>>>> [2015-08-16 20:24:34.538584] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate
>>>>> file failed: 002004003.flex lookup failed
>>>>> [2015-08-16 20:24:34.538904] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate
>>>>> file failed: 003009008.flex lookup failed
>>>>> [2015-08-16 20:24:34.539724] E [MSGID: 109023]
>>>>> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
>>>>> failed:/hcs/hcs/OperaArchiveCol/SK
>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/005009006.flex
>>>>> lookup failed
>>>>> [2015-08-16 20:24:34.539820] E [MSGID: 109016]
>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout
>>>>> failed
>>>>> for /hcs/hcs/OperaArchiveCol/SK
>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)
>>>>> [2015-08-16 20:24:34.540031] E [MSGID: 109016]
>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout
>>>>> failed
>>>>> for /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1
>>>>> [2015-08-16 20:24:34.540691] E [MSGID: 114031]
>>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
>>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/002005008.flex
>>>>> [Transport endpoint is not connected]
>>>>> [2015-08-16 20:24:34.541152] E [MSGID: 114031]
>>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
>>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005004009.flex
>>>>> [Transport endpoint is not connected]
>>>>> [2015-08-16 20:24:34.541331] E [MSGID: 114031]
>>>>> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
>>>>> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
>>>>> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007005011.flex
>>>>> [Transport endpoint is not connected]
>>>>> [2015-08-16 20:24:34.541486] E [MSGID: 109016]
>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout
>>>>> failed
>>>>> for /hcs/hcs/OperaArchiveCol
>>>>> [2015-08-16 20:24:34.541572] E [MSGID: 109016]
>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout
>>>>> failed
>>>>> for /hcs/hcs
>>>>> [2015-08-16 20:24:34.541639] E [MSGID: 109016]
>>>>> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout
>>>>> failed
>>>>> for /hcs
>>>>> 
>>>>> Any help would be greatly appreciated.
>>>>> CCing dht teams to give you better idea about why rebalance failed/ and
>>>>> about
>>>>> huge memory consumption by rebalance process (200GB RAM) .
>>>>> 
>>>>> Regards
>>>>> Rafi KC
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> --
>>>>> Christophe
>>>>> 
>>>>> Dr Christophe Trefois, Dipl.-Ing.
>>>>> Technical Specialist / Post-Doc
>>>>> 
>>>>> UNIVERSITÉ DU LUXEMBOURG
>>>>> 
>>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
>>>>> Campus Belval | House of Biomedicine
>>>>> 6, avenue du Swing
>>>>> L-4367 Belvaux
>>>>> T: +352 46 66 44 6124
>>>>> F: +352 46 66 44 6949
>>>>> http://www.uni.lu/lcsb
>>>>> 
>>>>> ----
>>>>> This message is confidential and may contain privileged information.
>>>>> It is intended for the named recipient only.
>>>>> If you receive it in error please notify me and permanently delete the
>>>>> original message and any copies.
>>>>> ----
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Gluster-devel mailing list
>>>>> Gluster-devel at gluster.org<mailto:Gluster-devel at gluster.org>
>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>> 
>>> 
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>