[Gluster-devel] Skipped files during rebalance

Mohammed Rafi K C rkavunga at redhat.com
Mon Aug 17 09:27:07 UTC 2015



On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:
>
> Dear all,
>
>  
>
> I have successfully added a new node to our setup, and finally managed
> to get a successful fix-layout run as well with no errors.
>
>  
>
> Now, as per the documentation, I started a gluster volume rebalance
> live start task and I see many skipped files. 
>
> The error log contains then entires as follows for each skipped file.
>
>  
>
> [2015-08-16 20:23:30.591161] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/004010008.flex lookup failed
>
> [2015-08-16 20:23:30.768391] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/007005003.flex lookup failed
>
> [2015-08-16 20:23:30.804811] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/006005009.flex lookup failed
>
> [2015-08-16 20:23:30.805201] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/005006011.flex lookup failed
>
> [2015-08-16 20:23:30.880037] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/005009012.flex lookup failed
>
> [2015-08-16 20:23:31.038236] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/003008007.flex lookup failed
>
> [2015-08-16 20:23:31.259762] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/004008006.flex lookup failed
>
> [2015-08-16 20:23:31.333764] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/007008001.flex lookup failed
>
> [2015-08-16 20:23:31.340190] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
>
> s_05(2013-10-11_17-12-02)/006007004.flex lookup failed
>
>  
>
> Update: one of the rebalance tasks now failed.
>
>  
>
> @Rafi, I got the same error as Friday except this time with data.
>

Packets that carrying the ping request could be waiting in the queue
during the whole time-out period, because of the heavy traffic in the
network. I have sent a patch for this. You can track the status here :
http://review.gluster.org/11935


>  
>
> [2015-08-16 20:24:34.533167] C
> [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0:
> server 192.168.123.104:49164 has not responded in the last 42 seconds,
> disconnecting.
>
> [2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwin
>
> d+0x1de)[0x7fa454bb09be] (-->
> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/li
>
> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
> op(INODELK(29)) called at 2015-08-16 20:23:51.305640 (xid=0x5dd4da)
>
> [2015-08-16 20:24:34.533672] E [MSGID: 114031]
> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote
> operation failed [Transport endpoint is not connected]
>
> [2015-08-16 20:24:34.534201] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwin
>
> d+0x1de)[0x7fa454bb09be] (-->
> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/li
>
> bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
> op(READ(12)) called at 2015-08-16 20:23:51.303938 (xid=0x5dd4d7)
>
> [2015-08-16 20:24:34.534347] E [MSGID: 109023]
> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file
> failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_
>
> 12(2013-10-12_00-12-55)/007008007.flex: failed to migrate data
>
> [2015-08-16 20:24:34.534413] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwin
>
> d+0x1de)[0x7fa454bb09be] (-->
> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
> op(READ(12)) called at 2015-08-16 20:23:51.303969 (xid=0x5dd4d8)
>
> [2015-08-16 20:24:34.534579] E [MSGID: 109023]
> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file
> failed: /hcs/hcs/OperaArchiveCol/SK
> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007009012.flex:
> failed to migrate data
>
> [2015-08-16 20:24:34.534676] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be]
> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace]
> (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
> op(READ(12)) called at 2015-08-16 20:23:51.313548 (xid=0x5dd4db)
>
> [2015-08-16 20:24:34.534745] E [MSGID: 109023]
> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file
> failed: /hcs/hcs/OperaArchiveCol/SK
> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/006008011.flex:
> failed to migrate data
>
> [2015-08-16 20:24:34.535199] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be]
> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace]
> (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
> op(READ(12)) called at 2015-08-16 20:23:51.326369 (xid=0x5dd4dc)
>
> [2015-08-16 20:24:34.535232] E [MSGID: 109023]
> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file
> failed: /hcs/hcs/OperaArchiveCol/SK
> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005003001.flex:
> failed to migrate data
>
> [2015-08-16 20:24:34.535984] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be]
> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace]
> (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
> op(READ(12)) called at 2015-08-16 20:23:51.326437 (xid=0x5dd4dd)
>
> [2015-08-16 20:24:34.536069] E [MSGID: 109023]
> [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file
> failed: /hcs/hcs/OperaArchiveCol/SK
> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007010012.flex:
> failed to migrate data
>
> [2015-08-16 20:24:34.536267] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be]
> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace]
> (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
> op(LOOKUP(27)) called at 2015-08-16 20:23:51.337240 (xid=0x5dd4de)
>
> [2015-08-16 20:24:34.536339] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK
> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/002005012.flex
> lookup failed
>
> [2015-08-16 20:24:34.536487] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be]
> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace]
> (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
> op(LOOKUP(27)) called at 2015-08-16 20:23:51.425254 (xid=0x5dd4df)
>
> [2015-08-16 20:24:34.536685] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be]
> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace]
> (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
> op(LOOKUP(27)) called at 2015-08-16 20:23:51.738907 (xid=0x5dd4e0)
>
> [2015-08-16 20:24:34.536891] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be]
> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace]
> (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
> op(LOOKUP(27)) called at 2015-08-16 20:23:51.805096 (xid=0x5dd4e1)
>
> [2015-08-16 20:24:34.537316] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be]
> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace]
> (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
> 0-live-client-0: forced unwinding frame type(GlusterFS 3.3)
> op(LOOKUP(27)) called at 2015-08-16 20:23:51.805977 (xid=0x5dd4e2)
>
> [2015-08-16 20:24:34.537735] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be]
> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace]
> (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] )))))
> 0-live-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2))
> called at 2015-08-16 20:23:52.530107 (xid=0x5dd4e3)
>
> [2015-08-16 20:24:34.538475] E [MSGID: 114031]
> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote
> operation failed [Transport endpoint is not connected]
>
> The message "E [MSGID: 114031]
> [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote
> operation failed [Transport endpoint is not connected]" repeated 4
> times between [2015-08-16 20:24:34.538475] and [2015-08-16
> 20:24:34.538535]
>
> [2015-08-16 20:24:34.538584] E [MSGID: 109023]
> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht:
> Migrate file failed: 002004003.flex lookup failed
>
> [2015-08-16 20:24:34.538904] E [MSGID: 109023]
> [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht:
> Migrate file failed: 003009008.flex lookup failed
>
> [2015-08-16 20:24:34.539724] E [MSGID: 109023]
> [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file
> failed:/hcs/hcs/OperaArchiveCol/SK
> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/005009006.flex
> lookup failed
>
> [2015-08-16 20:24:34.539820] E [MSGID: 109016]
> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout
> failed for /hcs/hcs/OperaArchiveCol/SK
> 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)
>
> [2015-08-16 20:24:34.540031] E [MSGID: 109016]
> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout
> failed for /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1
>
> [2015-08-16 20:24:34.540691] E [MSGID: 114031]
> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/002005008.flex
> [Transport endpoint is not connected]
>
> [2015-08-16 20:24:34.541152] E [MSGID: 114031]
> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005004009.flex
> [Transport endpoint is not connected]
>
> [2015-08-16 20:24:34.541331] E [MSGID: 114031]
> [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote
> operation failed. Path: /hcs/hcs/OperaArchiveCol/SK
> 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007005011.flex
> [Transport endpoint is not connected]
>
> [2015-08-16 20:24:34.541486] E [MSGID: 109016]
> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout
> failed for /hcs/hcs/OperaArchiveCol
>
> [2015-08-16 20:24:34.541572] E [MSGID: 109016]
> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout
> failed for /hcs/hcs
>
> [2015-08-16 20:24:34.541639] E [MSGID: 109016]
> [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout
> failed for /hcs
>
>  
>
> Any help would be greatly appreciated.
>
CCing dht teams to give you better idea about why rebalance failed/ and
about huge memory consumption by rebalance process (200GB RAM) .

Regards
Rafi KC



>  
>
> Thanks,
>
>  
>
> --
>
> Christophe
>
> *Dr Christophe Trefois, Dipl.-Ing.*  
> Technical Specialist / Post-Doc
>
> *UNIVERSITÉ DU LUXEMBOURG*
>
> *LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE*
> Campus Belval | House of Biomedicine  
> 6, avenue du Swing 
> L-4367 Belvaux  
> T: +352 46 66 44 6124 
> F: +352 46 66 44 6949  
> http://www.uni.lu/lcsb
>
> ----
> This message is confidential and may contain privileged information. 
> It is intended for the named recipient only. 
> If you receive it in error please notify me and permanently delete the
> original message and any copies. 
> ----
>
>   
>
>  
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150817/d35eacaa/attachment-0001.html>


More information about the Gluster-devel mailing list