[Gluster-devel] Skipped files during rebalance

Christophe TREFOIS christophe.trefois at uni.lu
Mon Aug 17 09:54:32 UTC 2015


Dear Rafi,

Thanks for submitting a patch.

@DHT, I have two additional questions / problems.

1. When doing a rebalance (with data) RAM consumption on the nodes goes dramatically high, eg out of 196 GB available per node, RAM usage would fill up to 195.6 GB. This seems quite excessive and strange to me.

2. As you can see, the rebalance (with data) failed as one endpoint becomes unconnected (even though it still is connected). I’m thinking this could be due to the high RAM usage?

Thank you for your help,

—
Christophe

Dr Christophe Trefois, Dipl.-Ing.
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine
6, avenue du Swing
L-4367 Belvaux
T: +352 46 66 44 6124
F: +352 46 66 44 6949
http://www.uni.lu/lcsb

[Facebook]<https://www.facebook.com/trefex>  [Twitter] <https://twitter.com/Trefex>   [Google Plus] <https://plus.google.com/+ChristopheTrefois/>   [Linkedin] <https://www.linkedin.com/in/trefoischristophe>   [skype] <http://skype:Trefex?call>

----
This message is confidential and may contain privileged information.
It is intended for the named recipient only.
If you receive it in error please notify me and permanently delete the original message and any copies.
----



On 17 Aug 2015, at 11:27, Mohammed Rafi K C <rkavunga at redhat.com<mailto:rkavunga at redhat.com>> wrote:



On 08/17/2015 01:58 AM, Christophe TREFOIS wrote:
Dear all,

I have successfully added a new node to our setup, and finally managed to get a successful fix-layout run as well with no errors.

Now, as per the documentation, I started a gluster volume rebalance live start task and I see many skipped files.
The error log contains then entires as follows for each skipped file.

[2015-08-16 20:23:30.591161] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/004010008.flex lookup failed
[2015-08-16 20:23:30.768391] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/007005003.flex lookup failed
[2015-08-16 20:23:30.804811] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/006005009.flex lookup failed
[2015-08-16 20:23:30.805201] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/005006011.flex lookup failed
[2015-08-16 20:23:30.880037] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/005009012.flex lookup failed
[2015-08-16 20:23:31.038236] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/003008007.flex lookup failed
[2015-08-16 20:23:31.259762] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/004008006.flex lookup failed
[2015-08-16 20:23:31.333764] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/007008001.flex lookup failed
[2015-08-16 20:23:31.340190] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Mea
s_05(2013-10-11_17-12-02)/006007004.flex lookup failed

Update: one of the rebalance tasks now failed.

@Rafi, I got the same error as Friday except this time with data.

Packets that carrying the ping request could be waiting in the queue during the whole time-out period, because of the heavy traffic in the network. I have sent a patch for this. You can track the status here : http://review.gluster.org/11935



[2015-08-16 20:24:34.533167] C [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-0: server 192.168.123.104:49164 has not responded in the last 42 seconds, disconnecting.
[2015-08-16 20:24:34.533614] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwin
d+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/li
bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at 2015-08-16 20:23:51.305640 (xid=0x5dd4da)
[2015-08-16 20:24:34.533672] E [MSGID: 114031] [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote operation failed [Transport endpoint is not connected]
[2015-08-16 20:24:34.534201] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwin
d+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/li
bgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2015-08-16 20:23:51.303938 (xid=0x5dd4d7)
[2015-08-16 20:24:34.534347] E [MSGID: 109023] [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_
12(2013-10-12_00-12-55)/007008007.flex: failed to migrate data
[2015-08-16 20:24:34.534413] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwin
d+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2015-08-16 20:23:51.303969 (xid=0x5dd4d8)
[2015-08-16 20:24:34.534579] E [MSGID: 109023] [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007009012.flex: failed to migrate data
[2015-08-16 20:24:34.534676] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2015-08-16 20:23:51.313548 (xid=0x5dd4db)
[2015-08-16 20:24:34.534745] E [MSGID: 109023] [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/006008011.flex: failed to migrate data
[2015-08-16 20:24:34.535199] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2015-08-16 20:23:51.326369 (xid=0x5dd4dc)
[2015-08-16 20:24:34.535232] E [MSGID: 109023] [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005003001.flex: failed to migrate data
[2015-08-16 20:24:34.535984] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2015-08-16 20:23:51.326437 (xid=0x5dd4dd)
[2015-08-16 20:24:34.536069] E [MSGID: 109023] [dht-rebalance.c:1124:dht_migrate_file] 0-live-dht: Migrate file failed: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007010012.flex: failed to migrate data
[2015-08-16 20:24:34.536267] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-08-16 20:23:51.337240 (xid=0x5dd4de)
[2015-08-16 20:24:34.536339] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/002005012.flex lookup failed
[2015-08-16 20:24:34.536487] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-08-16 20:23:51.425254 (xid=0x5dd4df)
[2015-08-16 20:24:34.536685] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-08-16 20:23:51.738907 (xid=0x5dd4e0)
[2015-08-16 20:24:34.536891] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-08-16 20:23:51.805096 (xid=0x5dd4e1)
[2015-08-16 20:24:34.537316] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-08-16 20:23:51.805977 (xid=0x5dd4e2)
[2015-08-16 20:24:34.537735] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fa454de59e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fa454bb09be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fa454bb0ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fa454bb247c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fa454bb2c38] ))))) 0-live-client-0: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2015-08-16 20:23:52.530107 (xid=0x5dd4e3)
[2015-08-16 20:24:34.538475] E [MSGID: 114031] [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote operation failed [Transport endpoint is not connected]
The message "E [MSGID: 114031] [client-rpc-fops.c:1621:client3_3_inodelk_cbk] 0-live-client-0: remote operation failed [Transport endpoint is not connected]" repeated 4 times between [2015-08-16 20:24:34.538475] and [2015-08-16 20:24:34.538535]
[2015-08-16 20:24:34.538584] E [MSGID: 109023] [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate file failed: 002004003.flex lookup failed
[2015-08-16 20:24:34.538904] E [MSGID: 109023] [dht-rebalance.c:1617:gf_defrag_migrate_single_file] 0-live-dht: Migrate file failed: 003009008.flex lookup failed
[2015-08-16 20:24:34.539724] E [MSGID: 109023] [dht-rebalance.c:1965:gf_defrag_get_entry] 0-live-dht: Migrate file failed:/hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)/005009006.flex lookup failed
[2015-08-16 20:24:34.539820] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_08(2013-10-11_20-12-25)
[2015-08-16 20:24:34.540031] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1
[2015-08-16 20:24:34.540691] E [MSGID: 114031] [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote operation failed. Path: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/002005008.flex [Transport endpoint is not connected]
[2015-08-16 20:24:34.541152] E [MSGID: 114031] [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote operation failed. Path: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/005004009.flex [Transport endpoint is not connected]
[2015-08-16 20:24:34.541331] E [MSGID: 114031] [client-rpc-fops.c:251:client3_3_mknod_cbk] 0-live-client-0: remote operation failed. Path: /hcs/hcs/OperaArchiveCol/SK 20131011_Oligo_Rot_lowConc_P1/Meas_12(2013-10-12_00-12-55)/007005011.flex [Transport endpoint is not connected]
[2015-08-16 20:24:34.541486] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol
[2015-08-16 20:24:34.541572] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs
[2015-08-16 20:24:34.541639] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs

Any help would be greatly appreciated.
CCing dht teams to give you better idea about why rebalance failed/ and about huge memory consumption by rebalance process (200GB RAM) .

Regards
Rafi KC




Thanks,

--
Christophe

Dr Christophe Trefois, Dipl.-Ing.
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine
6, avenue du Swing
L-4367 Belvaux
T: +352 46 66 44 6124
F: +352 46 66 44 6949
http://www.uni.lu/lcsb

----
This message is confidential and may contain privileged information.
It is intended for the named recipient only.
If you receive it in error please notify me and permanently delete the original message and any copies.
----






-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20150817/761f4357/attachment-0001.html>


More information about the Gluster-devel mailing list