[Gluster-users] Destroyed Connection on gluster 3.2.5

Heiko Schröter schroete at iup.physik.uni-bremen.de
Wed Nov 28 10:31:30 UTC 2012


Hello,

we encounter break down of the gluster mount on a single client running 
wget download in parallel (70 threads with ~20MB/s in total on a 1Gb net).
The other 23 clients did not drop the gluster mount.
In the client logs we see "destroyed connection" messages.
After force unmounting and mounting everything works fine again.
We had a gluster rebalance running before the data download and stopped 
it manually so it won't interfere with the download op.


I'am asking your experince if this could be related to network 
outages/overload ?
Or is there something broken with the stopped rebalance op in 3.2.5 ?
Would an upgrade to 3.3.1 improve the situation i.e. reconnect instead 
of "destroyed connection" and dangling mount points on the client ?

Heiko


Attached pls find the client log and logs of two affected bricks:
gluster 3.2.5
rd28 ~ # gluster volume info all
Volume Name: data
Type: Distribute
Status: Started
Number of Bricks: 16
Transport-type: tcp
Bricks:
Brick1: rd29:/data
Brick2: rd34:/data
Brick3: rd28:/data
Brick4: rd24:/data
Brick5: rd26:/data
Brick6: rd27:/data
Brick7: rd21:/data
Brick8: rd20:/data
Brick9: rd22:/data
Brick10: rd23:/data
Brick11: rd30:/data
Brick12: rd31:/data
Brick13: rd32:/data
Brick14: rd33:/data
Brick15: rd25:/data
Brick16: rd35:/data
Options Reconfigured:
nfs.port: 2049
cluster.min-free-disk: 5%
network.ping-timeout: 24
nfs.export-volumes: on
nfs.export-dir: /data
nfs.disable: off
performance.stat-prefetch: off


###### CLIENT (hc10):
[2012-11-26 22:00:10.489854] C 
[client-handshake.c:121:rpc_client_ping_timer_expired] 0-data-client-2: 
server 192.168.16.138:24009 has not responded in the last 24 seconds, 
disconnecting.
[2012-11-26 22:00:10.656750] E [rpc-clnt.c:341:saved_frames_unwind] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x6d) [0x7f0a06a8e50d] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) 
[0x7f0a06a8e1dd] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) 
[0x7f0a06a8e13e]))) 0-data-client-2: forced unwinding frame 
type(GlusterFS 3.1) op(RELEASE(41)) called at 2012-11-26 21:58:47.398947
[2012-11-26 22:00:10.656828] E [rpc-clnt.c:341:saved_frames_unwind] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x6d) [0x7f0a06a8e50d] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d) 
[0x7f0a06a8e1dd] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) 
[0x7f0a06a8e13e]))) 0-data-client-2: forced unwinding frame 
type(GlusterFS 3.1) op(WRITE(13)) called at 2012-11-26 21:58:47.399011
[2012-11-26 22:00:10.656844] I 
[client3_1-fops.c:683:client3_1_writev_cbk] 0-data-client-2: remote 
operation failed: Transport endpoint is not connected
[2012-11-26 22:00:10.657005] W [fuse-bridge.c:1828:fuse_writev_cbk] 
0-glusterfs-fuse: 1634988442: WRITE => -1 (Transport endpoint is not 
connected)
[2012-11-26 22:00:10.710760] I [socket.c:2275:socket_submit_request] 
0-data-client-2: not connected (priv->connected = 0)
[2012-11-26 22:00:10.710800] W [rpc-clnt.c:1417:rpc_clnt_submit] 
0-data-client-2: failed to submit rpc-request (XID: 0x21514768x Program: 
GlusterFS 3.1, ProgVers: 310, Proc: 13) to rpc-transport (data-client-2)
[2012-11-26 22:00:10.727969] I 
[client3_1-fops.c:683:client3_1_writev_cbk] 0-data-client-2: remote 
operation failed: Transport endpoint is not connected
[2012-11-26 22:00:10.727991] W [client3_1-fops.c:3622:client3_1_writev] 
0-data-client-2: failed to send the fop: Stale NFS file handle
pending frames:
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
<snip><snap>
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2012-11-26 22:00:10
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.2.5
/lib64/libc.so.6(+0x35b80)[0x7f0a060f0b80]
/usr/lib/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_sync_cbk+0x30)[0x7f0a02fa3550]
/usr/lib/glusterfs/3.2.5/xlator/cluster/distribute.so(dht_writev_cbk+0xd3)[0x7f0a031bbd63]
/usr/lib/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_writev+0x12a)[0x7f0a034054ca]
/usr/lib/glusterfs/3.2.5/xlator/protocol/client.so(client_writev+0xa1)[0x7f0a033e9921]
/usr/lib/glusterfs/3.2.5/xlator/cluster/distribute.so(dht_writev+0x162)[0x7f0a031c0c42]
/usr/lib/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_sync+0x569)[0x7f0a02f9c929]
/usr/lib/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_do_ops+0x53)[0x7f0a02fa0953]
/usr/lib/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_process_queue+0xf2)[0x7f0a02f9dcc2]
/usr/lib/glusterfs/3.2.5/xlator/performance/write-behind.so(wb_sync_cbk+0xf7)[0x7f0a02fa3617]
/usr/lib/glusterfs/3.2.5/xlator/cluster/distribute.so(dht_writev_cbk+0xd3)[0x7f0a031bbd63]
/usr/lib/glusterfs/3.2.5/xlator/protocol/client.so(client3_1_writev_cbk+0x507)[0x7f0a03401797]
/usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1ca)[0x7f0a06a8e0ba]
/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f0a06a8e13e]
/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7d)[0x7f0a06a8e1dd]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x6d)[0x7f0a06a8e50d]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f0a06a8aaa8]
/usr/lib/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_poll_err+0x54)[0x7f0a04447004]
/usr/lib/glusterfs/3.2.5/rpc-transport/socket.so(socket_event_handler+0x138)[0x7f0a0444ca38]
/usr/lib64/libglusterfs.so.0(+0x3ed4e)[0x7f0a06cd4d4e]
/usr/sbin/glusterfs(main+0x2a9)[0x406689]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0a060dd09d]
/usr/sbin/glusterfs[0x403b89]

##### BRICK (data-client-2):
[2012-11-26 22:00:17.19582] W 
[socket.c:1494:__socket_proto_state_machine] 0-tcp.data-server: reading 
from socket failed. Error (Transport endpoint is not connected), peer 
(192.168.16.167:1022)
[2012-11-26 22:00:17.60962] I [server.c:438:server_rpc_notify] 
0-data-server: disconnected connection from 192.168.16.167:1022
[2012-11-26 22:00:32.950134] W [socket.c:204:__socket_rwv] 
0-tcp.data-server: readv failed (Connection reset by peer)
[2012-11-26 22:00:32.950178] W [socket.c:775:__socket_read_simple_msg] 
0-tcp.data-server: reading from socket failed. Error (Connection reset 
by peer), peer (192.168.16.167:999)
[2012-11-26 22:00:32.962622] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/20120910/TREW/TREW-Band-15-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120910_00002.tar
[2012-11-26 22:00:32.962642] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/download_logs/wget_Date_20120911_Band_13.log
[2012-11-26 22:00:32.962792] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/20120910/TREW/TREW-Band-05-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120910_00002.tar
[2012-11-26 22:00:32.962809] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/20120909/TREW/TREW-Band-09-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120909_00004.tar
[2012-11-26 22:00:32.962893] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/download_logs/wget_Date_20120908_Band_07.log
[2012-11-26 22:00:32.962907] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/download_logs/wget_Date_20120908_Band_02.log
[2012-11-26 22:00:32.962945] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on /HGK/L2/inputdata/sacspecTOTL08_758.dat
[2012-11-26 22:00:32.962970] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/20120908/TREW/TREW-Band-04-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120908_00002.tar
[2012-11-26 22:00:32.962985] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/20120908/TREW/TREW-Band-06-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120908_00004.tar
[2012-11-26 22:00:32.963011] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/20120907/TREW/TREW-Band-06-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120907_00004.tar
[2012-11-26 22:00:32.971352] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on /HGK/L2/inputdata/sacspecTOTL08_758.dat
[2012-11-26 22:00:32.971367] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/download_logs/wget_Date_20120909_Band_07.log
[2012-11-26 22:00:32.971393] I [server.c:438:server_rpc_notify] 
0-data-server: disconnected connection from 192.168.16.167:999
[2012-11-26 22:00:32.971418] I 
[server-helpers.c:783:server_connection_destroy] 0-data-server: 
destroyed connection of hc10-17275-2012/11/24-20:57:47:169834-data-client-2


##### BRICK (data-client-11):
[2012-11-26 22:00:17.22727] W [socket.c:204:__socket_rwv] 
0-tcp.data-server: readv failed (Connection reset by peer)
[2012-11-26 22:00:17.56177] W 
[socket.c:1494:__socket_proto_state_machine] 0-tcp.data-server: reading 
from socket failed. Error (Connection reset by peer), peer 
(192.168.16.167:1000)
[2012-11-26 22:00:17.66354] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/download_logs/wget_Date_20120911_Band_06.log
[2012-11-26 22:00:17.66377] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/20120909/TREW/TREW-Band-04-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120909_00002.tar
[2012-11-26 22:00:17.66496] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/20120911/TREW/TREW-Band-06-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120911_00004.tar
[2012-11-26 22:00:17.66528] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/20120910/TREW/TREW-Band-06-SDR/TREW-SDR-Geo/TREW_TREW-SDR-Geo_20120910_00004.tar
[2012-11-26 22:00:17.66542] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/download_logs/wget_Date_20120910_Band_14.log
[2012-11-26 22:00:17.66558] I [server-helpers.c:485:do_fd_cleanup] 
0-data-server: fd cleanup on 
/TREW/download_logs/wget_Date_20120908_Band_10.log
[2012-11-26 22:00:17.66574] I [server.c:438:server_rpc_notify] 
0-data-server: disconnected connection from 192.168.16.167:1000
[2012-11-26 22:00:17.66973] I 
[server-helpers.c:783:server_connection_destroy] 0-data-server: 
destroyed connection of hc10-17275-2012/11/24-20:57:47:169834-data-client-11





More information about the Gluster-users mailing list