[Gluster-devel] Gluster 3.x hangs
nicolas prochazka
prochazka.nicolas at gmail.com
Mon Mar 29 10:33:24 UTC 2010
Hello,
some weeks ago, i send report to tell you that's glusterfs 3.x reboot our
system when we are testing some ha ( desactivate network interface :
ifconfig eth0 down).
You cannot reproduce into your systems.
Reboot of our system is due to : hung_task_panic and hung_task_timeout_secs
, when a task is blocking during 120 s , linux kernel does panic.
so set ung_task_panic to 0 or hung_task_timeout_secs > 600 to let some time.
1 - two server / client in replicate mode
2 - First server 10.98.98.1 is configuration server
3 - run gluster on two servers as :
/usr/local/sbin/glusterfsd --log-level=DEBUG --log-file=/tmpsafe/server.log
-N -f /etc/glusterfs/glusterfs-server.vol
/usr/local/sbin/glusterfs --log-level=DEBUG --log-file=/tmpsafe/client.log
-N -s 10.98.98.1 /mnt/vdisk/
4 - now on 10.98.98.1, do a ifconfig eth0 down.
5 - on 10.98.98.10, after a little timeout, ls /mnt/vdisk comes back (
using 10.98.98.10 as server )
6 - on 10.98.98.1 , ls /mnt/vdisk hangs forever
7 - on 10.98.98.1 , kill glusterfs client, rerun glusterfs , then ls
/mnt/vdisk reworks again ( using 10.98.98.1 as server )
during 6 , there's no log on server and client on 10.98.98.1
show log,
Regards,
Nicolas Prochazka.
-----------------------------------------------
#This file is auto generated, not edit ( Nicolas Prochazka Sep 2009)
# ------------- Create Brick blade definition
volume 10.98.98.1
type protocol/client
option transport-type tcp/client
option remote-host 10.98.98.1
option transport.socket.nodelay on
option remote-subvolume brick
end-volume
volume 10.98.98.10
type protocol/client
option transport-type tcp/client
option remote-host 10.98.98.10
option transport.socket.nodelay on
option remote-subvolume brick
end-volume
# ------------- Create Brick Replicate definition
# ------------- Create Distribute definition
volume last
type cluster/distribute
subvolumes 10.98.98.1 10.98.98.10
end-volume
volume iothreads
type performance/io-threads
option thread-count 8
subvolumes last
end-volume
volume io-cache
type performance/io-cache
option cache-size 2GB # default is 32MB
option cache-timeout 5 # default is 1
subvolumes iothreads
end-volume
volume writebehind
type performance/write-behind
option cache-size 4MB
subvolumes io-cache
end-volume
DEV-10.98.98.1:~# cat /etc/glusterfs/glusterfs-server.vol
volume brickless
type storage/posix
option directory /mnt/disks/export
end-volume
volume brickthread
type features/locks
subvolumes brickless
end-volume
volume brickcache
type performance/io-cache
option cache-size 2GB # default is 32MB
option cache-timeout 2 # default is 1
subvolumes brickthread
end-volume
volume brick
type performance/io-threads
option thread-count 8
subvolumes brickcache
end-volume
volume server
type protocol/server
subvolumes brick
option client-volume-filename /etc/glusterfs/Gglusterfs-client.vol
option transport-type tcp
option transport.socket.nodelay on
option verify-volfile-checksum no
option auth.addr.brick.allow 10.98.98.*
end-volume
Log of client on 10.98.98.10 , all seems to be ok.
[2010-03-29 12:48:04] E [client-protocol.c:415:client_ping_timer_expired]
10.98.98.1: Server 10.98.98.1:6996 has not responded in the last 42 seconds,
disconnecting.
[2010-03-29 12:48:04] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.1:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 12:48:04] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.1:
forced unwinding frame type(2) op(PING)
[2010-03-29 12:48:04] D [client-protocol.c:537:client_ping_cbk] 10.98.98.1:
timer must have expired
[2010-03-29 12:48:04] N [client-protocol.c:6994:notify] 10.98.98.1:
disconnected
[2010-03-29 12:48:06] E [socket.c:762:socket_connect_finish] 10.98.98.1:
connection to 10.98.98.1:6996 failed (No route to host)
[2010-03-29 12:48:09] E [socket.c:762:socket_connect_finish] 10.98.98.1:
connection to 10.98.98.1:6996 failed (No route to host)
log on 10.98.98.1
[2010-03-29 16:30:17] D [dht-diskusage.c:71:dht_du_info_cbk] last: on
subvolume '10.98.98.1': avail_percent is: 99.00 and avail_space is:
15069396992
[2010-03-29 16:30:17] N [client-protocol.c:6246:client_setvolume_cbk]
10.98.98.1: Connected to 10.98.98.1:6996, attached to remote volume 'brick'.
[2010-03-29 16:30:17] N [client-protocol.c:6246:client_setvolume_cbk]
10.98.98.10: Connected to 10.98.98.10:6996, attached to remote volume
'brick'.
[2010-03-29 16:30:17] N [client-protocol.c:6246:client_setvolume_cbk]
10.98.98.10: Connected to 10.98.98.10:6996, attached to remote volume
'brick'.
[2010-03-29 16:30:17] D [dht-diskusage.c:71:dht_du_info_cbk] last: on
subvolume '10.98.98.1': avail_percent is: 99.00 and avail_space is:
15069396992
[2010-03-29 16:30:17] D [dht-diskusage.c:71:dht_du_info_cbk] last: on
subvolume '10.98.98.10': avail_percent is: 99.00 and avail_space is:
88316628992
[2010-03-29 16:30:17] D [dht-diskusage.c:71:dht_du_info_cbk] last: on
subvolume '10.98.98.10': avail_percent is: 99.00 and avail_space is:
88316628992
[2010-03-29 16:30:21] D [dht-layout.c:576:dht_layout_normalize] last: found
anomalies in /iso. holes=1 overlaps=0
[2010-03-29 16:30:21] D [dht-common.c:164:dht_lookup_dir_cbk] last: fixing
assignment on /iso
[2010-03-29 16:30:21] D [dht-layout.c:576:dht_layout_normalize] last: found
anomalies in /ha. holes=1 overlaps=0
[2010-03-29 16:30:21] D [dht-common.c:164:dht_lookup_dir_cbk] last: fixing
assignment on /ha
[2010-03-29 16:30:21] D [dht-layout.c:576:dht_layout_normalize] last: found
anomalies in /monitoring. holes=1 overlaps=0
[2010-03-29 16:30:21] D [dht-common.c:164:dht_lookup_dir_cbk] last: fixing
assignment on /monitoring
nothing during hang
restart
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(LOOKUP)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(LOOKUP)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(LOOKUP)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(LOOKUP)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(LOOKUP)
[2010-03-29 16:58:26] D [socket.c:1326:socket_submit] 10.98.98.10: not
connected (priv->connected = 255)
[2010-03-29 16:58:26] D [dht-common.c:1590:dht_fd_cbk] last: subvolume
10.98.98.10 returned -1 (Transport endpoint is not connected)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] D [dht-common.c:1590:dht_fd_cbk] last: subvolume
10.98.98.10 returned -1 (Transport endpoint is not connected)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(1) op(STATFS)
[2010-03-29 16:58:26] E [saved-frames.c:165:saved_frames_unwind] 10.98.98.10:
forced unwinding frame type(2) op(PING)
[2010-03-29 16:58:26] D [client-protocol.c:537:client_ping_cbk] 10.98.98.10:
timer must have expired
[2010-03-29 16:58:29] E [socket.c:762:socket_connect_finish] 10.98.98.10:
connection to 10.98.98.10:6996 failed (No route to host)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20100329/953f7ad3/attachment-0003.html>
More information about the Gluster-devel
mailing list