[Gluster-devel] 2 out of 4 bonnies failed :-((
Sascha Ottolski
ottolski at web.de
Mon Jan 7 08:57:29 UTC 2008
Hi,
I found a somewhat frustrating test result after the weekend. I startet a
bonnie on four different clients (so a total of four bonnies in parallel). I
have two servers, each two partitions, wich are unifed and afred "over
cross", so each server has a brick and a mirrored brick of the other, using
tla patch-628.
For one, the results seem to be not too promising, as it more than 48 hours
hours to complete. Doing a bonnie on only one client took "only" about 12
hours (unfortunately, I don't have exact numbers about the runtime).
But even worse, two of the bonnies didn't finish at all. The first client
dropped out after approx. 8 hours, claiming "Can't open
file ./Bonnie.17791.001". However, the file is (partly) there, also on the
afr-mirror, but with different sizes. The log suggests that it was a timeout
problem (if I interpret it correctly):
2008-01-06 03:48:10 E [afr.c:3364:afr_close_setxattr_cbk] afr1:
(path=/Bonnie.17791.027 child=fsc1) op_ret=-1 op_errno=28
2008-01-06 03:50:34 W [client-protocol.c:209:call_bail] ns1: activating
bail-out. pending frames = 1. last sent = 2008-01-06 03:48:17
. last received = 2008-01-06 03:48:17 transport-timeout = 108
2008-01-06 03:50:34 C [client-protocol.c:217:call_bail] ns1: bailing transport
2008-01-06 03:50:34 W [client-protocol.c:4490:client_protocol_cleanup] ns1:
cleaning up state in transport object 0x522e40
2008-01-06 03:50:34 E [client-protocol.c:4542:client_protocol_cleanup] ns1:
forced unwinding frame type(1) op(5) reply=@0x2aaaab407a0
0
2008-01-06 03:50:34 E [afr.c:2573:afr_selfheal_lock_cbk] afrns:
(path=/Bonnie.17791.001 child=ns1) op_ret=-1 op_errno=107
2008-01-06 03:50:34 E [afr.c:2744:afr_open] afrns: self heal failed, returning
EIO
2008-01-06 03:50:34 C [tcp.c:81:tcp_disconnect] ns1: connection disconnected
2008-01-06 03:51:00 E [afr.c:1907:afr_selfheal_sync_file_writev_cbk] afr1:
(path=/Bonnie.17791.001 child=fsc1) op_ret=-1 op_errno=28
2008-01-06 03:51:00 E [afr.c:1693:afr_error_during_sync] afr1: error during
self-heal
2008-01-06 03:51:03 E [afr.c:2744:afr_open] afr1: self heal failed, returning
EIO
2008-01-06 03:51:03 E [fuse-bridge.c:670:fuse_fd_cbk] glusterfs-fuse:
12276158: /Bonnie.17791.001 => -1 (5)
2008-01-07 04:40:17 E [fuse-bridge.c:431:fuse_entry_cbk] glusterfs-fuse:
15841600: /Bonnie.26672.026 => -1 (2)
The second had a problem in creating / removing a dir:
Create files in sequential order...Can't make directory ./Bonnie.26672
Cleaning up test directory after error.
Bonnie: drastic I/O error (rmdir): No such file or directory
On this client, there is nothing found in the logs. For both cases, nothing is
in the server logs either (both server and clients had no special debug level
enabled).
No, the million dollar question is, how would I debug this situation,
preferably a bit quicker than 48 hours...
Thanks,
Sascha
More information about the Gluster-devel
mailing list