[Gluster-users] Error report: glusterfs2.0rc4 abend -- "readv failed (Bad address) "

Thu Mar 19 08:35:58 UTC 2009

Hi Anand,

I found two core dump in /, and did a backtrace.  In both cases, the error 
position is the abort() call in fuse-bridge.c:2583 -- 

2583                                    ERR_ABORT (buf->data);

# gdb /usr/sbin/glusterfs
GNU gdb Red Hat Linux (6.5-37.el5_2.2rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...(no debugging symbols 
found)
Using host libthread_db library "/lib/libthread_db.so.1".

(gdb) core core.28543
warning: exec file is newer than core file.
warning: Can't read pathname for load map: Input/output error.
//snip

Loaded symbols for /lib/libgcc_s.so.1
Core was generated by 
`/usr/sbin/glusterfs --log-level=WARNING --volfile=/etc/glusterfs/glusterfs.vol'.
Program terminated with signal 6, Aborted.
#0  0x003ef402 in __kernel_vsyscall ()
(gdb) bt
#0  0x003ef402 in __kernel_vsyscall ()
#1  0x00b84c00 in raise () from /lib/libc.so.6
#2  0x00b86451 in abort () from /lib/libc.so.6
#3  0x001189a8 in fuse_thread_proc (data=0x87815e8) at fuse-bridge.c:2583
#4  0x00d302db in start_thread () from /lib/libpthread.so.0
#5  0x00c2912e in clone () from /lib/libc.so.6
(gdb) up
#1  0x00b84c00 in raise () from /lib/libc.so.6
(gdb) up
#2  0x00b86451 in abort () from /lib/libc.so.6
(gdb) up
#3  0x001189a8 in fuse_thread_proc (data=0x87815e8) at fuse-bridge.c:2583
2583                                    ERR_ABORT (buf->data);
(gdb) list
2578                                    if (buf->data) {
2579                                            FREE (buf->data);
2580                                            buf->data = NULL;
2581                                    }
2582                                    buf->data = CALLOC (1, res);
2583                                    ERR_ABORT (buf->data);
2584                                    buf->len = res;
2585                            }
2586                            memcpy (buf->data, recvbuf, res); // evil evil
2587
(gdb) list 2560
2555
2556                    res = fuse_chan_receive (priv->ch,
2557                                             recvbuf,
2558                                             chan_size);
2559
2560                    if (priv->first_call) {
2561                            fuse_root_lookup (this);
2562                    }
2563
2564                    if (res == -1) {
(gdb)
2565                            if (errno != EINTR) {
2566                                    gf_log ("glusterfs-fuse", 
GF_LOG_ERROR,
2567                                            "fuse_chan_receive() 
returned -1 (%d)", errno);
2568                            }
2569                            if (errno == ENODEV)
2570                                    break;
2571                            continue;
2572                    }
2573
2574                    buf = priv->buf;
(gdb)
2575
2576                    if (res && res != -1) {
2577                            if (buf->len < (res)) {
2578                                    if (buf->data) {
2579                                            FREE (buf->data);
2580                                            buf->data = NULL;
2581                                    }
2582                                    buf->data = CALLOC (1, res);
2583                                    ERR_ABORT (buf->data);
2584                                    buf->len = res;

On Wednesday 18 March 2009 16:18:29 Anand Avati wrote:
> do you have a core dump which can be inspected? do you still see the
> error after taking the 1.3 client off?
>
> Avati
>
> On Sat, Mar 14, 2009 at 10:42 PM, Andrew McGill <list2008 at lunch.za.net> 
wrote:
> > Hello,
> >
> > I upgraded from glusterfs-1.3.12.tar.gz to glusterfs-2.0.0rc4.tar.gz
> > because I could not complete a rdiff-backup without inexplicable errors. 
> > My efforts have been rewarded with a crash, for which some logs are
> > displayed below.
> >
> > The backend is unify, with multiple afr subvolumes of two ext3 volumes
> > each.
> >
> > Here is how the client side of glusterfs died:
> >
> > 2009-03-14 13:32:15 W [afr-self-heal-data.c:798:afr_sh_data_fix] afr4:
> > Picking favorite child u100-rs1 as authentic source to resolve
> > conflicting data of
> > /backup5/robbie.foo.co.za/rdiff-backup-data/mirror_metadata.2009-03-14T08
> >:00:17+02:00.snapshot.gz 2009-03-14 13:32:15 W
> > [afr-self-heal-data.c:646:afr_sh_data_open_cbk] afr4: sourcing file
> > /backup5/robbie.foo.co.za/rdiff-backup-data/mirror_meta
> > data.2009-03-14T08:00:17+02:00.snapshot.gz from u100-rs1 to other sinks
> > 2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] u50-dcc1: readv failed
> > (Bad address)
> > 2009-03-14 13:32:22 E [socket.c:634:__socket_proto_state_machine]
> > u50-dcc1: read (Bad address) in state 3 (192.168.227.65:6996)
> > 2009-03-14 13:32:22 E [saved-frames.c:169:saved_frames_unwind] u50-dcc1:
> > forced unwinding frame type(1) op(READ)
> > 2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] u50-dr1: readv failed
> > (Bad address)
> > 2009-03-14 13:32:22 E [socket.c:634:__socket_proto_state_machine]
> > u50-dr1: read (Bad address) in state 3 (192.168.227.31:6996)
> > 2009-03-14 13:32:22 E [saved-frames.c:169:saved_frames_unwind] u50-dr1:
> > forced unwinding frame type(1) op(READ)
> > 2009-03-14 13:32:22 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse:
> > 5998294: READ => -1 (Transport endpoint is not connected)
> > 2009-03-14 13:33:03 E [socket.c:102:__socket_rwv] u50-dr2: readv failed
> > (Bad address)
> > 2009-03-14 13:33:03 E [socket.c:634:__socket_proto_state_machine]
> > u50-dr2: read (Bad address) in state 3 (192.168.227.32:6996)
> > 2009-03-14 13:33:03 E [saved-frames.c:169:saved_frames_unwind] u50-dr2:
> > forced unwinding frame type(1) op(READ)
> > 2009-03-14 13:33:03 E [socket.c:102:__socket_rwv] u50-rs3: readv failed
> > (Bad address)
> > 2009-03-14 13:33:03 E [socket.c:634:__socket_proto_state_machine]
> > u50-rs3: read (Bad address) in state 3 (192.168.227.59:6996)
> > 2009-03-14 13:33:03 E [saved-frames.c:169:saved_frames_unwind] u50-rs3:
> > forced unwinding frame type(1) op(READ)
> > 2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse:
> > 6006118: READ => -1 (Transport endpoint is not connected)
> > 2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse:
> > 6006119: READ => -1 (Transport endpoint is not connected)
> > 2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse:
> > 6006120: READ => -1 (Transport endpoint is not connected)
> > 2009-03-14 13:33:03 E [fuse-bridge.c:1548:fuse_readv_cbk] glusterfs-fuse:
> > 6006121: READ => -1 (Transport endpoint is not connected)
> > pending frames:
> > �
> > patchset: cb602a1d7d41587c24379cb2636961ab91446f86 +
> > signal received: 6
> > configuration details:argp 1
> > backtrace 1
> > dlfcn 1
> > fdatasync 1
> > libpthread 1
> > llistxattr 1
> > setfsid 1
> > spinlock 1
> > epoll.h 1
> > xattr.h 1
> > st_atim.tv_nsec 1
> > package-string: glusterfs 2.0.0rc4
> > [0x381420]
> > /lib/libc.so.6(abort+0x101)[0xb86451]
> > /usr/lib/glusterfs/2.0.0rc4/xlator/mount/fuse.so[0x54b9a8]
> > /lib/libpthread.so.0[0xd302db]
> > /lib/libc.so.6(clone+0x5e)[0xc2912e]
> > ---------
> >
> >
> > On the server side, the following messages don't enlighten me, but do
> > remind me that there was another client running version 1.13 still
> > connecting.  It looks like the server just noticed that the client died.
> >
> > 2009-03-14 13:30:03 E [socket.c:583:__socket_proto_state_machine] server:
> > socket header validate failed (192.168.227.167:1023). possible mismatch
> > of transport-type between server and client volumes, or version mismatch
> > 2009-03-14 13:30:03 N [server-protocol.c:8048:notify] server:
> > 192.168.227.167:1023 disconnected
> > 2009-03-14 13:31:45 E [socket.c:463:__socket_proto_validate_header]
> > server: socket header signature does not match :O (42.6c.6f)
> > 2009-03-14 13:31:45 E [socket.c:583:__socket_proto_state_machine] server:
> > socket header validate failed (192.168.227.167:1023). possible mismatch
> > of transport-type between server and client volumes, or version mismatch
> > 2009-03-14 13:31:45 N [server-protocol.c:8048:notify] server:
> > 192.168.227.167:1023 disconnected
> > 2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] server: readv failed
> > (Connection reset by peer)
> > 2009-03-14 13:32:22 E [socket.c:561:__socket_proto_state_machine] server:
> > read (Connection reset by peer) in state 1 (192.168.227.5:1020)
> > 2009-03-14 13:32:22 N [server-protocol.c:8048:notify] server:
> > 192.168.227.5:1020 disconnected
> > 2009-03-14 13:32:22 N [server-protocol.c:7295:mop_setvolume] server:
> > accepted client from 192.168.227.5:1020
> > 2009-03-14 13:35:48 N [server-protocol.c:8048:notify] server:
> > 192.168.227.5:1017 disconnected
> > 2009-03-14 13:35:48 N [server-protocol.c:8048:notify] server:
> > 192.168.227.5:1020 disconnected
> > 2009-03-14 13:35:48 N [server-helpers.c:515:server_connection_destroy]
> > server: destroyed connection of backup5.foo.com-23205-2009/03/14-07:10:
> > 52:777008-u50-dcc1
> >
> >
> > On another server brick, the 25Gb volume u50-dr1-raw was full (it should
> > have been 50Gb like its peer).  As I recall, the free space of the second
> > volume of AFR which does not get checked for disk space (a bug, IMHO).
> >
> > It said this, which could have led to the client-side failure a few
> > minutes later (the clocks are in sync):
> >
> > 2009-03-14 13:30:23 W [posix.c:773:posix_mkdir] u50-dr1-raw: mkdir
> > of /backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-backup.tmp.1: No
> > space left on device
> > 2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server:
> > 1109657: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> > backup.tmp.1) on u50-dr1 returning error: -1 (2)
> > 2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server:
> > 1109658: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> > backup.tmp.1) on u50-dr1 returning error: -1 (2)
> > 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server:
> > 3184942: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> > backup.tmp.1) on u50-dr1 for key <nul> returning error: -1 (2)
> > 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server:
> > 3184943: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> > backup.tmp.1) on u50-dr1 for key <nul> returning error: -1 (2)
> > 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server:
> > 3184947: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> > backup.tmp.1) on u50-dr1 for key hl returning error: -1 (2)
> > 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server:
> > 3184949: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> > backup.tmp.1) on u50-dr1 for key hl returning error: -1 (2)
> > 2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server:
> > 1109660: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> > backup.tmp.1/hl) on u50-dr1 returning error: -1 (2)
> > 2009-03-14 13:30:23 E [server-protocol.c:3478:server_stub_resume] server:
> > 1109661: INODELK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> > backup.tmp.1/hl) on u50-dr1 returning error: -1 (2)
> > 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server:
> > 3184952: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> > backup.tmp.1/hl) on u50-dr1 for key <nul> returning error: -1 (2)
> > 2009-03-14 13:30:23 E [server-protocol.c:3448:server_stub_resume] server:
> > 3184953: ENTRYLK (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> > backup.tmp.1/hl) on u50-dr1 for key <nul> returning error: -1 (2)
> > 2009-03-14 13:30:23 E [server-protocol.c:2774:server_stub_resume] server:
> > 1109663: XATTROP (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-
> > backup.tmp.1) on u50-dr1 returning error: -1 (2)
> > 2009-03-14 13:30:23 E [server-protocol.c:2868:server_stub_resume] server:
> > 1109665: RMDIR (/backup5/bumblebee.foo.co.za/rdiff-backup-data/rdiff-ba
> > ckup.tmp.1) on u50-dr1 returning error: -1 (2)
> > 2009-03-14 13:31:45 E [socket.c:463:__socket_proto_validate_header]
> > server: socket header signature does not match :O (42.6c.6f)
> > 2009-03-14 13:31:45 E [socket.c:583:__socket_proto_state_machine] server:
> > socket header validate failed (192.168.227.167:1022). possible mismatch
> > of transport-type between server and client volumes, or version mismatch
> > 2009-03-14 13:31:45 N [server-protocol.c:8048:notify] server:
> > 192.168.227.167:1022 disconnected
> > 2009-03-14 13:32:22 E [socket.c:102:__socket_rwv] server: readv failed
> > (Connection reset by peer)
> > 2009-03-14 13:32:22 E [socket.c:561:__socket_proto_state_machine] server:
> > read (Connection reset by peer) in state 1 (192.168.227.5:1016)
> > 2009-03-14 13:32:22 N [server-protocol.c:8048:notify] server:
> > 192.168.227.5:1016 disconnected
> > 2009-03-14 13:32:23 N [server-protocol.c:7295:mop_setvolume] server:
> > accepted client from 192.168.227.5:1016
> >
> >
> > I may have to move the backup in question off glusterfs (if I can just
> > find the space somewhere), since it has taken 4 days to realise that the
> > backing up is not just slow, but faulty.  (Of course, if I can't fix it,
> > I'll win a trip to the data center to install a new machine to replace
> > the system.)
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users