[Gluster-devel] brick stop responding (3.4.0qa8)

Mon Feb 4 16:28:33 UTC 2013

Now you are showing a segfault backtrace from glusterd, while previously
(the "single thread" backtrace) was from glusterfsd. Are both having
problems?

Avati

On Mon, Feb 4, 2013 at 1:34 AM, Emmanuel Dreyfus <manu at netbsd.org> wrote:

> On Sun, Feb 03, 2013 at 02:13:40PM -0800, Anand Avati wrote:
> > Yeah, a lot of threads are "missing"! Do the logs have anything unusual?
>
> I understand the threads are here, but the process is corrupted enough
> that gdb cannot figure them out. A kernel thread shows 15 threads operating
> and they do not terminate before it stops responding.
>
> Running with electric fence leads to an early crash. I am not sure
> it is related, but it is probably worth a fix:
>
> Program terminated with signal 11, Segmentation fault.
> #0  slotForUserAddress (address=0x7f7ff2bafff4) at efence.c:648
> 648     efence.c: No such file or directory.
>         in efence.c
> (gdb) bt
> #0  slotForUserAddress (address=0x7f7ff2bafff4) at efence.c:648
> #1  free (address=0x7f7ff2bafff4) at efence.c:713
> #2  0x00007f7ff744a31a in runner_end (runner=0x7f7ff1bfeef0) at run.c:370
> #3  0x00007f7ff744ac9e in runner_run_generic (runner=0x7f7ff1bfeef0,
>     rfin=0x7f7ff744a2f6 <runner_end>) at run.c:386
> #4  0x00007f7ff343ddb9 in glusterd_volume_start_glusterfs (
>     volinfo=0x7f7ff370fae0, brickinfo=0x7f7ff37407a8, wait=_gf_true)
>     at glusterd-utils.c:1337
> #5  0x00007f7ff344366b in glusterd_brick_start (volinfo=0x7f7ff370fae0,
>     brickinfo=0x7f7ff37407a8, wait=_gf_true) at glusterd-utils.c:3961
> #6  0x00007f7ff3447328 in glusterd_restart_bricks (conf=0x7f7ff73f2a98)
>     at glusterd-utils.c:3991
> #7  0x00007f7ff743c9c4 in synctask_wrap (old_task=<optimized out>)
>     at syncop.c:129
> #8  0x00007f7ff5e580a0 in swapcontext () from /usr/lib/libc.so.12
>
> In runner_end():
> 368             if (runner->argv) {
> 369                     for (p = runner->argv; *p; p++)
> 370                             GF_FREE (*p);
>
> Inspection of runner->argv shows it is not NULL terminated. electric-fence
> with EF_PROTECT_BELOW  cause us to crash here:
> 0x7f7ff2ba7e00: 0xf2babfe4      0x00007f7f      0xf2badffc      0x00007f7f
> 0x7f7ff2ba7e10: 0xf2bafff4      0x00007f7f      0xf2bb1ff0      0x00007f7f
> 0x7f7ff2ba7e20: 0xf2bb3fe4      0x00007f7f      0xf2bb5ffc      0x00007f7f
> 0x7f7ff2ba7e30: 0xf2bb7fc4      0x00007f7f      0xf2bb9ffc      0x00007f7f
> 0x7f7ff2ba7e40: 0xf2bbbfcc      0x00007f7f      0xf2bbdff0      0x00007f7f
> 0x7f7ff2ba7e50: 0xf2bbfff0      0x00007f7f      0xf2bc1ffc      0x00007f7f
> 0x7f7ff2ba7e60: 0xf2bc3fcc      0x00007f7f      0xf2bc5ff0      0x00007f7f
> 0x7f7ff2ba7e70: 0xf2bc7fc4      0x00007f7f      0xf2bc9ff0      0x00007f7f
> 0x7f7ff2ba7e80: 0xf2bcbff8      0x00007f7f      0xf2bcdff0      0x00007f7f
> 0x7f7ff2ba7e90: 0xf2bcffe0      0x00007f7f      Cannot access memory at
> address 0x7f7ff2ba7e98
>
> In case it helps, the last element:
> (gdb) x/1s 0x00007f7ff2bcffe0
> 0x7f7ff2bcffe0:  "gfs33-server.listen-port=49152"
>
> It is indeed the last one reported by ps:
> PID TTY   STAT    TIME COMMAND
> 626 ?     Ssl  0:00.34 /usr/local/sbin/glusterfsd -s localhost
> --volfile-id gfs33.hotstuff.export-wd1a -p
> /var/lib/glusterd/vols/gfs33/run/hotstuff-export-wd1a.pid -S
> /var/run/66fedc0377e53ff9d6523d0802a230d1.socket --brick-name /export/wd1a
> -l /usr/local/var/log/glusterfs/bricks/export-wd1a.log --xlator-option
> *-posix.glusterd-uuid=2dbb8fc1-c2ab-4992-b080-fdc8556d1e34 --brick-port
> 49152 --xlator-option gfs33-server.listen-port=49152
>
>
>
>
>
> --
> Emmanuel Dreyfus
> manu at netbsd.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20130204/5692a9f0/attachment-0001.html>