[Bugs] [Bug 1259511] Rebalance crashes

Thu Oct 8 13:19:27 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1259511

Carlos O'Donell <codonell at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |codonell at redhat.com

--- Comment #8 from Carlos O'Donell <codonell at redhat.com> ---
(In reply to Susant Kumar Palai from comment #7)
> Hi Vitaliy,
>   Sorry for delayed reponse. This looks like a crash in libc. I will get
> back on this after consulting someone from libc team.

(In reply to Vitaliy Margolen from comment #6)
> Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id
> rebalance/gv1 --xlator-option *dh'.
> Program terminated with signal SIGILL, Illegal instruction.
> (gdb) bt
> #0  0x00007fd2396d012b in __lll_lock_elision () from /lib64/libpthread.so.0
> #1  0x00007fd23a881eb9 in inode_ref (inode=0x7fd198ff9700) at inode.c:545
> #2  0x00007fd23a85fb0a in loc_copy (dst=dst at entry=0x7fd23405e074,
> src=src at entry=0x7fd22d69bec0) at xlator.c:854
> #3  0x00007fd23478e11b in dht_local_init (frame=frame at entry=0x7fd238348270,
> loc=loc at entry=0x7fd22d69bec0, fd=fd at entry=0x0, fop=fop at entry=GF_FOP_LOOKUP)
> at dht-helper.c:484
> #4  0x00007fd2347c0f81 in dht_lookup (frame=0x7fd238348270,
> this=0x7fd23000dd20, loc=0x7fd22d69bec0, xattr_req=0x0) at dht-common.c:2146
> #5  0x00007fd23a8a6042 in syncop_lookup (subvol=subvol at entry=0x7fd23000dd20,
> loc=loc at entry=0x7fd22d69bec0, iatt=iatt at entry=0x7fd22d69bbd0,
> parent=parent at entry=0x0, xdata_in=xdata_in at entry=0x0, 
>     xdata_out=xdata_out at entry=0x0) at syncop.c:1227
> #6  0x00007fd234797657 in gf_defrag_fix_layout
> (this=this at entry=0x7fd23000dd20, defrag=defrag at entry=0x7fd230034ae0,
> loc=loc at entry=0x7fd22d69bec0, fix_layout=fix_layout at entry=0x7fd23ab2abe8, 
>     migrate_data=migrate_data at entry=0x7fd23ab2aad0) at dht-rebalance.c:2427
> #7  0x00007fd234798953 in gf_defrag_start_crawl (data=0x7fd23000dd20) at
> dht-rebalance.c:2784
> #8  0x00007fd23a8a28b2 in synctask_wrap (old_task=<optimized out>) at
> syncop.c:380
> #9  0x00007fd238f3afd0 in ?? () from /lib64/libc.so.6
> #10 0x0000000000000000 in ?? ()

This is either a bug in glusterfs locking code (undefined behaviour which
previously worked but under elision triggers a failure) or a defect in the
OpenSUSE glibc POSIX Lock Elision feature (supported by Intel TSX).

You have three options and one potential workaround:

Options:

(1) Open a bug against the opensuse 13.2 glibc and have them look at the bug.
You'll need to provide them with a core dump, and hopefully a reproducer so
they can look at the issue.

(2) Reproduce the problem under Fedora or RHEL so that I can look at the issue
more closely. I expect you won't be able to reproduce on stable Fedora or RHEL
because we don't enable elision since we consider the feature experimental and
unstable.

(3) Reproduce the issue under upstream glibc and file an upstream bug for Intel
(Andi Kleen) and others like Red Hat (myself) to look at. This is not a
recommended course of action because it really requires an expert to set up
such a reproducer. Normally to test this out we'd use Fedora Rawhide (which
tracks glibc usptream).

Workaround:

The opensuse glibc build does have a no-elision build of glibc already part of
the build. You should be able to do this:

LD_PRELOAD=/lib64/noelision/libpthread.so.0 ./myapplication

To force the application to start with a libpthread that has elision disabled.
I warn you though that helper processes might not inherit that environment
variable and such processes would again be running with elision enabled.

I hope that helps.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=iR9g2Cs7Ha&a=cc_unsubscribe