[Bugs] [Bug 1690769] GlusterFS 5.5 crashes in 1x4 replicate setup.

bugzilla at redhat.com bugzilla at redhat.com
Fri May 3 22:30:15 UTC 2019


https://bugzilla.redhat.com/show_bug.cgi?id=1690769



--- Comment #4 from Artem Russakovskii <archon810 at gmail.com> ---
Phew, this was a fun one!

Long story short - after weeks of debugging with the amazing Gluster team
(thanks, Amar and Xavi!), we have found the root of the problem and a solution.

The crash happens on CPUs with an 'rtm' flag, in combination with slightly
older versions of glibc, specifically 2.26. The bug is fixed in glibc 2.29.

For example, 3 of our machines had these CPUs (run lscpu to find out):
Model name:          Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm
constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni
pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes
xsave avx f16c rdrand hypervisor lahf_lm pti fsgsbase tsc_adjust smep erms
xsaveopt arat

And the one that was crashing had this one: 
Model name:          Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm
constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni
pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm
3dnowprefetch cpuid_fault invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2
smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat

Since the version of glibc for OpenSUSE 15.0 is currently 2.26, the easiest
solution was to migrate the box to a CPU without the rtm feature, which we've
now done and confirmed the crash is gone.


Before the migration, Xavi did find a workaround:
1. export GLIBC_TUNABLES=glibc.tune.hwcaps=-RTM
2. Unmount and remount.
3. Confirm the above worked: for i in $(pgrep glusterfs); do ps h -o cmd -p $i;
cat /proc/$i/environ | xargs -0 -n 1 | grep "GLIBC_TUNABLES"; done


More info about this lock elision feature, as well as a quick test program can
be found here: https://sourceware.org/bugzilla/show_bug.cgi?id=23275.

Here are sample runs on hardware with 'rtm' feature (crash observed) and
without (no crash):

gcc -pthread test.c -o test

archon810 at citadel:/tmp> ./test 
Please add a check if lock-elision is available on your architecture. The check
in check_if_lock_elision_is_available () assumes, that lock-elision is enabled!

main: start 3 threads to run 2000000 iterations.
#0: started
#1: started
#2: started
.#0: pthread_mutex_destroy: failed with 16; in round=2295;
Aborted


archon810 at hive:/tmp> ./test 
Please add a check if lock-elision is available on your architecture. The check
in check_if_lock_elision_is_available () assumes, that lock-elision is enabled!

main: start 3 threads to run 2000000 iterations.
#0: started
#2: started
#1: started
........................................................................................................................................................................................................main:
end.


Not sure how the maintainers will choose to close this issue, but I hope it'll
help someone in the future, especially since we spent countless hours analyzing
and debugging (hopefully, not all in vain!).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list