[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Erik Jacobson erik.jacobson at hpe.com
Thu Apr 16 17:03:33 UTC 2020


So in my test runs since making that change, we have a different odd
behavior now. As you recall, this is with your patch -- still not
split-brain -- and now with performance.parallel-readdir off

The NFS server grinds to a hault after a few test runs. It does not core
dump.

All that shows up in the log is:

"pending frames:" with nothing after it and no date stamp.

I will start looking for interesting break points I guess.


The glusterfs for nfs is still alive:

root     30541     1 42 09:57 ?        00:51:06 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/gluster/9ddb5561058ff543.socket



[root at leader3 ~]# strace -f  -p 30541
strace: Process 30541 attached with 40 threads
[pid 30580] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30579] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30578] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30577] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30576] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30575] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30574] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30573] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30572] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30571] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30570] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30569] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30568] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30567] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30566] futex(0x7f88b8000020, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30565] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30564] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30563] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30562] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30561] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30560] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30559] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30558] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30557] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30556] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30555] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30554] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30553] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30552] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30551] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30550] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid 30549] futex(0x7f8904035f60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30548] futex(0x7f88b8000020, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=243775} <unfinished ...>
[pid 30546] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid 30545] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid 30544] futex(0x7f88b8000020, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30543] rt_sigtimedwait([HUP INT USR1 USR2 TERM],  <unfinished ...>
[pid 30542] futex(0x7f88b8000020, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 30541] futex(0x7f890c3a39d0, FUTEX_WAIT, 30548, NULL <unfinished ...>
[pid 30547] <... select resumed> )      = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
[pid 30547] select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}^Cstrace: Process 30541 detached
strace: Process 30542 detached
strace: Process 30543 detached
strace: Process 30544 detached
strace: Process 30545 detached
strace: Process 30546 detached
strace: Process 30547 detached
 <detached ...>
strace: Process 30548 detached
strace: Process 30549 detached
strace: Process 30550 detached
strace: Process 30551 detached
strace: Process 30552 detached
strace: Process 30553 detached
strace: Process 30554 detached
strace: Process 30555 detached
strace: Process 30556 detached
strace: Process 30557 detached
strace: Process 30558 detached
strace: Process 30559 detached
strace: Process 30560 detached
strace: Process 30561 detached
strace: Process 30562 detached
strace: Process 30563 detached
strace: Process 30564 detached
strace: Process 30565 detached
strace: Process 30566 detached
strace: Process 30567 detached
strace: Process 30568 detached
strace: Process 30569 detached
strace: Process 30570 detached
strace: Process 30571 detached
strace: Process 30572 detached
strace: Process 30573 detached
strace: Process 30574 detached
strace: Process 30575 detached
strace: Process 30576 detached
strace: Process 30577 detached
strace: Process 30578 detached
strace: Process 30579 detached
strace: Process 30580 detached




> On 16/04/20 8:04 pm, Erik Jacobson wrote:
> > Quick update just on how this got set.
> > 
> > gluster volume set cm_shared performance.parallel-readdir on
> > 
> > Is something we did turn on, thinking it might make our NFS services
> > faster and not knowing about it possibly being negative.
> > 
> > Below is a diff of the nfs volume file ON vs OFF. So I will simply turn
> > this OFF and do a test run.
> Yes,that should do it. I am not sure if performance.parallel-readdir was
> intentionally made to have an effect on gnfs volfiles. Usually, for other
> performance xlators, `gluster volume set` only changes the fuse volfile.


More information about the Gluster-users mailing list