[Bugs] [Bug 1810934] New: Segfault in FUSE process, potential use after free

Fri Mar 6 09:16:57 UTC 2020

https://bugzilla.redhat.com/show_bug.cgi?id=1810934

            Bug ID: 1810934
           Summary: Segfault in FUSE process, potential use after free
           Product: GlusterFS
           Version: mainline
          Hardware: x86_64
                OS: Linux
            Status: NEW
         Component: open-behind
          Severity: urgent
          Priority: medium
          Assignee: bugs at gluster.org
          Reporter: jahernan at redhat.com
                CC: bugs at gluster.org, jahernan at redhat.com,
                    manschwetus at cs-software-gmbh.de
        Depends On: 1697971
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1697971 +++

Description of problem:
We face regular segfaults of FUSE process, when running gitlab docker with data
placed on 1x3 glusterfs, all nodes running the container are gluster servers.

Version-Release number of selected component (if applicable):
We use latest packages available for ubuntu lts

# cat /etc/apt/sources.list.d/gluster-ubuntu-glusterfs-6-bionic.list
deb http://ppa.launchpad.net/gluster/glusterfs-6/ubuntu bionic main

How reproducible:
It happens regularly, at least daily, but we aren't aware of a specific
action/activity.

Steps to Reproduce:
1. setup docker swarm backed with glusterfs and run gitlab on it
2. ???
3. ???

Actual results:
Regular crashes of FUSE, cutting of persistent storage used for dockerized
gitlab

Expected results:
Stable availability of glusterfs

Additional info:
we preserve the produced corefiles, I'll attach bt and thread list of te one
today.

--- Additional comment from  on 2019-04-09 13:31:55 CEST ---

vlume infos:
$ sudo gluster  volume info

Volume Name: swarm-vols
Type: Replicate
Volume ID: a103c1da-d651-4d65-8f86-a8731e2a670c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.81:/gluster/data
Brick2: 192.168.1.86:/gluster/data
Brick3: 192.168.1.85:/gluster/data
Options Reconfigured:
performance.write-behind: off
performance.cache-max-file-size: 1GB
performance.cache-size: 2GB
nfs.disable: on
transport.address-family: inet
auth.allow: 127.0.0.1
$ sudo gluster  volume status
Status of volume: swarm-vols
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 192.168.1.81:/gluster/data            49153     0          Y       188805
Brick 192.168.1.86:/gluster/data            49153     0          Y       5543 
Brick 192.168.1.85:/gluster/data            49152     0          Y       2069 
Self-heal Daemon on localhost               N/A       N/A        Y       2080 
Self-heal Daemon on 192.168.1.81            N/A       N/A        Y       144760
Self-heal Daemon on 192.168.1.86            N/A       N/A        Y       63715

Task Status of Volume swarm-vols
------------------------------------------------------------------------------
There are no active volume tasks
$ mount |  grep swarm
localhost:/swarm-vols on /swarm/volumes type fuse.glusterfs
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072,_netdev)

--- Additional comment from  on 2019-04-12 15:41:59 CEST ---

Today I encountered another segfault, same stack.

--- Additional comment from  on 2019-04-15 08:48:14 CEST ---

I just checked another corefile from a second system, same stack, that issue
gets quite annoying.
Could some pls look into it.

--- Additional comment from  on 2019-04-23 10:03:21 CEST ---

Problem persists with 6.1, bt has changed a bit:

Crashdump1:
#0  0x00007fc3dcaa27f0 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#1  0x00007fc3dcaa2874 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#2  0x00007fc3ddb5cdcc in gf_uuid_unparse (out=0x7fc3c8005580
"c27a90a6-e68b-4b0b-af56-002ea7bf1fb4", uuid=0x8 <error: Cannot access memory
at address 0x8>) at ./glusterfs/compat-uuid.h:55
#3  uuid_utoa (uuid=uuid at entry=0x8 <error: Cannot access memory at address
0x8>) at common-utils.c:2777
#4  0x00007fc3d688c529 in ioc_open_cbk (frame=0x7fc3a8b56208, cookie=<optimized
out>, this=0x7fc3d001eb80, op_ret=0, op_errno=117, fd=0x7fc3c7678418,
xdata=0x0) at io-cache.c:646
#5  0x00007fc3d6cb09b1 in ra_open_cbk (frame=0x7fc3a8b5b698, cookie=<optimized
out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>,
fd=0x7fc3c7678418, xdata=0x0) at read-ahead.c:99
#6  0x00007fc3d71b10b3 in afr_open_cbk (frame=0x7fc3a8b67d48, cookie=0x0,
this=<optimized out>, op_ret=0, op_errno=0, fd=0x7fc3c7678418, xdata=0x0) at
afr-open.c:97
#7  0x00007fc3d747c5f8 in client4_0_open_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7fc3a8b58d18) at
client-rpc-fops_v2.c:284
#8  0x00007fc3dd9013d1 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7fc3d0057dd0, pollin=pollin at entry=0x7fc386e7e2b0) at
rpc-clnt.c:755
#9  0x00007fc3dd901773 in rpc_clnt_notify (trans=0x7fc3d0058090,
mydata=0x7fc3d0057e00, event=<optimized out>, data=0x7fc386e7e2b0) at
rpc-clnt.c:922
#10 0x00007fc3dd8fe273 in rpc_transport_notify (this=this at entry=0x7fc3d0058090,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=<optimized out>) at
rpc-transport.c:542
#11 0x00007fc3d8333474 in socket_event_poll_in (notify_handled=true,
this=0x7fc3d0058090) at socket.c:2522
#12 socket_event_handler (fd=fd at entry=11, idx=idx at entry=2, gen=gen at entry=4,
data=data at entry=0x7fc3d0058090, poll_in=<optimized out>, poll_out=<optimized
out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2924
#13 0x00007fc3ddbb2863 in event_dispatch_epoll_handler (event=0x7fc3cfffee54,
event_pool=0x5570195807b0) at event-epoll.c:648
#14 event_dispatch_epoll_worker (data=0x5570195c5a80) at event-epoll.c:761
#15 0x00007fc3dd2bc6db in start_thread (arg=0x7fc3cffff700) at
pthread_create.c:463
#16 0x00007fc3dcfe588f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Crashdump2:
#0  0x00007fdd13fd97f0 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#1  0x00007fdd13fd9874 in ?? () from /lib/x86_64-linux-gnu/libuuid.so.1
#2  0x00007fdd15093dcc in gf_uuid_unparse (out=0x7fdd0805a2d0
"1f739bdc-f7c0-4133-84cc-554eb594ae81", uuid=0x8 <error: Cannot access memory
at address 0x8>) at ./glusterfs/compat-uuid.h:55
#3  uuid_utoa (uuid=uuid at entry=0x8 <error: Cannot access memory at address
0x8>) at common-utils.c:2777
#4  0x00007fdd0d5c2529 in ioc_open_cbk (frame=0x7fdce44a9f88, cookie=<optimized
out>, this=0x7fdd0801eb80, op_ret=0, op_errno=117, fd=0x7fdcf1ad9b78,
xdata=0x0) at io-cache.c:646
#5  0x00007fdd0d9e69b1 in ra_open_cbk (frame=0x7fdce44d2a78, cookie=<optimized
out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>,
fd=0x7fdcf1ad9b78, xdata=0x0) at read-ahead.c:99
#6  0x00007fdd0dee70b3 in afr_open_cbk (frame=0x7fdce44a80a8, cookie=0x1,
this=<optimized out>, op_ret=0, op_errno=0, fd=0x7fdcf1ad9b78, xdata=0x0) at
afr-open.c:97
#7  0x00007fdd0e1b25f8 in client4_0_open_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7fdce4462528) at
client-rpc-fops_v2.c:284
#8  0x00007fdd14e383d1 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7fdd08054490, pollin=pollin at entry=0x7fdc942904d0) at
rpc-clnt.c:755
#9  0x00007fdd14e38773 in rpc_clnt_notify (trans=0x7fdd08054750,
mydata=0x7fdd080544c0, event=<optimized out>, data=0x7fdc942904d0) at
rpc-clnt.c:922
#10 0x00007fdd14e35273 in rpc_transport_notify (this=this at entry=0x7fdd08054750,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=<optimized out>) at
rpc-transport.c:542
#11 0x00007fdd0f86a474 in socket_event_poll_in (notify_handled=true,
this=0x7fdd08054750) at socket.c:2522
#12 socket_event_handler (fd=fd at entry=10, idx=idx at entry=4, gen=gen at entry=4,
data=data at entry=0x7fdd08054750, poll_in=<optimized out>, poll_out=<optimized
out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2924
#13 0x00007fdd150e9863 in event_dispatch_epoll_handler (event=0x7fdd0f3e0e54,
event_pool=0x55cf9c9277b0) at event-epoll.c:648
#14 event_dispatch_epoll_worker (data=0x55cf9c96ca20) at event-epoll.c:761
#15 0x00007fdd147f36db in start_thread (arg=0x7fdd0f3e1700) at
pthread_create.c:463
#16 0x00007fdd1451c88f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Please tell me if you need further information.

--- Additional comment from  on 2019-04-24 12:27:43 CEST ---

got another set of cores, one for each system in my gluster setup over night:

Core was generated by `/usr/sbin/glusterfs --process-name fuse
--volfile-server=localhost --volfile-id'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
65      ../nptl/pthread_mutex_lock.c: Datei oder Verzeichnis nicht gefunden.
[Current thread is 1 (Thread 0x7faee404c700 (LWP 28792))]
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
#1  0x00007faee544e4b5 in ob_fd_free (ob_fd=0x7faebc054df0) at
open-behind.c:198
#2  0x00007faee544edd6 in ob_inode_wake (this=this at entry=0x7faed8020d20,
ob_fds=ob_fds at entry=0x7faee404bc90) at open-behind.c:355
#3  0x00007faee544f062 in open_all_pending_fds_and_resume
(this=this at entry=0x7faed8020d20, inode=0x7faed037cf08, stub=0x7faebc008858) at
open-behind.c:442
#4  0x00007faee544f4ff in ob_rename (frame=frame at entry=0x7faebc1ceae8,
this=this at entry=0x7faed8020d20, src=src at entry=0x7faed03d9a70,
dst=dst at entry=0x7faed03d9ab0, xdata=xdata at entry=0x0) at open-behind.c:1035
#5  0x00007faeed1b0ad0 in default_rename (frame=frame at entry=0x7faebc1ceae8,
this=<optimized out>, oldloc=oldloc at entry=0x7faed03d9a70,
newloc=newloc at entry=0x7faed03d9ab0, xdata=xdata at entry=0x0) at defaults.c:2631
#6  0x00007faee501f798 in mdc_rename (frame=frame at entry=0x7faebc210468,
this=0x7faed80247d0, oldloc=oldloc at entry=0x7faed03d9a70,
newloc=newloc at entry=0x7faed03d9ab0, xdata=xdata at entry=0x0) at md-cache.c:1852
#7  0x00007faeed1c6936 in default_rename_resume (frame=0x7faed02d2318,
this=0x7faed8026430, oldloc=0x7faed03d9a70, newloc=0x7faed03d9ab0, xdata=0x0)
at defaults.c:1897
#8  0x00007faeed14cc45 in call_resume (stub=0x7faed03d9a28) at call-stub.c:2555
#9  0x00007faee4e10cd8 in iot_worker (data=0x7faed8034780) at io-threads.c:232
#10 0x00007faeec88f6db in start_thread (arg=0x7faee404c700) at
pthread_create.c:463
#11 0x00007faeec5b888f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

bCore was generated by `/usr/sbin/glusterfs --process-name fuse
--volfile-server=localhost --volfile-id'.
Program terminated with signal SIGSEGV, Segmentation fault.
t#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
65      ../nptl/pthread_mutex_lock.c: Datei oder Verzeichnis nicht gefunden.
[Current thread is 1 (Thread 0x7f7944069700 (LWP 24067))]
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
#1  0x00007f794682c4b5 in ob_fd_free (ob_fd=0x7f79283e94e0) at
open-behind.c:198
#2  0x00007f794682cdd6 in ob_inode_wake (this=this at entry=0x7f793801eee0,
ob_fds=ob_fds at entry=0x7f7944068c90) at open-behind.c:355
#3  0x00007f794682d062 in open_all_pending_fds_and_resume
(this=this at entry=0x7f793801eee0, inode=0x7f79301de788, stub=0x7f7928004578) at
open-behind.c:442
#4  0x00007f794682d4ff in ob_rename (frame=frame at entry=0x7f79280ab2b8,
this=this at entry=0x7f793801eee0, src=src at entry=0x7f7930558ea0,
dst=dst at entry=0x7f7930558ee0, xdata=xdata at entry=0x0) at open-behind.c:1035
#5  0x00007f794e729ad0 in default_rename (frame=frame at entry=0x7f79280ab2b8,
this=<optimized out>, oldloc=oldloc at entry=0x7f7930558ea0,
newloc=newloc at entry=0x7f7930558ee0, xdata=xdata at entry=0x0) at defaults.c:2631
#6  0x00007f79463fd798 in mdc_rename (frame=frame at entry=0x7f7928363ae8,
this=0x7f7938022990, oldloc=oldloc at entry=0x7f7930558ea0,
newloc=newloc at entry=0x7f7930558ee0, xdata=xdata at entry=0x0) at md-cache.c:1852
#7  0x00007f794e73f936 in default_rename_resume (frame=0x7f7930298c28,
this=0x7f79380245f0, oldloc=0x7f7930558ea0, newloc=0x7f7930558ee0, xdata=0x0)
at defaults.c:1897
#8  0x00007f794e6c5c45 in call_resume (stub=0x7f7930558e58) at call-stub.c:2555
#9  0x00007f79461eecd8 in iot_worker (data=0x7f7938032940) at io-threads.c:232
#10 0x00007f794de086db in start_thread (arg=0x7f7944069700) at
pthread_create.c:463
#11 0x00007f794db3188f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Core was generated by `/usr/sbin/glusterfs --process-name fuse
--volfile-server=localhost --volfile-id'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
65      ../nptl/pthread_mutex_lock.c: Datei oder Verzeichnis nicht gefunden.
[Current thread is 1 (Thread 0x7f736ce78700 (LWP 42526))]
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:65
#1  0x00007f73733499f7 in fd_unref (fd=0x7f73544165a8) at fd.c:515
#2  0x00007f736c3fb618 in client_local_wipe (local=local at entry=0x7f73441ff8c8)
at client-helpers.c:124
#3  0x00007f736c44a60a in client4_0_open_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7f7344038b48) at
client-rpc-fops_v2.c:284
#4  0x00007f73730d03d1 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7f7368054490, pollin=pollin at entry=0x7f73601b2730) at
rpc-clnt.c:755
#5  0x00007f73730d0773 in rpc_clnt_notify (trans=0x7f7368054750,
mydata=0x7f73680544c0, event=<optimized out>, data=0x7f73601b2730) at
rpc-clnt.c:922
#6  0x00007f73730cd273 in rpc_transport_notify (this=this at entry=0x7f7368054750,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=<optimized out>) at
rpc-transport.c:542
#7  0x00007f736db02474 in socket_event_poll_in (notify_handled=true,
this=0x7f7368054750) at socket.c:2522
#8  socket_event_handler (fd=fd at entry=10, idx=idx at entry=4, gen=gen at entry=1,
data=data at entry=0x7f7368054750, poll_in=<optimized out>, poll_out=<optimized
out>, poll_err=<optimized out>, event_thread_died=0 '\000') at socket.c:2924
#9  0x00007f7373381863 in event_dispatch_epoll_handler (event=0x7f736ce77e54,
event_pool=0x55621c2917b0) at event-epoll.c:648
#10 event_dispatch_epoll_worker (data=0x55621c2d6a80) at event-epoll.c:761
#11 0x00007f7372a8b6db in start_thread (arg=0x7f736ce78700) at
pthread_create.c:463
#12 0x00007f73727b488f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

--- Additional comment from Xavi Hernandez on 2019-04-24 12:38:59 CEST ---

Can you share these core dumps and reference the exact version of gluster and
ubuntu used for each one ?

As per your comments, I understand that you are not doing anything special when
this happens, right ? no volume reconfiguration or any other management task.

Mount logs would also be helpful.

--- Additional comment from  on 2019-04-24 12:46:40 CEST ---

Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-47-generic x86_64)

administrator at ubuntu-docker:~$ apt list glusterfs-*
Listing... Done
glusterfs-client/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-common/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-dbg/bionic 6.1-ubuntu1~bionic1 amd64
glusterfs-server/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed]

mount logs, not sure what you refer?

--- Additional comment from Xavi Hernandez on 2019-04-24 12:50:13 CEST ---

(In reply to manschwetus from comment #7)
> Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-47-generic x86_64)
> 
> administrator at ubuntu-docker:~$ apt list glusterfs-*
> Listing... Done
> glusterfs-client/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
> glusterfs-common/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed,automatic]
> glusterfs-dbg/bionic 6.1-ubuntu1~bionic1 amd64
> glusterfs-server/bionic,now 6.1-ubuntu1~bionic1 amd64 [installed]
> 
> mount logs, not sure what you refer?

In the server where gluster crashed, you should find a file like
/var/log/glusterfs/swarm-volumes.log. I would also need the coredump files and
ubuntu version.

--- Additional comment from  on 2019-04-24 12:50:43 CEST ---

--- Additional comment from  on 2019-04-24 12:56:25 CEST ---

--- Additional comment from  on 2019-04-24 12:57:10 CEST ---

--- Additional comment from  on 2019-04-24 12:58:02 CEST ---

--- Additional comment from  on 2019-04-24 13:00:05 CEST ---

administrator at ubuntu-docker:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.2 LTS
Release:        18.04
Codename:       bionic

Same for the others

--- Additional comment from  on 2019-04-24 13:02:44 CEST ---

--- Additional comment from Xavi Hernandez on 2019-04-24 15:22:50 CEST ---

I'm still analyzing the core dumps, but as a first test, could you disable
open-behind feature using the following command ?

    # gluster volume set <volname> open-behind off

--- Additional comment from  on 2019-04-24 15:43:34 CEST ---

Ok, I modified it, lets see if it helps, hopefully it wont kill the
performance.

sudo gluster volume info swarm-vols

Volume Name: swarm-vols
Type: Replicate
Volume ID: a103c1da-d651-4d65-8f86-a8731e2a670c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.1.81:/gluster/data
Brick2: 192.168.1.86:/gluster/data
Brick3: 192.168.1.85:/gluster/data
Options Reconfigured:
performance.open-behind: off
performance.cache-size: 1GB
cluster.self-heal-daemon: enable
performance.write-behind: off
auth.allow: 127.0.0.1
transport.address-family: inet
nfs.disable: on
performance.cache-max-file-size: 1GB

--- Additional comment from Xavi Hernandez on 2019-04-25 12:33:31 CEST ---

Right now the issue seems related to open-behind, so changing the component
accordingly.

--- Additional comment from  on 2019-04-30 09:42:31 CEST ---

Ok, as we haven't had a crash since we disable open-behind it seems to be
valid, to say that disabling open-behind bypasses the issue.
As I disabled write-behind due to another defect report, is it worth to re
enable it or would you expect problems with it?

--- Additional comment from Amar Tumballi on 2019-05-01 14:56:35 CEST ---

Yes please, you can re-enable write-behind if you have upgraded to
glusterfs-5.5 or glusterfs-6.x releases.

--- Additional comment from Xavi Hernandez on 2019-05-02 19:10:34 CEST ---

>From my debugging I think the issue is related to a missing fd_ref() when
ob_open_behind() is used. This could potentially cause a race when the same fd
is being unref'ed (refcount becoming 0) and ref'ed at the same time to handle
some open_and_resume() requests. I have not yet identified the exact sequence
of operations that cause the problem though. Knowing that the problem really
comes from here, I'll investigate further.

--- Additional comment from Amar Tumballi on 2019-05-27 17:37:53 CEST ---

Marking it back to NEW, as the assignee is still bugs at gluster.org

--- Additional comment from Xavi Hernandez on 2020-01-09 17:53:15 CET ---

Is it possible to check if the problem still persists with latest 6.x or 7.x ?

--- Additional comment from  on 2020-01-10 14:42:01 CET ---

Was there a potential fix?
If yes I'll have to update all systems and remove the workarounds, so it may
take a week.

--- Additional comment from Xavi Hernandez on 2020-01-10 15:33:33 CET ---

I don't see a clear candidate to fix this, but I haven't seen this problem in
other tests either.

Is it possible to try it in a test environment to not touch production ? I
can't guarantee that it will work.

--- Additional comment from  on 2020-01-29 08:19:14 CET ---

No, that is the problem, but I'll try. Would it be sufficient to just reset the
options, or would the fs needed to be unmouted/mounted?

--- Additional comment from  on 2020-03-03 16:30:45 CET ---

I updated today to

 apt list glusterfs-*
Listing... Done
glusterfs-client/bionic,now 6.7-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-common/bionic,now 6.7-ubuntu1~bionic1 amd64 [installed,automatic]
glusterfs-dbg/bionic 6.7-ubuntu1~bionic1 amd64
glusterfs-server/bionic,now 6.7-ubuntu1~bionic1 amd64 [installed]

Then I enabled open-behind, it crashed during afternoon. So problem persists. I
disabled open-behind again, to prevent further instability.

--- Additional comment from Csaba Henk on 2020-03-04 17:53:51 CET ---

Can you please share all the fresh data? Backtraces, coredumps, gluster vol
info/status, swarm-volumes.log. Can you please succinctly describe the workload
you put on this setup?

Note: I can find the availability of 6.7-ubuntu1~bionic; indeed, what I see is
6.8-ubuntu1~bionic1 at
https://launchpad.net/~gluster/+archive/ubuntu/glusterfs-6/+packages.

--- Additional comment from Xavi Hernandez on 2020-03-05 12:05:28 CET ---

I've been analyzing the core dumps again and I think I've identified a possible
cause. I'm trying to reproduce it but apparently it's hard to hit it with a
simple test.

I'll update on my findings soon.

--- Additional comment from Xavi Hernandez on 2020-03-05 16:53:32 CET ---

Finally I've been able to cause the same issue with master branch. I've needed
to add artificial delays and use gfapi to trigger the bug, but the race is
there and under certain workloads it could manifest.

The backtrace I've got is the same:

(gdb) bt
#0  0x00007ffff7f77e74 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007ffff0fb5145 in ob_fd_free (ob_fd=0x7fffd40063d0) at
open-behind.c:214
#2  0x00007ffff0fb5b18 in ob_inode_wake (this=this at entry=0x7fffe4018fb0,
ob_fds=ob_fds at entry=0x7ffff062ded0) at open-behind.c:361
#3  0x00007ffff0fb5ed1 in open_all_pending_fds_and_resume
(this=this at entry=0x7fffe4018fb0, inode=0x4b12b8, stub=0x7fffd400f138) at
open-behind.c:453
#4  0x00007ffff0fb61c3 in ob_rename (frame=frame at entry=0x7fffd4007c28,
this=0x7fffe4018fb0, src=src at entry=0x7fffcc0025f0,
dst=dst at entry=0x7fffcc002630, xdata=xdata at entry=0x0) at open-behind.c:1057
#5  0x00007ffff0f927a3 in mdc_rename (frame=frame at entry=0x7fffd4007da8,
this=0x7fffe401abc0, oldloc=oldloc at entry=0x7fffcc0025f0,
newloc=newloc at entry=0x7fffcc002630, xdata=xdata at entry=0x0) at md-cache.c:1848
#6  0x00007ffff7cf5c7c in default_rename_resume (frame=0x7fffcc001bf8,
this=0x7fffe401c7d0, oldloc=0x7fffcc0025f0, newloc=0x7fffcc002630, xdata=0x0)
at defaults.c:1897
#7  0x00007ffff7c77785 in call_resume (stub=0x7fffcc0025a8) at call-stub.c:2392
#8  0x00007ffff0f82700 in iot_worker (data=0x7fffe402e970) at io-threads.c:232
#9  0x00007ffff7f754e2 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffff7ea46d3 in clone () from /lib64/libc.so.6

The issue is caused by a missing reference on an fd. I'll fix it.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1697971
[Bug 1697971] Segfault in FUSE process, potential use after free
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.