[Bugs] [Bug 1626085] "glusterfs --process-name fuse" crashes and leads to "Transport endpoint is not connected"

Mon Feb 4 15:14:57 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1626085

GCth <rhb1 at gcth.net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Flags|needinfo?(ravishankar at redha |
                   |t.com)                      |
                   |needinfo?(rhb1 at gcth.net)    |

--- Comment #11 from GCth <rhb1 at gcth.net> ---
Up until line #17 they are the same, here's another example:

Core was generated by `/usr/sbin/glusterfs --process-name fuse
--volfile-server=xxxx --'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f8d5e877560 in __gf_free (free_ptr=0x7f8d49a25378) at mem-pool.c:330
330     mem-pool.c: No such file or directory.
[Current thread is 1 (Thread 0x7f8d521bf700 (LWP 2217))]
(gdb) bt
#0  0x00007f8d5e877560 in __gf_free (free_ptr=0x7f8d49a25378) at mem-pool.c:330
#1  0x00007f8d5e842e1e in dict_destroy (this=0x7f8d4994f708) at dict.c:701
#2  0x00007f8d5e842f25 in dict_unref (this=<optimized out>) at dict.c:753
#3  0x00007f8d584330d4 in afr_local_cleanup (local=0x7f8d49a56cc8,
this=<optimized out>) at afr-common.c:2091
#4  0x00007f8d5840d584 in afr_transaction_done (frame=<optimized out>,
this=<optimized out>) at afr-transaction.c:369
#5  0x00007f8d5841483a in afr_unlock (frame=frame at entry=0x7f8d4995ec08,
this=this at entry=0x7f8d54019d40) at afr-lk-common.c:1085
#6  0x00007f8d5840aeca in afr_changelog_post_op_done
(frame=frame at entry=0x7f8d4995ec08, this=this at entry=0x7f8d54019d40) at
afr-transaction.c:778
#7  0x00007f8d5840e105 in afr_changelog_post_op_do (frame=0x7f8d4995ec08,
this=0x7f8d54019d40) at afr-transaction.c:1442
#8  0x00007f8d5840edcf in afr_changelog_post_op_now (frame=0x7f8d4995ec08,
this=0x7f8d54019d40) at afr-transaction.c:1512
#9  0x00007f8d5840ef4c in afr_delayed_changelog_wake_up_cbk (data=<optimized
out>) at afr-transaction.c:2444
#10 0x00007f8d58410866 in afr_transaction_start
(local=local at entry=0x7f8d4cd6ed18, this=this at entry=0x7f8d54019d40) at
afr-transaction.c:2847
#11 0x00007f8d58410c89 in afr_transaction (frame=frame at entry=0x7f8d4e643068,
this=this at entry=0x7f8d54019d40, type=type at entry=AFR_DATA_TRANSACTION) at
afr-transaction.c:2918
#12 0x00007f8d583fcb70 in afr_do_writev (frame=frame at entry=0x7f8d4e245608,
this=this at entry=0x7f8d54019d40) at afr-inode-write.c:477
#13 0x00007f8d583fd81d in afr_writev (frame=frame at entry=0x7f8d4e245608,
this=this at entry=0x7f8d54019d40, fd=fd at entry=0x7f8d499f3758,
vector=0x7f8d4e932b40, count=1, offset=1024, flags=32769,
iobref=0x7f8d488cb3b0, xdata=0x0) at afr-inode-write.c:555
#14 0x00007f8d5818cbef in dht_writev (frame=frame at entry=0x7f8d4e29c598,
this=<optimized out>, fd=0x7f8d499f3758, vector=vector at entry=0x7f8d521be5c0,
count=count at entry=1, off=<optimized out>, flags=32769, iobref=0x7f8d488cb3b0,
xdata=0x0) at dht-inode-write.c:223
#15 0x00007f8d53df0b77 in wb_fulfill_head
(wb_inode=wb_inode at entry=0x7f8d49a25310, head=0x7f8d49bbcb40) at
write-behind.c:1156
#16 0x00007f8d53df0dfb in wb_fulfill (wb_inode=wb_inode at entry=0x7f8d49a25310,
liabilities=liabilities at entry=0x7f8d521be720) at write-behind.c:1233
#17 0x00007f8d53df21b6 in wb_process_queue
(wb_inode=wb_inode at entry=0x7f8d49a25310) at write-behind.c:1784
#18 0x00007f8d53df233f in wb_fulfill_cbk (frame=frame at entry=0x7f8d49cc15a8,
cookie=<optimized out>, this=<optimized out>, op_ret=op_ret at entry=1024,
op_errno=op_errno at entry=0, prebuf=prebuf at entry=0x7f8d49c7f8c0,
postbuf=<optimized out>, xdata=<optimized out>) at write-behind.c:1105
#19 0x00007f8d5818b31e in dht_writev_cbk (frame=0x7f8d498dfa48,
cookie=<optimized out>, this=<optimized out>, op_ret=1024, op_errno=0,
prebuf=0x7f8d49c7f8c0, postbuf=0x7f8d49c7f958, xdata=0x7f8d4e65a7e8) at
dht-inode-write.c:140
#20 0x00007f8d583fc2b7 in afr_writev_unwind (frame=frame at entry=0x7f8d48374db8,
this=this at entry=0x7f8d54019d40) at afr-inode-write.c:234
#21 0x00007f8d583fc83e in afr_writev_wind_cbk (frame=0x7f8d4995ec08,
cookie=<optimized out>, this=0x7f8d54019d40, op_ret=<optimized out>,
op_errno=<optimized out>, prebuf=<optimized out>, postbuf=0x7f8d521be9d0,
xdata=0x7f8d4e6e30d8) at afr-inode-write.c:388
#22 0x00007f8d586c4865 in client4_0_writev_cbk (req=<optimized out>,
iov=<optimized out>, count=<optimized out>, myframe=0x7f8d4e621578) at
client-rpc-fops_v2.c:685
#23 0x00007f8d5e61c130 in rpc_clnt_handle_reply
(clnt=clnt at entry=0x7f8d54085540, pollin=pollin at entry=0x7f8d4ea47850) at
rpc-clnt.c:755
#24 0x00007f8d5e61c48f in rpc_clnt_notify (trans=0x7f8d54085800,
mydata=0x7f8d54085570, event=<optimized out>, data=0x7f8d4ea47850) at
rpc-clnt.c:923
#25 0x00007f8d5e618893 in rpc_transport_notify (this=this at entry=0x7f8d54085800,
event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7f8d4ea47850)
at rpc-transport.c:525
#26 0x00007f8d59401671 in socket_event_poll_in (notify_handled=true,
this=0x7f8d54085800) at socket.c:2504
#27 socket_event_handler (fd=<optimized out>, idx=idx at entry=2, gen=4,
data=data at entry=0x7f8d54085800, poll_in=<optimized out>, poll_out=<optimized
out>, poll_err=<optimized out>) at socket.c:2905
#28 0x00007f8d5e8ab945 in event_dispatch_epoll_handler (event=0x7f8d521bee8c,
event_pool=0x56110317e0b0) at event-epoll.c:591
#29 event_dispatch_epoll_worker (data=0x7f8d5406f7e0) at event-epoll.c:668
#30 0x00007f8d5dacb494 in start_thread (arg=0x7f8d521bf700) at
pthread_create.c:333
#31 0x00007f8d5d374acf in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:97

All the gluster instances looks similar to the following setup:

Type: Distributed-Replicate
Volume ID: e9dd963c...
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.10.11.1:/export/data1
Brick2: 10.10.11.2:/export/data1
Brick3: 10.10.11.3:/export/data1
Brick4: 10.10.11.4:/export/data1
Options Reconfigured:
cluster.favorite-child-policy: mtime
cluster.self-heal-daemon: enable
performance.cache-size: 1GB
performance.quick-read: on
performance.stat-prefetch: on
performance.read-ahead: on
performance.readdir-ahead: on
auth.allow: 10.*.*.*
transport.address-family: inet
nfs.disable: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 50000

I do not have a reproducer, the gluster instance is 2-5TB of files, mostly
small ones, with lots of directories.
They reach up to 10M inodes used as reported by df -hi, brick storage is on XFS
as recommended.
The crash of individual glusterfs process happens once every several days.

-- 
You are receiving this mail because:
You are on the CC list for the bug.