[Bugs] [Bug 1381970] New: GlusterFS Daemon stops working after a longer runtime and higher file workload due to design flaws ?

bugzilla at redhat.com bugzilla at redhat.com
Wed Oct 5 13:01:26 UTC 2016


https://bugzilla.redhat.com/show_bug.cgi?id=1381970

            Bug ID: 1381970
           Summary: GlusterFS Daemon stops working after a longer runtime
                    and higher file workload due to design flaws?
           Product: GlusterFS
           Version: 3.7.15
         Component: rpc
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: jules at ispire.me
                CC: bugs at gluster.org



Description of problem:

Since first Release running GlusterFS 3.7 for some time glusterfsd seem to stop
working/crashing and freezes all the involved nodes which are member of the
glusterfs cluster.
At that state you can't login to the machines anymore and need to force reboot
them, since the volumes mounted becoming inaccessable and locking the whole
system.

Here is the nfs.log Output Before/after the Crash:

First Node:

[2016-10-05 09:30:13.410418] W [socket.c:596:__socket_rwv] 0-NLM-client: readv
on 10.1.20.32:23198 failed (No data available)
[2016-10-05 09:32:31.243012] E [socket.c:2292:socket_connect_finish]
0-NLM-client: connection to 10.1.20.32:23198 failed (Connection timed out)
pending frames:

frame : type(0) op(0)  <--- Repeating thousand times

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-10-05 09:32:33
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.15
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x7e)[0x7f28321a3fbe]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x31d)[0x7f28321c670d]
/lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7f2830c450e0]
/lib/x86_64-linux-gnu/libc.so.6(+0x91d8a)[0x7f2830ca1d8a]
/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/nfs/server.so(nlm_set_rpc_clnt+0x62)[0x7f2827714c32]
/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/nfs/server.so(nlm_rpcclnt_notify+0x35)[0x7f2827717395]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x2aa)[0x7f2831f7147a]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f2831f6d733]
/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/rpc-transport/socket.so(+0x4a73)[0x7f282d0d8a73]
/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/rpc-transport/socket.so(+0x8e1f)[0x7f282d0dce1f]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8e722)[0x7f283220c722]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f28314270a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f2830cf862d]


Second Node:

pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-10-05 09:35:44
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.15
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x7e)[0x7fa8251f0fbe]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x31d)[0x7fa82521370d]
/lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7fa823c920e0]
/lib/x86_64-linux-gnu/libc.so.6(+0x91d8a)[0x7fa823ceed8a]
/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/nfs/server.so(nlm_set_rpc_clnt+0x62)[0x7fa81e907c32]
/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/nfs/server.so(nlm_rpcclnt_notify+0x35)[0x7fa81e90a395]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x2aa)[0x7fa824fbe47a]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa824fba733]
/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/rpc-transport/socket.so(+0x4a73)[0x7fa820125a73]
/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/rpc-transport/socket.so(+0x8e1f)[0x7fa820129e1f]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x8e722)[0x7fa825259722]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7fa8244740a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fa823d4562d]


And here the most common lines from /var/log/glusterfs/bricks/*.log:

[2016-10-04 11:18:34.009309] E [MSGID: 113027] [posix.c:1538:posix_mkdir]
0-netshare-posix: mkdir of /xxxxxxx/shared/web failed [File exists]
[2016-10-04 12:06:31.511435] E [MSGID: 115050]
[server-rpc-fops.c:179:server_lookup_cbk] 0-netshare-server: 1147010245: LOOKUP
/www/schausteller.de/releases/20161004111428/web/anzeigenmarkt
(bed02db1-dd66-4244-9adb-5e8b22513d62/anzeigenmarkt) ==> (Permission denied)
[Permission denied]
[2016-10-04 15:44:03.460733] E [MSGID: 113001]
[posix.c:5415:_posix_handle_xattr_keyvalue_pair] 0-netshare-posix: getxattr
failed on
/storage/gfs/netshare/.glusterfs/42/ef/42ef035a-7688-45e2-ae2e-1306e49fcc9f
while doing xattrop: Key:trusted.afr.dirty  [No such file or directory]
[2016-10-04 22:24:13.715602] E [rpcsvc.c:565:rpcsvc_check_and_reply_error]
0-rpcsvc: rpc actor failed to complete successfully
[2016-10-04 22:24:13.715623] E [server-helpers.c:390:server_alloc_frame]
(-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)
[0x7f1b892f45e5]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/protocol/server.so(server3_3_lookup+0x8b)
[0x7f1b80b3649b]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/protocol/server.so(get_frame_from_request+0x302)
[0x7f1b80b1dc82] ) 0-server: invalid argument: client [Invalid argument]
[2016-10-04 22:24:13.715631] E [rpcsvc.c:565:rpcsvc_check_and_reply_error]
0-rpcsvc: rpc actor failed to complete successfully
[2016-10-04 22:24:13.715654] E [server-helpers.c:390:server_alloc_frame]
(-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)
[0x7f1b892f45e5]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/protocol/server.so(server3_3_lookup+0x8b)
[0x7f1b80b3649b]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/protocol/server.so(get_frame_from_request+0x302)
[0x7f1b80b1dc82] ) 0-server: invalid argument: client [Invalid argument]
[2016-10-04 22:28:08.361526] W [socket.c:596:__socket_rwv]
0-tcp.netshare-server: writev on 10.1.20.1:49124 failed (Broken pipe)
[2016-10-04 22:28:08.361996] E [rpcsvc.c:1314:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x6a4f5f, Program: GlusterFS 3.3,
ProgVers: 330, Proc: 27) to rpc-transport (tcp.netshare-server)
[2016-10-04 22:28:08.362163] W [entrylk.c:757:pl_entrylk_log_cleanup]
0-netshare-server: releasing lock on 67caf764-b306-4e59-a8d2-891287ba33b0 held
by {client=0x7f552400ff80, pid=-6 lk-owner=78bce576c67f0000}
[2016-10-04 22:28:08.362250] W [entrylk.c:757:pl_entrylk_log_cleanup]
0-netshare-server: releasing lock on 67caf764-b306-4e59-a8d2-891287ba33b0 held
by {client=0x7f552400ff80, pid=-6 lk-owner=78bce576c67f0000}
[2016-10-04 22:28:08.362402] E [MSGID: 113040]
[posix-helpers.c:1640:__posix_fd_ctx_get] 0-netshare-posix: Failed to get fd
context for a non-anonymous fd, gfid: 4dd09768-deb2-493c-a7f3-aa70c79c21e5
[2016-10-04 22:28:08.362431] W [MSGID: 113006] [posix.c:3444:posix_flush]
0-netshare-posix: pfd is NULL on fd=0x7f552c0ce8a4 [Invalid argument]
[2016-10-04 22:28:08.363021] W [inodelk.c:404:pl_inodelk_log_cleanup]
0-netshare-server: releasing lock on 1051d710-8d00-48f7-85ae-eea721b3b86e held
by {client=0x7f55240834b0, pid=-6 lk-owner=90d525f2a27f0000}
[2016-10-04 22:28:08.363027] E [rpcsvc.c:1314:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x472ca5d9, Program: GlusterFS
3.3, ProgVers: 330, Proc: 22) to rpc-transport (tcp.netshare-server)
[2016-10-04 22:28:08.367940] E [server-helpers.c:390:server_alloc_frame]
(-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325)
[0x7f5538dd15e5]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/protocol/server.so(server3_3_lookup+0x8b)
[0x7f55306ac49b]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/protocol/server.so(get_frame_from_request+0x302)
[0x7f5530693c82] ) 0-server: invalid argument: client [Invalid argument]
[2016-10-04 22:28:08.367953] E [server.c:205:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/debug/io-stats.so(io_stats_lookup_cbk+0x117)
[0x7f55308d3017]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/protocol/server.so(server_lookup_cbk+0x50e)
[0x7f55306abe1e]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/protocol/server.so(server_submit_reply+0x2ed)
[0x7f553068ffed] ) 0-: Reply submission failed
[2016-10-04 22:28:08.586426] E [MSGID: 113001]
[posix.c:5415:_posix_handle_xattr_keyvalue_pair] 0-netshare-posix: getxattr
failed on
/storage/gfs/netshare/.glusterfs/9a/7d/9a7d25ee-a7c4-45ed-9e62-2c780e11f700
while doing xattrop: Key:trusted.afr.dirty  [No such file or directory]
[2016-10-04 23:43:03.514841] E [rpcsvc.c:1314:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x28bc4, Program: GlusterFS 3.3,
ProgVers: 330, Proc: 40) to rpc-transport (tcp.netshare-server)
[2016-10-04 23:43:03.514923] E [server.c:205:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/debug/io-stats.so(io_stats_readdirp_cbk+0x18f)
[0x7f55308d1f8f]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/protocol/server.so(server_readdirp_cbk+0xe3)
[0x7f553069a653]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/protocol/server.so(server_submit_reply+0x2ed)
[0x7f553068ffed] ) 0-: Reply submission failed
[2016-10-04 23:43:03.515148] E [rpcsvc.c:1314:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x28bc3, Program: GlusterFS 3.3,
ProgVers: 330, Proc: 27) to rpc-transport (tcp.netshare-server)
[2016-10-04 23:43:03.515192] E [server.c:205:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/debug/io-stats.so(io_stats_lookup_cbk+0x117)
[0x7f55308d3017]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/protocol/server.so(server_lookup_cbk+0x50e)
[0x7f55306abe1e]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.15/xlator/protocol/server.so(server_submit_reply+0x2ed)
[0x7f553068ffed] ) 0-: Reply submission failed
[2016-10-04 23:43:04.518386] W [entrylk.c:757:pl_entrylk_log_cleanup]
0-netshare-server: releasing lock on 57869c3e-ab7d-42e1-8460-249c8ec1cc8d held
by {client=0x7f5524481400, pid=-6 lk-owner=c46de6952c7f0000}
[2016-10-04 23:43:04.518429] W [entrylk.c:757:pl_entrylk_log_cleanup]
0-netshare-server: releasing lock on 57869c3e-ab7d-42e1-8460-249c8ec1cc8d held
by {client=0x7f5524481400, pid=-6 lk-owner=c46de6952c7f0000}
[2016-10-04 23:43:04.518487] E [MSGID: 113040]
[posix-helpers.c:1640:__posix_fd_ctx_get] 0-netshare-posix: Failed to get fd
context for a non-anonymous fd, gfid: 4dd09768-deb2-493c-a7f3-aa70c79c21e5
[2016-10-04 23:43:04.518512] W [MSGID: 113006] [posix.c:3444:posix_flush]
0-netshare-posix: pfd is NULL on fd=0x7f552c0cfc78 [Invalid argument]
[2016-10-04 23:43:04.532080] E [rpcsvc.c:1314:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x1ad80, Program: GlusterFS 3.3,
ProgVers: 330, Proc: 27) to rpc-transport (tcp.netshare-server)
[2016-10-04 23:43:06.531762] W [entrylk.c:757:pl_entrylk_log_cleanup]
0-netshare-server: releasing lock on 67caf764-b306-4e59-a8d2-891287ba33b0 held
by {client=0x7f552c0f6ae0, pid=-6 lk-owner=c8218da72e7f0000}
[2016-10-04 23:43:06.531816] E [MSGID: 113040]
[posix-helpers.c:1640:__posix_fd_ctx_get] 0-netshare-posix: Failed to get fd
context for a non-anonymous fd, gfid: cb026705-8b5c-45db-9333-c1ca1fa4c025
[2016-10-04 23:44:04.530858] E [MSGID: 113040]
[posix-helpers.c:1640:__posix_fd_ctx_get] 0-netshare-posix: Failed to get fd
context for a non-anonymous fd, gfid: 4dd09768-deb2-493c-a7f3-aa70c79c21e5
[2016-10-04 23:44:04.530880] W [MSGID: 113006] [posix.c:3444:posix_flush]
0-netshare-posix: pfd is NULL on fd=0x7f552c0d0260 [Invalid argument]
[2016-10-05 10:04:50.164154] W [MSGID: 115009]
[server-resolve.c:574:server_resolve] 0-netshare-server: no resolution type for
(null) (LOOKUP)
[2016-10-05 10:04:50.164281] E [MSGID: 115050]
[server-rpc-fops.c:179:server_lookup_cbk] 0-netshare-server: 782572: LOOKUP
(null) (00000000-0000-0000-0000-000000000000/HistoryEntry.orm.xml) ==> (Invalid
argument) [Invalid argument]



Version-Release number of selected component (if applicable): 3.7.15


How reproducible:
Serving daily mixed content/data with read/write access and wait a few weeks
for it to crash.

Steps to Reproduce:
1. Only reproducable on higher workload glusterfs clusters.
2. Produce daily mixed read/write access to NFS mounted Volumes.
3. Wait until gluster daemon crashes.
4. Have fun getting it under control.

Actual results:
Non Production Ready Releases since first GlusterFS 3.7

Expected results:
Stability!

Additional info:
Since first Release of GlusterFS 3.7 i'm facing into this issue, GlusterFS
after some time soon or later becomes instable and will Overload your Servers.
At that state you can't do anything else than press the reset button or force a
reboot.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.


More information about the Bugs mailing list