[Gluster-users] pausing scrub crashed scrub daemon on nodes

Fri Sep 8 10:37:42 UTC 2017

Hi,

I am using glusterfs 3.10.1 with 30 nodes each with 36 bricks and 10 nodes
each with 16 bricks in a single cluster.

By default I have paused scrub process to have it run manually. for the
first time, i was trying to run scrub-on-demand and it was running fine,
but after some time, i decided to pause scrub process due to high CPU usage
and user reporting folder listing taking time.
But scrub pause resulted below message in some of the nodes.
Also, i can see that scrub daemon is not showing in volume status for some
nodes.

Error msg type 1
--

[2017-09-01 10:04:45.840248] I [bit-rot.c:1683:notify]
0-glustervol-bit-rot-0: BitRot scrub ondemand called
[2017-09-01 10:05:05.094948] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt:
Volume file changed
[2017-09-01 10:05:06.401792] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt:
Volume file changed
[2017-09-01 10:05:07.544524] I [MSGID: 118035]
[bit-rot-scrub.c:1297:br_scrubber_scale_up] 0-glustervol-bit-rot-0: Scaling
up scrubbe
rs [0 => 36]
[2017-09-01 10:05:07.552893] I [MSGID: 118048]
[bit-rot-scrub.c:1547:br_scrubber_log_option] 0-glustervol-bit-rot-0: SCRUB
TUNABLES::
 [Frequency: biweekly, Throttle: lazy]
[2017-09-01 10:05:07.552942] I [MSGID: 118038]
[bit-rot-scrub.c:948:br_fsscan_schedule] 0-glustervol-bit-rot-0: Scrubbing
is schedule
d to run at 2017-09-15 10:05:07
[2017-09-01 10:05:07.553457] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2017-09-01 10:05:20.953815] I [bit-rot.c:1683:notify]
0-glustervol-bit-rot-0: BitRot scrub ondemand called
[2017-09-01 10:05:20.953845] I [MSGID: 118038]
[bit-rot-scrub.c:1085:br_fsscan_ondemand] 0-glustervol-bit-rot-0: Ondemand
Scrubbing s
cheduled to run at 2017-09-01 10:05:21
[2017-09-01 10:05:22.216937] I [MSGID: 118044]
[bit-rot-scrub.c:615:br_scrubber_log_time] 0-glustervol-bit-rot-0:
Scrubbing started a
t 2017-09-01 10:05:22
[2017-09-01 10:05:22.306307] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt:
Volume file changed
[2017-09-01 10:05:24.684900] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2017-09-06 08:37:26.422267] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt:
Volume file changed
[2017-09-06 08:37:28.351821] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt:
Volume file changed
[2017-09-06 08:37:30.350786] I [MSGID: 118034]
[bit-rot-scrub.c:1342:br_scrubber_scale_down] 0-glustervol-bit-rot-0:
Scaling down scr
ubbers [36 => 0]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2017-09-06 08:37:30
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.10.1
/usr/lib/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x78)[0x7fda0ab0b4f8]
/usr/lib/libglusterfs.so.0(gf_print_trace+0x324)[0x7fda0ab14914]
/lib/x86_64-linux-gnu/libc.so.6(+0x36d40)[0x7fda09ef9d40]
/usr/lib/libglusterfs.so.0(syncop_readv_cbk+0x17)[0x7fda0ab429e7]
/usr/lib/glusterfs/3.10.1/xlator/protocol/client.so(+0x2db4b)[0x7fda04986b4b]
/usr/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7fda0a8d5490]
/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x1e7)[0x7fda0a8d5777]
/usr/lib/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fda0a8d17d3]
/usr/lib/glusterfs/3.10.1/rpc-transport/socket.so(+0x7194)[0x7fda05826194]
/usr/lib/glusterfs/3.10.1/rpc-transport/socket.so(+0x9635)[0x7fda05828635]
/usr/lib/libglusterfs.so.0(+0x83db0)[0x7fda0ab64db0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fda0a290182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fda09fbd47d]
--------------

Error msg type 2

[2017-09-01 10:01:20.387248] I [MSGID: 118035]
[bit-rot-scrub.c:1297:br_scrubber_scale_up] 0-glustervol-bit-rot-0: Scaling
up scrubbe
rs [0 => 36]
[2017-09-01 10:01:20.392544] I [MSGID: 118048]
[bit-rot-scrub.c:1547:br_scrubber_log_option] 0-glustervol-bit-rot-0: SCRUB
TUNABLES::
 [Frequency: biweekly, Throttle: lazy]
[2017-09-01 10:01:20.392571] I [MSGID: 118038]
[bit-rot-scrub.c:948:br_fsscan_schedule] 0-glustervol-bit-rot-0: Scrubbing
is schedule
d to run at 2017-09-15 10:01:20
[2017-09-01 10:01:20.392727] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2017-09-01 10:01:35.078694] I [bit-rot.c:1683:notify]
0-glustervol-bit-rot-0: BitRot scrub ondemand called
[2017-09-01 10:01:35.078735] I [MSGID: 118038]
[bit-rot-scrub.c:1085:br_fsscan_ondemand] 0-glustervol-bit-rot-0: Ondemand
Scrubbing s
cheduled to run at 2017-09-01 10:01:36
[2017-09-01 10:01:36.355827] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt:
Volume file changed
[2017-09-01 10:01:37.018622] I [MSGID: 118044]
[bit-rot-scrub.c:615:br_scrubber_log_time] 0-glustervol-bit-rot-0:
Scrubbing started a
t 2017-09-01 10:01:37
[2017-09-01 10:01:37.601774] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk]
0-glusterfs: No change in volfile, continuing
[2017-09-06 08:33:37.738627] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt:
Volume file changed
[2017-09-06 08:33:39.812894] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt:
Volume file changed
[2017-09-06 08:33:41.828432] I [MSGID: 118034]
[bit-rot-scrub.c:1342:br_scrubber_scale_down] 0-glustervol-bit-rot-0:
Scaling down scr
ubbers [36 => 0]
[2017-09-06 08:33:41.884031] I [MSGID: 118051]
[bit-rot-ssm.c:80:br_scrub_ssm_state_stall] 0-glustervol-bit-rot-0: Volume
is under ac
tive scrubbing. Pausing scrub..
[2017-09-06 08:34:26.477106] C
[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-970:
server 192.168.0.21:49177 has
not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:34:29.477438] C
[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-980:
server 192.168.0.21:49178 has
not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:34:37.478198] C
[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-1040:
server 192.168.0.21:49184 has
 not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:34:40.478550] C
[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-1070:
server 192.168.0.21:49187 has
 not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:34:56.480200] C
[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-990:
server 192.168.0.21:49179 has
not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:34:59.480520] C
[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-760:
server 192.168.0.21:49156 has
not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:35:01.480751] C
[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-1020:
server 192.168.0.21:49182 has
 not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:35:05.481223] C
[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-790:
server 192.168.0.21:49159 has not responded in the last 42 seconds,
disconnecting.
[2017-09-06 09:03:43.637208] E [rpc-clnt.c:200:call_bail] 0-glusterfs:
bailing out frame type(GlusterFS Handshake) op(GETSPEC(2)) xid = 0x8 sent =
2017-09-06 08:33:39.813002. timeout = 1800 for 127.0.0.1:24007
[2017-09-06 09:03:44.637338] E [rpc-clnt.c:200:call_bail]
0-glustervol-client-760: bailing out frame type(GlusterFS 3.3) op(READ(12))
xid = 0x160f941 sent = 2017-09-06 08:33:41.843336. timeout = 1800 for
192.168.0.21:49156
[2017-09-06 09:03:44.637726] W [MSGID: 114031]
[client-rpc-fops.c:2992:client3_3_readv_cbk] 0-glustervol-client-760:
remote operation failed [Transport endpoint is not connected]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2017-09-06 09:03:44
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.10.1
/usr/lib/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x78)[0x7f26721934f8]
/usr/lib/libglusterfs.so.0(gf_print_trace+0x324)[0x7f267219c914]
/lib/x86_64-linux-gnu/libc.so.6(+0x36d40)[0x7f2671581d40]
/usr/lib/libglusterfs.so.0(syncop_readv_cbk+0x17)[0x7f26721ca9e7]
/usr/lib/glusterfs/3.10.1/xlator/protocol/client.so(+0x2db4b)[0x7f2667dd3b4b]
/usr/lib/libgfrpc.so.0(+0xf92c)[0x7f2671f5c92c]
/usr/lib/libglusterfs.so.0(+0x36eb2)[0x7f267219feb2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f2671918182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f267164547d]
-------

My queries are below:-

1. To resume scrub process should I restart glusterd service in node where
scrub daemon is not running or do a volume force start

2. if resumed, will it start from where it was stopped.

3. I am assuming, scrub by default assigns thread by calculating the number
of bricks in the node. need an option to change it in gluster volume
command.
   Because in my case my node has 12 CPU's (Intel Xeon CPU 6 core + HT)
when scrub was running it consumed all CPU 99%.
   or it should be intelligent enough to scale down depending on available
CPUs in the node.

4. Why was this crash?

regards
Amudhan P
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170908/00bd40a8/attachment.html>