<div dir="ltr"><div>Hi,</div><div><br></div><div>I am using glusterfs 3.10.1 with 30 nodes each with 36 bricks and 10 nodes each with 16 bricks in a single cluster. </div><div><br></div><div>By default I have paused scrub process to have it run manually. for the first time, i was trying to run scrub-on-demand and it was running fine, </div><div>but after some time, i decided to pause scrub process due to high CPU usage and user reporting folder listing taking time. </div><div>But scrub pause resulted below message in some of the nodes.</div><div>Also, i can see that scrub daemon is not showing in volume status for some nodes.</div><div><br></div><div>Error msg type 1</div><div>--</div><div><br></div><div>[2017-09-01 10:04:45.840248] I [bit-rot.c:1683:notify] 0-glustervol-bit-rot-0: BitRot scrub ondemand called</div><div>[2017-09-01 10:05:05.094948] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed</div><div>[2017-09-01 10:05:06.401792] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed</div><div>[2017-09-01 10:05:07.544524] I [MSGID: 118035] [bit-rot-scrub.c:1297:br_scrubber_scale_up] 0-glustervol-bit-rot-0: Scaling up scrubbe</div><div>rs [0 =&gt; 36]</div><div>[2017-09-01 10:05:07.552893] I [MSGID: 118048] [bit-rot-scrub.c:1547:br_scrubber_log_option] 0-glustervol-bit-rot-0: SCRUB TUNABLES::</div><div> [Frequency: biweekly, Throttle: lazy]</div><div>[2017-09-01 10:05:07.552942] I [MSGID: 118038] [bit-rot-scrub.c:948:br_fsscan_schedule] 0-glustervol-bit-rot-0: Scrubbing is schedule</div><div>d to run at 2017-09-15 10:05:07</div><div>[2017-09-01 10:05:07.553457] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing</div><div>[2017-09-01 10:05:20.953815] I [bit-rot.c:1683:notify] 0-glustervol-bit-rot-0: BitRot scrub ondemand called</div><div>[2017-09-01 10:05:20.953845] I [MSGID: 118038] [bit-rot-scrub.c:1085:br_fsscan_ondemand] 0-glustervol-bit-rot-0: Ondemand Scrubbing s</div><div>cheduled to run at 2017-09-01 10:05:21</div><div>[2017-09-01 10:05:22.216937] I [MSGID: 118044] [bit-rot-scrub.c:615:br_scrubber_log_time] 0-glustervol-bit-rot-0: Scrubbing started a</div><div>t 2017-09-01 10:05:22</div><div>[2017-09-01 10:05:22.306307] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed</div><div>[2017-09-01 10:05:24.684900] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing</div><div>[2017-09-06 08:37:26.422267] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed</div><div>[2017-09-06 08:37:28.351821] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed</div><div>[2017-09-06 08:37:30.350786] I [MSGID: 118034] [bit-rot-scrub.c:1342:br_scrubber_scale_down] 0-glustervol-bit-rot-0: Scaling down scr</div><div>ubbers [36 =&gt; 0]</div><div>pending frames:</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>patchset: git://<a href="http://git.gluster.org/glusterfs.git">git.gluster.org/glusterfs.git</a></div><div>signal received: 11</div><div>time of crash:</div><div>2017-09-06 08:37:30</div><div>configuration details:</div><div>argp 1</div><div>backtrace 1</div><div>dlfcn 1</div><div>libpthread 1</div><div>llistxattr 1</div><div>setfsid 1</div><div>spinlock 1</div><div>epoll.h 1</div><div>xattr.h 1</div><div>st_atim.tv_nsec 1</div><div>package-string: glusterfs 3.10.1</div><div>/usr/lib/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x78)[0x7fda0ab0b4f8]</div><div>/usr/lib/libglusterfs.so.0(gf_print_trace+0x324)[0x7fda0ab14914]</div><div>/lib/x86_64-linux-gnu/libc.so.6(+0x36d40)[0x7fda09ef9d40]</div><div>/usr/lib/libglusterfs.so.0(syncop_readv_cbk+0x17)[0x7fda0ab429e7]</div><div>/usr/lib/glusterfs/3.10.1/xlator/protocol/client.so(+0x2db4b)[0x7fda04986b4b]</div><div>/usr/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7fda0a8d5490]</div><div>/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x1e7)[0x7fda0a8d5777]</div><div>/usr/lib/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fda0a8d17d3]</div><div>/usr/lib/glusterfs/3.10.1/rpc-transport/socket.so(+0x7194)[0x7fda05826194]</div><div>/usr/lib/glusterfs/3.10.1/rpc-transport/socket.so(+0x9635)[0x7fda05828635]</div><div>/usr/lib/libglusterfs.so.0(+0x83db0)[0x7fda0ab64db0]</div><div>/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fda0a290182]</div><div>/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fda09fbd47d]</div><div>--------------</div><div><br></div><div>Error msg type 2</div><div><br></div><div>[2017-09-01 10:01:20.387248] I [MSGID: 118035] [bit-rot-scrub.c:1297:br_scrubber_scale_up] 0-glustervol-bit-rot-0: Scaling up scrubbe</div><div>rs [0 =&gt; 36]</div><div>[2017-09-01 10:01:20.392544] I [MSGID: 118048] [bit-rot-scrub.c:1547:br_scrubber_log_option] 0-glustervol-bit-rot-0: SCRUB TUNABLES::</div><div> [Frequency: biweekly, Throttle: lazy]</div><div>[2017-09-01 10:01:20.392571] I [MSGID: 118038] [bit-rot-scrub.c:948:br_fsscan_schedule] 0-glustervol-bit-rot-0: Scrubbing is schedule</div><div>d to run at 2017-09-15 10:01:20</div><div>[2017-09-01 10:01:20.392727] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing</div><div>[2017-09-01 10:01:35.078694] I [bit-rot.c:1683:notify] 0-glustervol-bit-rot-0: BitRot scrub ondemand called</div><div>[2017-09-01 10:01:35.078735] I [MSGID: 118038] [bit-rot-scrub.c:1085:br_fsscan_ondemand] 0-glustervol-bit-rot-0: Ondemand Scrubbing s</div><div>cheduled to run at 2017-09-01 10:01:36</div><div>[2017-09-01 10:01:36.355827] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed</div><div>[2017-09-01 10:01:37.018622] I [MSGID: 118044] [bit-rot-scrub.c:615:br_scrubber_log_time] 0-glustervol-bit-rot-0: Scrubbing started a</div><div>t 2017-09-01 10:01:37</div><div>[2017-09-01 10:01:37.601774] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing</div><div>[2017-09-06 08:33:37.738627] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed</div><div>[2017-09-06 08:33:39.812894] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed</div><div>[2017-09-06 08:33:41.828432] I [MSGID: 118034] [bit-rot-scrub.c:1342:br_scrubber_scale_down] 0-glustervol-bit-rot-0: Scaling down scr</div><div>ubbers [36 =&gt; 0]</div><div>[2017-09-06 08:33:41.884031] I [MSGID: 118051] [bit-rot-ssm.c:80:br_scrub_ssm_state_stall] 0-glustervol-bit-rot-0: Volume is under ac</div><div>tive scrubbing. Pausing scrub..</div><div>[2017-09-06 08:34:26.477106] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-970: server <a href="http://192.168.0.21:49177">192.168.0.21:49177</a> has</div><div>not responded in the last 42 seconds, disconnecting.</div><div>[2017-09-06 08:34:29.477438] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-980: server <a href="http://192.168.0.21:49178">192.168.0.21:49178</a> has</div><div>not responded in the last 42 seconds, disconnecting.</div><div>[2017-09-06 08:34:37.478198] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-1040: server <a href="http://192.168.0.21:49184">192.168.0.21:49184</a> has</div><div> not responded in the last 42 seconds, disconnecting.</div><div>[2017-09-06 08:34:40.478550] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-1070: server <a href="http://192.168.0.21:49187">192.168.0.21:49187</a> has</div><div> not responded in the last 42 seconds, disconnecting.</div><div>[2017-09-06 08:34:56.480200] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-990: server <a href="http://192.168.0.21:49179">192.168.0.21:49179</a> has</div><div>not responded in the last 42 seconds, disconnecting.</div><div>[2017-09-06 08:34:59.480520] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-760: server <a href="http://192.168.0.21:49156">192.168.0.21:49156</a> has</div><div>not responded in the last 42 seconds, disconnecting.</div><div>[2017-09-06 08:35:01.480751] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-1020: server <a href="http://192.168.0.21:49182">192.168.0.21:49182</a> has</div><div> not responded in the last 42 seconds, disconnecting.</div><div>[2017-09-06 08:35:05.481223] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-790: server <a href="http://192.168.0.21:49159">192.168.0.21:49159</a> has not responded in the last 42 seconds, disconnecting.</div><div>[2017-09-06 09:03:43.637208] E [rpc-clnt.c:200:call_bail] 0-glusterfs: bailing out frame type(GlusterFS Handshake) op(GETSPEC(2)) xid = 0x8 sent = 2017-09-06 08:33:39.813002. timeout = 1800 for <a href="http://127.0.0.1:24007">127.0.0.1:24007</a></div><div>[2017-09-06 09:03:44.637338] E [rpc-clnt.c:200:call_bail] 0-glustervol-client-760: bailing out frame type(GlusterFS 3.3) op(READ(12)) xid = 0x160f941 sent = 2017-09-06 08:33:41.843336. timeout = 1800 for <a href="http://192.168.0.21:49156">192.168.0.21:49156</a></div><div>[2017-09-06 09:03:44.637726] W [MSGID: 114031] [client-rpc-fops.c:2992:client3_3_readv_cbk] 0-glustervol-client-760: remote operation failed [Transport endpoint is not connected]</div><div>pending frames:</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>frame : type(0) op(0)</div><div>patchset: git://<a href="http://git.gluster.org/glusterfs.git">git.gluster.org/glusterfs.git</a></div><div>signal received: 11</div><div>time of crash:</div><div>2017-09-06 09:03:44</div><div>configuration details:</div><div>argp 1</div><div>backtrace 1</div><div>dlfcn 1</div><div>libpthread 1</div><div>llistxattr 1</div><div>setfsid 1</div><div>spinlock 1</div><div>epoll.h 1</div><div>xattr.h 1</div><div>st_atim.tv_nsec 1</div><div>package-string: glusterfs 3.10.1</div><div>/usr/lib/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x78)[0x7f26721934f8]</div><div>/usr/lib/libglusterfs.so.0(gf_print_trace+0x324)[0x7f267219c914]</div><div>/lib/x86_64-linux-gnu/libc.so.6(+0x36d40)[0x7f2671581d40]</div><div>/usr/lib/libglusterfs.so.0(syncop_readv_cbk+0x17)[0x7f26721ca9e7]</div><div>/usr/lib/glusterfs/3.10.1/xlator/protocol/client.so(+0x2db4b)[0x7f2667dd3b4b]</div><div>/usr/lib/libgfrpc.so.0(+0xf92c)[0x7f2671f5c92c]</div><div>/usr/lib/libglusterfs.so.0(+0x36eb2)[0x7f267219feb2]</div><div>/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f2671918182]</div><div>/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f267164547d]</div><div>-------</div><div><br></div><div><br></div><div>My queries are below:-</div><div><br></div><div>1. To resume scrub process should I restart glusterd service in node where scrub daemon is not running or do a volume force start</div><div><br></div><div>2. if resumed, will it start from where it was stopped.</div><div><br></div><div>3. I am assuming, scrub by default assigns thread by calculating the number of bricks in the node. need an option to change it in gluster volume command.</div><div>   Because in my case my node has 12 CPU&#39;s (Intel Xeon CPU 6 core + HT) when scrub was running it consumed all CPU 99%.</div><div>   or it should be intelligent enough to scale down depending on available CPUs in the node.</div><div><br></div><div>4. Why was this crash? </div><div><br></div><div><br></div><div>regards</div><div>Amudhan P</div></div>