[Gluster-users] pausing scrub crashed scrub daemon on nodes

Wed Sep 13 16:20:58 UTC 2017

Hi Amudhan,

Replies inline.

On Fri, Sep 8, 2017 at 6:37 AM, Amudhan P <amudhan83 at gmail.com> wrote:

> Hi,
>
> I am using glusterfs 3.10.1 with 30 nodes each with 36 bricks and 10 nodes
> each with 16 bricks in a single cluster.
>
> By default I have paused scrub process to have it run manually. for the
> first time, i was trying to run scrub-on-demand and it was running fine,
> but after some time, i decided to pause scrub process due to high CPU
> usage and user reporting folder listing taking time.
> But scrub pause resulted below message in some of the nodes.
> Also, i can see that scrub daemon is not showing in volume status for some
> nodes.
>
> Error msg type 1
> --
>
> [2017-09-01 10:04:45.840248] I [bit-rot.c:1683:notify]
> 0-glustervol-bit-rot-0: BitRot scrub ondemand called
> [2017-09-01 10:05:05.094948] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec]
> 0-mgmt: Volume file changed
> [2017-09-01 10:05:06.401792] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec]
> 0-mgmt: Volume file changed
> [2017-09-01 10:05:07.544524] I [MSGID: 118035] [bit-rot-scrub.c:1297:br_scrubber_scale_up]
> 0-glustervol-bit-rot-0: Scaling up scrubbe
> rs [0 => 36]
> [2017-09-01 10:05:07.552893] I [MSGID: 118048] [bit-rot-scrub.c:1547:br_scrubber_log_option]
> 0-glustervol-bit-rot-0: SCRUB TUNABLES::
>  [Frequency: biweekly, Throttle: lazy]
> [2017-09-01 10:05:07.552942] I [MSGID: 118038] [bit-rot-scrub.c:948:br_fsscan_schedule]
> 0-glustervol-bit-rot-0: Scrubbing is schedule
> d to run at 2017-09-15 10:05:07
> [2017-09-01 10:05:07.553457] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk]
> 0-glusterfs: No change in volfile, continuing
> [2017-09-01 10:05:20.953815] I [bit-rot.c:1683:notify]
> 0-glustervol-bit-rot-0: BitRot scrub ondemand called
> [2017-09-01 10:05:20.953845] I [MSGID: 118038] [bit-rot-scrub.c:1085:br_fsscan_ondemand]
> 0-glustervol-bit-rot-0: Ondemand Scrubbing s
> cheduled to run at 2017-09-01 10:05:21
> [2017-09-01 10:05:22.216937] I [MSGID: 118044] [bit-rot-scrub.c:615:br_scrubber_log_time]
> 0-glustervol-bit-rot-0: Scrubbing started a
> t 2017-09-01 10:05:22
> [2017-09-01 10:05:22.306307] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec]
> 0-mgmt: Volume file changed
> [2017-09-01 10:05:24.684900] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk]
> 0-glusterfs: No change in volfile, continuing
> [2017-09-06 08:37:26.422267] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec]
> 0-mgmt: Volume file changed
> [2017-09-06 08:37:28.351821] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec]
> 0-mgmt: Volume file changed
> [2017-09-06 08:37:30.350786] I [MSGID: 118034] [bit-rot-scrub.c:1342:br_scrubber_scale_down]
> 0-glustervol-bit-rot-0: Scaling down scr
> ubbers [36 => 0]
> pending frames:
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> patchset: git://git.gluster.org/glusterfs.git
> signal received: 11
> time of crash:
> 2017-09-06 08:37:30
> configuration details:
> argp 1
> backtrace 1
> dlfcn 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 3.10.1
> /usr/lib/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x78)[0x7fda0ab0b4f8]
> /usr/lib/libglusterfs.so.0(gf_print_trace+0x324)[0x7fda0ab14914]
> /lib/x86_64-linux-gnu/libc.so.6(+0x36d40)[0x7fda09ef9d40]
> /usr/lib/libglusterfs.so.0(syncop_readv_cbk+0x17)[0x7fda0ab429e7]
> /usr/lib/glusterfs/3.10.1/xlator/protocol/client.so(+
> 0x2db4b)[0x7fda04986b4b]
> /usr/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7fda0a8d5490]
> /usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x1e7)[0x7fda0a8d5777]
> /usr/lib/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fda0a8d17d3]
> /usr/lib/glusterfs/3.10.1/rpc-transport/socket.so(+0x7194)[0x7fda05826194]
> /usr/lib/glusterfs/3.10.1/rpc-transport/socket.so(+0x9635)[0x7fda05828635]
> /usr/lib/libglusterfs.so.0(+0x83db0)[0x7fda0ab64db0]
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fda0a290182]
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fda09fbd47d]
> --------------
>
> Error msg type 2
>
> [2017-09-01 10:01:20.387248] I [MSGID: 118035] [bit-rot-scrub.c:1297:br_scrubber_scale_up]
> 0-glustervol-bit-rot-0: Scaling up scrubbe
> rs [0 => 36]
> [2017-09-01 10:01:20.392544] I [MSGID: 118048] [bit-rot-scrub.c:1547:br_scrubber_log_option]
> 0-glustervol-bit-rot-0: SCRUB TUNABLES::
>  [Frequency: biweekly, Throttle: lazy]
> [2017-09-01 10:01:20.392571] I [MSGID: 118038] [bit-rot-scrub.c:948:br_fsscan_schedule]
> 0-glustervol-bit-rot-0: Scrubbing is schedule
> d to run at 2017-09-15 10:01:20
> [2017-09-01 10:01:20.392727] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk]
> 0-glusterfs: No change in volfile, continuing
> [2017-09-01 10:01:35.078694] I [bit-rot.c:1683:notify]
> 0-glustervol-bit-rot-0: BitRot scrub ondemand called
> [2017-09-01 10:01:35.078735] I [MSGID: 118038] [bit-rot-scrub.c:1085:br_fsscan_ondemand]
> 0-glustervol-bit-rot-0: Ondemand Scrubbing s
> cheduled to run at 2017-09-01 10:01:36
> [2017-09-01 10:01:36.355827] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec]
> 0-mgmt: Volume file changed
> [2017-09-01 10:01:37.018622] I [MSGID: 118044] [bit-rot-scrub.c:615:br_scrubber_log_time]
> 0-glustervol-bit-rot-0: Scrubbing started a
> t 2017-09-01 10:01:37
> [2017-09-01 10:01:37.601774] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk]
> 0-glusterfs: No change in volfile, continuing
> [2017-09-06 08:33:37.738627] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec]
> 0-mgmt: Volume file changed
> [2017-09-06 08:33:39.812894] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec]
> 0-mgmt: Volume file changed
> [2017-09-06 08:33:41.828432] I [MSGID: 118034] [bit-rot-scrub.c:1342:br_scrubber_scale_down]
> 0-glustervol-bit-rot-0: Scaling down scr
> ubbers [36 => 0]
> [2017-09-06 08:33:41.884031] I [MSGID: 118051] [bit-rot-ssm.c:80:br_scrub_ssm_state_stall]
> 0-glustervol-bit-rot-0: Volume is under ac
> tive scrubbing. Pausing scrub..
> [2017-09-06 08:34:26.477106] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired]
> 0-glustervol-client-970: server 192.168.0.21:49177 has
> not responded in the last 42 seconds, disconnecting.
> [2017-09-06 08:34:29.477438] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired]
> 0-glustervol-client-980: server 192.168.0.21:49178 has
> not responded in the last 42 seconds, disconnecting.
> [2017-09-06 08:34:37.478198] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired]
> 0-glustervol-client-1040: server 192.168.0.21:49184 has
>  not responded in the last 42 seconds, disconnecting.
> [2017-09-06 08:34:40.478550] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired]
> 0-glustervol-client-1070: server 192.168.0.21:49187 has
>  not responded in the last 42 seconds, disconnecting.
> [2017-09-06 08:34:56.480200] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired]
> 0-glustervol-client-990: server 192.168.0.21:49179 has
> not responded in the last 42 seconds, disconnecting.
> [2017-09-06 08:34:59.480520] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired]
> 0-glustervol-client-760: server 192.168.0.21:49156 has
> not responded in the last 42 seconds, disconnecting.
> [2017-09-06 08:35:01.480751] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired]
> 0-glustervol-client-1020: server 192.168.0.21:49182 has
>  not responded in the last 42 seconds, disconnecting.
> [2017-09-06 08:35:05.481223] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired]
> 0-glustervol-client-790: server 192.168.0.21:49159 has not responded in
> the last 42 seconds, disconnecting.
> [2017-09-06 09:03:43.637208] E [rpc-clnt.c:200:call_bail] 0-glusterfs:
> bailing out frame type(GlusterFS Handshake) op(GETSPEC(2)) xid = 0x8 sent =
> 2017-09-06 08:33:39.813002. timeout = 1800 for 127.0.0.1:24007
> [2017-09-06 09:03:44.637338] E [rpc-clnt.c:200:call_bail]
> 0-glustervol-client-760: bailing out frame type(GlusterFS 3.3) op(READ(12))
> xid = 0x160f941 sent = 2017-09-06 08:33:41.843336. timeout = 1800 for
> 192.168.0.21:49156
> [2017-09-06 09:03:44.637726] W [MSGID: 114031] [client-rpc-fops.c:2992:client3_3_readv_cbk]
> 0-glustervol-client-760: remote operation failed [Transport endpoint is not
> connected]
> pending frames:
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> frame : type(0) op(0)
> patchset: git://git.gluster.org/glusterfs.git
> signal received: 11
> time of crash:
> 2017-09-06 09:03:44
> configuration details:
> argp 1
> backtrace 1
> dlfcn 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 3.10.1
> /usr/lib/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x78)[0x7f26721934f8]
> /usr/lib/libglusterfs.so.0(gf_print_trace+0x324)[0x7f267219c914]
> /lib/x86_64-linux-gnu/libc.so.6(+0x36d40)[0x7f2671581d40]
> /usr/lib/libglusterfs.so.0(syncop_readv_cbk+0x17)[0x7f26721ca9e7]
> /usr/lib/glusterfs/3.10.1/xlator/protocol/client.so(+
> 0x2db4b)[0x7f2667dd3b4b]
> /usr/lib/libgfrpc.so.0(+0xf92c)[0x7f2671f5c92c]
> /usr/lib/libglusterfs.so.0(+0x36eb2)[0x7f267219feb2]
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f2671918182]
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f267164547d]
> -------
>
>
> My queries are below:-
>
> 1. To resume scrub process should I restart glusterd service in node where
> scrub daemon is not running or do a volume force start
>

For resuming the scrub process, you can do a volume start force (gluster
volume start <volume name> force)

>
> 2. if resumed, will it start from where it was stopped.
>

If resumed, it will start from beginning. Resuming from where it left would
happen when the scrubber is continued after a pause. But in this case the
process itself is starting.
So it would start from the beginning of the volume.

>
> 3. I am assuming, scrub by default assigns thread by calculating the
> number of bricks in the node. need an option to change it in gluster volume
> command.
>    Because in my case my node has 12 CPU's (Intel Xeon CPU 6 core + HT)
> when scrub was running it consumed all CPU 99%.
>    or it should be intelligent enough to scale down depending on available
> CPUs in the node.
>
>
Scrub's threads are mainly scaled up and down based on the throttling mode
used.  By default scrubber uses LAZY throttling. Have you changed the
throttling to some other higher value? (such as NORMAL or AGGRESSIVE). Also
what is the scrub frequency?

> 4. Why was this crash?
>
>
>
Can you please provide the core file associated with the crash? It will
help us understand why the scrubber crashed.

Can you please provide the following information to analyze the issue
further?

A) Core file for the crash
B) o/p of the following commands
    "gluster volume info"
   " gluster volume status"
C) gluster logs from the node where the scrubber crashed (present in
/var/log/glusterfs)

Regards,
Raghavendra

regards
> Amudhan P
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170913/5c8456c7/attachment.html>