[Gluster-devel] [Gluster-users] "rpc_clnt_ping_timer_expired" errors

Thu Mar 21 10:48:28 UTC 2019

On Thu, Mar 21, 2019 at 4:10 PM Mauro Tridici <mauro.tridici at cmcc.it> wrote:

> Hi Raghavendra,
>
> the number of errors reduced, but during last days I received some error
> notifications from Nagios server similar to the following one:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> ****** Nagios *****Notification Type: PROBLEMService: Brick -
> /gluster/mnt5/brickHost: s04Address: s04-stgState: CRITICALDate/Time: Mon
> Mar 18 19:56:36 CET 2019Additional Info:CHECK_NRPE STATE CRITICAL: Socket
> timeout after 10 seconds.*
>
> The error was related only to s04 gluster server.
>
> So, following your suggestions,  I executed, on s04 node, the top command.
> In attachment, you can find the related output.
>

top output doesn't contain cmd/thread names. Was there anything wrong.

> Thank you very much for your help.
> Regards,
> Mauro
>
>
>
> On 14 Mar 2019, at 13:31, Raghavendra Gowdappa <rgowdapp at redhat.com>
> wrote:
>
> Thanks Mauro.
>
> On Thu, Mar 14, 2019 at 3:38 PM Mauro Tridici <mauro.tridici at cmcc.it>
> wrote:
>
>> Hi Raghavendra,
>>
>> I just changed the client option value to 8.
>> I will check the volume behaviour during the next hours.
>>
>> The GlusterFS version is 3.12.14.
>>
>> I will provide you the logs as soon as the activity load will be high.
>> Thank you,
>> Mauro
>>
>> On 14 Mar 2019, at 04:57, Raghavendra Gowdappa <rgowdapp at redhat.com>
>> wrote:
>>
>>
>>
>> On Wed, Mar 13, 2019 at 3:55 PM Mauro Tridici <mauro.tridici at cmcc.it>
>> wrote:
>>
>>> Hi Raghavendra,
>>>
>>> Yes, server.event-thread has been changed from 4 to 8.
>>>
>>
>> Was client.event-thread value too changed to 8? If not, I would like to
>> know the results of including this tuning too. Also, if possible, can you
>> get the output of following command from problematic clients and bricks
>> (during the duration when load tends to be high and ping-timer-expiry is
>> seen)?
>>
>> # top -bHd 3
>>
>> This will help us to know  CPU utilization of event-threads.
>>
>> And I forgot to ask, what version of Glusterfs are you using?
>>
>> During last days, I noticed that the error events are still here although
>>> they have been considerably reduced.
>>>
>>> So, I used grep command against the log files in order to provide you a
>>> global vision about the warning, error and critical events appeared today
>>> at 06:xx (may be useful I hope).
>>> I collected the info from s06 gluster server, but the behaviour is the
>>> the almost the same on the other gluster servers.
>>>
>>> *ERRORS:  *
>>> *CWD: /var/log/glusterfs *
>>> *COMMAND: grep " E " *.log |grep "2019-03-13 06:"*
>>>
>>> (I can see a lot of this kind of message in the same period but I'm
>>> notifying you only one record for each type of error)
>>>
>>> glusterd.log:[2019-03-13 06:12:35.982863] E [MSGID: 101042]
>>> [compat.c:569:gf_umount_lazy] 0-management: Lazy unmount of
>>> /var/run/gluster/tier2_quota_list/
>>>
>>> glustershd.log:[2019-03-13 06:14:28.666562] E
>>> [rpc-clnt.c:350:saved_frames_unwind] (-->
>>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f4a71ddcebb] (-->
>>> /lib64/libgfr
>>> pc.so.0(saved_frames_unwind+0x1de)[0x7f4a71ba1d9e] (-->
>>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f4a71ba1ebe] (-->
>>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup
>>> +0x90)[0x7f4a71ba3640] (-->
>>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f4a71ba4130] )))))
>>> 0-tier2-client-55: forced unwinding frame type(GlusterFS 3.3)
>>> op(INODELK(29))
>>> called at 2019-03-13 06:14:14.858441 (xid=0x17fddb50)
>>>
>>> glustershd.log:[2019-03-13 06:17:48.883825] E
>>> [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to
>>> 192.168.0.55:49158 failed (Connection timed out); disco
>>> nnecting socket
>>> glustershd.log:[2019-03-13 06:19:58.931798] E
>>> [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to
>>> 192.168.0.55:49158 failed (Connection timed out); disco
>>> nnecting socket
>>> glustershd.log:[2019-03-13 06:22:08.979829] E
>>> [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to
>>> 192.168.0.55:49158 failed (Connection timed out); disco
>>> nnecting socket
>>> glustershd.log:[2019-03-13 06:22:36.226847] E [MSGID: 114031]
>>> [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote
>>> operation failed [Transport endpoint
>>> is not connected]
>>> glustershd.log:[2019-03-13 06:22:36.306669] E [MSGID: 114031]
>>> [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote
>>> operation failed [Transport endpoint
>>> is not connected]
>>> glustershd.log:[2019-03-13 06:22:36.385257] E [MSGID: 114031]
>>> [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote
>>> operation failed [Transport endpoint
>>> is not connected]
>>>
>>> *WARNINGS:*
>>> *CWD: /var/log/glusterfs *
>>> *COMMAND: grep " W " *.log |grep "2019-03-13 06:"*
>>>
>>> (I can see a lot of this kind of message in the same period but I'm
>>> notifying you only one record for each type of warnings)
>>>
>>> glustershd.log:[2019-03-13 06:14:28.666772] W [MSGID: 114031]
>>> [client-rpc-fops.c:1080:client3_3_getxattr_cbk] 0-tier2-client-55: remote
>>> operation failed. Path: <gfid:b6b35d
>>> 0f-f34d-4c25-bbe8-74bde0248d7e> (b6b35d0f-f34d-4c25-bbe8-74bde0248d7e).
>>> Key: (null) [Transport endpoint is not connected]
>>>
>>> glustershd.log:[2019-03-13 06:14:31.421576] W [MSGID: 122035]
>>> [ec-common.c:571:ec_child_select] 0-tier2-disperse-9: Executing operation
>>> with some subvolumes unavailable (2)
>>>
>>> glustershd.log:[2019-03-13 06:15:31.547417] W [MSGID: 122032]
>>> [ec-heald.c:266:ec_shd_index_sweep] 0-tier2-disperse-9: unable to get
>>> index-dir on tier2-client-55 [Operation
>>> now in progress]
>>>
>>> quota-mount-tier2.log:[2019-03-13 06:12:36.116277] W [MSGID: 101002]
>>> [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is
>>> deprecated, preferred is 'trans
>>> port.address-family', continuing with correction
>>> quota-mount-tier2.log:[2019-03-13 06:12:36.198430] W [MSGID: 101174]
>>> [graph.c:363:_log_if_unknown_option] 0-tier2-readdir-ahead: option
>>> 'parallel-readdir' is not recognized
>>> quota-mount-tier2.log:[2019-03-13 06:12:37.945007] W
>>> [glusterfsd.c:1375:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25)
>>> [0x7f340892be25] -->/usr/sbin/glusterfs(gluste
>>> rfs_sigwaiter+0xe5) [0x55ef010164b5]
>>> -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x55ef0101632b] ) 0-:
>>> received signum (15), shutting down
>>>
>>> *CRITICALS:*
>>> *CWD: /var/log/glusterfs *
>>> *COMMAND: grep " C " *.log |grep "2019-03-13 06:"*
>>>
>>> no critical errors at 06:xx
>>> only one critical error during the day
>>>
>>> *[root at s06 glusterfs]# grep " C " *.log |grep "2019-03-13"*
>>> glustershd.log:[2019-03-13 02:21:29.126279] C
>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server
>>> 192.168.0.55:49158 has not responded in the last 42 seconds,
>>> disconnecting.
>>>
>>>
>>> Thank you very much for your help.
>>> Regards,
>>> Mauro
>>>
>>> On 12 Mar 2019, at 05:17, Raghavendra Gowdappa <rgowdapp at redhat.com>
>>> wrote:
>>>
>>> Was the suggestion to increase server.event-thread values tried? If yes,
>>> what were the results?
>>>
>>> On Mon, Mar 11, 2019 at 2:40 PM Mauro Tridici <mauro.tridici at cmcc.it>
>>> wrote:
>>>
>>>> Dear All,
>>>>
>>>> do you have any suggestions about the right way to "debug" this issue?
>>>> In attachment, the updated logs of “s06" gluster server.
>>>>
>>>> I noticed a lot of intermittent warning and error messages.
>>>>
>>>> Thank you in advance,
>>>> Mauro
>>>>
>>>>
>>>>
>>>> On 4 Mar 2019, at 18:45, Raghavendra Gowdappa <rgowdapp at redhat.com>
>>>> wrote:
>>>>
>>>>
>>>> +Gluster Devel <gluster-devel at gluster.org>, +Gluster-users
>>>> <gluster-users at gluster.org>
>>>>
>>>> I would like to point out another issue. Even if what I suggested
>>>> prevents disconnects, part of the solution would be only symptomatic
>>>> treatment and doesn't address the root cause of the problem. In most of the
>>>> ping-timer-expiry issues, the root cause is the increased load on bricks
>>>> and the inability of bricks to be responsive under high load. So, the
>>>> actual solution would be doing any or both of the following:
>>>> * identify the source of increased load and if possible throttle it.
>>>> Internal heal processes like self-heal, rebalance, quota heal are known to
>>>> pump traffic into bricks without much throttling (io-threads _might_ do
>>>> some throttling, but my understanding is its not sufficient).
>>>> * identify the reason for bricks to become unresponsive during load.
>>>> This may be fixable issues like not enough event-threads to read from
>>>> network or difficult to fix issues like fsync on backend fs freezing the
>>>> process or semi fixable issues (in code) like lock contention.
>>>>
>>>> So any genuine effort to fix ping-timer-issues (to be honest most of
>>>> the times they are not issues related to rpc/network) would involve
>>>> performance characterization of various subsystems on bricks and clients.
>>>> Various subsystems can include (but not necessarily limited to), underlying
>>>> OS/filesystem, glusterfs processes, CPU consumption etc
>>>>
>>>> regards,
>>>> Raghavendra
>>>>
>>>> On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <mauro.tridici at cmcc.it>
>>>> wrote:
>>>>
>>>>> Thank you, let’s try!
>>>>> I will inform you about the effects of the change.
>>>>>
>>>>> Regards,
>>>>> Mauro
>>>>>
>>>>> On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <rgowdapp at redhat.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <mauro.tridici at cmcc.it>
>>>>> wrote:
>>>>>
>>>>>> Hi Raghavendra,
>>>>>>
>>>>>> thank you for your reply.
>>>>>> Yes, you are right. It is a problem that seems to happen randomly.
>>>>>> At this moment, server.event-threads value is 4. I will try to
>>>>>> increase this value to 8. Do you think that it could be a valid value ?
>>>>>>
>>>>>
>>>>> Yes. We can try with that. You should see at least frequency of
>>>>> ping-timer related disconnects  reduce with this value (even if it doesn't
>>>>> eliminate the problem completely).
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Mauro
>>>>>>
>>>>>>
>>>>>> On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <rgowdapp at redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran <
>>>>>> nbalacha at redhat.com> wrote:
>>>>>>
>>>>>>> Hi Mauro,
>>>>>>>
>>>>>>> It looks like some problem on s06. Are all your other nodes ok? Can
>>>>>>> you send us the gluster logs from this node?
>>>>>>>
>>>>>>> @Raghavendra G <raghavendra at gluster.com> , do you have any idea as
>>>>>>> to how this can be debugged? Maybe running top ? Or debug brick logs?
>>>>>>>
>>>>>>
>>>>>> If we can reproduce the problem, collecting tcpdump on both ends of
>>>>>> connection will help. But, one common problem is these bugs are
>>>>>> inconsistently reproducible and hence we may not be able to capture tcpdump
>>>>>> at correct intervals. Other than that, we can try to collect some evidence
>>>>>> that poller threads were busy (waiting on locks). But, not sure what debug
>>>>>> data provides that information.
>>>>>>
>>>>>> From what I know, its difficult to collect evidence for this issue
>>>>>> and we could only reason about it.
>>>>>>
>>>>>> We can try a workaround though - try increasing server.event-threads
>>>>>> and see whether ping-timer expiry issues go away with an optimal value. If
>>>>>> that's the case, it kind of provides proof for our hypothesis.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Nithya
>>>>>>>
>>>>>>> On Mon, 4 Mar 2019 at 15:25, Mauro Tridici <mauro.tridici at cmcc.it>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> some minutes ago I received this message from NAGIOS server
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ****** Nagios *****Notification Type: PROBLEMService: Brick -
>>>>>>>> /gluster/mnt2/brickHost: s06Address: s06-stgState: CRITICALDate/Time: Mon
>>>>>>>> Mar 4 10:25:33 CET 2019Additional Info:CHECK_NRPE STATE CRITICAL: Socket
>>>>>>>> timeout after 10 seconds.*
>>>>>>>>
>>>>>>>> I checked the network, RAM and CPUs usage on s06 node and
>>>>>>>> everything seems to be ok.
>>>>>>>> No bricks are in error state. In /var/log/messages, I detected
>>>>>>>> again a crash of “check_vol_utili” that I think it is a module used by NRPE
>>>>>>>> executable (that is the NAGIOS client).
>>>>>>>>
>>>>>>>> Mar  4 10:15:29 s06 kernel: traps: check_vol_utili[161224] general
>>>>>>>> protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0 in
>>>>>>>> libglusterfs.so.0.0.1[7facff9b7000+f7000]
>>>>>>>> Mar  4 10:15:29 s06 abrt-hook-ccpp: Process 161224 (python2.7) of
>>>>>>>> user 0 killed by SIGSEGV - dumping core
>>>>>>>> Mar  4 10:15:29 s06 abrt-server: Generating core_backtrace
>>>>>>>> Mar  4 10:15:29 s06 abrt-server: Error: Unable to open
>>>>>>>> './coredump': No such file or directory
>>>>>>>> Mar  4 10:16:01 s06 systemd: Created slice User Slice of root.
>>>>>>>> Mar  4 10:16:01 s06 systemd: Starting User Slice of root.
>>>>>>>> Mar  4 10:16:01 s06 systemd: Started Session 201010 of user root.
>>>>>>>> Mar  4 10:16:01 s06 systemd: Starting Session 201010 of user root.
>>>>>>>> Mar  4 10:16:01 s06 systemd: Removed slice User Slice of root.
>>>>>>>> Mar  4 10:16:01 s06 systemd: Stopping User Slice of root.
>>>>>>>> Mar  4 10:16:24 s06 abrt-server: Duplicate: UUID
>>>>>>>> Mar  4 10:16:24 s06 abrt-server: DUP_OF_DIR:
>>>>>>>> /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
>>>>>>>> Mar  4 10:16:24 s06 abrt-server: Deleting problem directory
>>>>>>>> ccpp-2019-03-04-10:15:29-161224 (dup of ccpp-2018-09-25-12:27:42-13041)
>>>>>>>> Mar  4 10:16:24 s06 abrt-server: Generating core_backtrace
>>>>>>>> Mar  4 10:16:24 s06 abrt-server: Error: Unable to open
>>>>>>>> './coredump': No such file or directory
>>>>>>>> Mar  4 10:16:24 s06 abrt-server: Cannot notify
>>>>>>>> '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event
>>>>>>>> 'report_uReport' exited with 1
>>>>>>>> Mar  4 10:16:24 s06 abrt-hook-ccpp: Process 161391 (python2.7) of
>>>>>>>> user 0 killed by SIGABRT - dumping core
>>>>>>>> Mar  4 10:16:25 s06 abrt-server: Generating core_backtrace
>>>>>>>> Mar  4 10:16:25 s06 abrt-server: Error: Unable to open
>>>>>>>> './coredump': No such file or directory
>>>>>>>> Mar  4 10:17:01 s06 systemd: Created slice User Slice of root.
>>>>>>>>
>>>>>>>> Also, I noticed the following errors that I think are very critical:
>>>>>>>>
>>>>>>>> Mar  4 10:21:12 s06 glustershd[20355]: [2019-03-04 09:21:12.954798]
>>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55:
>>>>>>>> server 192.168.0.55:49158 has not responded in the last 42
>>>>>>>> seconds, disconnecting.
>>>>>>>> Mar  4 10:22:01 s06 systemd: Created slice User Slice of root.
>>>>>>>> Mar  4 10:22:01 s06 systemd: Starting User Slice of root.
>>>>>>>> Mar  4 10:22:01 s06 systemd: Started Session 201017 of user root.
>>>>>>>> Mar  4 10:22:01 s06 systemd: Starting Session 201017 of user root.
>>>>>>>> Mar  4 10:22:01 s06 systemd: Removed slice User Slice of root.
>>>>>>>> Mar  4 10:22:01 s06 systemd: Stopping User Slice of root.
>>>>>>>> Mar  4 10:22:03 s06 glustershd[20355]: [2019-03-04 09:22:03.964120]
>>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60:
>>>>>>>> server 192.168.0.54:49165 has not responded in the last 42
>>>>>>>> seconds, disconnecting.
>>>>>>>> Mar  4 10:23:01 s06 systemd: Created slice User Slice of root.
>>>>>>>> Mar  4 10:23:01 s06 systemd: Starting User Slice of root.
>>>>>>>> Mar  4 10:23:01 s06 systemd: Started Session 201018 of user root.
>>>>>>>> Mar  4 10:23:01 s06 systemd: Starting Session 201018 of user root.
>>>>>>>> Mar  4 10:23:02 s06 systemd: Removed slice User Slice of root.
>>>>>>>> Mar  4 10:23:02 s06 systemd: Stopping User Slice of root.
>>>>>>>> Mar  4 10:24:01 s06 systemd: Created slice User Slice of root.
>>>>>>>> Mar  4 10:24:01 s06 systemd: Starting User Slice of root.
>>>>>>>> Mar  4 10:24:01 s06 systemd: Started Session 201019 of user root.
>>>>>>>> Mar  4 10:24:01 s06 systemd: Starting Session 201019 of user root.
>>>>>>>> Mar  4 10:24:01 s06 systemd: Removed slice User Slice of root.
>>>>>>>> Mar  4 10:24:01 s06 systemd: Stopping User Slice of root.
>>>>>>>> Mar  4 10:24:03 s06 glustershd[20355]: [2019-03-04 09:24:03.982502]
>>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-16:
>>>>>>>> server 192.168.0.52:49158 has not responded in the last 42
>>>>>>>> seconds, disconnecting.
>>>>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746109] C
>>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-3: server
>>>>>>>> 192.168.0.51:49153 has not responded in the last 42 seconds,
>>>>>>>> disconnecting.
>>>>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746215] C
>>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: server
>>>>>>>> 192.168.0.52:49156 has not responded in the last 42 seconds,
>>>>>>>> disconnecting.
>>>>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746260] C
>>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-21: server
>>>>>>>> 192.168.0.51:49159 has not responded in the last 42 seconds,
>>>>>>>> disconnecting.
>>>>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746296] C
>>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: server
>>>>>>>> 192.168.0.52:49161 has not responded in the last 42 seconds,
>>>>>>>> disconnecting.
>>>>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746413] C
>>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: server
>>>>>>>> 192.168.0.54:49165 has not responded in the last 42 seconds,
>>>>>>>> disconnecting.
>>>>>>>> Mar  4 10:24:07 s06 glustershd[20355]: [2019-03-04 09:24:07.982952]
>>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-45:
>>>>>>>> server 192.168.0.54:49155 has not responded in the last 42
>>>>>>>> seconds, disconnecting.
>>>>>>>> Mar  4 10:24:18 s06 glustershd[20355]: [2019-03-04 09:24:18.990929]
>>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25:
>>>>>>>> server 192.168.0.52:49161 has not responded in the last 42
>>>>>>>> seconds, disconnecting.
>>>>>>>> Mar  4 10:24:31 s06 glustershd[20355]: [2019-03-04 09:24:31.995781]
>>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-20:
>>>>>>>> server 192.168.0.53:49159 has not responded in the last 42
>>>>>>>> seconds, disconnecting.
>>>>>>>> Mar  4 10:25:01 s06 systemd: Created slice User Slice of root.
>>>>>>>> Mar  4 10:25:01 s06 systemd: Starting User Slice of root.
>>>>>>>> Mar  4 10:25:01 s06 systemd: Started Session 201020 of user root.
>>>>>>>> Mar  4 10:25:01 s06 systemd: Starting Session 201020 of user root.
>>>>>>>> Mar  4 10:25:01 s06 systemd: Removed slice User Slice of root.
>>>>>>>> Mar  4 10:25:01 s06 systemd: Stopping User Slice of root.
>>>>>>>> Mar  4 10:25:57 s06 systemd: Created slice User Slice of root.
>>>>>>>> Mar  4 10:25:57 s06 systemd: Starting User Slice of root.
>>>>>>>> Mar  4 10:25:57 s06 systemd-logind: New session 201021 of user root.
>>>>>>>> Mar  4 10:25:57 s06 systemd: Started Session 201021 of user root.
>>>>>>>> Mar  4 10:25:57 s06 systemd: Starting Session 201021 of user root.
>>>>>>>> Mar  4 10:26:01 s06 systemd: Started Session 201022 of user root.
>>>>>>>> Mar  4 10:26:01 s06 systemd: Starting Session 201022 of user root.
>>>>>>>> Mar  4 10:26:21 s06 nrpe[162388]: Error: Could not complete SSL
>>>>>>>> handshake with 192.168.1.56: 5
>>>>>>>> Mar  4 10:27:01 s06 systemd: Started Session 201023 of user root.
>>>>>>>> Mar  4 10:27:01 s06 systemd: Starting Session 201023 of user root.
>>>>>>>> Mar  4 10:28:01 s06 systemd: Started Session 201024 of user root.
>>>>>>>> Mar  4 10:28:01 s06 systemd: Starting Session 201024 of user root.
>>>>>>>> Mar  4 10:29:01 s06 systemd: Started Session 201025 of user root.
>>>>>>>> Mar  4 10:29:01 s06 systemd: Starting Session 201025 of user root.
>>>>>>>>
>>>>>>>> But, unfortunately, I don’t understand why it is happening.
>>>>>>>> Now, NAGIOS server shows that s06 status is ok:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ****** Nagios *****Notification Type: RECOVERYService: Brick -
>>>>>>>> /gluster/mnt2/brickHost: s06Address: s06-stgState: OKDate/Time: Mon Mar 4
>>>>>>>> 10:35:23 CET 2019Additional Info:OK: Brick /gluster/mnt2/brick is up*
>>>>>>>>
>>>>>>>> Nothing is changed from RAM, CPUs, and NETWORK point of view.
>>>>>>>> /var/log/message file has been updated:
>>>>>>>>
>>>>>>>> Mar  4 10:32:01 s06 systemd: Starting Session 201029 of user root.
>>>>>>>> Mar  4 10:32:30 s06 glustershd[20355]: [2019-03-04 09:32:30.069082]
>>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10:
>>>>>>>> server 192.168.0.52:49156 has not responded in the last 42
>>>>>>>> seconds, disconnecting.
>>>>>>>> Mar  4 10:32:55 s06 glustershd[20355]: [2019-03-04 09:32:55.074689]
>>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-66:
>>>>>>>> server 192.168.0.54:49167 has not responded in the last 42
>>>>>>>> seconds, disconnecting.
>>>>>>>> Mar  4 10:33:01 s06 systemd: Started Session 201030 of user root.
>>>>>>>> Mar  4 10:33:01 s06 systemd: Starting Session 201030 of user root.
>>>>>>>> Mar  4 10:34:01 s06 systemd: Started Session 201031 of user root.
>>>>>>>> Mar  4 10:34:01 s06 systemd: Starting Session 201031 of user root.
>>>>>>>> Mar  4 10:35:01 s06 nrpe[162562]: Could not read request from
>>>>>>>> client 192.168.1.56, bailing out...
>>>>>>>> Mar  4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket Shutdown.
>>>>>>>> Mar  4 10:35:01 s06 systemd: Started Session 201032 of user root.
>>>>>>>> Mar  4 10:35:01 s06 systemd: Starting Session 201032 of user root.
>>>>>>>>
>>>>>>>> Could you please help me to understand what it’s happening ?
>>>>>>>> Thank you in advance.
>>>>>>>>
>>>>>>>> Rergards,
>>>>>>>> Mauro
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1 Mar 2019, at 12:17, Mauro Tridici <mauro.tridici at cmcc.it>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you, Milind.
>>>>>>>> I executed the instructions you suggested:
>>>>>>>>
>>>>>>>> - grep “blocked for” /var/log/messages on s06 returns no output (no
>>>>>>>> “blocked” word is detected in messages file);
>>>>>>>> - in /var/log/messages file I can see this kind of error repeated
>>>>>>>> for a lot of times:
>>>>>>>>
>>>>>>>> Mar  1 08:43:01 s06 systemd: Starting Session 196071 of user root.
>>>>>>>> Mar  1 08:43:01 s06 systemd: Removed slice User Slice of root.
>>>>>>>> Mar  1 08:43:01 s06 systemd: Stopping User Slice of root.
>>>>>>>> Mar  1 08:43:02 s06 kernel: traps: check_vol_utili[57091] general
>>>>>>>> protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0 in
>>>>>>>> libglusterfs.so.0.0.1[7f88e769b000+f7000]
>>>>>>>> Mar  1 08:43:02 s06 abrt-hook-ccpp: Process 57091 (python2.7) of
>>>>>>>> user 0 killed by SIGSEGV - dumping core
>>>>>>>> Mar  1 08:43:02 s06 abrt-server: Generating core_backtrace
>>>>>>>> Mar  1 08:43:02 s06 abrt-server: Error: Unable to open
>>>>>>>> './coredump': No such file or directory
>>>>>>>> Mar  1 08:43:58 s06 abrt-server: Duplicate: UUID
>>>>>>>> Mar  1 08:43:58 s06 abrt-server: DUP_OF_DIR:
>>>>>>>> /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
>>>>>>>> Mar  1 08:43:58 s06 abrt-server: Deleting problem directory
>>>>>>>> ccpp-2019-03-01-08:43:02-57091 (dup of ccpp-2018-09-25-12:27:42-13041)
>>>>>>>> Mar  1 08:43:58 s06 dbus[1872]: [system] Activating service
>>>>>>>> name='org.freedesktop.problems' (using servicehelper)
>>>>>>>> Mar  1 08:43:58 s06 dbus[1872]: [system] Successfully activated
>>>>>>>> service 'org.freedesktop.problems'
>>>>>>>> Mar  1 08:43:58 s06 abrt-server: Generating core_backtrace
>>>>>>>> Mar  1 08:43:58 s06 abrt-server: Error: Unable to open
>>>>>>>> './coredump': No such file or directory
>>>>>>>> Mar  1 08:43:58 s06 abrt-server: Cannot notify
>>>>>>>> '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event
>>>>>>>> 'report_uReport' exited with 1
>>>>>>>> Mar  1 08:44:01 s06 systemd: Created slice User Slice of root.
>>>>>>>> Mar  1 08:44:01 s06 systemd: Starting User Slice of root.
>>>>>>>> Mar  1 08:44:01 s06 systemd: Started Session 196072 of user root.
>>>>>>>> Mar  1 08:44:01 s06 systemd: Starting Session 196072 of user root.
>>>>>>>> Mar  1 08:44:01 s06 systemd: Removed slice User Slice of root.
>>>>>>>>
>>>>>>>> - in /var/log/messages file I can see also 4 errors related to
>>>>>>>> other cluster servers:
>>>>>>>>
>>>>>>>> Mar  1 11:05:01 s06 systemd: Starting User Slice of root.
>>>>>>>> Mar  1 11:05:01 s06 systemd: Started Session 196230 of user root.
>>>>>>>> Mar  1 11:05:01 s06 systemd: Starting Session 196230 of user root.
>>>>>>>> Mar  1 11:05:01 s06 systemd: Removed slice User Slice of root.
>>>>>>>> Mar  1 11:05:01 s06 systemd: Stopping User Slice of root.
>>>>>>>> Mar  1 11:05:59 s06 glustershd[70117]: [2019-03-01 10:05:59.347094]
>>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-33:
>>>>>>>> server 192.168.0.51:49163 has not responded in the last 42
>>>>>>>> seconds, disconnecting.
>>>>>>>> Mar  1 11:06:01 s06 systemd: Created slice User Slice of root.
>>>>>>>> Mar  1 11:06:01 s06 systemd: Starting User Slice of root.
>>>>>>>> Mar  1 11:06:01 s06 systemd: Started Session 196231 of user root.
>>>>>>>> Mar  1 11:06:01 s06 systemd: Starting Session 196231 of user root.
>>>>>>>> Mar  1 11:06:01 s06 systemd: Removed slice User Slice of root.
>>>>>>>> Mar  1 11:06:01 s06 systemd: Stopping User Slice of root.
>>>>>>>> Mar  1 11:06:12 s06 glustershd[70117]: [2019-03-01 10:06:12.351319]
>>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-1:
>>>>>>>> server 192.168.0.52:49153 has not responded in the last 42
>>>>>>>> seconds, disconnecting.
>>>>>>>> Mar  1 11:06:38 s06 glustershd[70117]: [2019-03-01 10:06:38.356920]
>>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-7:
>>>>>>>> server 192.168.0.52:49155 has not responded in the last 42
>>>>>>>> seconds, disconnecting.
>>>>>>>> Mar  1 11:07:01 s06 systemd: Created slice User Slice of root.
>>>>>>>> Mar  1 11:07:01 s06 systemd: Starting User Slice of root.
>>>>>>>> Mar  1 11:07:01 s06 systemd: Started Session 196232 of user root.
>>>>>>>> Mar  1 11:07:01 s06 systemd: Starting Session 196232 of user root.
>>>>>>>> Mar  1 11:07:01 s06 systemd: Removed slice User Slice of root.
>>>>>>>> Mar  1 11:07:01 s06 systemd: Stopping User Slice of root.
>>>>>>>> Mar  1 11:07:36 s06 glustershd[70117]: [2019-03-01 10:07:36.366259]
>>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-0:
>>>>>>>> server 192.168.0.51:49152 has not responded in the last 42
>>>>>>>> seconds, disconnecting.
>>>>>>>> Mar  1 11:08:01 s06 systemd: Created slice User Slice of root.
>>>>>>>>
>>>>>>>> No “blocked” word is in /var/log/messages files on other cluster
>>>>>>>> servers.
>>>>>>>> In attachment, the /var/log/messages file from s06 server.
>>>>>>>>
>>>>>>>> Thank you in advance,
>>>>>>>> Mauro
>>>>>>>>
>>>>>>>> <messages.zip>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1 Mar 2019, at 11:47, Milind Changire <mchangir at redhat.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> The traces of very high disk activity on the servers are often
>>>>>>>> found in /var/log/messages
>>>>>>>> You might want to grep for "blocked for" in /var/log/messages on
>>>>>>>> s06 and correlate the timestamps to confirm the unresponsiveness as
>>>>>>>> reported in gluster client logs.
>>>>>>>> In cases of high disk activity, although the operating system
>>>>>>>> continues to respond to ICMP pings, the processes writing to disks often
>>>>>>>> get blocked to a large flush to the disk which could span beyond 42 seconds
>>>>>>>> and hence result in ping-timer-expiry logs.
>>>>>>>>
>>>>>>>> As a side note:
>>>>>>>> If you indeed find gluster processes being blocked in
>>>>>>>> /var/log/messages, you might want to tweak sysctl tunables called
>>>>>>>> vm.dirty_background_ratio or vm.dirty_background_bytes to a smaller value
>>>>>>>> than the existing. Please read up more on those tunables before touching
>>>>>>>> the settings.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici <mauro.tridici at cmcc.it>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> in attachment the client log captured after changing
>>>>>>>>> network.ping-timeout option.
>>>>>>>>> I noticed this error involving server 192.168.0.56 (s06)
>>>>>>>>>
>>>>>>>>> [2019-03-01 09:23:36.077287] I [rpc-clnt.c:1962:rpc_clnt_reconfig]
>>>>>>>>> 0-tier2-client-71: changing ping timeout to 42 (from 0)
>>>>>>>>> [2019-03-01 09:23:36.078213] I
>>>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in
>>>>>>>>> volfile,continuing
>>>>>>>>> [2019-03-01 09:23:36.078432] I
>>>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in
>>>>>>>>> volfile,continuing
>>>>>>>>> [2019-03-01 09:23:36.092357] I
>>>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in
>>>>>>>>> volfile,continuing
>>>>>>>>> [2019-03-01 09:23:36.094146] I
>>>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in
>>>>>>>>> volfile,continuing
>>>>>>>>> [2019-03-01 10:06:24.708082] C
>>>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: server
>>>>>>>>> 192.168.0.56:49156 has not responded in the last 42 seconds,
>>>>>>>>> disconnecting.
>>>>>>>>>
>>>>>>>>> I don’t know why it happens, s06 server seems to be reachable.
>>>>>>>>>
>>>>>>>>> [athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
>>>>>>>>> Trying 192.168.0.56...
>>>>>>>>> Connected to 192.168.0.56.
>>>>>>>>> Escape character is '^]'.
>>>>>>>>> ^CConnection closed by foreign host.
>>>>>>>>> [athena_login2][/users/home/sysm02/]> ping 192.168.0.56
>>>>>>>>> PING 192.168.0.56 (192.168.0.56) 56(84) bytes of data.
>>>>>>>>> 64 bytes from 192.168.0.56: icmp_seq=1 ttl=64 time=0.116 ms
>>>>>>>>> 64 bytes from 192.168.0.56: icmp_seq=2 ttl=64 time=0.101 ms
>>>>>>>>>
>>>>>>>>> --- 192.168.0.56 ping statistics ---
>>>>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 1528ms
>>>>>>>>> rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012 ms
>>>>>>>>>
>>>>>>>>> [athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
>>>>>>>>> Trying 192.168.0.56...
>>>>>>>>> Connected to 192.168.0.56.
>>>>>>>>> Escape character is '^]'.
>>>>>>>>>
>>>>>>>>> Thank you for your help,
>>>>>>>>> Mauro
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 1 Mar 2019, at 10:29, Mauro Tridici <mauro.tridici at cmcc.it>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> thank you for the explanation.
>>>>>>>>> I just changed network.ping-timeout option to default value
>>>>>>>>> (network.ping-timeout=42).
>>>>>>>>>
>>>>>>>>> I will check the logs to see if the errors will appear again.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Mauro
>>>>>>>>>
>>>>>>>>> On 1 Mar 2019, at 04:43, Milind Changire <mchangir at redhat.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> network.ping-timeout should not be set to zero for non-glusterd
>>>>>>>>> clients.
>>>>>>>>> glusterd is a special case for which ping-timeout is set to zero
>>>>>>>>> via /etc/glusterfs/glusterd.vol
>>>>>>>>>
>>>>>>>>> Setting network.ping-timeout to zero disables arming of the ping
>>>>>>>>> timer for connections. This disables testing the connection for
>>>>>>>>> responsiveness and hence avoids proactive fail-over.
>>>>>>>>>
>>>>>>>>> Please reset network.ping-timeout to a non-zero positive value,
>>>>>>>>> eg. 42
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Feb 28, 2019 at 5:07 PM Nithya Balachandran <
>>>>>>>>> nbalacha at redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>> Adding Raghavendra and Milind to comment on this.
>>>>>>>>>>
>>>>>>>>>> What is the effect of setting network.ping-timeout to 0 and
>>>>>>>>>> should it be set back to 42?
>>>>>>>>>> Regards,
>>>>>>>>>> Nithya
>>>>>>>>>>
>>>>>>>>>> On Thu, 28 Feb 2019 at 16:01, Mauro Tridici <
>>>>>>>>>> mauro.tridici at cmcc.it> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Nithya,
>>>>>>>>>>>
>>>>>>>>>>> sorry for the late.
>>>>>>>>>>> network.ping-timeout has been set to 0 in order to try to solve
>>>>>>>>>>> some timeout problems, but it didn’t help.
>>>>>>>>>>> I can set it to the default value.
>>>>>>>>>>>
>>>>>>>>>>> Can I proceed with the change?
>>>>>>>>>>>
>>>>>>>>>>> Thank you,
>>>>>>>>>>> Mauro
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 28 Feb 2019, at 04:41, Nithya Balachandran <
>>>>>>>>>>> nbalacha at redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Mauro,
>>>>>>>>>>>
>>>>>>>>>>> Is network.ping-timeout still set to 0. The default value is 42.
>>>>>>>>>>> Is there a particular reason why this was changed?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Nithya
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 27 Feb 2019 at 21:32, Mauro Tridici <
>>>>>>>>>>> mauro.tridici at cmcc.it> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Xavi,
>>>>>>>>>>>>
>>>>>>>>>>>> thank you for the detailed explanation and suggestions.
>>>>>>>>>>>> Yes, transport.listen-backlog option is still set to 1024.
>>>>>>>>>>>>
>>>>>>>>>>>> I will check the network and connectivity status using “ping”
>>>>>>>>>>>> and “telnet” as soon as the errors will come back again.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Mauro
>>>>>>>>>>>>
>>>>>>>>>>>> Il giorno 27 feb 2019, alle ore 16:42, Xavi Hernandez <
>>>>>>>>>>>> jahernan at redhat.com> ha scritto:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Mauro,
>>>>>>>>>>>>
>>>>>>>>>>>> those errors say that the mount point is not connected to some
>>>>>>>>>>>> of the bricks while executing operations. I see references to 3rd and 6th
>>>>>>>>>>>> bricks of several disperse sets, which seem to map to server s06. For some
>>>>>>>>>>>> reason, gluster is having troubles connecting from the client machine to
>>>>>>>>>>>> that particular server. At the end of the log I see that after long time a
>>>>>>>>>>>> reconnect is done to both of them. However little after, other bricks from
>>>>>>>>>>>> the s05 get disconnected and a reconnect times out.
>>>>>>>>>>>>
>>>>>>>>>>>> That's really odd. It seems like if server/communication is cut
>>>>>>>>>>>> to s06 for some time, then restored, and then the same happens to the next
>>>>>>>>>>>> server.
>>>>>>>>>>>>
>>>>>>>>>>>> If the servers are really online and it's only a communication
>>>>>>>>>>>> issue, it explains why server memory and network has increased: if the
>>>>>>>>>>>> problem only exists between the client and servers, any write made by the
>>>>>>>>>>>> client will automatically mark the file as damaged, since some of the
>>>>>>>>>>>> servers have not been updated. Since self-heal runs from the server nodes,
>>>>>>>>>>>> they will probably be correctly connected to all bricks, which allows them
>>>>>>>>>>>> to heal the just damaged file, which increases memory and network usage.
>>>>>>>>>>>>
>>>>>>>>>>>> I guess you still have transport.listen-backlog set to 1024,
>>>>>>>>>>>> right ?
>>>>>>>>>>>>
>>>>>>>>>>>> Just to try to identify if the problem really comes from
>>>>>>>>>>>> network, can you check if you lose some pings from the client to all of the
>>>>>>>>>>>> servers while you are seeing those errors in the log file ?
>>>>>>>>>>>>
>>>>>>>>>>>> You can also check if during those errors, you can telnet to
>>>>>>>>>>>> the port of the brick from the client.
>>>>>>>>>>>>
>>>>>>>>>>>> Xavi
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 26, 2019 at 10:17 AM Mauro Tridici <
>>>>>>>>>>>> mauro.tridici at cmcc.it> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Nithya,
>>>>>>>>>>>>>
>>>>>>>>>>>>> “df -h” operation is not still slow, but no users are using
>>>>>>>>>>>>> the volume, RAM and NETWORK usage is ok on the client node.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was worried about this kind of warnings/errors:
>>>>>>>>>>>>>
>>>>>>>>>>>>> [2019-02-25 10:59:00.664323] W [MSGID: 122035]
>>>>>>>>>>>>> [ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing operation
>>>>>>>>>>>>> with some subvolumes unavailable (20)
>>>>>>>>>>>>>
>>>>>>>>>>>>> [2019-02-26 03:11:35.212603] E
>>>>>>>>>>>>> [rpc-clnt.c:350:saved_frames_unwind] (-->
>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (-->
>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (-->
>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (-->
>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45]
>>>>>>>>>>>>> (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] )))))
>>>>>>>>>>>>> 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13))
>>>>>>>>>>>>> called at 2019-02-26 03:10:56.549903 (xid=0x106f1c5)
>>>>>>>>>>>>>
>>>>>>>>>>>>> [2019-02-26 03:13:03.313831] E
>>>>>>>>>>>>> [socket.c:2376:socket_connect_finish] 0-tier2-client-50: connection to
>>>>>>>>>>>>> 192.168.0.56:49156 failed (Timeout della connessione);
>>>>>>>>>>>>> disconnecting socket
>>>>>>>>>>>>>
>>>>>>>>>>>>> It seems that some subvolumes are not available and
>>>>>>>>>>>>> 192.168.0.56 server (s06) is not reachable.
>>>>>>>>>>>>> But gluster servers are up&running and bricks are ok.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In attachment the updated tier2.log file.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Mauro
>>>>>>>>>>>>>
>>>>>>>>>>>>> Il giorno 26 feb 2019, alle ore 04:03, Nithya Balachandran <
>>>>>>>>>>>>> nbalacha at redhat.com> ha scritto:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see a lot of EC messages in the log but they don't seem very
>>>>>>>>>>>>> serious. Xavi, can you take a look?
>>>>>>>>>>>>>
>>>>>>>>>>>>> The only errors I see are:
>>>>>>>>>>>>> [2019-02-25 10:58:45.519871] E
>>>>>>>>>>>>> [rpc-clnt.c:350:saved_frames_unwind] (-->
>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (-->
>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (-->
>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (-->
>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45]
>>>>>>>>>>>>> (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] )))))
>>>>>>>>>>>>> 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13))
>>>>>>>>>>>>> called at 2019-02-25 10:57:47.429969 (xid=0xd26fe7)
>>>>>>>>>>>>> [2019-02-25 10:58:51.461493] E
>>>>>>>>>>>>> [rpc-clnt.c:350:saved_frames_unwind] (-->
>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (-->
>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] (-->
>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (-->
>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45]
>>>>>>>>>>>>> (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] )))))
>>>>>>>>>>>>> 0-tier2-client-41: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13))
>>>>>>>>>>>>> called at 2019-02-25 10:57:47.499174 (xid=0xf47d6a)
>>>>>>>>>>>>> [2019-02-25 11:07:57.152874] E
>>>>>>>>>>>>> [socket.c:2376:socket_connect_finish] 0-tier2-client-70: connection to
>>>>>>>>>>>>> 192.168.0.55:49163 failed (Timeout della connessione);
>>>>>>>>>>>>> disconnecting socket
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is the df -h operation still slow? If yes, can you take a
>>>>>>>>>>>>> tcpdump of the client while running df -h and send that across?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Nithya
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, 25 Feb 2019 at 17:27, Mauro Tridici <
>>>>>>>>>>>>> mauro.tridici at cmcc.it> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry, some minutes after my last mail message, I noticed
>>>>>>>>>>>>>> that “df -h” command hanged for a while before returns the prompt.
>>>>>>>>>>>>>> Yesterday, everything was ok in the gluster client log, but,
>>>>>>>>>>>>>> today, I see a lot of errors (please, take a look to the attached file).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On the client node, I detected an important RAM e NETWORK
>>>>>>>>>>>>>> usage.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you think that the errors have been caused by the client
>>>>>>>>>>>>>> resources usage?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you in advance,
>>>>>>>>>>>>>> Mauro
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190321/2c44acfc/attachment-0001.html>