[Bugs] [Bug 1417050] [Stress] : SHD Logs flooded with "Heal Failed" messages, filling up "/" quickly

bugzilla at redhat.com bugzilla at redhat.com
Fri Jan 27 05:56:14 UTC 2017


https://bugzilla.redhat.com/show_bug.cgi?id=1417050



--- Comment #1 from Ashish Pandey <aspandey at redhat.com> ---
Description of problem:
-----------------------

4 Node cluster,1 EC volume exported via Ganesha.


Replaced 4 bricks in different subvols,which triggered a heal.

Data was being populated from multiple Ganesha mounts while heal happened.


The shd logs and Ganesha-gfapi logs were literally flooded with Heal failures :

The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report]
0-testvol-disperse-4: Heal failed [Invalid argument]" repeated 4 times between
[2017-01-22 09:58:51.836831] and [2017-01-22 09:58:52.109527]
[2017-01-22 09:58:52.225738] W [MSGID: 122002] [ec-common.c:71:ec_heal_report]
0-testvol-disperse-4: Heal failed [Invalid argument]
[2017-01-22 09:58:52.228903] W [MSGID: 122002] [ec-common.c:71:ec_heal_report]
0-testvol-disperse-4: Heal failed [Invalid argument]
The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report]
0-testvol-disperse-0: Heal failed [Invalid argument]" repeated 8 times between
[2017-01-22 09:58:52.066587] and [2017-01-22 09:58:52.328220]
[2017-01-22 09:58:52.340050] W [MSGID: 122002] [ec-common.c:71:ec_heal_report]
0-testvol-disperse-0: Heal failed [Invalid argument]
[2017-01-22 09:58:52.379059] W [MSGID: 122002] [ec-common.c:71:ec_heal_report]
0-testvol-disperse-6: Heal failed [Invalid argument]
The message "W [MSGID: 122002] [ec-common.c:71:ec_heal_report]
0-testvol-disperse-0: Heal failed [Invalid argument]" repeated 4 times between
[2017-01-22 09:58:52.340050] and [2017-01-22 09:58:52.453352]
[2017-01-22 09:58:52.461780] W [MSGID: 122002] [ec-common.c:71:ec_heal_report]
0-testvol-disperse-6: Heal failed [Invalid argument]
[2017-01-22 09:58:52.495767] W [MSGID: 122002] [ec-common.c:71:ec_heal_report]
0-testvol-disperse-0: Heal failed [Invalid argument]
[2017-01-22 09:58:52.544095] W [MSGID: 122002] [ec-common.c:71:ec_heal_report]
0-testvol-disperse-0: Heal failed [Invalid argument]
[2017-01-22 09:58:52.584513] W [MSGID: 122002] [ec-common.c:71:ec_heal_report]
0-testvol-disperse-0: Heal failed [Invalid argument]


The log message was populated nearly 1000000 times per node in the last 24
hours :

[root at gqas015 ~]# cat /var/log/ganesha-gfapi.log |grep -i "Heal fail"|wc -l
1433241

[root at gqas009 ~]# cat /var/log/glusterfs/glustershd.log|grep -i "Heal fail"|wc
-l
1075672
[root at gqas009 ~]# 


It's filling up / quickly :


[root at gqas015 ~]# ll -h /var/log/glusterfs/glustershd.log
-rw------- 1 root root 403M Jan 23 01:57 /var/log/glusterfs/glustershd.log
[root at gqas015 ~]# 
[root at gqas015 ~]# ll -h /var/log/ganesha-gfapi.log 
-rw------- 1 root root 538M Jan 23 00:59 /var/log/ganesha-gfapi.log
[root at gqas015 ~]# 



*********************************
EXACT WORKLOAD on Ganesha mounts 
*********************************

Smallfile Appends,ll -R,tarball untar



Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-2.4.1-6.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-12.el7rhgs.x86_64

How reproducible:
-----------------

Reporting the first occurence.


Actual results:
---------------

Logs spammed with "Heal Failed" errors

Expected results:
-----------------

No Log spamming.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=I4L89ygcNs&a=cc_unsubscribe


More information about the Bugs mailing list