[Gluster-users] Gluster 3.4.2 on Redhat 6.5

Mon Mar 24 11:54:41 UTC 2014

Hi Carlos,

Thanks for coming back to me... in response to your queries:

PID is low, 1153 for glusterd with glusterfsd 1168 and 2 x glusterfs with 1318 and 1319 so I'd agree... it doesn't seem that glusterd is crashing and being restarted.

As of today, Monday morning top is reporting 1398 glusterd zombie processes.

I have this problem on all 4 of my gluster nodes and all four are being monitored by the attached nagios plugin.

In terms of testing, I've prevented nagios from running the attached check script and restarted the glusterd process using
"service glusterd restart". I've let it run for a few hours and haven't yet seen any zombie processes created. This I think is good as, for whatever reason, it appears to point at the nagios check script being the problem.

My next check was to run the nagios check once to see if it created a Zombie process... it did.... So I started looking at the script. I forced the script to exit after the first command "gluster volume heal audio info" and no Zombie process was created. This pointed me to the second which takes this form.... I'm no expert of HERE documents in shell but I think that it maybe causing the issue:
while read -r line; do
     field=($(echo $line))
     case ${field[0]} in
     Brick)
           brick=${field[@]:2}
           ;;
     Disk)
           key=${field[@]:0:3}
           if [ "${key}" = "Disk Space Free" ]; then
                freeunit=${field[@]:4}
                unit=${freeunit: -2}
                free=${freeunit%$unit}
                if [ "$unit" != "GB" ]; then
                     Exit UNKNOWN "Unknown disk space size $freeunit\n"
                fi
                if (( $(bc <<< "${free} < ${freegb}") == 1 )); then
                     freegb=$free
                fi
           fi
           ;;
     Online)
           online=${field[@]:2}
           if [ "${online}" = "Y" ]; then
                let $((bricksfound++))
           else
                errors=("${errors[@]}" "$brick offline")
           fi
           ;;
     esac
done < <( sudo gluster volume status ${VOLUME} detail)

Anyone spot why this would be an issue?

Thanks,
Steve

From: Carlos Capriotti [mailto:capriotti.carlos at gmail.com]
Sent: 22 March 2014 11:51
To: Steve Thomas
Cc: gluster-users at gluster.org
Subject: Re: [Gluster-users] Gluster 3.4.2 on Redhat 6.5

ok, let's see if we can gather more info.

I am not a specialist, but you know... another pair of eyes.

My system has a single glusterd process and it has a pretty low PID, meaning it has not crashed.

What is your PID for your glusterd ? how many zombie processes are there reported by top ?

I've been running my preliminary tests with gluster for a little over a month now and have never seen this. My platform is CentOS 6.5, so, I'd say it is pretty similar.

>From my perspective, even making gluster sweat, running some intense rsync jobs in parallel, and seeing glusterd AND glusterfs take 120% of processing time on top (each on one core), they never crashed.

My zombie count, from top,  is zero.

On the other hand, I had one of my nodes, the other day, crashing a process every time I started a high demanding task. Ends up I had (and still have) a hardware problem on one of the processor (or the main board; still undiagnosed).

Do you have this problem on one node only ?

Any chance you have something special compiled on your kernel ?

Any particularly memory-hungry tweak on your sysctl ?

Sounds like the system, not gluster.

KR,

Carlos

On Fri, Mar 21, 2014 at 10:29 PM, Steve Thomas <sthomas at rpstechnologysolutions.co.uk<mailto:sthomas at rpstechnologysolutions.co.uk>> wrote:
Hi all...

Further investigation shows in excess of 500 glusterd zombie processes and continuing to climb on the box ...

Any suggestions? Am happy to provide logs etc to get to the bottom of this....

_____________________________________________
From: Steve Thomas
Sent: 21 March 2014 13:21
To: 'gluster-users at gluster.org<mailto:gluster-users at gluster.org>'
Subject: Gluster 3.4.2 on Redhat 6.5

Hi,

I'm running Gluster 3.4.2 on Redhat 6.5 with 4 servers with a brick on each. This brick is mounted locally and used by apache to server audio files for an IVR system. Each of these audio files are typically around 80-100Kb.

System appears to be working ok in terms of health and status via gluster CLI.

The system is monitored by nagios and there's a check for zombie processes and the gluster status. It appears that over a 24 hour period the number of Zombie processes on the box has increased and is continually increasing. Investigating these are "glusterd" processes.

I'm making an assumption but I'd suspect that the regular nagios checks are resulting in the increase in zombie processes as they are querying the glusterd process. The command that the nagios plugin is running is:

#Check heal status
gluster volume heal audio info

#Check volume status
gluster volume status audio detail

Does anyone have any suggestions as to why glusterd is resulting in these zombie processes?

Thanks for help in advance,

Steve

_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org<mailto:Gluster-users at gluster.org>
http://supercolony.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140324/19e1b83b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: check_glusterfs.sh
Type: application/octet-stream
Size: 3731 bytes
Desc: check_glusterfs.sh
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140324/19e1b83b/attachment.obj>