[Gluster-devel] Re: cp taking 100% cpu and never terminating

Mickey Mazarick mic at digitaltadpole.com
Sat Jun 14 20:12:56 UTC 2008

I'm still seeing the problem described below. It only happens over the 
ibverbs transport and very infrequently tcp. This is an intermittent 
problem, but happens quite frequently over ibverbs. It will use all the 
processing power on a single core of the client machine. I can repeat 
the command but eventually the machine will lock with all processors 
doing a cp or a tar command. We see it on both kernel 2.6.18 and 2.6.24. 
Has anyone there been able to replicate it?

-Mickey Mazarick

Mickey Mazarick wrote:
> Something odd is happening when I run a shell script with cp commands 
> in it. This happens infrequently but I have to reboot the system to 
> get my processor back. I'm never taring or copying more than 50 megs 
> of data.
> It either hangs on a command like:
> cp --reply=yes /usr/src/linux-${kernver}/.config 
> /tftpboot/node_root/boot/config-${kernver}
> or
> tar cf - etc | gzip > /tftpboot/node_root/drbl_ssi/template_etc.tgz
> when I do a top I see:
> 1603 root      20   0 54160 1616  508 R  100  0.0  33:02.72 cp
> (100% cpu time)
> I'm unable to kill that process in any way, but I can kill the shell 
> script that spawned it. The CP command is still running.
> I see the below errors on the client:
> 2008-05-11 17:02:32 E [client-protocol.c:1238:client_flush] system1: : 
> returning EBADFD
> 2008-05-11 17:02:32 E [afr.c:2623:afr_flush_cbk] afr1: 
> (path=/scripts/gluster/afrheal.sh child=system1) op_ret=-1 op_errno=77
> 2008-05-11 17:02:32 W [client-protocol.c:1296:client_close] system1: 
> no valid fd found, returning
> 2008-05-11 17:02:32 W [client-protocol.c:1296:client_close] 
> system-ns1: no valid fd found, returning
> My client and server specs are identical to:
> http://www.gluster.org/docs/index.php/Simple_High_Availability_Storage_with_GlusterFS_1.3 
> This happens equally over ib-verbs and tcp transports.


