[Gluster-devel] Re: cp taking 100% cpu and never terminating

Mon Jun 16 11:50:17 UTC 2008

--reply=yes just answers yes to the questions "would you like to 
overwrite file XXX"
I'm most afraid this is some fuse bug but surely someone would have 
reported it elsewhere. The cp is not in state R+ which is weird but even 
a kill -9 PID won't get rid of it.
I'll get you a gdb trace asap :-)

Thanks!
-Mic

Raghavendra G wrote:
> Hi Mickey,
> Is it possible to attach to glusterfs process using gdb, while cp is 
> hung and get a backtrace?
> #ps aux | grep -i glusterfs
> # gdb -p <glusterfs-process-id>
> and in gdb,
> gdb> bt
>
> Also,
> It would be helpful, If you can get a backtrace of cp also.
> #ps aux | grep -i cp
> # gdb -p <cp-process-id>
> gdb> bt
>
> Also, I am curious to know what do the --reply option to cp does.
>
> regards,
>
> On Sun, Jun 15, 2008 at 12:12 AM, Mickey Mazarick 
> <mic at digitaltadpole.com <mailto:mic at digitaltadpole.com>> wrote:
>
>     I'm still seeing the problem described below. It only happens over
>     the ibverbs transport and very infrequently tcp. This is an
>     intermittent problem, but happens quite frequently over ibverbs.
>     It will use all the processing power on a single core of the
>     client machine. I can repeat the command but eventually the
>     machine will lock with all processors doing a cp or a tar command.
>     We see it on both kernel 2.6.18 and 2.6.24. <http://2.6.24.> Has
>     anyone there been able to replicate it?
>
>     Thanks!
>     -Mickey Mazarick
>
>
>
>     Mickey Mazarick wrote:
>
>         Something odd is happening when I run a shell script with cp
>         commands in it. This happens infrequently but I have to reboot
>         the system to get my processor back. I'm never taring or
>         copying more than 50 megs of data.
>
>         It either hangs on a command like:
>         cp --reply=yes /usr/src/linux-${kernver}/.config
>         /tftpboot/node_root/boot/config-${kernver}
>         or
>         tar cf - etc | gzip >
>         /tftpboot/node_root/drbl_ssi/template_etc.tgz
>
>         when I do a top I see:
>          PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>          COMMAND
>         1603 root      20   0 54160 1616  508 R  100  0.0  33:02.72 cp
>         (100% cpu time)
>
>         I'm unable to kill that process in any way, but I can kill the
>         shell script that spawned it. The CP command is still running.
>
>         I see the below errors on the client:
>         2008-05-11 17:02:32 E [client-protocol.c:1238:client_flush]
>         system1: : returning EBADFD
>         2008-05-11 17:02:32 E [afr.c:2623:afr_flush_cbk] afr1:
>         (path=/scripts/gluster/afrheal.sh child=system1) op_ret=-1
>         op_errno=77
>         2008-05-11 17:02:32 W [client-protocol.c:1296:client_close]
>         system1: no valid fd found, returning
>         2008-05-11 17:02:32 W [client-protocol.c:1296:client_close]
>         system-ns1: no valid fd found, returning
>
>         My client and server specs are identical to:
>         http://www.gluster.org/docs/index.php/Simple_High_Availability_Storage_with_GlusterFS_1.3
>
>
>         This happens equally over ib-verbs and tcp transports.
>
>
>
>     -- 
>
>
>     _______________________________________________
>     Gluster-devel mailing list
>     Gluster-devel at nongnu.org <mailto:Gluster-devel at nongnu.org>
>     http://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
>
>
> -- 
> Raghavendra G
>
> A centipede was happy quite, until a toad in fun,
> Said, "Prey, which leg comes after which?",
> This raised his doubts to such a pitch,
> He fell flat into the ditch,
> Not knowing how to run.
> -Anonymous 


--