[Gluster-users] Unreasonably high load after rebooting a brick

Mon Jul 13 19:03:22 UTC 2015

Hello,

I'm testing an environment on AWS right now and running into a strange
issue.  In summary, my setup is like this:

2 x c4.large (2 visible CPU, 4gb ram) running a 500gb (magnetic
backend) Replicate gluster volume.  So, each instance has 100% of the
data.  Running 3.7.2 on servers and multiple clients.  Bricks are xfs
with noatime mount flag.  Server and client threads set at 4 right now
(was 2 before, same issue)

Currently there are ~800,000 smaller files (jpeg, 300k to 3 meg) on
the volume, and one of the clients is writing new files to the volume
constantly, on average about 2-3 per second.

100% of the time, there is practically no load.  I could run these on
micro instances.. but, if I happen to reboot one of them, I run into
some serious trouble.  Both boxes max out on cpu, load average goes
into the 4-6 range, and my client can no longer write to the volume.

About 18 minutes later, there are finally log entries added to the
glustershd.log file and it begins a self heal on added files.  The
load calms down about 5-10 minutes after that, and other clients can
do reads and writes again.  However, my original client that was
trying to write the small files is ultimately stuck, I can't even do
an ls on a folder without it taking 30+ seconds.  Ultimately if I kill
off everything that was trying to write, unmount and remount the
volume, I can get it functional again.

Do I just have too many small files?  Would this not happen with gp2
(ssd) bricks?  Is there a way to throttle whatever ate up all the cpu
so that services can continue with the fully functional brick?

I appreciate any insight.  Thank you for your bandwidth.

Ray