<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Hi everyone. <br>
</p>
<p>For some time now, I've been plagued with a slow Gluster array.
Metrics like disk utilization, disk latency, CPU usage, load
average, and eth1 traffic (the network device handling Gluster
synchronization), were all quite high. One of the servers in the
array had disk utilization pegged at 100% most of the time. This
caused quite a lot of slowness in our e-mail, especially webmail,
which keeps its data on the Gluster array.<br>
</p>
<p>Then a funny thing happened. I needed to reboot two of the
servers in the array (NFS2 and NFS3, referenced later) to add more
disks in the RAID array to create a new Gluster brick. Rebooting
NFS2 had no trouble, and the situation did not change. Doing the
same to NFS3 on the other hand, performed three days later, was
different. First, before the reboot, I saw many kernel messages
like this in the console:<br>
<br>
XFS: possible memory allocation deadlock in kmem_alloc
(mode:0x250)<br>
</p>
<p>After googling the error, it turns out it's indicating that the
XFS filesystem needs to be defragged. I take note of the error and
then reboot this second server.</p>
<p>When it came back up, disk utilization, disk latency, CPU usage,
load average and eth1 traffic all fell off a cliff and stayed that
way. Webmail was running faster than ever the next morning. I do
some more tests and research on the issue.</p>
<p>XFS reported that file fragmentation was fairly high:<br>
<br>
root@nfs3:/home/ernied# xfs_db -r /dev/sda5<br>
xfs_db> frag -f<br>
actual 7783555, ideal 6644098, fragmentation factor 14.64%</p>
<p>I defragged NFS2 last night, it took about an hour, and now it
reports that fragmentation is only about 1.5%. <br>
</p>
<p>But other people, especially in the Gluster mailing list, say
that XFS defragging isn't necessary unless the disk is nearly full
anyway. <br>
</p>
<p>And what's up with Gluster behaving properly after this reboot?
Why did rebooting NFS2 (which was actually the busier one, with
the highest disk utilization) do nothing, but doing the same to
NFS3 apparently fixed everything? Or was it defragging the disk?</p>
<p>While it's nice that things are back to optimal operation, I need
to know why this is happening. <br>
<br>
Attached are some images from Munin showing the before and after
the reboot, which happened at 10pm last night. Defragging went
from 12am to 1am.<br>
</p>
</body>
</html>