[Gluster-users] Shard Production Report – Week Two :)
Lindsay Mathieson
lindsay.mathieson at gmail.com
Sun May 1 06:31:25 UTC 2016
Shard Production Report – Week Two :)
Hi Y’all, meant to do a Week One report but I had various dramas with
failing hard disks – bad timing really. But good for testing.
First week things went pretty well – I had 12 Vms on 3 nodes running off
a rep 3 Sharded (64MB) Gluster Volume. It coped well – good performance,
rebooted nodes and/or killed gluster processes and i/o continued without
a hitch, no users noticed and heal time was more than satisfactory.
There was a drama mid week when on gluster brick froze. Bad feeling
until I realised the ceph osd’s on the same node had also frozen. It was
a hard disk failure that locked up the underlying zfs pool (eventually
two disks in the same mirror dammit). System kept going until the next
day when I could replace the disks and reboot the node.
On the weekend I downed the volume and ran a md5sum across the shards on
all three nodes. Unfortunately I found 8 mismatching shards which was
concerning. However:
- All the shards were from the same VM image file, that was running on
the node with the disk failures.
- Of the 3 copies of each shard, two always matched, meaning I could
easily select one for healing
- 7 of the mismatching shards were from the bad disk. I put this down to
failed zfs recovery after the fact
- Repair was easy – just deleted the mismatched shard copy and issued a
full heal.
Second week – I gradually migrated the remaining VM’s to gluster and
retired the ceph pools. So far no one has noticed :) performance has
greatly improved and iowaits have greatly reduced, Overall the VM’s seem
much less vulnerable to peaks in i/o, with a smoother experience
overall. A rolling upgrade and reboot of the servers went very smoothly,
had to wait about 15 min between boots for heals to finish.
Friday night I downed the volume again and re-checksummed all the shards
(2TB Data per brick) – everything matched down to the last byte.
It was instructive bringing it back up – just for laughs started all the
Windows VM’s (30) simultaneously and it actually coped. iowait went
through the roof (50% on one nodes) but they all started without a hitch
and were accessible in a few minutes. After about an hour the cluster
had settled down to under 5%
I know in the overall scheme of things our setup is pretty small – 30+
VM’s amoungst 3 nodes, but its pretty important for our small business.
Very pleased with the outcome so far and will be continuing. All the
issues which bothered us when we first looked at gluster (3.6) have been
addressed.
Cheers all and a big thanks to the devs, testers and documentators.
--
Lindsay Mathieson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160501/063a5da1/attachment.html>
More information about the Gluster-users
mailing list