[Gluster-users] gluster rebuild time
Liam Slusser
lslusser at gmail.com
Fri Feb 5 01:25:26 UTC 2010
All,
I've been asked to share some rebuild times on my large gluster
cluster. I recently added another more storage (bricks) and did a
full ls -alR on the whole system. I estimate we have around 50
million files and directories.
Gluster Server Hardware:
2 x Supermicro 4u chassis with 24 1.5tb SATA drives and another 24
1.5tb SATA drives in an external drive array via SAS (total of 96
drives all together), 8 core 2.5ghz xeon, 8gig ram
3ware raid controllers, 24 drives per raid6 array, 4 arrays total, 2
arrays per server
Centos 5.3 64bit
XFS with inode64 mount option
Gluster 2.0.9
Bonded gigabit ethernet
Clients:
20 or so Dell 1950 clients
Mixture of RedHat ES4 and Centos 5 clients + 20 Windows XP clients via
Samba (theses are VMs and do "have to run on windows" jobs)
All clients on gigabit ethernet
I must say that our load on our gluster servers is normally very high,
"load average" on the box is anywhere from 7-10 at peak (although
decent service times) - so im sure if we had a more idle system the
rebuild time would have been quicker. The system is at its highest
load while writing a large amount of data while at peak of the day -
so i try to schedule jobs around our peak times.
Anyhow...
I started the job sometime January 16th and it JUST finished...18 days later.
real 27229m56.894s
user 13m19.833s
sys 56m51.277s
Finish date was Wed Feb 3 23:33:12 PST 2010
Now i've known some people have mentioned that Gluster is happier with
many bricks instead of larger raid arrays like I use however either
way id be stuck doing a ls -aglR which takes forever. So id rather
add a huge amount of space at once and keep the system setup similar -
and let my 3ware controllers deal with drive failures instead of
having to do a ls -aglR each time i loose a drive. Replacing a drive
with the 3ware controller 7 to 8 days in a 24 drive raid6 array but
thats better then 18 days for Gluster to do a ls -aglR.
By comparison our old 14 node Isilon 6000 cluster (6tb per node) did a
node rebuilt/resync in about a day or two - theres a big difference in
block level and file system level replication!
We're still running Gluster 2.0.9 but I am looking to upgrade to 3.0
once a few more releases are out and am hoping that the new checksum
based checks will speedup this whole process. Once i have some
numbers on 3.0 ill be sure to share.
thanks,
liam
More information about the Gluster-users
mailing list