[Gluster-users] XEN VPS unresponsive because of selfhealing

Mon Apr 18 10:47:52 UTC 2011

Hello,

I've been actively watching this project since its early 2.0 releases
and think it has made great progress. Personally, the problems it's
solving and the way it does it are interesting to me.

We are a webhosting company and have used GlusterFS for serving some of
the hostings from GlusterFS due to their size.

While serving XEN domUs from GlusterFS, yesterday we were
trying to upgrade GlusterFS 3.1.2 to the latest version 3.1.4 . Our
setup is pretty much the standard distribute-replicate:

Volume Name: images
Type: Distributed-Replicate
Status: Started
Number of Bricks: 12 x 2 = 24
Transport-type: tcp
Bricks:
Brick1: gnode002.local:/data1/images
Brick2: gnode004.local:/data1/images
Brick3: gnode002.local:/data2/images
Brick4: gnode004.local:/data2/images
Brick5: gnode002.local:/data3/images
Brick6: gnode004.local:/data3/images
Brick7: gnode002.local:/data4/images
Brick8: gnode004.local:/data4/images
Brick9: gnode006.local:/data1/images
Brick10: gnode008.local:/data1/images
Brick11: gnode006.local:/data2/images
Brick12: gnode008.local:/data2/images
Brick13: gnode006.local:/data3/images
Brick14: gnode008.local:/data3/images
Brick15: gnode006.local:/data4/images
Brick16: gnode008.local:/data4/images
Brick17: gnode010.local:/data1/images
Brick18: gnode012.local:/data1/images
Brick19: gnode010.local:/data2/images
Brick20: gnode012.local:/data2/images
Brick21: gnode010.local:/data3/images
Brick22: gnode012.local:/data3/images
Brick23: gnode010.local:/data4/images
Brick24: gnode012.local:/data4/images
Options Reconfigured:
performance.quick-read: off
network.ping-timeout: 30

XEN servers have mounted images through the GlusterFS native client and
served using tap:aio driver.

We wanted to upgrade gluster on each node, one at a time (but we did
only gnode002) . So we did this:

root at gnode002.local: /etc/init.d/glusterd stop && killall glusterfsd
&& /etc/init.d/glusterd start

we had to kill processess because glusterd didn't shutdown properly. The
problem was, that after execution, self-healing immediately started to
check consistency. glusterfsd process could have been down for 5-6
seconds so we expected selfhealing not to initiate, but it did. This
would not be a problem on its own, if selfhealing itself wouldn't make
our VPS totally unresponsive for 90 minutes until it stopped because
gluster has locked (or the access to image was so slow ?) the image.

So question is - is there a way to avoid this or minimize these
effects? Has anyone had the same experience with selfhealing in
GlusterFS+XEN environment?

Regards,
Tomas Corej

S pozdravom
-- 
[ Ohodnotte kvalitu emailu: http://nicereply.com/websupport/Corej/ ]

Tomáš Čorej | admin section

+421 (0)2 20 60 80 89
+421 (0)2 20 60 80 80

http://WebSupport.sk
*** BERTE A VYCHUTNAVAJTE ***