[Gluster-users] Fwd: Live rolling upgrade

Wed Apr 23 14:48:51 UTC 2014

Hello,

I'm looking into a usecase, rolling upgrade of gluster cluster nodes, and
have issues with that a node I'm to take away from the gluster cluster may
have data it hasn't pushed to the cluster although I'm doing a nice
shutdown like "sync; service glusterd stop".

The reason for taking away a node might be to install more memory, disk and
such. So quite normal maintenance all in all.

During this rolling upgrade, there will be use of the gluster volume they
serve. Data will be incoming and outgoing.

I've found a easy testcase using two VM's that each act as gluster cluster
nodes.

I create the volume as:

gluster volume create test-volume replica 2 transport tcp
192.168.0.1:/export/brick
192.168.0.9:/export/brick

>From guides etc it seems like a common basic setup.

After making sure everything works ok in normal mode with both servers
alive, I shutdown one node. Everything is still ok of course and the
remaining node is taking care of serving the volume, as the clients still
push data to the volume. Now, after we have maintained the first node, we
bring it up again and it gets into the cluster ok (replication starts). So,
now we want to maintain the second node and brings it down. Unfortunately
this means that data it had on the volume might not have made it to the
first node before we stop it. I can see that because I'm checking md5sum of
a datafile just written to the volume from the node which I shortly am
about to shutdown and at same time checking the md5sum on the file as it is
seen on the node that just was maintained.

Here is how I do this. I'm starting with node1 up, node2 down (simulating
it beeing down for maintenance).

# node1
dd if=/dev/urandom of=/import/datafil bs=65535 count=4096; md5sum
/import/datafil; ls -l /import/datafil; sync; umount /import; sync; service
glusterd stop

# above take about 35 sec to finish so after like 10 sec I then startup
glusterd on second node, simulating node2 coming back from maintenance.

# node2
service glusterd start; sleep 3; mount -t glusterfs 192.168.0.9:test-volume
/import; while true; do md5sum /import/datafil; ls -l /import/datafil;
sleep 1; done

The output then on node1 looks like:

root at p1-sr0-sl1:/var/log/glusterfs# dd if=/dev/urandom of=/import/datafil
bs=65535 count=4096; md5sum /import/datafil; ls -l /import/datafil;
/root/filesync /import/datafil; umount /import; service glusterd stop
4096+0 records in
4096+0 records out
268431360 bytes (268 MB) copied, 35.6098 s, 7.5 MB/s
6f7e441ccd11f8679ec824aafda56abc  /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
fsync of /import/datafil ... done
Stopping glusterd:
root at p1-sr0-sl1:/var/log/glusterfs#

and on node2:

root at p1-sr0-sl9:~# service glusterd start; sleep 2; mount -t glusterfs
192.168.0.9:test-volume /import; while true; do md5sum /import/datafil; ls
-l /import/datafil; sleep 1; done
Starting glusterd:
d05c8177b981b921b0c56980eaf3e33e  /import/datafil
-rw-r--r-- 1 root root 172750260 Apr 23 11:41 /import/datafil
1d0cf10228cb341290fa43094cc67edf  /import/datafil
-rw-r--r-- 1 root root 207221670 Apr 23 11:41 /import/datafil
f9a0f254c3239c6d8ebad1be05b27bf7  /import/datafil
-rw-r--r-- 1 root root 242413965 Apr 23 11:41 /import/datafil
md5sum: /import/datafil: Transport endpoint is not connected
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
e0d7bd9fa1fce24d65ccf89b8217231f  /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
e0d7bd9fa1fce24d65ccf89b8217231f  /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
....

So, the sum is not matching. It is quite obvious we got more and more data
from node1 until it was down for maintanence (no more data to file after
"transport endpoint is not connected", md5sum stays the same)

Now if I start glusterd on node1 again:

root at p1-sr0-sl1:/var/log/glusterfs# service glusterd start
Starting glusterd:
root at p1-sr0-sl1:/var/log/glusterfs#

It will after a while sync ok on node2:

...
e0d7bd9fa1fce24d65ccf89b8217231f  /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
e0d7bd9fa1fce24d65ccf89b8217231f  /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
99e935937799cba1edaab3aed622798a  /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:42 /import/datafil
2d6a9afbd3f8517baab5622d1337826f  /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:42 /import/datafil
badce4130e98cbe6675793680c6bf3d7  /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:42 /import/datafil
6f7e441ccd11f8679ec824aafda56abc  /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:42 /import/datafil
6f7e441ccd11f8679ec824aafda56abc  /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:42 /import/datafil
6f7e441ccd11f8679ec824aafda56abc  /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
6f7e441ccd11f8679ec824aafda56abc  /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
6f7e441ccd11f8679ec824aafda56abc  /import/datafil
-rw-r--r-- 1 root root 268431360 Apr 23 11:41 /import/datafil
^C
root at p1-sr0-sl9:~#

Metadata syncs it seems (the file has correct length and date) but in the
file it is just zero's from the point where it didn't manage to shuffle
data from node1 to node2. It surprises me a bit that a glusterd node is
allowed to leave the cluster without having all local unique data written
to the remaining nodes (in this case just one). Think of same scenario with
a nfs server. If a client has mounted and pushed some data to the fs, we
cannot unmount until it has written it all cleanly to the nfs server.

Seems like sync is just like a "nop" on glusterfs volumes, which probably
is by design as can be understood from this, admitting a bit old, thread:

http://sourceforge.net/p/fuse/mailman/fuse-devel/thread/87ljeziig8.fsf@frosties.localdomain

So it is not accepted for sync to "spread into" fuse userspace fs's (still
true?)
and thus we don't get a full sync done.

So, the question is how to know when a glusterd node is ready (= not having
any local unique data) to be shutdown?
Anyone else caring about this? What's your recipes to manage a nice clean
glusterfs node shutdown? (while still beeing sure data is correct at all
time)

We are running glusterfs 3.4.2 on a 3.4'ish linux kernel. I've ported the
fuse kernel/userland parts from the thread mentioned above so syncfs() is
arriving to glusterfs. Also started with some tweaking to glusterfs in an
attempt to have gluster flush it's local unique data when this syncfs()
arrives.
Wonder if anyone else is looking into this area?

Best regards,
Per
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140423/7cee108e/attachment.html>