[Gluster-users] How different self healing scenarios works ?

Fri Dec 16 11:50:37 UTC 2016

Hello,

I am testing different uses cases where I am not sure to well understand how Gluster (3.9 here) self healing works. The context is a dispersed 4+2 volume “vol1” on 6 nodes gl[1..6], one brick per node.

1) while a client is reading a 5Go file F on vol1, the file on gl6 (actually a 1/4 portion) is emptied with echo > F. At this point I can see a reversing of the network flows from gl6 => client to client => gl6, and the portion of the file F start healing. I assume that the file is recovered from the client to gl6.

Q1.1 : I would think that gl6 detect the file corruption and recreate it from existing portion on gl[1..5], but it happens from the client which is reading the file. That make sens in the way that this file is already accessed by others node, but in term of performance it could be a bottleneck on the client if the files where multiple To ?
Q1.2 : why and in which conditions this happens ? does the same happens when the file is written ?

2) vol1 in mounted by the client but not files are accessed, then all portions of files (20 files of 5Go each) on gl6 are removed with rm *. If "gluster v heal vol1 full” is issued on gl6 or the glusterfs-server process is restarted, all files re-apear instantly, this is nice ! the file system on nodes is xfs.

Q2.1 I assume that files metadata are recovered from other nodes and data are re-linked from blocks stil existing on file system ? is it specific for XFS or will it be the same with others fs like ext or zfs ? (future plans are to use zfs as the underlaying fs)

3) vol1 in mounted by the client but not files are accessed, files (the same 20 files of 5Go) are voluntary corrupted with echo 0 >> F, then "gluster v heal vol1 full” is issued on gl6, files are recovered one by one from gl3 here.

Q3.1 : Why gluster didn’t try to recover multiples files from different nodes at the same times ?
Q3.2 : I already see Gluster recovering files from multiples nodes at the same times during heavy workloads, in which circonstances this happens ?
Q3.3 : after which time gluster would have detect the corruption and start the self healing process ?

Actually I didn’t find any explanations on the web on how self healing process works and what are different uses cases / scenarios, any pointers ?

Thanks