[Gluster-devel] Self-heal with partial files

Kevan Benson kbenson at a-1networks.com
Thu Oct 4 17:32:25 UTC 2007


Krishna Srinivas wrote:
> On 10/4/07, Kevan Benson <kbenson at a-1networks.com> wrote:
>   
>> Is self heal supposed to work with partial files?  I have an issue where
>> self-heal isn't happening on some servers with AFR and unify in a HA
>> setup I developed.  Two servers, two clients, all AFR and unify done on
>> client side.
>>
>> If I kill a connection while a large file is being written, the
>> glusterfs mount waits the appropriate timeout period (10 seconds in my
>> case) and then finishes writing the file to the still active server.
>> This results in a full file on one server and a partial file on the
>> other (the one I stopped traffic to temporarily to simulate a
>> crash/network problem).  If I then enable the disabled server and read
>> data from the problematic file, it doesn't self-heal  itself and move
>> the full file to the server with the partial file.
>>
>> Anything written entirely while a server is offline (i.e. the offline
>> server has no knowledge of it) is correctly created on read from the
>> file, so the problem seems to be related to files that are partially
>> written to one server.
>>
>> Can someone comment on the particular conditions that cause a self
>> heal?  Is there something I can do to force it to self heal at this
>> point (I repeat that reading data from the file does not work).  I know
>> I can use rsync and some foo to fix this, but that becomes less and less
>> feasible as the mount size grows and the time for rsync to compare sides
>> lengthens.
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
>>     
>
> Hi Kevan,
>
> It should have worked fine in your case. What version of glusterfs are
> you using? Just before you do the second read (or open rather) which
> should have triggered self-heal can you do getfattr -n trusted.afr.version <>
> on the partial file and also the full file  in the backend and give the output?
>
> Thanks
> Krishna
>
>   

Glusterfs TLA 504, fuse-2.7.0-gfs4.

The trusted.afr.version attribute doesn't exist on the partial file, it 
does exist on the complete file (with value "1").  From what I just 
tested, it doesn't look like it's set until the file operation is 
complete (it doesn't exist during writing).  Are files without this 
attribute assumed to have a value of "0" or something to ensure that 
they participate in self-heal correctly?

It doesn't look like it, as if I append data to the file, the partial 
version gets assigned a trusted.afr.version=1, while the complete file's 
trusted.afr.version is incremented to 2.  Self heal now works for that 
file, and on a read of file data the partial file is updated with all 
data and the trusted.afr.version is set to 2.





More information about the Gluster-devel mailing list