[Gluster-devel] Re: New IDEA: The Checksumming xlator ( AFR Translator have problem )

Thu Jan 17 09:44:23 UTC 2008

Hi,

Yes, I would agree these changes would improve the current implementation.

However, a "better" way would be for the client, on failing to write to ONE of the AFR volumes, to write the change to a logfile on the remaining volumes .. then for the recovering server to playback the logfile when it comes back up, or to recopy the file if there are insufficient logs or if the file has been erased.

This would "seem" to be a very simple implementation .. 

Client;

Write to AFR
If Fail then
  if log file does not exist create log file
  Record, file, version, offset, size, data in logfile

On server;

When recovering;

  for each entry in logfile
     if file age > most recent transaction
        re-copy whole file
     else
        replay transaction

  if all volumes "UP", remove logfile 

?????

One of the REAL benefits of this is that the file is still available DURING a heal operation.
At the moment a HEAL only takes place when a file is being opened, and while the copy is taking place the file blocks ...

Gareth.

----- Original Message -----
step 3.: "Angel" <clist at uah.es>
To: "Gareth Bult" <gareth at encryptec.net>
Cc: gluster-devel at nongnu.org
Sent: 17 January 2008 08:47:06 o'clock (GMT) Europe/London
Subject: New IDEA: The Checksumming xlator ( AFR Translator have problem )

Hi Gareth

You said it!!, gluster is revolutionary!!

AFR does a good job, we only have to help AFR be a better guy!!

What we need is a checksumming translator!!

Suppouse you have your posix volumes A and B on diferent servers.

So your are using AFR(A,B) on client

One of your AFRed node fails ( A ) and some time later it goes back to life but its backend filesystem 
got trashed and fsck'ed and now maybe there subtle differences on the files inside.

¡¡Your beloved 100GB XEN files now dont match on your "fautly" A node and your fresh B node!! 

AFR would notice this by means (i think) of a xattrS on both files, that's VERSION(FILE on node A) != VERSION(FILE on node B) or anything like that.

But the real problem as you pointed out is that AFR only know files dont match, so have to copy every byte from you 100GB image from B to A (automatically on self-heal or on file access )

That's many GB's (maybe PB's)  going back and forth over the net. THIS IS VERY EXPENSIVE, all we know that.

Enter the Checksumming xlator (SHA1 or MD5 maybe md4 as rsync seems to use that with any problem)

Checksumming xlator sits a top your posix modules on every node. Whenever you request the xattr SHA1[block_number] on a file the checksumming xlator intercepts this call
reads block number "block_number" from the file calculates SHA1 and returns this as xattr pair key:value.

Now AFR can request SHA1 blockwise on both servers and update only those blocks that dont match SHA1.

With a decent block size we can save a lot of info for every transaction.

-- In the case your taulty node lost its contents you have to copy the whole 100GB XEN files again
-- In the case SHA1 mismatch AFR can only update diferences saving a lot of resources like RSYNC does. 

One more avanced feature would be incoproprate xdelta librari functions, making possible generate binary patchs against files...

Now we only need someone to implement this xlator :-)

Regards

El Jueves, 17 de Enero de 2008 01:49, escribió:
> Mmm...
> 
> There are a couple of real issues with self heal at the moment that make it a minefield for the inexperienced.
> 
> Firstly there's the mount bug .. if you have two servers and two clients, and one AFR, there's a temptation to mount each client against a different server. Which initially works fine .. right up until one of the glusterfsd's ends .. when it still works fine. However, when you restart the failed glusterfsd, one client will erroneously connect to it (or this is my interpretation of the net effect), regardless of the fact that self-heal has not taken place .. and because it's out of sync, doing a "head -c1" on a file you know has changed gets you nowhere. So essentially you need to remount clients against non-crashed servers before starting a crashed server .. which is not nice. (this is a filed bug)
> 
> Then we have us poor XEN users who store 100Gb's worth of XEN images on a gluster mount .. which means we can live migrate XEN instances between servers .. which is fantastic. However, after a server config change or a server crash, it means we need to copy 100Gb between the servers .. which wouldn't be so bad if we didn't have to stop and start each XEN instance in order for self heal to register the file as changed .. and while self-heal is re-copying the images, they can't be used, so you're looking as 3-4 mins of downtime per instance.
> 
> Apart from that (!) I think gluster is a revolutionary filesystem and will go a long way .. especially if the bug list shrinks .. ;-)
> 
> Keep up the good work :)
> 
> [incidentally, I now have 3 separate XEN/gluster server stacks, all running live-migrate - it works!]
> 
> Regards,
> Gareth.
>

-- 
----------------------------
Clister UAH
----------------------------