[Gluster-devel] Re: New IDEA: The Checksumming xlator ( AFR Translator have problem )

Anand Avati avati at zresearch.com
Thu Jan 17 16:11:51 UTC 2008


Gareth,
 Yes. keeping log on other servers is the option what we also have in mind
and think is one of the best options. This is what we might do in the
'proactive self heal' feature of 1.4

avati

2008/1/17, Gareth Bult <gareth at encryptec.net>:
>
> Mmm, my opinion is that it's relatively easy, but needs to be in AFR ...
>
>
> ----- Original Message -----
> step 3.: "Angel" <clist at uah.es>
> To: gluster-devel at nongnu.org
> Sent: 17 January 2008 12:17:23 o'clock (GMT) Europe/London
> Subject: [Gluster-devel] Re: New IDEA: The Checksumming xlator ( AFR
> Translator have problem )
>
> hi
>
> Managing log files seems pretty hard for me at this moment are you
> confident it is feasible?
>
> On the other side checksumming also seem very interesting for me as a
> usable userspace feature (off loading cheksums from client apps to server)
>
> Definitely checksumming is on my TODO list;
>
> Im very busy now and still have my QUOTA xlator pet in progress..
>
> Anyway is hard to start making logfile AFR without disturbing current AFR
> developers.
>
> I sure they should have own ideas about what to do this subject.
>
>
> Regards,
>
> Life's hard but root password helps!
>
> El Jueves, 17 de Enero de 2008 11:11, Gareth Bult escribió:
> > Erm, I said;
> >
> > >to write the change to a logfile on the remaining volumes
> >
> > By which I meant that the log file would be written on the remaining
> available server volumes ... (!)
> >
> > Regards,
> > Gareth.
> >
> > ----- Original Message -----
> > step 3.: "Angel" <clist at uah.es>
> > To: gluster-devel at nongnu.org
> > Sent: 17 January 2008 10:07:12 o'clock (GMT) Europe/London
> > Subject: Re: [Gluster-devel] Re: New IDEA: The Checksumming xlator ( AFR
> Translator have problem )
> >
> > The probems is:
> >
> > If you place AFR on the client, how servers get the log file during
> recovery operations??
> >
> > Regards, Angel
> >
> >
> > El Jueves, 17 de Enero de 2008 10:44, Gareth Bult escribió:
> > > Hi,
> > >
> > > Yes, I would agree these changes would improve the current
> implementation.
> > >
> > > However, a "better" way would be for the client, on failing to write
> to ONE of the AFR volumes, to write the change to a logfile on the remaining
> volumes .. then for the recovering server to playback the logfile when it
> comes back up, or to recopy the file if there are insufficient logs or if
> the file has been erased.
> > >
> > > This would "seem" to be a very simple implementation ..
> > >
> > > Client;
> > >
> > > Write to AFR
> > > If Fail then
> > >   if log file does not exist create log file
> > >   Record, file, version, offset, size, data in logfile
> > >
> > > On server;
> > >
> > > When recovering;
> > >
> > >   for each entry in logfile
> > >      if file age > most recent transaction
> > >         re-copy whole file
> > >      else
> > >         replay transaction
> > >
> > >   if all volumes "UP", remove logfile
> > >
> > > ?????
> > >
> > > One of the REAL benefits of this is that the file is still available
> DURING a heal operation.
> > > At the moment a HEAL only takes place when a file is being opened, and
> while the copy is taking place the file blocks ...
> > >
> > > Gareth.
> > >
> > > ----- Original Message -----
> > > step 3.: "Angel" <clist at uah.es>
> > > To: "Gareth Bult" <gareth at encryptec.net>
> > > Cc: gluster-devel at nongnu.org
> > > Sent: 17 January 2008 08:47:06 o'clock (GMT) Europe/London
> > > Subject: New IDEA: The Checksumming xlator ( AFR Translator have
> problem )
> > >
> > > Hi Gareth
> > >
> > > You said it!!, gluster is revolutionary!!
> > >
> > > AFR does a good job, we only have to help AFR be a better guy!!
> > >
> > > What we need is a checksumming translator!!
> > >
> > > Suppouse you have your posix volumes A and B on diferent servers.
> > >
> > > So your are using AFR(A,B) on client
> > >
> > > One of your AFRed node fails ( A ) and some time later it goes back to
> life but its backend filesystem
> > > got trashed and fsck'ed and now maybe there subtle differences on the
> files inside.
> > >
> > > ¡¡Your beloved 100GB XEN files now dont match on your "fautly" A node
> and your fresh B node!!
> > >
> > > AFR would notice this by means (i think) of a xattrS on both files,
> that's VERSION(FILE on node A) != VERSION(FILE on node B) or anything like
> that.
> > >
> > > But the real problem as you pointed out is that AFR only know files
> dont match, so have to copy every byte from you 100GB image from B to A
> (automatically on self-heal or on file access )
> > >
> > > That's many GB's (maybe PB's)  going back and forth over the net. THIS
> IS VERY EXPENSIVE, all we know that.
> > >
> > > Enter the Checksumming xlator (SHA1 or MD5 maybe md4 as rsync seems to
> use that with any problem)
> > >
> > > Checksumming xlator sits a top your posix modules on every node.
> Whenever you request the xattr SHA1[block_number] on a file the checksumming
> xlator intercepts this call
> > > reads block number "block_number" from the file calculates SHA1 and
> returns this as xattr pair key:value.
> > >
> > > Now AFR can request SHA1 blockwise on both servers and update only
> those blocks that dont match SHA1.
> > >
> > > With a decent block size we can save a lot of info for every
> transaction.
> > >
> > > -- In the case your taulty node lost its contents you have to copy the
> whole 100GB XEN files again
> > > -- In the case SHA1 mismatch AFR can only update diferences saving a
> lot of resources like RSYNC does.
> > >
> > > One more avanced feature would be incoproprate xdelta librari
> functions, making possible generate binary patchs against files...
> > >
> > > Now we only need someone to implement this xlator :-)
> > >
> > > Regards
> > >
> > > El Jueves, 17 de Enero de 2008 01:49, escribió:
> > > > Mmm...
> > > >
> > > > There are a couple of real issues with self heal at the moment that
> make it a minefield for the inexperienced.
> > > >
> > > > Firstly there's the mount bug .. if you have two servers and two
> clients, and one AFR, there's a temptation to mount each client against a
> different server. Which initially works fine .. right up until one of the
> glusterfsd's ends .. when it still works fine. However, when you restart the
> failed glusterfsd, one client will erroneously connect to it (or this is my
> interpretation of the net effect), regardless of the fact that self-heal has
> not taken place .. and because it's out of sync, doing a "head -c1" on a
> file you know has changed gets you nowhere. So essentially you need to
> remount clients against non-crashed servers before starting a crashed server
> .. which is not nice. (this is a filed bug)
> > > >
> > > > Then we have us poor XEN users who store 100Gb's worth of XEN images
> on a gluster mount .. which means we can live migrate XEN instances between
> servers .. which is fantastic. However, after a server config change or a
> server crash, it means we need to copy 100Gb between the servers .. which
> wouldn't be so bad if we didn't have to stop and start each XEN instance in
> order for self heal to register the file as changed .. and while self-heal
> is re-copying the images, they can't be used, so you're looking as 3-4 mins
> of downtime per instance.
> > > >
> > > > Apart from that (!) I think gluster is a revolutionary filesystem
> and will go a long way .. especially if the bug list shrinks .. ;-)
> > > >
> > > > Keep up the good work :)
> > > >
> > > > [incidentally, I now have 3 separate XEN/gluster server stacks, all
> running live-migrate - it works!]
> > > >
> > > > Regards,
> > > > Gareth.
> > > >
> > >
> >
>
> --
> Don't be shive by the tone of my voice. Just got my new weapon, weapon of
> choice...
> ->>--------------------------------------------------
>
> Angel J. Alvarez Miguel, Sección de Sistemas
> Area de Explotación y Seguridad Informática
> Servicios Informaticos, Universidad de Alcalá (UAH)
> Alcalá de Henares 28871, Madrid  ** ESPAÑA **
> Tfno: +34 91 885 46 32 Fax: 91 885 51 12
>
> ------------------------------------[www.uah.es]-<<--
> "No va mas señores..."
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>



-- 
If I traveled to the end of the rainbow
As Dame Fortune did intend,
Murphy would be there to tell me
The pot's at the other end.



More information about the Gluster-devel mailing list