[Gluster-users] How reliable is XFS under Gluster?

Kal Black kaloblak at gmail.com
Mon Dec 9 14:51:41 UTC 2013


Thank you all for the wonderful input,
I haven't used extensively XFS so far and my concerns primarily came from
reading an article (mostly the discussion after it) by Jonathan Corbetrom
on LWN (http://lwn.net/Articles/476263/) and another one
http://toruonu.blogspot.ca/2012/12/xfs-vs-ext4.html. They are both
relatively recent and I was under the impression the XFS still has
problems, in certain cases of power loss, where the metadata and the actual
data are not being in sync, which might lead existing data being corrupted.
But again, like Paul Robert Marino pointed out, choosing a right IO
scheduler might greatly reduce the risk of this to happen.


On Sun, Dec 8, 2013 at 11:04 AM, Paul Robert Marino <prmarino1 at gmail.com>wrote:

> XFS is fine Ive been using it on various distros in production for
> over a decade now and I've rarely had any problems with it and when I
> have they have been trivial to fix which is something I honestly cant
> say about ext3 or ext4.
>
> Usually when there is a power failure during a write if the
> transaction wasn't completely committed to the disk it is rolled back
> via the journal.the one exception to this is when you have a battery
> backed cache where the battery discharges before power is restored, or
> a very cheap consumer grade disk which uses its cache for writes and
> lies about the sync state.
> in either of these scenarios any file system will have problems.
>
> Out of any of the filesystems Ive worked with in general XFS handles
> the battery discharge senario the cleanest and is the easiest to
> recover.
> if you have the second scenario with the cheap disks with a cache that
> lies nothing will help you not even a fsync because the hardware lies.
> Also the subject of fsync is a little more complicated than most
> people think there are several kinds of fsync and each behaves
> differently on different filesystems. PostgreSQL has documentation
> about it here
> http://www.postgresql.org/docs/9.1/static/runtime-config-wal.html
> looks at wal_sync_method if you would like to have a better about how
> fsync works without getting too deep into the subject.
>
> By the way most apps don't need to do fsyncs and it would bring your
> system to a crawl if they all did so take people saying
> all programs should fsync with a grain of salt.
>
> In most cases when these problems come up its really that they didn't
> set the right IO scheduler for what the server does. For example CFQ
> which is the EL default can leave your write in ram cache for quite a
> while before sending it to disk in an attempt to optimize your IO;
> however the deadline scheduler will attempt to optimize your IO but
> will predictably sync it to disk after a period of time regardless of
> whether it was able to fully optimize it or not. Also there is noop
> which does no optimization at all and leaves every thing to the
> hardware, this is common and recommended for VM's and there is some
> argument to use it with high end raid controllers for things like
> financial data where you need to absolutely ensure the write happen
> ASAP because there may be fines or other large penalties if you loose
> any data.
>
>
>
> On Sat, Dec 7, 2013 at 3:04 AM, Franco Broi <Franco.Broi at iongeo.com>
> wrote:
> > Been using ZFS for about 9 months and am about to add as other 400TB, no
> > issues so far.
> >
> > On 7 Dec 2013 04:23, Brian Foster <bfoster at redhat.com> wrote:
> > On 12/06/2013 01:57 PM, Kal Black wrote:
> >> Hello,
> >> I am in the point of picking up a FS for new brick nodes. I was used to
> >> like and use ext4 until now but I recently red for an issue introduced
> by
> >> a
> >> patch in ext4 that breaks the distributed translator. In the same time,
> it
> >> looks like the recommended FS for a brick is no longer ext4 but XFS
> which
> >> apparently will also be the default FS in the upcoming RedHat7. On the
> >> other hand, XFS is being known as a file system that can be easily
> >> corrupted (zeroing files) in case of a power failure. Supporters of the
> >> file system claim that this should never happen if an application has
> been
> >> properly coded (properly committing/fsync-ing data to storage) and the
> >> storage itself has been properly configured (disk cash disabled on
> >> individual disks and battery backed cache used on the controllers). My
> >> question is, should I be worried about losing data in a power failure or
> >> similar scenarios (or any) using GlusterFS and XFS? Are there best
> >> practices for setting up a Gluster brick + XFS? Has the ext4 issue been
> >> reliably fixed? (my understanding is that this will be impossible unless
> >> ext4 isn't being modified to allow popper work with Gluster)
> >>
> >
> > Hi Kal,
> >
> > You are correct in that Red Hat recommends using XFS for gluster bricks.
> > I'm sure there are plenty of ext4 (and other fs) users as well, so other
> > users should chime in as far as real experiences with various brick
> > filesystems goes. Also, I believe the dht/ext issue has been resolved
> > for some time now.
> >
> > With regard to "XFS zeroing files on power failure," I'd suggest you
> > check out the following blog post:
> >
> >
> http://sandeen.net/wordpress/computers/xfs-does-not-null-files-and-requires-no-flux/
> >
> > My cursory understanding is that there were apparently situations where
> > the inode size of a recently extended file would be written to the log
> > before the actual extending data is written to disk, thus creating a
> > crash window where the updated size would be seen, but not the actual
> > data. In other words, this isn't a "zeroing files" behavior in as much
> > as it is an ordering issue with logging the inode size. This is probably
> > why you've encountered references to fsync(), because with the fix your
> > data is still likely lost (unless/until you've run an fsync to flush to
> > disk), you just shouldn't see the extended inode size unless the actual
> > data made it to disk.
> >
> > Also note that this was fixed in 2007. ;)
> >
> > Brian
> >
> >> Best regards
> >>
> >>
> >>
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> http://supercolony.gluster.org/mailman/listinfo/gluster-users
> >>
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://supercolony.gluster.org/mailman/listinfo/gluster-users
> >
> > ________________________________
> >
> >
> > This email and any files transmitted with it are confidential and are
> > intended solely for the use of the individual or entity to whom they are
> > addressed. If you are not the original recipient or the person
> responsible
> > for delivering the email to the intended recipient, be advised that you
> have
> > received this email in error, and that any use, dissemination,
> forwarding,
> > printing, or copying of this email is strictly prohibited. If you
> received
> > this email in error, please immediately notify the sender and delete the
> > original.
> >
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131209/4b56a323/attachment.html>


More information about the Gluster-users mailing list