[Gluster-devel] Performance Translators' Stability and Usefulness

Mon Jul 6 22:26:47 UTC 2009

I recommended GlusterFS to my client without reservation, but got pissed off
because bugs was found from time to time and wasted too much time to trace
the source of problem - and also Gluster team is hiding the problem.

For an actual example, check this link:
http://www.gluster.org/docs/index.php?title=Translators_options&diff=4891&oldid=4799
glusterFS has autoscaling issues, but Gluster team never made it open. they
just hide it quitely by removing autoscaling part in wiki. For those of us
following the previous document on wiki, we spent a lot time and energy and
learned the problem from the hard lessons.

John

On Mon, Jul 6, 2009 at 10:20 AM, Geoff Kassel <gkassel at users.sourceforge.net
> wrote:

> Hi Anand,
>   Thank you for your explanation. I appreciate the circumstances you're in
> -
> I'm in a not-too-dissimilar environment myself.
>
>   If you don't mind taking some more advice - do you mind taking down your
> current QA process document? It does not seem to be an accurate
> representation of your QA process at all.
>
>   Alternatively, you could document what you really do, and then try to
> improve on it - a technique common to many quality management
> methodologies.
> If that doesn't look so good at first - well, you don't have to publish it
> openly. You're running an open source project, people are prepared for
> things
> to be a bit rough and ready. Just don't make representations that it's
> otherwise. (Major version numbers and marketing spiel are what I'm talking
> about here.)
>
>   Misleading people - intentionally or otherwise - kills community support
> and commercial trust in your product fast. Open source projects in
> particular
> need to be more open than purely commercial efforts, because not only do
> you
> lose users, you lose current and potential developers when this happens.
>
>   On the code front - can you please start using code comments? It's really
> hard to follow the purpose of some parts of the code otherwise, and that
> makes it difficult for those in the community to help you fix problems or
> provide new functionality. After all, isn't getting the community to help
> write and debug the software part of the cost effectiveness of the open
> source development technique?
>
>   (I understand that there may be language issues at stake here. But this
> is
> the era of automatic translation, after all - hackers like me will get
> along
> okay so long as we can get the gist :)
>
>   Please don't be afraid to use code quality analysis tools, even if they
> do
> insert some less-than-attractive comments. Tools like RATS and FlawFinder
> are
> free, they catch a lot of potential and actual stability and security
> issues,
> and can be partially automated as part of wider testing frameworks.
>
>   GlusterFS should be eligible to sign up to use Coverity's scan for free.
> It's a highly recommended static analysis tool, and if you make use of the
> results, there are some quite dramatic gains in stability and reliability
> to
> be made.
>
>   Also having a general look over the code every now and then will do
> wonders
> for these aspects as well - look at the security record of OpenBSD to see
> how
> effective code audits can be.
>
>   On the testing framework front - I know how hard it is to start writing
> unit and regression tests for a project already under way. The answer I've
> found to this is to get developers writing tests for the new functionality
> they write, as they write it. (Leaving it to later - say, for the QA team
> to
> do - makes this process a lot more difficult, as I've found.) This
> documents
> in live code how the system should work, and if run whenever changes to
> that
> functionality are made, detects breakages fast.
>
>   When the QA team or the community uncovers a bug, get the QA team to
> write
> a test case covering that issue, documenting (again in live code) what the
> correct behaviour should be. Between these two activities, the coverage of
> the testing framework will improve in leaps and bounds.
>
>   Over time, you'll develop a full regression testing suite, which, if run
> before major releases (if not before each repository commit), will save a
> lot
> of time and embarrassment when the occasional bug pops up to affect older
> features negatively or cause known bugs to resurface.
>
>   Thank you for listening to me, and I hope this advice proves useful to
> you.
>
> Geoff.
>
> On Mon, 6 Jul 2009, Anand Babu Periasamy wrote:
> > Gordon, Geoff, Fillipe,
> >
> > We are sorry!. We admit we had a rough and difficult past.
> >
> > Here are the reasons, why it was difficult for us:
> > * Limited staff and QA environment.
> > * GlusterFS is a programmable file system. It supported many OS distros,
> > applications, hardware and storage architecture. It was impossible to QA
> > all possible combinations. What we declared as stable is just one of many
> > such use-cases.
> > * Poor documentation.
> >
> > We are now VC funded. We have increased the size of our team and hardware
> > lab significantly. 2.0 is an outcome of this investment. 2.0.3 scheduled
> > for this week will be relatively lot more stable. A dedicated technical
> > writer is now working on an improved version of our installation guide.
>  We
> > are going to templatize GlusterFS stable configurations through a tool
> for
> > generating and managing volume spec files. GlusterSP (storage platform)
> > will completely automate the installation and management of a ruggedized
> > release of GlusterFS in an embedded OS form. GlusterSP 2010 first beta
> will
> > be out in 2 months. With its web based UI and pre-configured system
> image,
> > a number of error factors are reduced.
> >
> > We are constantly learning and improving. You are making a valuable
> > contribution by constructively criticizing us with details and proposals.
> > We take them seriously and positively.
> >
> > Happy Hacking,
> > --
> > Anand Babu Periasamy
> > GPG Key ID: 0x62E15A31
> > Blog [http://unlocksmith.org]
> > GlusterFS [http://www.gluster.org]
> > GNU/Linux [http://www.gnu.org]
> >
> > Geoff Kassel wrote:
> > > Hi Gordan,
> > >
> > >> What is production unready (more than Gluster) about PeerFS or
> SeznamFS?
> > >
> > > Well, I'm mostly going by your email comparing these of a few months
> ago.
> > > Your needs are not that dissimilar to mine.
> > >
> > > I see on the project page for SeznamFS now that there's apparently
> > > support for SeznamFS to do master-master replication 'MySQL' style -
> with
> > > the limitations of MySQL's master-master replication, apparently.
> > >
> > > However, I can't seem to find out exactly what those limitations entail
> -
> > > or how to set it up in this mode. (And I am looking for a system that
> > > would allow more than two masters/peers, which is why I passed over
> DRBD
> > > for GlusterFS originally.)
> > >
> > > I can't get even the PeerFS web page to load. That's a disturbing sign
> to
> > > me.
> > >
> > >> You can fail over NFS servers. If the servers themselves are mirrored
> > >> (DRBD) and/or have a shared file system NFS should be able to handle
> the
> > >> IP being migrated between servers. I've found it this tends to work
> > >> better with NFS over UDP provided you have a network that doesn't
> > >> normally suffer packet loss.
> > >
> > > Sorry, thought you were talking about NFS exports from just one local
> > > drive/RAID array.
> > >
> > > My leading fallback option for when I give up on Gluster is pretty much
> > > exactly what you've just described. However - I have the same
> (potential)
> > > issue as you with DRBD and WANs looming over my project i.e. the
> eventual
> > > need to run masters/peers in geographically distributed sites.
> > >
> > >> How do you mean? GFS1 has been in the vanilla kernel for a while.
> > >
> > > I don't use a vanilla kernel. I use a 'hardened' kernel patched with
> PaX
> > > and a few other security systems, to protect against stack smashing
> > > attacks and other nasties. (Just a little bit of extra, relative
> > > security, to make would-be attackers go after softer targets.)
> > >
> > > PaX is especially intolerant of memory faults in general, which is
> where
> > > my efforts in patching GlusterFS were focused. (And yes, I have
> disabled
> > > PaX features for Gluster. No, it didn't improve anything.)
> > >
> > > When I was looking into GFS, I found that the GFS patches (perhaps I
> was
> > > looking at v2) didn't work with the hardened patchset. GlusterFS had
> more
> > > promise than GFS anyway, so I went with GlusterFS.
> > >
> > >>> An older version of GlusterFS - as buggy as it is for me - is
> > >>> unfortunately still the best option.
> > >>
> > >> Out of interest, what was the last version of Gluster did you deem
> > >> completely stable?
> > >
> > > What works for me with only (only!) a few crashes a day, and no
> apparent
> > > data corruption is 1.4.0tla849. TLA 636 worked a little better for me -
> > > only random crashes once in a while. (But again - backwards
> incompatible
> > > changes had crept in between the two versions, so I couldn't go back.)
> > >
> > > I had much better stability with the earlier 1.3 releases. I can't
> > > remember exactly which ones now. (I suspect it was 1.3.3, but I'm no
> > > longer sure.) It's been quite a while.
> > >
> > >> I don't agree on that particular point, since the last outstanding bug
> > >> I'm seeing with any significant frequency in my use case is the one of
> > >> having to wait for a few seconds for the FS to settle after mounting
> > >> before doing anything or the operation fails. And to top it off, I've
> > >> just had it succeed without the wait. That seems quite
> heisenbuggy/recey
> > >> to me. :)
> > >
> > > Sorry, I was talking about the data corruption bugs. Not your
> > > first-access issue.
> > >
> > >> That doesn't help - the first-access-settle-time bug has been around
> for
> > >> a very long time. ;)
> > >
> > > Indeed.
> > >
> > > It's my hope that once testing frameworks (and syslog logging, in your
> > > case) are made available to the community, people like us can attempt
> to
> > > debug our systems with some degree of confidence that we're not causing
> > > other subtle issues with our patches.
> > >
> > > That's got to be better for the project as a whole.
> > >
> > > Geoff.
> > >
> > > On Sun, 5 Jul 2009, Gordan Bobic wrote:
> > >> Geoff Kassel wrote:
> > >>>> Sounds like a lot of effort and micro-downtime compared to a
> migration
> > >>>> to something else. Have you explored other options like PeerFS, GFS
> > >>>> and SeznamFS? Or NFS exports with failover rather than Gluster
> > >>>> clients, with Gluster only server-to-server?
> > >>>
> > >>> These options are not production ready (as I believe has been pointed
> > >>> out already to the list) for what I need;
> > >>
> > >> What is production unready (more than Gluster) about PeerFS or
> SeznamFS?
> > >>
> > >>> or in the case of NFS, defeating the
> > >>> point of redundancy in the first place.
> > >>
> > >> You can fail over NFS servers. If the servers themselves are mirrored
> > >> (DRBD) and/or have a shared file system NFS should be able to handle
> the
> > >> IP being migrated between servers. I've found it this tends to work
> > >> better with NFS over UDP provided you have a network that doesn't
> > >> normally suffer packet loss.
> > >>
> > >>> (Also, GFS is also not compatible
> > >>> with the kernel patchset I need to use.)
> > >>
> > >> How do you mean? GFS1 has been in the vanilla kernel for a while.
> > >>
> > >>> I have tried AFR on the server side and the client side. Both display
> > >>> similar issues.
> > >>>
> > >>> An older version of GlusterFS - as buggy as it is for me - is
> > >>> unfortunately still the best option.
> > >>
> > >> Out of interest, what was the last version of Gluster did you deem
> > >> completely stable?
> > >>
> > >>> (That doesn't mean I can't complain about the lack of progress
> towards
> > >>> stability and reliability, though :)
> > >>
> > >> Heh - and would you believe I just rebooted one of my
> root-on-glusterfs
> > >> nodes and it came up OK without the bail-out requiring manual
> > >> intervention caused by the bug that causes first access after mounting
> > >> to fail before things have settled.
> > >>
> > >>>> One of the problems is that some tests in this case are impossible
> to
> > >>>> carry out without having multiple nodes up and running, as a number
> of
> > >>>> bugs have been arising in cases where nodes join/leave or cause race
> > >>>> conditions. It would require a distributed test harness which would
> be
> > >>>> difficult to implement so that they run on any client that builds
> the
> > >>>> binaries. Just because the test harness doesn't ship with the
> sources
> > >>>> doesn't mean it doesn't exist on a test rig the developers use
> > >>>
> > >>> Okay, so what about the volume of test cases that can be tested
> without
> > >>> a distributed test harness? I don't see any sign of testing
> mechanisms
> > >>> for that.
> > >>
> > >> That point is hard to argue against. :)
> > >>
> > >>> And wouldn't it be prudent anyway - giving how often the GlusterFS
> devs
> > >>> do not have access to the platform with the reported problem - to
> > >>> provide this harness so that people can generate the appropriate test
> > >>> results the devs need for themselves? (Giving a complete stranger
> from
> > >>> overseas root access is a legal minefield to those who have to work
> > >>> with data held in-confidence.)
> > >>
> > >> Indeed. And shifting test-case VM images tends to be impractical (even
> > >> though I have provided both to the gluster developers in the past for
> > >> specific error-case analysis).
> > >>
> > >>> It's been my impression, though, that the relevant bugs are not
> > >>> heisenbugs or race conditions.
> > >>
> > >> I don't agree on that particular point, since the last outstanding bug
> > >> I'm seeing with any significant frequency in my use case is the one of
> > >> having to wait for a few seconds for the FS to settle after mounting
> > >> before doing anything or the operation fails. And to top it off, I've
> > >> just had it succeed without the wait. That seems quite
> heisenbuggy/recey
> > >> to me. :)
> > >>
> > >>> (I'm judging that on the speed of the follow up patch, by the way -
> > >>> race conditions notoriously can take a long time to track down.)
> > >>
> > >> That doesn't help - the first-access-settle-time bug has been around
> for
> > >> a very long time. ;)
> > >>
> > >> Gordan
> > >>
> > >>
> > >> _______________________________________________
> > >> Gluster-devel mailing list
> > >> Gluster-devel at nongnu.org
> > >> http://lists.nongnu.org/mailman/listinfo/gluster-devel
> > >
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel at nongnu.org
> > > http://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20090706/c312ba6f/attachment-0003.html>