We've been using GlusterFS to manage shared files across a number of hosts
in the past few months and have ran into a few problems -- basically one
every month, roughly.  The problems are occasionally extremely difficult to
track down to GlusterFS, as they often masquerade as something else in the
application log files that we have.  The problems have been one instance of
split-brain and then a number of instances of "stuck" files (i.e. any stat
calls would block for an hour and then timeout with an error) as well as a
couple instances of "ghost" files (remove the file, but GlusterFS continues
to show it for a little while until the cache times out).

We do *not* place a large amount of load on GlusterFS, and don't have any
significant performance issues to deal with.  With that in mind, the core
question of this e-mail is: "How can I modify our configuration to be the
absolute *most* stable (problem free) that it can be, even if it means
sacrificing performance?"  In sum, I don't have any particular performance
concerns at this moment, but the GlusterFS bugs that we encounter are quite
problematic -- so I'm willing to entertain any suggested stability
improvement, even if it has a negative impact on performance (I suspect that
the answer here is just "turn off all performance-enhancing gluster
caching", but I wanted to validate that is actually true before going so
far).  Thus please suggest anything that could be done to improve the
stability of our setup -- as an aside, I think that this would be an
advantageous thing to add to the FAQ.  Right now the FAQ contains
information for *performance* tuning, but not for *stability* tuning.

Thanks for any help that you can give/suggestions that you can make.

Here are the details of our environment:

GlusterFS Version: 3.1.5
Mount method: glusterfsd/FUSE
GlusterFS Servers: web01, web02
GlusterFS Clients: web01, web02, dj01, dj02

$ sudo gluster volume info

Volume Name: shared-application-data
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Brick1: web01:/var/glusterfs/bricks/shared
Brick2: web02:/var/glusterfs/bricks/shared
Options Reconfigured:
network.ping-timeout: 5
nfs.disable: on

Configuration File Contents:
volume shared-application-data-client-0
    type protocol/client
    option remote-host web01
    option remote-subvolume /var/glusterfs/bricks/shared
    option transport-type tcp
    option ping-timeout 5

volume shared-application-data-client-1
    type protocol/client
    option remote-host web02
    option remote-subvolume /var/glusterfs/bricks/shared
    option transport-type tcp
    option ping-timeout 5

volume shared-application-data-replicate-0
    type cluster/replicate
    subvolumes shared-application-data-client-0

volume shared-application-data-write-behind
    type performance/write-behind
    subvolumes shared-application-data-replicate-0

volume shared-application-data-read-ahead
    type performance/read-ahead
    subvolumes shared-application-data-write-behind

volume shared-application-data-io-cache
    type performance/io-cache
    subvolumes shared-application-data-read-ahead

volume shared-application-data-quick-read
    type performance/quick-read
    subvolumes shared-application-data-io-cache

volume shared-application-data-stat-prefetch
    type performance/stat-prefetch
    subvolumes shared-application-data-quick-read

volume shared-application-data
    type debug/io-stats
    subvolumes shared-application-data-stat-prefetch

volume management
    type mgmt/glusterd
    option working-directory /etc/glusterd
    option transport-type socket,rdma
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2

