[Gluster-users] Fwd: files not syncing up with glusterfs 3.1.2

Joe Landman landman at scalableinformatics.com
Mon Feb 21 18:47:13 UTC 2011

On 02/21/2011 01:39 PM, Kon Wilms wrote:
> On Mon, Feb 21, 2011 at 9:45 AM, Steve Wilson<stevew at purdue.edu>  wrote:
>> We had trouble with reliability for small, actively-accessed files on a
>> distribute-replicate volume in both GlusterFS 3.11 and 3.12.  It seems that
>> the replicated servers would eventually get out of sync with each other on
>> these kinds of files.  For a while, we dropped replication and only ran the
>> volume as distributed.  This has worked reliably for the past week or so
>> without any errors that we were seeing before: no such file, invalid
>> argument, etc.
> I'm running thousands of small files over NFSv3 through NGINX with
> distribute and have had the opposite experience. Unfortunately when
> NGINX can't access a file over NFS it means a customer calling us, so
> right now gluster is basically sitting idle (posted my output to the
> list a while back with no response).

We've had lots of issues with files disappearing or being inaccessible 
prior to 3.1.2 with the NFS client and server translator.  After 3.1.2, 
many of these problems *seem* to have been resolved, though all this 
means in this instance is that the customer hasn't submitted a ticket yet.

I had thought it was originally a timebase issue ... as we had a minute 
or two drift on some of the nodes (since fixed).  But we had a pretty 
consistent error in this regard.

We did open problem reports.  Unfortunately, no action so far (they just 
closed them this morning, though nothing has been solved per se, the 
issue simply has not yet resurfaced).  I'll leave those reports closed 
for now.

This said, this error, or one with a very similar signature, has been in 
the code since the 2.x series.  I really ... really want to track it 
down, but I can't create a simple replicator for it to present to the 
team.  If you have what you think is a simple replicator, please, email 
me offline.  We'll try it here, and if we can get it down to a very 
simple replication case and test, we'll re-open the bugs.

I'd hate to think its a heisenbug, but that is where I am leaning now.

