[Gluster-users] "mismatching layouts" errors after expanding volume

Sat Feb 25 15:12:19 UTC 2012

Jeff-
We traced the problem to an array bounds error in a FORTRAN program, 
which resulted in it spontaneously producing a 250GB sparse file.  We 
think glusterd crashed when the program was terminated half way through, 
but there is evidence in the client logs that it stopped responding 
temporarily when the program was allowed to run to completion.  I don't 
think that particular problem will happen again now that we know what 
causes it, but the user whose program was responsible said that 
GlusterFS should be able to handle that sort of problem because memory 
errors are common in a research environment, and the fact that GlusterFS 
can't handle it means it is not very robust.  I don't want to say 
whether or not I agree or disagree with that statement but I promised to 
pass it on to the developers.

The reason I originally linked this problem to a recent volume expansion 
is that I have been seeing layout related errors in the client and NFS 
logs since that expansion.  Here is a sample from two separate clients 
dating back to a few minutes after the add-brick operation was carried out.

[2012-02-07 14:47:07.462200] W 
[fuse-resolve.c:328:fuse_resolve_path_deep] 0-fuse: 
/users/tjp/QuikSCAT_daily_v4/daily2hrws20040619v4.nc: no gfid found
[2012-02-07 14:47:07.483662] W 
[fuse-resolve.c:328:fuse_resolve_path_deep] 0-fuse: 
/users/tjp/QuikSCAT_daily_v4/mingmt: no gfid found
[2012-02-07 14:47:25.312719] I 
[dht-layout.c:682:dht_layout_dir_mismatch] 1-atmos-dht: subvol: 
atmos-replicate-4; inode layout - 1651910495 - 1982292593; disk layout - 
2761050402 - 3067833779
[2012-02-07 14:47:25.312917] I [dht-common.c:524:dht_revalidate_cbk] 
1-atmos-dht: mismatching layouts for /users

[2012-02-07 15:16:41.256252] I [dht-layout.c:581:dht_layout_normalize] 
0-atmos-dht: found anomalies in /. holes=1 overlaps=0
[2012-02-07 15:16:41.256285] I 
[dht-common.c:362:dht_lookup_root_dir_cbk] 0-atmos-dht: fixing 
assignment on /
[2012-02-07 19:12:58.451690] W 
[fuse-resolve.c:273:fuse_resolve_deep_cbk] 0-fuse: /users/*: no gfid found
[2012-02-07 19:12:58.466364] W 
[fuse-resolve.c:273:fuse_resolve_deep_cbk] 0-fuse: /users/*: no gfid found
[2012-02-07 19:12:58.495660] W 
[fuse-resolve.c:328:fuse_resolve_path_deep] 0-fuse: /users/lsrf: no gfid 
found

Unlike the sparse file incident this week there were no accompanying 
error messages reporting "subvolumes down" or "no child is up" in the 
period following the volume expansion.  At the time I put the layout 
errors down to the fact that the fix-layout operation had not yet 
completed.  Is that a possible explanation do you think?  When the 
errors were still occurring several days later I thought perhaps that 
fix-layout hadn't completed properly and ran it again.  Now I am 
wondering if the layout errors following the volume expansion were the 
result of any of the other factors I suggested in my original posting.  
Are any of those the likely cause do you think?

Dodgy FORTRAN programs aside, the layout errors have not reoccurred on 
most of the clients since I did the following on Tuesday night.

1) Ran fsck.ext4 on all the bricks
2) Restarted glusterd on all the servers
3) Removed the pair of bricks recently added
4) Added that pair of bricks again, using "bdan14.nerc-essc.ac.uk" 
instead of "bdan14" as one of the host names.
5) Re-ran fix-layout

I said the errors had not occurred on _most_ of the nodes.  Last night 
the machine where I was running a self-heal on the volume hung, and 
after restarting it I discovered that the GlusterFS client had suddenly 
reported "11 subvolumes down" followed by multiple reports of 
"anomalies" with a few "failed to get fd ctx. EBADFD" warnings thrown in 
for good measure.  Preceding that there were lots of these...

[2012-02-24 18:27:20.21981] C 
[client-handshake.c:121:rpc_client_ping_timer_expired] 
0-atmos-client-27: server 192.171.166.96:24041 has not responded in the 
last 42 seconds, disconnecting.

..followed by lots of these...

[2012-02-24 18:56:48.378887] E [rpc-clnt.c:197:call_bail] 
0-atmos-client-27: bailing out frame type(GlusterFS Handshake) 
op(PING(3)) xid = 0x11551459x sent = 2012-02-24 18:26:35.833590. timeout 
= 1800
[2012-02-24 18:56:48.409872] W [client-handshake.c:264:client_ping_cbk] 
0-atmos-client-27: timer must have expired

.. and then a load of these...

[2012-02-24 18:58:28.782252] C 
[rpc-clnt.c:436:rpc_clnt_fill_request_info] 0-atmos-client-27: cannot 
lookup the saved frame corresponding to xid (11551458)
[2012-02-24 18:58:28.782677] W [socket.c:1327:__socket_read_reply] 0-: 
notify for event MAP_XID failed
[2012-02-24 18:58:28.782738] I [client.c:1883:client_rpc_notify] 
0-atmos-client-27: disconnected

It looks as if there was one of each of the above for all the bricks in 
the volume.  There was no evidence of anything happening on the servers 
during all this and the other clients didn't report any problems, so 
this one client just appears to have gone haywire by itself.  I hope 
this was just a consequence of the DHT self heal process dealing with a 
multitude of errors following the recent upheaval, but right now I'm not 
inclined to disagree with the dodgy FORTRAN program owner's views on the 
robustness of GlusterFS.  I upset him by asking if he had been running 
his dodgy program again...

-Dan.

On 02/23/2012 05:21 PM, Jeff Darcy wrote:
> On 02/23/2012 11:45 AM, Dan Bretherton wrote:
>>> The main question is therefore why
>>> we're losing connectivity to these servers.
>> Could there be a hardware issue?  I have replaced the network cables for
>> the two servers but I don't really know what else to check.  The network
>> switch hasn't recorded any errors for those two ports.  There isn't
>> anything sinister in /var/log/messages.
>>
>> It seems a bit of a coincidence that both servers lost connection at
>> exactly the same time.   The only thing the users have started doing
>> differently recently is processing a large number of small text files.
>> There is one particular application they are running that processes this
>> data, but the load on the Glusterfs servers doesn't go up when it is
>> running.
> It does seem like a weird coincidence.  About the only thing I can think of is
> that there's some combination of events that occurs on those two servers but
> not the others.  For example, what if there's some file that happens to live on
> that replica pair, and which is accessed in some particularly pathological way?
>   I used to see something like that with some astrophysics code that would try
> to open and truncate the same file from each of a thousand nodes simultaneously
> each time it started.  Needless to say, this caused a few problems.  ;)  Maybe
> there's something about this new job type that similarly "converges" on one
> file for configuration, logging, something like that?