[Gluster-users] "mismatching layouts" errors after expanding volume
Jeff Darcy
jdarcy at redhat.com
Thu Feb 23 14:41:50 UTC 2012
On 02/23/2012 08:58 AM, Dan Bretherton wrote:
> It is reassuring to know that these errors are self repairing. That does
> appear to be happening, but only when I run "find -print0 | xargs --null stat
>>/dev/null" in affected directories.
Hm. Then maybe the xattrs weren't *set* on that brick.
> I will run that self-heal on the whole
> volume as well, but I have had to start with specific directories that people
> want to work in today. Does repeating the fix-layout operation have any
> effect, or are the xattr repairs all done by the self-heal mechanism?
AFAICT the DHT self-heal mechanism (not to be confused with the better known
AFR self-heal mechanism) will take care of this. Running fix-layout would be
redundant for those directories, but not harmful.
> I have found the cause of the transient brick failure; it happened again this
> morning on a replicated pair of bricks. Suddenly the
> etc-glusterfs-glusterd.vol.log file was flooded with these messages every few
> seconds.
>
> E [socket.c:2080:socket_connect] 0-management: connection attempt failed
> (Connection refused)
>
> One of the clients then reported errors like the following.
>
> [2012-02-23 11:19:22.922785] E [afr-common.c:3164:afr_notify]
> 2-atmos-replicate-3: All subvolumes are down. Going offline until atleast one
> of them comes back up.
> [2012-02-23 11:19:22.923682] I [dht-layout.c:581:dht_layout_normalize]
> 0-atmos-dht: found anomalies in /. holes=1 overlaps=0
Bingo. This is exactly how DHT subvolume #3 could "miss out" on a directory
being created or updated, as seems to have happened.
> [2012-02-23 11:19:22.923714] I [dht-selfheal.c:569:dht_selfheal_directory]
> 0-atmos-dht: 1 subvolumes down -- not fixing
>
> [2012-02-23 11:19:22.941468] W [socket.c:1494:__socket_proto_state_machine]
> 1-atmos-client-7: reading from socket failed. Error (Transport endpoint is not
> connected), peer (192.171.166.89:24019)
> [2012-02-23 11:19:22.972307] I [client.c:1883:client_rpc_notify]
> 1-atmos-client-7: disconnected
> [2012-02-23 11:19:22.972352] E [afr-common.c:3164:afr_notify]
> 1-atmos-replicate-3: All subvolumes are down. Going offline until atleast one
> of them comes back up.
>
> The servers causing trouble were still showing as Connected in "gluster peer
> status" and nothing appeared to be wrong except for glusterd misbehaving.
> Restarting glusterd solved the problem, but given that this has happened twice
> this week already I am worried that it could happen again at any time. Do you
> know what might be causing glusterd to stop responding like this?
The glusterd failures and the brick failures are likely to share a common
cause, as opposed to one causing the other. The main question is therefore why
we're losing connectivity to these servers. Secondarily, there might be a bug
to do with the failure being seen in the I/O path but not in the peer path, but
that's not likely to be the *essential* problem.
More information about the Gluster-users
mailing list