[Gluster-users] file locked / inaccessible if auto-heal required & confusing log messages (1.4rc3)
Daniel Maher
dma+gluster at witbe.net
Wed Dec 17 14:40:44 UTC 2008
Hello,
I recently upgraded my infrastructure from a 1.3.12 server-based AFR
cluster to a 1.4rc3 client-based AFR cluster. Among other things, i
have noticed one very obvious change in the behaviour of self-healing
between the two setups...
The scenario is basic : one of the server nodes becomes inaccessible,
and as a result, changes to a given file are not replicated. When the
downed node returns, and the file is accessed, the self-heal feature is
triggered, thus ensuring the integrity of the data across all server nodes.
So far so good ; however, between the previous setup and that of the
current, « something » has resulted in differing behaviour vis-à-vis the
availability of said file.
In the previous 1.3 server-based AFR setup, if a client attempted to
write to the file, it was able to do so, with the change being
replicated to the newly-returned node as part of the self-heal process.
Perfect.
However, in the current 1.4 client-based AFR setup, if a client attempts
to write to the file, instead of Gluster accepting the write and
propagating the change during the self-heal, the file becomes
momentarily inaccessible. The self-heal process is then triggered, and
the file - without the current attempted write - is replicated.
Subsequent accesses are successful (and replicate as expected), but that
« triggering write » still fails the first time.
Furthermore, the log entry related to this particular process is
confusing (log excerpt below). It follows the form :
1. Self-heal triggered
2. Unable to resolve conflicting data
3. Self-heal completed
4. File not found
The reported conflict does not, in fact, appear to affect the self-heal,
in that the file is replicated as expected. Is the error itself
erroneous, or is there actually a problem ? Furthermore, even though
the file clearly exists, and has in fact just been replicated, Gluster
reports then throws an error on OPEN.
This can't possibly be the expected behaviour. What within the
underlying infrastructure has changed ? How can it be fixed ?
Some log snippets :
Tomcat (on client)
---------
[Thread-25]09:32:09,398 ERROR: Error in copyfile.
java.io.FileNotFoundException: /glusterfs/some/directory/somefile.txt
(Input/output error)
---------
glusterfs.log (on client)
---------
2008-12-17 09:32:09 W [afr-self-heal-common.c:1005:afr_self_heal]
nasdash-afr: performing self heal on
/glusterfs/some/directory/somefile.txt (metadata=0 data=1 entry=0)
2008-12-17 09:32:09 E [afr-self-heal-data.c:777:afr_sh_data_fix]
nasdash-afr: Unable to resolve conflicting data of
/glusterfs/some/directory/somefile.txt. Please resolve manually by
deleting the file /glusterfs/some/directory/somefile.txt from all but
the preferred subvolume
2008-12-17 09:32:09 W [afr-self-heal-data.c:70:afr_sh_data_done]
nasdash-afr: self heal of /glusterfs/some/directory/somefile.txt completed
2008-12-17 09:32:09 E [fuse-bridge.c:662:fuse_fd_cbk] glusterfs-fuse:
189804: OPEN() /glusterfs/some/directory/somefile.txt => -1
(Input/output error)
---------
Comments ?
--
Daniel Maher <dma+gluster AT witbe DOT net>
More information about the Gluster-users
mailing list