[Bugs] [Bug 1234891] New: gf_store_save_value() fflush() error-checking bug, leading to corruption of glusterd.info when filesystem is full
bugzilla at redhat.com
bugzilla at redhat.com
Tue Jun 23 13:02:03 UTC 2015
https://bugzilla.redhat.com/show_bug.cgi?id=1234891
Bug ID: 1234891
Summary: gf_store_save_value() fflush() error-checking bug,
leading to corruption of glusterd.info when filesystem
is full
Product: GlusterFS
Version: 3.4.2
Component: glusterd
Assignee: bugs at gluster.org
Reporter: tero.marttila at aalto.fi
CC: bugs at gluster.org, gluster-bugs at redhat.com
Description of problem:
When the host filesystem containing the glusterd workdir files (e.g.
/var/lib/glusterd/) is full, this can lead to situations where files such as
glusterd.info or peers/* etc are replaced with empty zero-length files. See Bug
1207534 and [1] for another example of this. I experienced this with
glusterfs-server 3.4.2-1ubuntu1 running on Ubuntu 14.04.
Digging through the related code, I suspect this may be caused by the use of
feof() in following error libglusterfs/src/store.c gf_store_save_value() code
path which is intended to handle exactly these kinds of write errors:
ret = fprintf (fp, "%s=%s\n", key, value);
if (ret < 0) {
...
}
ret = fflush (fp);
if (feof (fp)) {
...
}
ret = 0;
This code appears to be present in both glusterfs 3.4.2 and mainline git.
Based on my understanding, an error condition such as ENOSPC in fflush() will
NOT cause feof() to return nonzero.
Steps to Reproduce:
Using the following example test program:
int main (int argc, char **argv)
{
int ret;
const char *path = argv[1];
int fd = open(path, O_RDWR | O_CREAT | O_TRUNC, 0600);
FILE *fp = fdopen(fd, "a+");
ret = fprintf(fp, "foo=bar\n");
fprintf(stderr, "fprintf = %d: %s\n", ret, strerror(errno));
ret = fflush(fp);
fprintf(stderr, "fflush = %d: %s\n", ret, strerror(errno));
ret = feof(fp);
fprintf(stderr, "feof = %d\n", ret);
fclose(fp);
close(fd);
return ret;
}
Running this on a "full" ext4 filesystem:
$ touch mnt/test/test2 && ls -l mnt/test/test2
-rw-r--r-- 1 xxx staff 0 Jun 23 15:52 mnt/test/test2
$ echo foo > mnt/test/test2
-bash: echo: write error: No space left on device
$ rm mnt/test/test2
Gives the following:
$ ./test mnt/test/test2
fprintf = 8: Success
fflush = -1: No space left on device
feof = 0
Actual results:
I believe that this behavior of fflush()/feof() would cause
gf_store_save_value() to ignore the write error, causing the store file to be
replaced by an empty file as per the bug symptoms.
Expected results:
gf_store_save_value() should handle the -1 return code from fflush() by
returning an error, causing the tmpfile to be unlinked and the existing file to
be left intact.
[1] https://www.mail-archive.com/users@ovirt.org/msg25215.html
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list