[Bugs] [Bug 1234891] New: gf_store_save_value() fflush() error-checking bug, leading to corruption of glusterd.info when filesystem is full

Tue Jun 23 13:02:03 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1234891

            Bug ID: 1234891
           Summary: gf_store_save_value() fflush() error-checking bug,
                    leading to corruption of glusterd.info when filesystem
                    is full
           Product: GlusterFS
           Version: 3.4.2
         Component: glusterd
          Assignee: bugs at gluster.org
          Reporter: tero.marttila at aalto.fi
                CC: bugs at gluster.org, gluster-bugs at redhat.com

Description of problem:

When the host filesystem containing the glusterd workdir files (e.g.
/var/lib/glusterd/) is full, this can lead to situations where files such as
glusterd.info or peers/* etc are replaced with empty zero-length files. See Bug
1207534 and [1] for another example of this. I experienced this with
glusterfs-server 3.4.2-1ubuntu1 running on Ubuntu 14.04.

Digging through the related code, I suspect this may be caused by the use of
feof() in following error libglusterfs/src/store.c gf_store_save_value() code
path which is intended to handle exactly these kinds of write errors:

        ret = fprintf (fp, "%s=%s\n", key, value);
        if (ret < 0) {
                ...
        }

        ret = fflush (fp);
        if (feof (fp)) {
                ...
        }

        ret = 0;

This code appears to be present in both glusterfs 3.4.2 and mainline git.

Based on my understanding, an error condition such as ENOSPC in fflush() will
NOT cause feof() to return nonzero.

Steps to Reproduce:

Using the following example test program:

    int main (int argc, char **argv)
    {
        int ret;

        const char *path = argv[1];

        int fd = open(path, O_RDWR | O_CREAT | O_TRUNC, 0600);

        FILE *fp = fdopen(fd, "a+");

        ret = fprintf(fp, "foo=bar\n");

        fprintf(stderr, "fprintf = %d: %s\n", ret, strerror(errno));

        ret = fflush(fp);

        fprintf(stderr, "fflush = %d: %s\n", ret, strerror(errno));

        ret = feof(fp);

        fprintf(stderr, "feof = %d\n", ret);

        fclose(fp);
        close(fd);

        return ret;
    }

Running this on a "full" ext4 filesystem:

    $ touch mnt/test/test2 && ls -l mnt/test/test2
    -rw-r--r-- 1 xxx staff 0 Jun 23 15:52 mnt/test/test2
    $ echo foo > mnt/test/test2
    -bash: echo: write error: No space left on device
    $ rm mnt/test/test2 

Gives the following:

    $ ./test mnt/test/test2
    fprintf = 8: Success
    fflush = -1: No space left on device
    feof = 0

Actual results:

I believe that this behavior of fflush()/feof() would cause
gf_store_save_value() to ignore the write error, causing the store file to be
replaced by an empty file as per the bug symptoms.

Expected results:

gf_store_save_value() should handle the -1 return code from fflush() by
returning an error, causing the tmpfile to be unlinked and the existing file to
be left intact.

[1] https://www.mail-archive.com/users@ovirt.org/msg25215.html

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.