[Gluster-devel] write-behind bug with ftruncate

Anand Avati anand.avati at gmail.com
Sun Jul 17 11:28:29 UTC 2011

On Sun, Jul 17, 2011 at 2:29 PM, Emmanuel Dreyfus <manu at netbsd.org> wrote:

> However, the reordering does not occur in FUSE, and it seems i was wrong
> about write-behind, and that removing it just made the bug disapear by
> chance.
> As I now understand, the problem is that fuse_setattr_cbk() will request a
> ftruncate() after the SETATTR. Here is what I get in the logs:

Do you mean fuse_setattr_cbk is triggering an ftruncate() when it was not
supposed to trigger it? I reviewed the code again just now but don't seem to
find it doing such a faulty thing.

fuse_write()    size = 4096, offset = 39981056
> fuse_setattr()  fsi->valid = 0x78 => truncate_needed,  size = 39987632
> fuse_write()    size = 20480, offset = 39985152
> (...)
> client3_1_writev()      size = 4096, offset = 39981056
> fuse_setattr_cbk()      call fuse_do_truncate, offset = 39987632
> client3_1_writev()      size = 2480, offset = 39985152
> (...)
> client3_1_ftruncate()   offset = 39987632

The above sequence of events look proper, for the given fsi->valid (0x78).
Please read below for explanation.

Why does it decides to set truncate_needed? fsi->valid = 0x78 means this is

Exactly, FATTR_SIZE getting set means a truncate or ftruncate (depending on
FATTR_FH being set) needs to be done.

Here is the offending code:
> #define FATTR_MASK   (FATTR_SIZE                        \
>                      | FATTR_UID | FATTR_GID           \
>                      | FATTR_ATIME | FATTR_MTIME       \
>                      | FATTR_MODE)
> (...)
>        if ((fsi->valid & (FATTR_MASK)) != FATTR_SIZE) {
>                if (fsi->valid & FATTR_SIZE) {
>                        state->size            = fsi->size;
>                        state->truncate_needed = _gf_true;
>                }
> The sin is therefore to set FATTR_ATIME | FATTR_MTIME, while glusterfs
> assumes this is a ftruncate() calls because only FATTR_SIZE is set. Am I
> correct?

This current behavior is the right behavior. FATTR_SIZE being set indicates
a truncate is necessary. FATTR_ATIME|FATTR_MTIME being set indicates a
utimes() is necessary and FATTR_UID|FATTR_GID being set indicates a
chown/chmod is necessary, and FATTR_FH being set indicates an fXXXX()
variant of the above calls. Multiple flags can be set at the same time -
i.e, FATTR_ATIME|FATTR_MTIME|FATTR_SIZE can all be set in the same
fuse_setattr() call and the filesystem (glusterfs) needs to perform all the
required actions accordingly.

The problem I see here is, the write calls are arriving before setattr has
completed (i.e, before send_fuse_obj() is called for the _entire_ setattr
operation). This would naturally lead the writes and truncate to race within
the filesystem as they are issued concurrently.

Filesystems only guarantee (if at all) completion of two syscall actions in
a particular sequence only if the second syscall was issued after the return
of the first syscall. In this situation, based on the sequence you paste
above, the setattr and write seem to be issued concurrently. Because, till
you see fuse_truncate_cbk in the logs, the setattr() processing is not
complete. Any other write() in the meantime is subject to race and the
filesystem need not guarantee any particular order or completion.

A possible cause of this problem could be that VOP_SETATTR in NetBSD is only
'setting' vnode attributes in memory, returning the syscall, and eventually
results in fuse_setattr reach gluster _after_ the sys_ftruncate() syscall
returns to the application. Is this a possibility? That can explain multiple
write and setattr/truncate executing concurrently. (Or, the application is
just poorly written without understanding the expectations of concurrency of
syscalls and not seeing this behavior in on-disk filesystems as they don't
have such upcall/scheduling issues)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20110717/de40a96e/attachment-0003.html>

More information about the Gluster-devel mailing list