[Gluster-devel] error while reading from an open file

Fri Sep 4 04:28:47 UTC 2009

with write behind being removed from configuration, success is not returned for
writes which are failed. The above situation is caused because of replicate in
the setup. If replicate is removed from the test setup, this issue is not
observed. As an example consider replicate over 2 nodes and following sequence
of operations.

1. start writing on gluster mount, data is replicated to both the nodes.
2. stop node1. Application still receives success on writes, but data is
written only to node2.
3. restart node1. Application still receives success on writes, but data is
still not written to node1, since the file is no longer open on node1. Also
note that self-heal will not be done from node2 to node1 since replicate does
not support self-heal on open fds yet.
4. stop node2. Application receives either ENOTCONN or EBADFD, based on the
child from which replicate received the last reply for write. subvolume
corresponding to node1 returns EBADFD and that of node2 returns ENOTCONN.
5. Application sensing writes are failing, issues an open on the file. Note
that node2 is still down. If it were to be up, data on node1 would be synced
from node2.
6. Now writes happen only on node1. Note that file on node1 was missing some
writes happened on node2. Now, node2 will miss the future writes, leading to a
split-brain situation.
7. Bring up the node1. If open were to happen now, when both nodes are up,
replicate identifies this as a split brain situation and a manual intervention
is needed to identify the "more" correct version of file and remove the other.
Then replicate copies the more correct version of file to other node.

Hence the issue here is not writes being failed. Writes are happening but there
is a split-brain situation because of the way servers have been restarted.

On Wed, Sep 2, 2009 at 8:06 PM, Brian Hirt <bhirt at mobygames.com> wrote:

>
> On Sep 2, 2009, at 7:12 AM, Vijay Bellur wrote:
>
>  Brian Hirt wrote:
>>
>>>
>>> The first part of this problem (open files not surviving gluster
>>> restarts) seems like a pretty major design flaw that needs to be fixed.
>>>
>> Yes, we do know that this is a problem and we have our sights set on
>> solving this.
>>
>
> That is good to know.  Do you know if is this planned on being back ported
> into 2.0 or is it going to be part of 2.1?  Is there a bug report id so we
> can follow the progress?
>
>
>>  The second part (gluster not reporting the error to the writer when
>>> gluster chokes) is a critical problem that needs to be fixed.
>>>
>>
>> This is a bug in the write-behind translator and bug 242 has been tracked
>> to address this.
>>
>> A discussion from the mailing list archives which could be of interest to
>> you for the tail -f problem:
>>
>> http://gluster.org/pipermail/gluster-users/20090113/001362.html
>>
>>
> Is there any additional information I can provide in this bug report?   I
> have disabled the following section from my test clients and can confirm
> that some errors that were not being reported are now being sent back to the
> writer program.  It's certainly an improvement over no errors being
> reported.
>
> volume writebehind
>  type performance/write-behind
>  option window-size 1MB
>  subvolumes distribute
> end-volume
>
> I've also discovered, that this problem is not isolated to the writebehind
> module.  While some errors are being sent back to the writer, there is still
> data corruption in the gluster created file.  Gluster is still reporting
> success to the writer when writes have failed.   I have a simple program
> that writes 1, 2, 3, 4 ... N to a file at the rate of 100 lines per second.
>  Whenever the writer gets an error returned from write() it waits a second,
> reopens the file and continues writing.   While this writer is writing, I
> restart the gluster nodes one by one.  Once this is done, I stop the writter
> and check it for corruption.
>
> One interesting observation I have made is that when restarting the gluster
> servers, sometimes errorno EBADFD is returned and sometimes it's ENOTCONN.
>  When errno is ENOTCONN (107 in ubuntu 9.04) the file is not corrupted. When
> errno is EBADFD (77 in ubuntu 9.04) there is file corruption.  These
> statements are based on a limited number of test runs, but were always true
> for me.
>
> Some sample output of some tests:
>
> bhirt at ubuntu:~/gluster-tests$ rm -f /unify/m/test1.2009-09-02 &&
> ./write-numbers /unify/m/test1.2009-09-02
> problems writting to fd, reopening logfile (errno = 77) in one second
> ^C
> bhirt at ubuntu:~/gluster-tests$ ./check-numbers /unify/m/test1.2009-09-02
> 169 <> 480
> bhirt at ubuntu:~/gluster-tests$ rm -f /unify/m/test1.2009-09-02 &&
> ./write-numbers /unify/m/test1.2009-09-02
> problems writting to fd, reopening logfile (errno = 107) in one second
> ^C
>
> bhirt at ubuntu:~/gluster-tests$ ./check-numbers /unify/m/test1.2009-09-02
> OK
>
> The programs I use to test this are:
>
> bhirt at ubuntu:~/gluster-tests$ cat write-numbers.c check-numbers
> #include <stdio.h>
> #include <stdlib.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <string.h>
>
> #define BUFSIZE        65536
>
> /* write 100 entries per second */
> #define WRITE_DELAY    1000000 / 100
>
> int open_testfile(char *testfile)
> {
>    int fd;
>
>    fd = open(testfile, O_WRONLY | O_CREAT | O_APPEND, 0666);
>
>    if (fd < 0) {
>        perror("open");
>        exit(2);
>    }
>
>    return(fd);
> }
>
> void usage(char *s)
> {
>    fprintf(stderr, "\nusage: %s testfile\n\n",s);
> }
>
> int main (int argc, char **argv)
> {
>    char buf[BUFSIZE];
>    int  logfd;
>    int  nread;
>    int counter = 0;
>
>
>    if (argc != 2) {
>        usage(argv[0]);
>        exit(1);
>    }
>
>    logfd = open_testfile(argv[1]);
>
>    /* loop endlessly */
>    for (;;) {
>
>        snprintf(buf, sizeof(buf), "%d\n",counter);
>        nread = strnlen(buf,sizeof(buf));
>
>        /* write data */
>        int nwrite = write(logfd, buf, nread);
>
>        if (nwrite == nread) {
>            counter++;
>            usleep(WRITE_DELAY);
>        } else {
>            /* restarted gluster nodes give this error in 2.0.6 */
>            if (errno == EBADFD || errno == ENOTCONN)
>            {
>              /* wait a second before re-opening the file */
>              fprintf(stderr,"problems writting to fd, reopening logfile
> (errno = %d) in one second\n",errno);
>              sleep(1);
>
>              /* reopen log file, and set write again flag so the data tries
> to get written back */
>              logfd = open_testfile(argv[1]);
>            }
>            else
>            {
>              perror("write");
>              exit(2);
>            }
>        }
>    }
> }
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my $i=0;
>
> while (<>) { die "$i <> $_" if $i++ != $_; }
> print STDERR "OK\n";
>
> The client log file during one of the tests I ran.
>
> [2009-09-02 09:59:23] E [saved-frames.c:165:saved_frames_unwind] remote1:
> forced unwinding frame type(1) op(FINODELK)
> [2009-09-02 09:59:23] N [client-protocol.c:6246:notify] remote1:
> disconnected
> [2009-09-02 09:59:23] E [socket.c:745:socket_connect_finish] remote1:
> connection to 10.0.1.31:6996 failed (Connection refused)
> [2009-09-02 09:59:26] N [client-protocol.c:5559:client_setvolume_cbk]
> remote1: Connected to 10.0.1.31:6996, attached to remote volume 'brick'.
> [2009-09-02 09:59:30] E [saved-frames.c:165:saved_frames_unwind] remote2:
> forced unwinding frame type(1) op(WRITE)
> [2009-09-02 09:59:30] W [fuse-bridge.c:1534:fuse_writev_cbk]
> glusterfs-fuse: 153358: WRITE => -1 (Transport endpoint is not connected)
> [2009-09-02 09:59:30] N [client-protocol.c:6246:notify] remote2:
> disconnected
> [2009-09-02 09:59:30] E [socket.c:745:socket_connect_finish] remote2:
> connection to 10.0.1.32:6996 failed (Connection refused)
> [2009-09-02 09:59:33] N [client-protocol.c:5559:client_setvolume_cbk]
> remote2: Connected to 10.0.1.32:6996, attached to remote volume 'brick'.
> [2009-09-02 09:59:34] N [client-protocol.c:5559:client_setvolume_cbk]
> remote1: Connected to 10.0.1.31:6996, attached to remote volume 'brick'.
> [2009-09-02 09:59:37] E [saved-frames.c:165:saved_frames_unwind] remote1:
> forced unwinding frame type(1) op(FINODELK)
> [2009-09-02 09:59:37] W [fuse-bridge.c:1534:fuse_writev_cbk]
> glusterfs-fuse: 153923: WRITE => -1 (File descriptor in bad state)
> [2009-09-02 09:59:37] N [client-protocol.c:6246:notify] remote1:
> disconnected
> [2009-09-02 09:59:40] N [client-protocol.c:5559:client_setvolume_cbk]
> remote1: Connected to 10.0.1.31:6996, attached to remote volume 'brick'.
> [2009-09-02 09:59:41] N [client-protocol.c:5559:client_setvolume_cbk]
> remote2: Connected to 10.0.1.32:6996, attached to remote volume 'brick'.
> [2009-09-02 09:59:44] N [client-protocol.c:5559:client_setvolume_cbk]
> remote1: Connected to 10.0.1.31:6996, attached to remote volume 'brick'.
> [2009-09-02 09:59:51] W [fuse-bridge.c:882:fuse_err_cbk] glusterfs-fuse:
> 155106: FLUSH() ERR => -1 (File descriptor in bad state)
> [2009-09-02 09:59:51] W [fuse-bridge.c:882:fuse_err_cbk] glusterfs-fuse:
> 155108: FLUSH() ERR => -1 (File descriptor in bad state)
>
>
>
>  Regards,
>> Vijay
>>
>>
>>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at nongnu.org
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>

-- 
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20090904/bfb21dea/attachment-0003.html>