[Gluster-devel] error while reading from an open file

Mon Sep 7 05:13:49 UTC 2009

On Fri, Sep 4, 2009 at 7:31 PM, Brian Hirt <bhirt at mobygames.com> wrote:

> Raghavendra ,
>
> Thanks for looking into this, its great that you have identified the bug.
>  Let's hope you can get a fix out soon since it's a pretty serious error
> that undermines data integrity and reliability of gluster.  From your
> message, I can't quite tell if you are agreeing that there is a serious
> problem, or just explaining what's happening in gluster as an explanation
> that everything is working as planned.
>
> I understand what you are saying below, but from the viewpoint of a client
> application, there absolutely is data corruption.  If a program is writing
> to a file, handling all errors reported to it by the operating system, and
> later reads from the same  file and finds that the data it wrote to the file
> isn't in there that is fundamentally a data corruption issue with very
> serious implications for any program.
>
> Restarting servers is typical behavior for setups of all sizes.  Machines
> require routine maintenance.  It is guaranteed that servers will be brought
> offline for hardware changes and or operating system updates.
>
> The documentation is misleading, as it implies split brain as one of the
> new features in 2.0.  I quote
>        "Replicate correctness (includes cases of splitbrain, proper
> self-heal fixes etc) Replicate is one of GlusterFS's key features. It was
> questioned for its ability to function well in some split brain situations.
> There were some corner cases in AFR which needed to be re-worked. With 2.0.x
> releases, AFR has proper detection of split-brain situations and works
> smoothly in all cases."
>
> This documentation is also misleading and incorrect.  The statement that
> there is no data corruption when reading a file is just sort of silly and
> appears to be an attempt to soften the actual truth that any opened,
> replicated file that is being written to during a restart will become
> corrupt.   Then this statement "All file systems are vulnerable to such
> loses" is just another attempt to make it sound like gluster behaves like
> all other filesystems, when in fact that's not even close to the truth.
>

The situation you're facing is because of self-heal not being done on open
fds which in turn might lead to split brain scenarios. Self-heal not being
done on open fds is a known issue and will be fixed. As far as the split
brain scenarios, glusterfs can just identify that is a split-brain and
manual intervention is needed to copy the correct file. And the
documentation on split brain explains the reasons for it.
http://www.gluster.org/docs/index.php/Understanding_AFR_Translator

> http://gluster.org/docs/index.php/GlusterFS_Technical_FAQ#What_happens_in_case_of_hardware_or_GlusterFS_crash.3F
>
> I would suggest you change it to something more accurate and specific:
>        "WARNING:  If gluster is used in environments where files are
> replicated and written to, you will experience data loss/corruption when you
> restart servers.  The longer you leave a file open, the larger your exposure
> to this issue will be."
>
> If the documentation was a bit more honest about what gluster can and
> cannot do, instead of talking it up as the next best thing, it would go a
> long way towards helping the project out.  Right now, after all the problems
> I've experienced, seen other people report on this list, along with the lack
> of responses or hand waving that these issues aren't serious, I have very
> little trust in this project.  It's really a shame, because it seems like
> there is so much potential.
>
> Kind Regards,
>
> Brian
>
>
>
> On Sep 3, 2009, at 10:28 PM, Raghavendra G wrote:
>
>  with write behind being removed from configuration, success is not
>> returned for
>> writes which are failed. The above situation is caused because of
>> replicate in
>> the setup. If replicate is removed from the test setup, this issue is not
>>
>> observed. As an example consider replicate over 2 nodes and following
>> sequence
>>
>> of operations.
>>
>> 1. start writing on gluster mount, data is replicated to both the nodes.
>> 2. stop node1. Application still receives success on writes, but data is
>>
>> written only to node2.
>>
>> 3. restart node1. Application still receives success on writes, but data
>> is
>> still not written to node1, since the file is no longer open on node1.
>> Also
>> note that self-heal will not be done from node2 to node1 since replicate
>> does
>>
>> not support self-heal on open fds yet.
>>
>> 4. stop node2. Application receives either ENOTCONN or EBADFD, based on
>> the
>> child from which replicate received the last reply for write. subvolume
>> corresponding to node1 returns EBADFD and that of node2 returns ENOTCONN.
>>
>> 5. Application sensing writes are failing, issues an open on the file.
>> Note
>>
>> that node2 is still down. If it were to be up, data on node1 would be
>> synced
>> from node2.
>> 6. Now writes happen only on node1. Note that file on node1 was missing
>> some
>>
>> writes happened on node2. Now, node2 will miss the future writes, leading
>> to a
>>
>> split-brain situation.
>> 7. Bring up the node1. If open were to happen now, when both nodes are up,
>> replicate identifies this as a split brain situation and a manual
>> intervention
>>
>> is needed to identify the "more" correct version of file and remove the
>> other.
>>
>> Then replicate copies the more correct version of file to other node.
>>
>> Hence the issue here is not writes being failed. Writes are happening but
>> there
>>
>> is a split-brain situation because of the way servers have been restarted.
>>
>>
>> On Wed, Sep 2, 2009 at 8:06 PM, Brian Hirt <bhirt at mobygames.com> wrote:
>>
>> On Sep 2, 2009, at 7:12 AM, Vijay Bellur wrote:
>>
>> Brian Hirt wrote:
>>
>> The first part of this problem (open files not surviving gluster restarts)
>> seems like a pretty major design flaw that needs to be fixed.
>> Yes, we do know that this is a problem and we have our sights set on
>> solving this.
>>
>> That is good to know.  Do you know if is this planned on being back ported
>> into 2.0 or is it going to be part of 2.1?  Is there a bug report id so we
>> can follow the progress?
>>
>>
>>
>> The second part (gluster not reporting the error to the writer when
>> gluster chokes) is a critical problem that needs to be fixed.
>>
>> This is a bug in the write-behind translator and bug 242 has been tracked
>> to address this.
>>
>> A discussion from the mailing list archives which could be of interest to
>> you for the tail -f problem:
>>
>> http://gluster.org/pipermail/gluster-users/20090113/001362.html
>>
>>
>> Is there any additional information I can provide in this bug report?   I
>> have disabled the following section from my test clients and can confirm
>> that some errors that were not being reported are now being sent back to the
>> writer program.  It's certainly an improvement over no errors being
>> reported.
>>
>> volume writebehind
>>  type performance/write-behind
>>  option window-size 1MB
>>  subvolumes distribute
>> end-volume
>>
>> I've also discovered, that this problem is not isolated to the writebehind
>> module.  While some errors are being sent back to the writer, there is still
>> data corruption in the gluster created file.  Gluster is still reporting
>> success to the writer when writes have failed.   I have a simple program
>> that writes 1, 2, 3, 4 ... N to a file at the rate of 100 lines per second.
>>  Whenever the writer gets an error returned from write() it waits a second,
>> reopens the file and continues writing.   While this writer is writing, I
>> restart the gluster nodes one by one.  Once this is done, I stop the writter
>> and check it for corruption.
>>
>> One interesting observation I have made is that when restarting the
>> gluster servers, sometimes errorno EBADFD is returned and sometimes it's
>> ENOTCONN.  When errno is ENOTCONN (107 in ubuntu 9.04) the file is not
>> corrupted. When errno is EBADFD (77 in ubuntu 9.04) there is file
>> corruption.  These statements are based on a limited number of test runs,
>> but were always true for me.
>>
>> Some sample output of some tests:
>>
>> bhirt at ubuntu:~/gluster-tests$ rm -f /unify/m/test1.2009-09-02 &&
>> ./write-numbers /unify/m/test1.2009-09-02
>> problems writting to fd, reopening logfile (errno = 77) in one second
>> ^C
>> bhirt at ubuntu:~/gluster-tests$ ./check-numbers /unify/m/test1.2009-09-02
>> 169 <> 480
>> bhirt at ubuntu:~/gluster-tests$ rm -f /unify/m/test1.2009-09-02 &&
>> ./write-numbers /unify/m/test1.2009-09-02
>> problems writting to fd, reopening logfile (errno = 107) in one second
>> ^C
>>
>> bhirt at ubuntu:~/gluster-tests$ ./check-numbers /unify/m/test1.2009-09-02
>> OK
>>
>> The programs I use to test this are:
>>
>> bhirt at ubuntu:~/gluster-tests$ cat write-numbers.c check-numbers
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <errno.h>
>> #include <fcntl.h>
>> #include <unistd.h>
>> #include <string.h>
>>
>> #define BUFSIZE        65536
>>
>> /* write 100 entries per second */
>> #define WRITE_DELAY    1000000 / 100
>>
>> int open_testfile(char *testfile)
>> {
>>   int fd;
>>
>>   fd = open(testfile, O_WRONLY | O_CREAT | O_APPEND, 0666);
>>
>>   if (fd < 0) {
>>       perror("open");
>>       exit(2);
>>   }
>>
>>   return(fd);
>> }
>>
>> void usage(char *s)
>> {
>>   fprintf(stderr, "\nusage: %s testfile\n\n",s);
>> }
>>
>> int main (int argc, char **argv)
>> {
>>   char buf[BUFSIZE];
>>   int  logfd;
>>   int  nread;
>>   int counter = 0;
>>
>>
>>   if (argc != 2) {
>>       usage(argv[0]);
>>       exit(1);
>>   }
>>
>>   logfd = open_testfile(argv[1]);
>>
>>   /* loop endlessly */
>>   for (;;) {
>>
>>       snprintf(buf, sizeof(buf), "%d\n",counter);
>>       nread = strnlen(buf,sizeof(buf));
>>
>>       /* write data */
>>       int nwrite = write(logfd, buf, nread);
>>
>>       if (nwrite == nread) {
>>           counter++;
>>           usleep(WRITE_DELAY);
>>       } else {
>>           /* restarted gluster nodes give this error in 2.0.6 */
>>           if (errno == EBADFD || errno == ENOTCONN)
>>           {
>>             /* wait a second before re-opening the file */
>>             fprintf(stderr,"problems writting to fd, reopening logfile
>> (errno = %d) in one second\n",errno);
>>             sleep(1);
>>
>>             /* reopen log file, and set write again flag so the data tries
>> to get written back */
>>             logfd = open_testfile(argv[1]);
>>           }
>>           else
>>           {
>>             perror("write");
>>             exit(2);
>>           }
>>       }
>>   }
>> }
>>
>> #!/usr/bin/perl
>>
>> use strict;
>> use warnings;
>>
>> my $i=0;
>>
>> while (<>) { die "$i <> $_" if $i++ != $_; }
>> print STDERR "OK\n";
>>
>> The client log file during one of the tests I ran.
>>
>> [2009-09-02 09:59:23] E [saved-frames.c:165:saved_frames_unwind] remote1:
>> forced unwinding frame type(1) op(FINODELK)
>> [2009-09-02 09:59:23] N [client-protocol.c:6246:notify] remote1:
>> disconnected
>> [2009-09-02 09:59:23] E [socket.c:745:socket_connect_finish] remote1:
>> connection to 10.0.1.31:6996 failed (Connection refused)
>> [2009-09-02 09:59:26] N [client-protocol.c:5559:client_setvolume_cbk]
>> remote1: Connected to 10.0.1.31:6996, attached to remote volume 'brick'.
>> [2009-09-02 09:59:30] E [saved-frames.c:165:saved_frames_unwind] remote2:
>> forced unwinding frame type(1) op(WRITE)
>> [2009-09-02 09:59:30] W [fuse-bridge.c:1534:fuse_writev_cbk]
>> glusterfs-fuse: 153358: WRITE => -1 (Transport endpoint is not connected)
>> [2009-09-02 09:59:30] N [client-protocol.c:6246:notify] remote2:
>> disconnected
>> [2009-09-02 09:59:30] E [socket.c:745:socket_connect_finish] remote2:
>> connection to 10.0.1.32:6996 failed (Connection refused)
>> [2009-09-02 09:59:33] N [client-protocol.c:5559:client_setvolume_cbk]
>> remote2: Connected to 10.0.1.32:6996, attached to remote volume 'brick'.
>> [2009-09-02 09:59:34] N [client-protocol.c:5559:client_setvolume_cbk]
>> remote1: Connected to 10.0.1.31:6996, attached to remote volume 'brick'.
>> [2009-09-02 09:59:37] E [saved-frames.c:165:saved_frames_unwind] remote1:
>> forced unwinding frame type(1) op(FINODELK)
>> [2009-09-02 09:59:37] W [fuse-bridge.c:1534:fuse_writev_cbk]
>> glusterfs-fuse: 153923: WRITE => -1 (File descriptor in bad state)
>> [2009-09-02 09:59:37] N [client-protocol.c:6246:notify] remote1:
>> disconnected
>> [2009-09-02 09:59:40] N [client-protocol.c:5559:client_setvolume_cbk]
>> remote1: Connected to 10.0.1.31:6996, attached to remote volume 'brick'.
>> [2009-09-02 09:59:41] N [client-protocol.c:5559:client_setvolume_cbk]
>> remote2: Connected to 10.0.1.32:6996, attached to remote volume 'brick'.
>> [2009-09-02 09:59:44] N [client-protocol.c:5559:client_setvolume_cbk]
>> remote1: Connected to 10.0.1.31:6996, attached to remote volume 'brick'.
>> [2009-09-02 09:59:51] W [fuse-bridge.c:882:fuse_err_cbk] glusterfs-fuse:
>> 155106: FLUSH() ERR => -1 (File descriptor in bad state)
>> [2009-09-02 09:59:51] W [fuse-bridge.c:882:fuse_err_cbk] glusterfs-fuse:
>> 155108: FLUSH() ERR => -1 (File descriptor in bad state)
>>
>>
>>
>> Regards,
>> Vijay
>>
>>
>>
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at nongnu.org
>> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
>>
>>
>> --
>> Raghavendra G
>>
>>
>

-- 
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-devel/attachments/20090907/b31c4df7/attachment-0003.html>