[Gluster-users] file corruption on Gluster 3.5.1 and Ubuntu 14.04

mike mike at luminatewireless.com
Fri Sep 5 23:10:59 UTC 2014


I have found that the O_APPEND flag is key to this failure - I had overlooked that flag when reading the strace and trying to cobble up a minimal reproduction.

I now have a small pair of python scripts that can reliably reproduce this failure.

On Sep 5, 2014, at 11:11 AM, mike <mike at luminatewireless.com> wrote:

> Hi all,
> 
> I've been trying to track down a bug that looks like data corruption coming from the Gluster FUSE client. I'm pretty new to Gluster, so perhaps there are better ways to get to the bottom of this.
> 
> I have a distributed, replicated volume that was initialized as 3.5.1 and it has been functioning fine in general.
> 
> Recently we discovered a write corruption issue that is always reproducible by one of our binaries.
> 
> The scenarios is as follows:
> 
> There is one writer appending binary entries that have a sequence number and checksum. Writes are small - 38 bytes each - and there is no fsync while the file is streaming out.
> 
> If this is the only activity, everything is perfectly fine.
> 
> However, if that file is simultaneously opened for reading, in a tail-like fashion, the data written into gluster is corrupt. Reading the file back out of disk reveals a dropped message (a missing sequence number). This happens when the reader and writer are different processes, so this is not a bug in our userspace.
> 
> What is even more interesting it that the first run during the writing reports the sequence number of the failure differently than all subsequent tests.
> 
> It feels like there is some kind of cache coherency issue here, but after putting glusterfs into -L DEBUG and following along, I'm not closer to a understanding the issue.
> 
> stracing glusterfs suggests that the missing records were actually sent to the remote glusterfs daemons. I have appended that strace at the bottom.
> 
> For reference, here is a bit of strace that shows the overall data flow:
> 
> The file is opened for writing like so:
> 
> 32684 open(".../logtest", O_WRONLY|O_CREAT, 0664) = 7
> 32684 fcntl(7, F_SETFD, FD_CLOEXEC)     = 0
> 32684 fcntl(7, F_GETFL)                 = 0x8001 (flags O_WRONLY|O_LARGEFILE)
> 32684 fcntl(7, F_SETFL, O_WRONLY|O_APPEND|O_LARGEFILE) = 0
> 32686 write(7, "...", 38) = 38
> 32686 nanosleep({0, 1000000}, NULL)     = 0
> 
> 
> The file is opened for reading like so:
> 
> open(".../logtest", O_RDONLY) = 6
> fstat(6, {st_mode=S_IFREG|0644, st_size=912, ...}) = 0
> lseek(6, 0, SEEK_CUR)                   = 0
> read(6, "..."..., 131072) = 912
> read(6, "", 131072)                     = 0
> lseek(6, 0, SEEK_SET)                   = 0
> read(6, "..."..., 912) = 912
> nanosleep({0, 10000000}, NULL)          = 0
> read(6, "...", 131072) = 114
> read(6, "", 131072)                     = 0
> lseek(6, 0, SEEK_SET)                   = 0
> read(6, "..."..., 1026) = 1026
> nanosleep({0, 10000000}, NULL)          = 0
> read(6, "...", 131072) = 114
> read(6, "", 131072)                     = 0
> lseek(6, 0, SEEK_SET)                   = 0
> read(6, "..."..., 1140) = 1102
> lseek(6, 38, SEEK_CUR)                  = 1140
> nanosleep({0, 10000000}, NULL)          = 0
> read(6, "...", 131072) = 114
> 
> I have been trying to duplicate the failure with a small python script, but I have as yet been unable to do so. This C++ binary does reproduce it quite reliably, but the build system for this is quite complex and not readily shareable.
> 
> I hope you can give me some advice about where to look next.
> 
> Thanks,
> -Mike
> 
> 
> 
> Strace from glusterfs client. fd 22 a tcp socket connected to glusterfsd on a remote host.
> 
> 
> 30049 writev(22, [{"\200\0\0\342", 4}, {"\0\0\0M\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\1\n\0\0\0&\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0", 120}, {"\31D\6#\0\0\0\6\23\221\33-\357\306\250@\0\0\0\5\0\0\0\0\273\\^\v8\356\243\372\10\1\30\0 \7", 38}], 4) = 230
> 
> This is the last record that shows up in the glusterfs file before the "gap" - sequence number 7.
> 
> 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0N\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\204\236\205QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208
> 30049 writev(22, [{"\200\0\0\220", 4}, {"\0\0\0O\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\31\0\5\363\227\0\0\0L\0\0ul\0\0'\24\0\0\23\210\0\0\0\f\0\0\0\4\0\0\0\6\0\0\0.\0\0\0~\0\0\17\241\0\0\17\242\0\0\17\243\0\0\17\245\0\0\17\246\0\0\23\210\0\0'\20\0\0'\21\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 116}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\3\0\0\0\0", 28}], 3) = 148
> 30049 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0P\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0010\0\0\0\0\0\0\0L\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3) = 164
> 30049 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0Q\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\204\236\205QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0\0\0\0\1\n\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3 <unfinished ...>
> 30058 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0R\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\1|\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3 <unfinished ...>
> 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0S\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208
> 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0T\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208
> 30049 writev(22, [{"\200\0\1\10", 4}, {"\0\0\0U\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0010\0\0\0L\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0", 120}, {"\31D\6#\0\0\0\6\23\221\33-\357\350\336\20\0\0\0\5\0\0\0\0\345Ob/@P\226\233\10\1\30\0 \10\31D\6#\0\0\0\6\23\221\33-\357\373\321\240\0\0\0\5\0\0\0\0\27$\341,\303\324\25\202\10\1\30\0 \t", 76}], 4) = 268
> 30049 writev(22, [{"\200\0\0\342", 4}, {"\0\0\0V\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\1|\0\0\0&\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0", 120}, {"\31D\6#\0\0\0\6\23\221\33-\360\16\212\230\0\0\0\5\0\0\0\0\4t\22\330\323\r4\10\10\1\30\0 \n", 38}], 4) = 230
> 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0W\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208
> 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0X\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208
> 30049 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0Y\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\f\0\5\363\227\0\0\0L\0\0ul\0\0'\24\0\0\23\210\0\0\0\f\0\0\0\4\0\0\0\6\0\0\0.\0\0\0~\0\0\17\241\0\0\17\242\0\0\17\243\0\0\17\245\0\0\17\246\0\0\23\210\0\0'\20\0\0'\21\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 116}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0", 44}], 3) = 164
> 30049 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0Z\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0010\0\0\0\0\0\0\0L\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3) = 164
> 30049 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0[\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0\0\0\0\1|\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3) = 164
> 30058 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0\\\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0010\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3) = 164
> 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0]\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208
> 
> This is the first record that shows up in the file after the gap - sequence number 11. Somehow, records 8,9,10 are getting squashed.
> 
> 30049 writev(22, [{"\200\0\0\342", 4}, {"\0\0\0^\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0010\0\0\0&\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0", 120}, {"\31D\6#\0\0\0\6\23\221\33-\360mc\330\0\0\0\5\0\0\0\0\366\37\221\333\365V\4\223\10\1\30\0 \v", 38}], 4 <unfinished ...>
> 30049 writev(22, [{"\200\0\0\200", 4}, {"\0\0\0_\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\20\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\32\0\0\0\1\0\0\0\v\0\0\0\2batch-fsync\0001\0\0\0", 60}], 3) = 132
> 30049 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0`\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10D\21\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\1V\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3) = 164
> 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0a\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10D\21\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208
> 
> 



More information about the Gluster-users mailing list