[Gluster-users] Split-brain seen with [0 0] pending matrix and io-cache page errors

Sun Oct 19 09:01:58 UTC 2014

On 10/19/2014 01:36 PM, Anirban Ghoshal wrote:
> It is possible, yes, because these are actually a kind of log files. I 
> suppose, like other logging frameworks these files an remain open for 
> a considerable period, and then get renamed to support log rotate 
> semantics.
>
> That said, I might need to check with the team that actually manages 
> the logging framework to be sure. I only take care of the file-system 
> stuff. I can tell you for sure Monday.
>
> If it is the same race that you mention, is there a fix for it?
>
> Thanks,
> Anirban
>
>
I am working on the fix.

RCA:
0) Lets say the file 'abc.log' is opened for writing on replica pair 
(brick-0, brick-1)
1) brick-0 went down
2) abc.log is renamed to abc.log.1
3) brick-0 comes back up
4) re-open on old abc.log happens from mount to brick-0
5) self-heal kicks in and deletes old abc.log and creates and syncs 
abc.log.1
6) But the mount is still writing to the deleted 'old abc.log' on 
brick-0 so abc.log.1 file remains at the same size while abc.log.1 file 
keeps increasing on brick-1. This leads to size mismatch split-brain on 
abc.log.1.

Race happens between steps 4), 5). If 5) happens before 4) no 
split-brain will be observed.

Work-around:

0) Take backup of good abc.log.1 file from brick-1. (Just being paranoid)

Do any of the following two steps to make sure the stale file that is 
open is closed
1-a) Take the brick process with bad file down using kill -9 <brick-pid> 
(In my example brick-0).
1-b) Introduce a temporary disconnect between mount and brick-0.
(I would choose 1-a)
2) Remove the bad file(abc.log.1) and its gfid-backend-file from brick-0
3) Bring the brick back up (gluster volume start <volname> 
force)/restore the connection and let it heal by doing 'stat' on the 
file abc.log.1 on the mount.

This bug existed from 2012, from the first time I implemented 
rename/hard-link self-heal. It is difficult to re-create. I have to put 
break-points at several places in the process to hit the race.

Pranith
>
> Thanks,
> Anirban
>
> ------------------------------------------------------------------------
> *From: * Pranith Kumar Karampuri <pkarampu at redhat.com>;
> *To: * Anirban Ghoshal <chalcogen_eg_oxygen at yahoo.com>; 
> <gluster-users at gluster.org>;
> *Subject: * Re: [Gluster-users] Split-brain seen with [0 0] pending 
> matrix and io-cache page errors
> *Sent: * Sun, Oct 19, 2014 5:42:24 AM
>
>
> On 10/18/2014 04:36 PM, Anirban Ghoshal wrote:
>> Hi,
>>
>> Yes, they do, and considerably. I'd forgotten to mention that on my 
>> last email. Their mtimes, however, as far as i could tell on separate 
>> servers, seemed to coincide.
>>
>> Thanks,
>> Anirban
>>
>>
>
> Are these files always open? And is it possible that the file could 
> have been renamed when one of the bricks was offline? I know of a race 
> which can introduce this one. Just trying to find if it is the same case.
>
> Pranith
>
>
>> ------------------------------------------------------------------------
>> *From: * Pranith Kumar Karampuri <pkarampu at redhat.com>;
>> *To: * Anirban Ghoshal <chalcogen_eg_oxygen at yahoo.com>; 
>> gluster-users at gluster.org <gluster-users at gluster.org>;
>> *Subject: * Re: [Gluster-users] Split-brain seen with [0 0] pending 
>> matrix and io-cache page errors
>> *Sent: * Sat, Oct 18, 2014 12:26:08 AM
>>
>> hi,
>>       Could you see if the size of the file mismatches?
>>
>> Pranith
>>
>> On 10/18/2014 04:20 AM, Anirban Ghoshal wrote:
>>> Hi everyone,
>>>
>>> I have this really confusing split-brain here that's bothering me. I 
>>> am running glusterfs 3.4.2 over linux 2.6.34. I have a replica 2 
>>> volume 'testvol' that is It seems I cannot read/stat/edit the file 
>>> in question, and `gluster volume heal testvol info split-brain` 
>>> shows nothing. Here are the logs from the fuse-mount for the volume:
>>>
>>> [2014-09-29 07:53:02.867111] W [fuse-bridge.c:1172:fuse_err_cbk] 
>>> 0-glusterfs-fuse: 4560969: FLUSH() ERR => -1 (Input/output error)
>>> [2014-09-29 07:54:16.007799] W [page.c:991:__ioc_page_error] 
>>> 0-testvol-io-cache: page error for page = 0x7fd5c8529d20 & waitq = 
>>> 0x7fd5c8067d40
>>> [2014-09-29 07:54:16.007854] W [fuse-bridge.c:2089:fuse_readv_cbk] 
>>> 0-glusterfs-fuse: 4561103: READ => -1 (Input/output error)
>>> [2014-09-29 07:54:16.008018] W [page.c:991:__ioc_page_error] 
>>> 0-testvol-io-cache: page error for page = 0x7fd5c8607ee0 & waitq = 
>>> 0x7fd5c8067d40
>>> [2014-09-29 07:54:16.008056] W [fuse-bridge.c:2089:fuse_readv_cbk] 
>>> 0-glusterfs-fuse: 4561104: READ => -1 (Input/output error)
>>> [2014-09-29 07:54:16.008233] W [page.c:991:__ioc_page_error] 
>>> 0-testvol-io-cache: page error for page = 0x7fd5c8066f30 & waitq = 
>>> 0x7fd5c8067d40
>>> [2014-09-29 07:54:16.008269] W [fuse-bridge.c:2089:fuse_readv_cbk] 
>>> 0-glusterfs-fuse: 4561105: READ => -1 (Input/output error)
>>> [2014-09-29 07:54:16.008800] W [page.c:991:__ioc_page_error] 
>>> 0-testvol-io-cache: page error for page = 0x7fd5c860bcf0 & waitq = 
>>> 0x7fd5c863b1f0
>>> [2014-09-29 07:54:16.008839] W [fuse-bridge.c:2089:fuse_readv_cbk] 
>>> 0-glusterfs-fuse: 4561107: READ => -1 (Input/output error)
>>> [2014-09-29 07:54:16.009365] W [page.c:991:__ioc_page_error] 
>>> 0-testvol-io-cache: page error for page = 0x7fd5c85fd120 & waitq = 
>>> 0x7fd5c8067d40
>>> [2014-09-29 07:54:16.009413] W [fuse-bridge.c:2089:fuse_readv_cbk] 
>>> 0-glusterfs-fuse: 4561109: READ => -1 (Input/output error)
>>> [2014-09-29 07:54:16.040549] W [afr-open.c:213:afr_open] 
>>> 0-testvol-replicate-0: failed to open as split brain seen, returning 
>>> EIO
>>> [2014-09-29 07:54:16.040594] W [fuse-bridge.c:915:fuse_fd_cbk] 
>>> 0-glusterfs-fuse: 4561142: OPEN() 
>>> /SECLOG/20140908.d/SECLOG_00000000000000427425_00000000000000000000.log 
>>> => -1 (Input/output error)
>>>
>>> Could somebody please give me some clue on where to begin? I checked 
>>> the xattrs on 
>>> /SECLOG/20140908.d/SECLOG_00000000000000427425_00000000000000000000.log 
>>> and it seems the changelogs are [0, 0] on both replicas, and the 
>>> gfid's match.
>>>
>>> Thank you very much for any help on this.
>>> Anirban
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20141019/53c03587/attachment.html>