[Gluster-users] Split-brain seen with [0 0] pending matrix and io-cache page errors

Pranith Kumar Karampuri pkarampu at redhat.com
Mon Oct 20 04:08:05 UTC 2014


On 10/19/2014 06:05 PM, Anirban Ghoshal wrote:
> I see. Thanks a tonne for the thorough explanation! :) I can see that 
> our setup would be vulnerable here because the logger on one server is 
> not generally aware of the state of the replica on the other server. 
> So, it is possible that the log files may have been renamed before 
> heal had a chance to kick in.
>
> Could I also request you for the bug ID (should there be one) against 
> which you are coding up the fix, so that we could get a notification 
> once it is passed?
>
This bug was reported by Redhat QE and the bug is cloned upstream. I 
copied the relevant content so you would understand the context:
https://bugzilla.redhat.com/show_bug.cgi?id=1154491

Pranith
>
> Also, as an aside, is O_DIRECT supposed to prevent this from occurring 
> if one were to make allowance for the performance hit?
>
Unfortunately no :-(. As far as I understand that was the only work-around.

Pranith
>
> Thanks again,
> Anirban
>
>
> ------------------------------------------------------------------------
> *From: * Pranith Kumar Karampuri <pkarampu at redhat.com>;
> *To: * Anirban Ghoshal <chalcogen_eg_oxygen at yahoo.com>; 
> <gluster-users at gluster.org>;
> *Subject: * Re: [Gluster-users] Split-brain seen with [0 0] pending 
> matrix and io-cache page errors
> *Sent: * Sun, Oct 19, 2014 9:01:58 AM
>
>
> On 10/19/2014 01:36 PM, Anirban Ghoshal wrote:
>> It is possible, yes, because these are actually a kind of log files. 
>> I suppose, like other logging frameworks these files an remain open 
>> for a considerable period, and then get renamed to support log rotate 
>> semantics.
>>
>> That said, I might need to check with the team that actually manages 
>> the logging framework to be sure. I only take care of the file-system 
>> stuff. I can tell you for sure Monday.
>>
>> If it is the same race that you mention, is there a fix for it?
>>
>> Thanks,
>> Anirban
>>
>>
> I am working on the fix.
>
> RCA:
> 0) Lets say the file 'abc.log' is opened for writing on replica pair 
> (brick-0, brick-1)
> 1) brick-0 went down
> 2) abc.log is renamed to abc.log.1
> 3) brick-0 comes back up
> 4) re-open on old abc.log happens from mount to brick-0
> 5) self-heal kicks in and deletes old abc.log and creates and syncs 
> abc.log.1
> 6) But the mount is still writing to the deleted 'old abc.log' on 
> brick-0 so abc.log.1 file remains at the same size while abc.log.1 
> file keeps increasing on brick-1. This leads to size mismatch 
> split-brain on abc.log.1.
>
> Race happens between steps 4), 5). If 5) happens before 4) no 
> split-brain will be observed.
>
> Work-around:
>
> 0) Take backup of good abc.log.1 file from brick-1. (Just being paranoid)
>
> Do any of the following two steps to make sure the stale file that is 
> open is closed
> 1-a) Take the brick process with bad file down using kill -9 
> <brick-pid> (In my example brick-0).
> 1-b) Introduce a temporary disconnect between mount and brick-0.
> (I would choose 1-a)
> 2) Remove the bad file(abc.log.1) and its gfid-backend-file from brick-0
> 3) Bring the brick back up (gluster volume start <volname> 
> force)/restore the connection and let it heal by doing 'stat' on the 
> file abc.log.1 on the mount.
>
> This bug existed from 2012, from the first time I implemented 
> rename/hard-link self-heal. It is difficult to re-create. I have to 
> put break-points at several places in the process to hit the race.
>
> Pranith
>
>>
>> Thanks,
>> Anirban
>>
>> ------------------------------------------------------------------------
>> *From: * Pranith Kumar Karampuri <pkarampu at redhat.com>;
>> *To: * Anirban Ghoshal <chalcogen_eg_oxygen at yahoo.com>; 
>> <gluster-users at gluster.org>;
>> *Subject: * Re: [Gluster-users] Split-brain seen with [0 0] pending 
>> matrix and io-cache page errors
>> *Sent: * Sun, Oct 19, 2014 5:42:24 AM
>>
>>
>> On 10/18/2014 04:36 PM, Anirban Ghoshal wrote:
>>> Hi,
>>>
>>> Yes, they do, and considerably. I'd forgotten to mention that on my 
>>> last email. Their mtimes, however, as far as i could tell on 
>>> separate servers, seemed to coincide.
>>>
>>> Thanks,
>>> Anirban
>>>
>>>
>>
>> Are these files always open? And is it possible that the file could 
>> have been renamed when one of the bricks was offline? I know of a 
>> race which can introduce this one. Just trying to find if it is the 
>> same case.
>>
>> Pranith
>>
>>
>>> ------------------------------------------------------------------------
>>> *From: * Pranith Kumar Karampuri <pkarampu at redhat.com>;
>>> *To: * Anirban Ghoshal <chalcogen_eg_oxygen at yahoo.com>; 
>>> gluster-users at gluster.org <gluster-users at gluster.org>;
>>> *Subject: * Re: [Gluster-users] Split-brain seen with [0 0] pending 
>>> matrix and io-cache page errors
>>> *Sent: * Sat, Oct 18, 2014 12:26:08 AM
>>>
>>> hi,
>>>       Could you see if the size of the file mismatches?
>>>
>>> Pranith
>>>
>>> On 10/18/2014 04:20 AM, Anirban Ghoshal wrote:
>>>> Hi everyone,
>>>>
>>>> I have this really confusing split-brain here that's bothering me. 
>>>> I am running glusterfs 3.4.2 over linux 2.6.34. I have a replica 2 
>>>> volume 'testvol' that is It seems I cannot read/stat/edit the file 
>>>> in question, and `gluster volume heal testvol info split-brain` 
>>>> shows nothing. Here are the logs from the fuse-mount for the volume:
>>>>
>>>> [2014-09-29 07:53:02.867111] W [fuse-bridge.c:1172:fuse_err_cbk] 
>>>> 0-glusterfs-fuse: 4560969: FLUSH() ERR => -1 (Input/output error)
>>>> [2014-09-29 07:54:16.007799] W [page.c:991:__ioc_page_error] 
>>>> 0-testvol-io-cache: page error for page = 0x7fd5c8529d20 & waitq = 
>>>> 0x7fd5c8067d40
>>>> [2014-09-29 07:54:16.007854] W [fuse-bridge.c:2089:fuse_readv_cbk] 
>>>> 0-glusterfs-fuse: 4561103: READ => -1 (Input/output error)
>>>> [2014-09-29 07:54:16.008018] W [page.c:991:__ioc_page_error] 
>>>> 0-testvol-io-cache: page error for page = 0x7fd5c8607ee0 & waitq = 
>>>> 0x7fd5c8067d40
>>>> [2014-09-29 07:54:16.008056] W [fuse-bridge.c:2089:fuse_readv_cbk] 
>>>> 0-glusterfs-fuse: 4561104: READ => -1 (Input/output error)
>>>> [2014-09-29 07:54:16.008233] W [page.c:991:__ioc_page_error] 
>>>> 0-testvol-io-cache: page error for page = 0x7fd5c8066f30 & waitq = 
>>>> 0x7fd5c8067d40
>>>> [2014-09-29 07:54:16.008269] W [fuse-bridge.c:2089:fuse_readv_cbk] 
>>>> 0-glusterfs-fuse: 4561105: READ => -1 (Input/output error)
>>>> [2014-09-29 07:54:16.008800] W [page.c:991:__ioc_page_error] 
>>>> 0-testvol-io-cache: page error for page = 0x7fd5c860bcf0 & waitq = 
>>>> 0x7fd5c863b1f0
>>>> [2014-09-29 07:54:16.008839] W [fuse-bridge.c:2089:fuse_readv_cbk] 
>>>> 0-glusterfs-fuse: 4561107: READ => -1 (Input/output error)
>>>> [2014-09-29 07:54:16.009365] W [page.c:991:__ioc_page_error] 
>>>> 0-testvol-io-cache: page error for page = 0x7fd5c85fd120 & waitq = 
>>>> 0x7fd5c8067d40
>>>> [2014-09-29 07:54:16.009413] W [fuse-bridge.c:2089:fuse_readv_cbk] 
>>>> 0-glusterfs-fuse: 4561109: READ => -1 (Input/output error)
>>>> [2014-09-29 07:54:16.040549] W [afr-open.c:213:afr_open] 
>>>> 0-testvol-replicate-0: failed to open as split brain seen, 
>>>> returning EIO
>>>> [2014-09-29 07:54:16.040594] W [fuse-bridge.c:915:fuse_fd_cbk] 
>>>> 0-glusterfs-fuse: 4561142: OPEN() 
>>>> /SECLOG/20140908.d/SECLOG_00000000000000427425_00000000000000000000.log 
>>>> => -1 (Input/output error)
>>>>
>>>> Could somebody please give me some clue on where to begin? I 
>>>> checked the xattrs on 
>>>> /SECLOG/20140908.d/SECLOG_00000000000000427425_00000000000000000000.log 
>>>> and it seems the changelogs are [0, 0] on both replicas, and the 
>>>> gfid's match.
>>>>
>>>> Thank you very much for any help on this.
>>>> Anirban
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20141020/f7e18f06/attachment.html>


More information about the Gluster-users mailing list