[Gluster-users] split-brain
Ravishankar N
ravishankar at redhat.com
Mon Mar 20 13:19:47 UTC 2017
On 03/20/2017 06:31 PM, Bernhard Dübi wrote:
> Hi Ravi,
>
> thank you very much for looking into this
> The gluster volumes are used by CommVault Simpana to store backup
> data. Nothing/Nobody should access the underlying infrastructure.
>
> while looking at the xattrs of the files, I noticed that the only
> difference was the bit-rot.version. So, I assume that something in the
> synchronization of the bit-rot data went wrong and having different
> bit-rot.versions is considered like a split-brain situation and access
> is denied because there is no guarantee of correctness. this is just a
> wild guess.
Hi Bernhard,
bit-rot version can be different between bricks of the replica when I/O
is successful only on one brick of the replica when the other brick was
down. (though AFR self-heal will later heal the contents, but not modify
bitrot xattrs). So that is not a problem.
>
> over the weekend I identified hundreds of files with input/output
> errors. I compared the sha256sum of both bricks, they were always the
> same. I then deleted the affected files from gluster and recreated
> them. this should have fixed the issue. Verification is still running.
>
> if you're interested in the root cause, I can send you more log files
> and the xattrs of some files
If you did not access the underlying bricks directly like you said then
it could possibly be a bitrot bug. If you don't mind please raise a BZ
under the bitrot component and the appropriate gluster version with all
client and brick logs attached.
Also if you do have some kind of reproducer, that would help a lot.
-Ravi
>
>
> Best Regards
> Bernhard
>
>
> 2017-03-20 12:57 GMT+01:00 Ravishankar N <ravishankar at redhat.com
> <mailto:ravishankar at redhat.com>>:
>
> SFILE_CONTAINER_080 is the one which seems to be in split-brain.
> SFILE_CONTAINER_046, for which you have provided the getfattr
> output, hard links etc doesn't seem to be in split-brain. We do
> see that the fops on SFILE_CONTAINER_046 are failing on the client
> translator itself due to EIO:
>
> [2017-03-17 19:49:56.088867] E [MSGID: 114031]
> [client-rpc-fops.c:444:client3_3_open_cbk]
> 0-Server_Legal_01-client-0: remote operation failed. Path:
> /Server_Legal/CV_MAGNETIC/V_944453/CHUNK_9291168/SFILE_CONTAINER_046
> (bfdfe21a-1af3-474b-a6a4-bc0e17edb529) [Input/output error]
>
> [2017-03-17 19:49:56.089012] E [MSGID: 114031]
> [client-rpc-fops.c:444:client3_3_open_cbk]
> 0-Server_Legal_01-client-1: remote operation failed. Path:
> /Server_Legal/CV_MAGNETIC/V_944453/CHUNK_9291168/SFILE_CONTAINER_046
> (bfdfe21a-1af3-474b-a6a4-bc0e17edb529) [Input/output error]
>
> which is why the sha256sum on the mount gave EIO. And that is
> because the file seems to be corrupt on both bricks because the
> 'trusted.bit-rot.bad-file' xattr is set.
>
> Did you write to the files directly on the backend? What is
> interesting is that the sha256sum is same on both the bricks
> despite being both marked as bad by bitrot.
>
> -Ravi
>
>
> On 03/18/2017 03:20 AM, Bernhard Dübi wrote:
>> Hi,
>>
>> I have a situation
>>
>> the volume logfile reports a possible split-brain but when I try
>> to heal it fails because the file is not in split-brain. Any ideas?
>>
>>
>>
>>
>> Regards
>>
>> Bernhard
>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>> <http://lists.gluster.org/mailman/listinfo/gluster-users>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170320/aba3c3bd/attachment.html>
More information about the Gluster-users
mailing list