[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

Mon Mar 30 19:35:51 UTC 2020

On March 30, 2020 7:54:59 PM GMT+03:00, Erik Jacobson <erik.jacobson at hpe.com> wrote:
>> Hi Erik,
>> Sadly I didn't have the time to take a look in your logs, but I would
>like to ask you whether you have statiatics of the network bandwidth
>usage.
>> Could it be possible that the gNFS server is  starved for bandwidth
>and fails to reach all bricks  leading to 'split-brain' errors ?
>> 
>
>I understand. I doubt there is a bandwidth issue but I'll add this to
>my
>checks. We have 288 nodes per server normally and they run fine with
>all
>servers up. The 76 number is just what we happened to have access to on
>an internal system.
>
>Question: What you mentioned above, and a feeling I have too personally
>is -- is the split-brain error actually a generic catch-all error for
>not being able to get access to a file? So when it says "split-brain"
>could it really mean any type of access error? Could it also be given
>when there is a IO timeout or something?
>
>I'm starting to break open the source code to look around but I think
>my
>head will explode before I understand it enough. I will still give it a
>shot.
>
>I have access to this system until later tonight. Then it goes away. We
>have duplicated it on another system that stays, but the machine
>internally is so contended for that I wouldn't get a time slot until
>later in the week anyway. Trying to make as much use of this "gift"
>machine as I can :) :)
>
>Thanks again for the replies so far.
>
>Erik

Hey Erik,

Sadly I am not a  developer,  so I can't answer your questions.
Still,  a  bandwith starvation looks like a possible  (at least to me) reason - although error messages and timeouts should fill the logs.

I can recommend you to increase logging for both brick & volume to the maximum and try to reproduce the issue.
Keep in mind that the logs can grow very fast.

Best Regards,
Strahil Nikolov