[Gluster-users] Interesting split-brain...

Ludwig Gamache ludwig at elementai.com
Thu Jun 15 13:42:38 UTC 2017


Hi, I did a maintenance on the 2 bricks that we have. I added RAM. One of
the brick was down for about 30 minutes and the other one for about 10
minutes. In between the shutdown, I only gave a few minutes to gluster to
heal. I know that many files were still not in synch when I have shutdown
the second brick.

The rest is some assumption. I know that one of the user was trying to
share the zsh history file between multiple dockers. He tried to use the
same file and also tried to use a directory to have multiple history files.

My guess is that when I shutdown the first node, he created the directory.
When I rebooted the first brick and shutdown the second one, I most likely
have not give enough time to heal the 2 bricks. Then, he created the file
on the second node. When I rebooted the second brick, Gluster was not able
to recover.

Would a third brick have solved this situation? I am not entirely sure.

On Thu, Jun 15, 2017 at 1:43 AM, Mohammed Rafi K C <rkavunga at redhat.com>
wrote:

> Can you please explain How we ended up in this scenario. I think that will
> help to understand more about this scenarios and why gluster recommend
> replica 3 or arbiter volume.
>
> Regards
>
> Rafi KC
>
> On 06/15/2017 10:46 AM, Karthik Subrahmanya wrote:
>
> Hi Ludwig,
>
> There is no way to resolve gfid split-brains with type mismatch. You have
> to do it manually by following the steps in [1].
> In case of type mismatch it is recommended to resolve it manually. But for
> only gfid mismatch in 3.11 we have a way to
> resolve it by using the *favorite-child-policy*.
> Since the file is not important, you can go with deleting that.
>
> [1] https://gluster.readthedocs.io/en/latest/Troubleshooting/
> split-brain/#fixing-directory-entry-split-brain
>
> HTH,
> Karthik
>
> On Thu, Jun 15, 2017 at 8:23 AM, Ludwig Gamache <ludwig at elementai.com>
> wrote:
>
>> I am new to gluster but already like it. I did a maintenance last week
>> where I shutdown both nodes (one after each others). I had many files that
>> needed to be healed after that. Everything worked well, except for 1 file.
>> It is in split-brain, with 2 different GFID. I read the documentation but
>> it only covers the cases where the GFID is the same on both bricks. BTW, I
>> am running Gluster 3.10.
>>
>> Here are some details...
>>
>> [root at NAS-01 .glusterfs]# gluster volume heal data01 info
>>
>> Brick 192.168.186.11:/mnt/DATA/data
>>
>> /abc/.zsh_history
>>
>> /abc - Is in split-brain
>>
>>
>> Status: Connected
>>
>> Number of entries: 2
>>
>>
>> Brick 192.168.186.12:/mnt/DATA/data
>>
>> /abc - Is in split-brain
>>
>>
>> /abc/.zsh_history
>>
>> Status: Connected
>>
>> Number of entries: 2
>>
>> On brick 1:
>>
>> [root at NAS-01 abc]# ls -lart
>>
>> total 75
>>
>> drwxr-xr-x.  2 root  root  2 Jun  8 13:26 .zsh_history
>>
>> drwxr-xr-x.  3 12078 root  3 Jun 12 11:36 .
>>
>> drwxrwxrwt. 17 root  root 17 Jun 12 12:20 ..
>>
>> On brick 2:
>>
>> [root at DC-MTL-NAS-02 abc]# ls -lart
>>
>> total 66
>>
>> -rw-rw-r--.  2 12078 12078 1085 Jun 12 04:42 .zsh_history
>>
>> drwxr-xr-x.  2 12078 root     3 Jun 12 10:36 .
>>
>> drwxrwxrwt. 17 root  root    17 Jun 12 11:20 ..
>>
>> Notice that on one brick, it is a file and on the other one it is a
>> directory.
>>
>> On brick 1:
>>
>> [root at NAS-01 abc]# getfattr -d -m . -e hex /mnt/DATA/data/abc/.zsh_histor
>> y
>>
>> getfattr: Removing leading '/' from absolute path names
>>
>> # file: mnt/DATA/data/abc/.zsh_history
>>
>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
>> c6162656c65645f743a733000
>>
>> trusted.afr.data01-client-0=0x000000000000000000000000
>>
>> trusted.afr.data01-client-1=0x000000000000000200000000
>>
>> trusted.gfid=0xdee43407139d41f091d13e106a51f262
>>
>> trusted.glusterfs.dht=0x000000010000000000000000ffffffff
>>
>> On brick 2:
>>
>> root at NAS-02 abc]# getfattr -d -m . -e hex /mnt/DATA/data/abc/.zsh_histor
>> y
>>
>> getfattr: Removing leading '/' from absolute path names
>>
>> # file: mnt/DATA/data/abc/.zsh_history
>>
>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
>> c6162656c65645f743a733000
>>
>> trusted.afr.data01-client-0=0x000000170000000200000000
>>
>> trusted.afr.data01-client-1=0x000000000000000000000000
>>
>> trusted.bit-rot.version=0x060000000000000059397acd0005dadd
>>
>> trusted.gfid=0xa70ae9af887a4a37875f5c7c81ebc803
>>
>> Any recommendation on how to recover from that? BTW, the file is not
>> important and I could easily get rid of it without impact. So, if this is
>> an easy solution...
>>
>> Regards,
>>
>> --
>> Ludwig Gamache
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
>
> _______________________________________________
> Gluster-users mailing listGluster-users at gluster.orghttp://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>


-- 
Ludwig Gamache
IT Director - Element AI
4200 St-Laurent, suite 1200
514-704-0564
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170615/70e17b93/attachment.html>


More information about the Gluster-users mailing list