[Gluster-users] Random and frequent split brain

Thu Jul 17 06:24:31 UTC 2014

On 07/17/2014 08:41 AM, Nilesh Govindrajan wrote:
> log1 and log2 are brick logs. The others are client logs.
I see a lot of logs as below in 'log1' you attached. It seems like the 
device ID of where the file where it is actually stored, where the 
gfid-link of the same file is stored i.e inside <brick-dir>/.glusterfs/ 
are different. What all devices/filesystems are present inside the brick 
represented by 'log1'?

[2014-07-16 00:00:24.358628] W [posix-handle.c:586:posix_handle_hard] 
0-home-posix: mismatching ino/dev between file 
/data/gluster/home/techiebuzz/techie-buzz.com/wp-content/cache/page_enhanced/techie-buzz.com/social-networking/facebook-will-permanently-remove-your-deleted-photos.html/_index.html.old 
(1077282838/2431) and handle 
/data/gluster/home/.glusterfs/ae/f0/aef0404b-e084-4501-9d0f-0e6f5bb2d5e0 
(1077282836/2431)
[2014-07-16 00:00:24.358646] E [posix.c:823:posix_mknod] 0-home-posix: 
setting gfid on 
/data/gluster/home/techiebuzz/techie-buzz.com/wp-content/cache/page_enhanced/techie-buzz.com/social-networking/facebook-will-permanently-remove-your-deleted-photos.html/_index.html.old 
failed

Pranith

>
> On Thu, Jul 17, 2014 at 8:08 AM, Pranith Kumar Karampuri
> <pkarampu at redhat.com> wrote:
>> On 07/17/2014 07:28 AM, Nilesh Govindrajan wrote:
>>> On Thu, Jul 17, 2014 at 7:26 AM, Nilesh Govindrajan <me at nileshgr.com>
>>> wrote:
>>>> Hello,
>>>>
>>>> I'm having a weird issue. I have this config:
>>>>
>>>> node2 ~ # gluster peer status
>>>> Number of Peers: 1
>>>>
>>>> Hostname: sto1
>>>> Uuid: f7570524-811a-44ed-b2eb-d7acffadfaa5
>>>> State: Peer in Cluster (Connected)
>>>>
>>>> node1 ~ # gluster peer status
>>>> Number of Peers: 1
>>>>
>>>> Hostname: sto2
>>>> Port: 24007
>>>> Uuid: 3a69faa9-f622-4c35-ac5e-b14a6826f5d9
>>>> State: Peer in Cluster (Connected)
>>>>
>>>> Volume Name: home
>>>> Type: Replicate
>>>> Volume ID: 54fef941-2e33-4acf-9e98-1f86ea4f35b7
>>>> Status: Started
>>>> Number of Bricks: 1 x 2 = 2
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: sto1:/data/gluster/home
>>>> Brick2: sto2:/data/gluster/home
>>>> Options Reconfigured:
>>>> performance.write-behind-window-size: 2GB
>>>> performance.flush-behind: on
>>>> performance.cache-size: 2GB
>>>> cluster.choose-local: on
>>>> storage.linux-aio: on
>>>> transport.keepalive: on
>>>> performance.quick-read: on
>>>> performance.io-cache: on
>>>> performance.stat-prefetch: on
>>>> performance.read-ahead: on
>>>> cluster.data-self-heal-algorithm: diff
>>>> nfs.disable: on
>>>>
>>>> sto1/2 is alias to node1/2 respectively.
>>>>
>>>> As you see, NFS is disabled so I'm using the native fuse mount on both
>>>> nodes.
>>>> The volume contains files and php scripts that are served on various
>>>> websites. When both nodes are active, I get split brain on many files
>>>> and the mount on node2 going 'input/output error' on many of them
>>>> which causes HTTP 500 errors.
>>>>
>>>> I delete the files from the brick using find -samefile. It fixes for a
>>>> few minutes and then the problem is back.
>>>>
>>>> What could be the issue? This happens even if I use the NFS mounting
>>>> method.
>>>>
>>>> Gluster 3.4.4 on Gentoo.
>>> And yes, network connectivity is not an issue between them as both of
>>> them are located in the same DC. They're connected via 1 Gbit line
>>> (common for internal and external network) but external network
>>> doesn't cross 200-500 Mbit/s leaving quite a good window for gluster.
>>> I also tried enabling quorum but that doesn't help either.
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>> hi Nilesh,
>>        Could you attach the mount, brick logs so that we can inspect what is
>> going on the setup.
>>
>> Pranith