[Gluster-users] Self heal problem

Marcus Wellhardh wellhardh at roxen.com
Wed Dec 4 08:45:12 UTC 2013


Hi,

I reduced my setup to two nodes and did a graceful restart of one node
(rod). Still same problem, split-brain on the vSphere-HA lock file. I
found some additional log entries that might give som clues, especially
the "CREATE (null)" error on the live node while the other is offline. 

[root at todd ~]# tail -F /var/log/glusterfs/bricks/data-gv0.log
[2013-12-04 07:37:00.339843] I [server.c:762:server_rpc_notify] 0-gv0-server: disconnecting connectionfrom rod.roxen.com-23979-2013/12/04-07:20:52:76630-gv0-client-0-0
[2013-12-04 07:37:00.339866] I [server-helpers.c:729:server_connection_put] 0-gv0-server: Shutting down connection rod.roxen.com-23979-2013/12/04-07:20:52:76630-gv0-client-0-0
[2013-12-04 07:37:00.339889] I [server-helpers.c:617:server_connection_destroy] 0-gv0-server: destroyed connection of rod.roxen.com-23979-2013/12/04-07:20:52:76630-gv0-client-0-0
[2013-12-04 07:37:00.810926] I [server.c:762:server_rpc_notify] 0-gv0-server: disconnecting connectionfrom rod.roxen.com-2472-2013/12/03-16:06:42:234499-gv0-client-0-0
[2013-12-04 07:37:00.810950] I [server-helpers.c:729:server_connection_put] 0-gv0-server: Shutting down connection rod.roxen.com-2472-2013/12/03-16:06:42:234499-gv0-client-0-0
[2013-12-04 07:37:00.811005] I [server-helpers.c:617:server_connection_destroy] 0-gv0-server: destroyed connection of rod.roxen.com-2472-2013/12/03-16:06:42:234499-gv0-client-0-0
[2013-12-04 07:38:01.696398] I [server-rpc-fops.c:1618:server_create_cbk] 0-gv0-server: 445781: CREATE (null) (f0648215-68ff-441e-88aa-99a553c6d4e6/.lck-21133152dee76ab0) ==> (File exists)
[2013-12-04 07:41:11.841299] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from rod.roxen.com-2447-2013/12/04-07:41:11:718343-gv0-client-0-0 (version: 3.4.1)
[2013-12-04 07:41:17.345764] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from rod.roxen.com-2875-2013/12/04-07:41:17:416820-gv0-client-0-0 (version: 3.4.1)
[2013-12-04 07:41:17.395322] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from rod.roxen.com-2873-2013/12/04-07:41:17:400240-gv0-client-0-0 (version: 3.4.1)


[root at rod ~]# tail -F /var/log/glusterfs/bricks/data-gv0.log
[2013-12-04 07:41:17.928235] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from todd.roxen.com-14615-2013/12/03-13:58:46:20150-gv0-client-1-0 (version: 3.4.1)
[2013-12-04 07:41:18.273444] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from todd.roxen.com-15161-2013/12/03-14:02:51:809483-gv0-client-1-0 (version: 3.4.1)
[2013-12-04 07:41:18.372988] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (79b19fbf-4fc9-45e4-bb2c-c0f7cabf3de5) is not found. anonymous fd creation failed
[2013-12-04 07:41:18.373367] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (79b19fbf-4fc9-45e4-bb2c-c0f7cabf3de5) is not found. anonymous fd creation failed
[2013-12-04 07:41:18.860315] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from rod.roxen.com-2447-2013/12/04-07:41:11:718343-gv0-client-1-0 (version: 3.4.1)
[2013-12-04 07:41:20.030341] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from todd.roxen.com-14617-2013/12/03-13:58:46:34506-gv0-client-1-0 (version: 3.4.1)
[2013-12-04 07:41:20.597249] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) is not found. anonymous fd creation failed
[2013-12-04 07:41:20.597409] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (3b6c53f2-cefc-46ad-81f3-0fdaa4c97414) is not found. anonymous fd creation failed
[2013-12-04 07:41:20.597422] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) is not found. anonymous fd creation failed
[2013-12-04 07:41:20.597468] I [server-rpc-fops.c:293:server_finodelk_cbk] 0-gv0-server: 438245: FINODELK -2 (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) ==> (No such file or directory)
[2013-12-04 07:41:20.597546] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (62050a26-ab8c-4dd7-b1ac-c4be46a42cbb) is not found. anonymous fd creation failed
[2013-12-04 07:41:20.597664] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (86c7c599-3715-4df1-9a35-3ba8703cbf60) is not found. anonymous fd creation failed
[2013-12-04 07:41:20.597823] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (49fcdf96-5ea1-4005-b394-6c581ab93a64) is not found. anonymous fd creation failed
[2013-12-04 07:41:20.598239] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) is not found. anonymous fd creation failed
[2013-12-04 07:41:20.638964] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) is not found. anonymous fd creation failed
[2013-12-04 07:41:20.638984] I [server-rpc-fops.c:293:server_finodelk_cbk] 0-gv0-server: 438250: FINODELK -2 (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) ==> (No such file or directory)
[2013-12-04 07:41:20.639913] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) is not found. anonymous fd creation failed
[2013-12-04 07:41:20.640078] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) is not found. anonymous fd creation failed
[2013-12-04 07:41:20.640095] I [server-rpc-fops.c:1401:server_fsync_cbk] 0-gv0-server: 438252: FSYNC -2 (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) ==> (No such file or directory)


[root at todd ~]# getfattr -m . -d -e hex /data/gv0/production/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5165-87b180a-vmware/.lck-21133152dee76ab0 
getfattr: Removing leading '/' from absolute path names
# file: data/gv0/production/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5165-87b180a-vmware/.lck-21133152dee76ab0
trusted.afr.gv0-client-0=0x000000000000000000000000
trusted.afr.gv0-client-1=0x000000410000000100000000
trusted.gfid=0x8e7f2d5951a44cd8a8c12d55baccf1dc


[root at rod ~]# getfattr -m . -d -e hex /data/gv0/production/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5165-87b180a-vmware/.lck-21133152dee76ab0 
getfattr: Removing leading '/' from absolute path names
# file: data/gv0/production/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5165-87b180a-vmware/.lck-21133152dee76ab0
trusted.afr.gv0-client-0=0x000000000000000000000000
trusted.afr.gv0-client-1=0x000000000000000000000000
trusted.gfid=0x90fef179bc4e4b2d9cccd13a9a7d859f

Regards,
Marcus

On Tue, 2013-12-03 at 09:01 +0100, Marcus Wellhardh wrote: 
> Hi,
> 
> I did a trivial test to verify my delete/recreate theory:
> 
>   1) File exists on all nodes.
>   2) One node is powered down.
>   3) File is deleted and recreated with same filename.
>   4) Failing node is restarted.
>   5) Self heal worked on the modified file.
> 
> Glusterfs handled that above scenario perfectly. So the question is why
> does self heal fail on the vSphere-HA lock file? Does anyone have a
> troubleshoot idea?
> 
> I am using:
> 
>   glusterfs-3.4.1-3.el6.x86_64
>   CentOS release 6.4
> 
> Regards,
> Marcus
> 
> On Fri, 2013-11-29 at 14:05 +0100, Marcus Wellhardh wrote: 
> > Hi,
> > 
> > I have a glusterfs volume replicated on three nodes. I am planing to use
> > the volume as storage for vMware ESXi machines using NFS. The reason for
> > using tree nodes is to be able to configure Quorum and avoid
> > split-brains. However, during my initial testing when intentionally and
> > gracefully restart the node "ned", a split-brain/self-heal error
> > occurred.
> > 
> > The log on "todd" and "rod" gives:
> > 
> >   [2013-11-29 12:34:14.614456] E [afr-self-heal-data.c:1270:afr_sh_data_open_cbk] 0-gv0-replicate-0: open of <gfid:09b6d1d7-e583-4cee-93a4-4e972346ade3> failed on child gv0-client-2 (No such file or directory)
> > 
> > The reason is probably that the file was deleted and recreated with the
> > same file name during the time the node was offline, i.e. new inode and
> > thus new gfid. 
> > 
> > Is this expected? Is it possible to configure the volume to
> > automatically handle this?
> > 
> > The same problem happens every time I test a restart. It looks like
> > Vmware is constantly creating new lock-files for the vSphere-HA
> > directory.
> > 
> > Below you will find various information about the glusterfs volume. I
> > have also attached the full logs for all three nodes. 
> > 
> > [root at todd ~]# gluster volume info
> >  
> > Volume Name: gv0
> > Type: Replicate
> > Volume ID: a847a533-9509-48c5-9c18-a40b48426fbc
> > Status: Started
> > Number of Bricks: 1 x 3 = 3
> > Transport-type: tcp
> > Bricks:
> > Brick1: todd-storage:/data/gv0
> > Brick2: rod-storage:/data/gv0
> > Brick3: ned-storage:/data/gv0
> > Options Reconfigured:
> > cluster.server-quorum-type: server
> > cluster.server-quorum-ratio: 51%
> > 
> > [root at todd ~]# gluster volume heal gv0 info 
> > Gathering Heal info on volume gv0 has been successful
> > 
> > Brick todd-storage:/data/gv0
> > Number of entries: 2
> > /production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware
> > /production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb
> > 
> > Brick rod-storage:/data/gv0
> > Number of entries: 2
> > /production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware
> > /production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb
> > 
> > Brick ned-storage:/data/gv0
> > Number of entries: 0
> > 
> > [root at todd ~]# getfattr -m . -d -e hex /data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb
> > getfattr: Removing leading '/' from absolute path names
> > # file: data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb
> > trusted.afr.gv0-client-0=0x000000000000000000000000
> > trusted.afr.gv0-client-1=0x000000000000000000000000
> > trusted.afr.gv0-client-2=0x000002810000000100000000
> > trusted.gfid=0x09b6d1d7e5834cee93a44e972346ade3
> > 
> > [root at todd ~]# stat /data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb
> >   File: `/data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb'
> >   Size: 84        	Blocks: 8          IO Block: 4096   regular file
> > Device: fd03h/64771d	Inode: 1191        Links: 2
> > Access: (0775/-rwxrwxr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> > Access: 2013-11-29 11:38:36.285091183 +0100
> > Modify: 2013-11-29 13:26:24.668822831 +0100
> > Change: 2013-11-29 13:26:24.668822831 +0100
> > 
> > [root at rod ~]# getfattr -m . -d -e hex /data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb
> > getfattr: Removing leading '/' from absolute path names
> > # file: data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb
> > trusted.afr.gv0-client-0=0x000000000000000000000000
> > trusted.afr.gv0-client-1=0x000000000000000000000000
> > trusted.afr.gv0-client-2=0x000002810000000100000000
> > trusted.gfid=0x09b6d1d7e5834cee93a44e972346ade3
> > 
> > [root at rod ~]# stat /data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb
> >   File: `/data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb'
> >   Size: 84        	Blocks: 8          IO Block: 4096   regular file
> > Device: fd03h/64771d	Inode: 1558        Links: 2
> > Access: (0775/-rwxrwxr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> > Access: 2013-11-29 11:38:36.284671510 +0100
> > Modify: 2013-11-29 13:26:24.668985155 +0100
> > Change: 2013-11-29 13:26:24.669985185 +0100
> > 
> > [root at ned ~]# getfattr -m . -d -e hex /data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb
> > getfattr: Removing leading '/' from absolute path names
> > # file: data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb
> > trusted.afr.gv0-client-0=0x000000000000000000000000
> > trusted.afr.gv0-client-1=0x000000000000000000000000
> > trusted.afr.gv0-client-2=0x000000000000000000000000
> > trusted.gfid=0x76caf49a25d74ebdb711a562412bee43
> > 
> > [root at ned ~]# stat /data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb
> >   File: `/data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb'
> >   Size: 84        	Blocks: 8          IO Block: 4096   regular file
> > Device: fd03h/64771d	Inode: 4545        Links: 2
> > Access: (0775/-rwxrwxr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> > Access: 2013-11-29 11:34:45.199330329 +0100
> > Modify: 2013-11-29 11:37:03.773330311 +0100
> > Change: 2013-11-29 11:37:03.773330311 +0100
> > 
> > Regards,
> > Marcus Wellhardh
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://supercolony.gluster.org/mailman/listinfo/gluster-users
> 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users




More information about the Gluster-users mailing list