[Gluster-users] Server outage, file sync/self-heal doesn't sync ALL files?!

Martin Schenker martin.schenker at profitbricks.com
Fri Apr 29 17:35:55 UTC 2011


Sorry, I had manually sync due to imminent server upgrades.
50 min. after the initial sync I was asked to bring the servers in a 
safe state for an upgrade and did a manual 
"touch-on-server13-client-mountpoint" which triggered an immediate 
self-heal on the rest of the files.

All files were in sync across all four server after this action. Will 
run this command next time!!

Best, Martin

Am 29.04.2011 19:30, schrieb Pranith Kumar. Karampuri:
> hi Martin,
>        Could you please send the output of -m "trusted*" instead of "trusted.afr" for the remaining 24 files from both the servers. I would like to see the gfids of these files on both the machines.
>
> Pranith.
> ----- Original Message -----
> From: "Martin Schenker"<martin.schenker at profitbricks.com>
> To: gluster-users at gluster.org
> Sent: Friday, April 29, 2011 8:39:46 PM
> Subject: [Gluster-users] Server outage,	file sync/self-heal doesn't sync ALL files?!
>
> Hi all!
>
> We have another incident over here.
>
> One of the servers (pserver12) in a pair (12&  13) has been rebooted.
> pserver13 showed 63 files not in sync after the outage for 2h.
>
> Both server are clients as well.
>
> Starting pserver12 brought up the self-heal mechanism, but only 39 files
> were triggered within the first 10 min. Now the system seems dormant and
> 24 files are left hanging.
>
> On the other three servers no inconsistencies are seen.
>
> tail of client log file:
>
> 2011-04-29 14:48:23.820022] I
> [afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done]
> 0-storage0-replicate-2: diff self-heal on /pserver13-17: 1960 blocks of
> 22736 were different (8.62%)
> [2011-04-29 14:48:23.887651] E [afr-common.c:110:afr_set_split_brain]
> 0-storage0-replicate-2: invalid argument: inode
> [2011-04-29 14:48:23.887740] I
> [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk]
> 0-storage0-replicate-2: background  data self-heal completed on
> /pserver13-17
> [2011-04-29 14:48:24.272220] I
> [afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done]
> 0-storage0-replicate-2: diff self-heal on /pserver13-19: 1960 blocks of
> 22744 were different (8.62%)
> [2011-04-29 14:48:24.341868] E [afr-common.c:110:afr_set_split_brain]
> 0-storage0-replicate-2: invalid argument: inode
> [2011-04-29 14:48:24.341959] I
> [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk]
> 0-storage0-replicate-2: background  data self-heal completed on
> /pserver13-19
> [2011-04-29 14:48:24.758131] I
> [afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done]
> 0-storage0-replicate-2: diff self-heal on /pserver13-23: 1952 blocks of
> 22752 were different (8.58%)
> [2011-04-29 14:48:24.766054] E [afr-common.c:110:afr_set_split_brain]
> 0-storage0-replicate-2: invalid argument: inode
> [2011-04-29 14:48:24.766137] I
> [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk]
> 0-storage0-replicate-2: background  data self-heal completed on
> /pserver13-23
> [2011-04-29 14:48:24.884613] I
> [afr-self-heal-algorithm.c:526:sh_diff_loop_driver_done]
> 0-storage0-replicate-2: diff self-heal on /pserver13-10: 1952 blocks of
> 22760 were different (8.58%)
> [2011-04-29 14:48:24.895631] E [afr-common.c:110:afr_set_split_brain]
> 0-storage0-replicate-2: invalid argument: inode
> [2011-04-29 14:48:24.895721] I
> [afr-self-heal-common.c:1527:afr_self_heal_completion_cbk]
> 0-storage0-replicate-2: background  data self-heal completed on
> /pserver13-10
> 0 root at pserver13:/var/log/glusterfs # date
> Fri Apr 29 15:08:18 UTC 2011
>
>
> Search for mismatch:
>
> 0 root at pserver13:~ # getfattr -R -d -e hex -m "trusted.afr."
> /mnt/gluster/brick?/storage | grep -v 0x000000000000000000000000 | grep
> -B1 -A1 trusted | grep -c file
> getfattr: Removing leading '/' from absolute path names
> *24*
>
>
> 0 root at pserver13:~ # getfattr -R -d -e hex -m "trusted.afr."
> /mnt/gluster/brick?/storage | grep -v 0x000000000000000000000000 | grep
> -B1  trusted
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-33
> trusted.afr.storage0-client-4=0x270000010000000000000000
> --
> # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-26
> trusted.afr.storage0-client-4=0x270000010000000000000000
> --
> # file:
> mnt/gluster/brick0/storage/images/1959/cd55c5f3-9aa1-bfd9-99a0-01c13a7d8559/hdd-images
> trusted.afr.storage0-client-4=0x000000000000001600000001
> --
> # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-24
> trusted.afr.storage0-client-4=0x270000010000000000000000
> --
> # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-8
> trusted.afr.storage0-client-4=0x270000010000000000000000
> --
> # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-21
> trusted.afr.storage0-client-4=0x270000010000000000000000
> --
> # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-22
> trusted.afr.storage0-client-4=0x270000010000000000000000
> --
> # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-30
> trusted.afr.storage0-client-4=0x270000010000000000000000
> --
> # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-20
> trusted.afr.storage0-client-4=0x270000010000000000000000
> --
> # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-9
> trusted.afr.storage0-client-4=0x270000010000000000000000
> --
> # file: mnt/gluster/brick0/storage/de-dc1-c1-pserver5-38
> trusted.afr.storage0-client-4=0x270000010000000000000000
> --
> # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-18
> trusted.afr.storage0-client-6=0x270000010000000000000000
> --
> # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-2
> trusted.afr.storage0-client-6=0x270000010000000000000000
> --
> # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-23
> trusted.afr.storage0-client-6=0x270000010000000000000000
> --
> # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-4
> trusted.afr.storage0-client-6=0x270000010000000000000000
> --
> # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-3
> trusted.afr.storage0-client-6=0x270000010000000000000000
> --
> # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-34
> trusted.afr.storage0-client-6=0x270000010000000000000000
> --
> # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-37
> trusted.afr.storage0-client-6=0x270000010000000000000000
> --
> # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-12
> trusted.afr.storage0-client-6=0x270000010000000000000000
> --
> # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-27
> trusted.afr.storage0-client-6=0x270000010000000000000000
> --
> # file:
> mnt/gluster/brick1/storage/images/1831/9a039a81-60fe-5fa3-f562-8f6d3828382b/hdd-images/13169
> trusted.afr.storage0-client-6=0x100000020000000000000000
> --
> # file:
> mnt/gluster/brick1/storage/images/1959/cd55c5f3-9aa1-bfd9-99a0-01c13a7d8559/hdd-images
> trusted.afr.storage0-client-6=0x000000000000001600000002
> --
> # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-25
> trusted.afr.storage0-client-6=0x270000010000000000000000
> --
> # file: mnt/gluster/brick1/storage/de-dc1-c1-pserver5-7
> trusted.afr.storage0-client-6=0x270000010000000000000000
>
>
>
> I could trigger manually but why isn't the sync/self-heal not working on
> all files shown as inconsistent? Or am I assuming something wrongly here?!?
>
> Best, Martin
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>    




More information about the Gluster-users mailing list