[Gluster-users] Replication delay

Fri Jan 24 11:52:44 UTC 2014

Fabio,
     It has nothing to do with SELINUX IMO. You were saying self-heal happens when the VM is paused, that means writes from self-heal's fd are succeeding. So something happened to that VM's fd using which kvm writes. Wonder what!. When did you start getting this problem? What happened at that time.

Pranith

----- Original Message -----
> From: "Fabio Rosati" <fabio.rosati at geminformatica.it>
> To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> Cc: "Gluster-users at gluster.org List" <gluster-users at gluster.org>
> Sent: Friday, January 24, 2014 5:09:25 PM
> Subject: Re: [Gluster-users] Replication delay
> 
> You're right! In the brick log from the first peer (networker AKA
> nw1glus.gem.local) I found lots of these errors:
> 
> [2014-01-24 11:32:28.482639] E [posix.c:2135:posix_writev] 0-gv_pri-posix:
> write failed: offset 4812114432, Invalid argument
> [2014-01-24 11:32:28.485334] I [server-rpc-fops.c:1439:server_writev_cbk]
> 0-gv_pri-server: 31817: WRITEV 0 (f1e928ad-d4dd-49f3-abae-e99cb1f310e1) ==>
> (Invalid argument)
> [2014-01-24 11:32:28.483791] E [posix.c:2135:posix_writev] 0-gv_pri-posix:
> write failed: offset 5562239488, Invalid argument
> [2014-01-24 11:32:28.485416] I [server-rpc-fops.c:1439:server_writev_cbk]
> 0-gv_pri-server: 31820: WRITEV 0 (f1e928ad-d4dd-49f3-abae-e99cb1f310e1) ==>
> (Invalid argument)
> [2014-01-24 11:32:28.484275] E [posix.c:2135:posix_writev] 0-gv_pri-posix:
> write failed: offset 5757467136, Invalid argument
> [2014-01-24 11:32:28.482841] E [posix.c:2135:posix_writev] 0-gv_pri-posix:
> write failed: offset 3742501376, Invalid argument
> [2014-01-24 11:32:28.485494] I [server-rpc-fops.c:1439:server_writev_cbk]
> 0-gv_pri-server: 31822: WRITEV 0 (f1e928ad-d4dd-49f3-abae-e99cb1f310e1) ==>
> (Invalid argument)
> [2014-01-24 11:32:28.485534] I [server-rpc-fops.c:1439:server_writev_cbk]
> 0-gv_pri-server: 31818: WRITEV 0 (f1e928ad-d4dd-49f3-abae-e99cb1f310e1) ==>
> (Invalid argument)
> [2014-01-24 11:32:28.530943] E [posix.c:2135:posix_writev] 0-gv_pri-posix:
> write failed: offset 3156122112, Invalid argument
> [2014-01-24 11:32:28.530997] I [server-rpc-fops.c:1439:server_writev_cbk]
> 0-gv_pri-server: 31832: WRITEV 0 (f1e928ad-d4dd-49f3-abae-e99cb1f310e1) ==>
> (Invalid argument)
> 
> Then I noticed the SELinux context on the two bricks are different, I don't
> know if this can be the cause of the errors:
> 
> [root at networker gluspri]# ll -Z /glustexp/pri1/brick/
> -rw-------. qemu qemu system_u:object_r:file_t:s0      alfresco.qc2
> 
> [root at networker2 ~]# ll -Z /glustexp/pri1/brick/
> -rw-------. qemu qemu unconfined_u:object_r:file_t:s0  alfresco.qc2
> 
> 
> Fabio
> 
> ----- Messaggio originale -----
> > Da: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> > A: "Fabio Rosati" <fabio.rosati at geminformatica.it>
> > Cc: "Gluster-users at gluster.org List" <gluster-users at gluster.org>
> > Inviato: Venerdì, 24 gennaio 2014 12:27:56
> > Oggetto: Re: [Gluster-users] Replication delay
> > 
> > Fabio,
> >       Seems like writes on first brick of this replica pair seem to be
> >       failing from the mount. Could you check both client and brick logs to
> >       see where these failures are coming from?
> > 
> > Pranith
> > ----- Original Message -----
> > > From: "Fabio Rosati" <fabio.rosati at geminformatica.it>
> > > To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> > > Cc: "Gluster-users at gluster.org List" <gluster-users at gluster.org>
> > > Sent: Friday, January 24, 2014 4:50:52 PM
> > > Subject: Re: [Gluster-users] Replication delay
> > > 
> > > Ok, that's the output after the VM has been halted:
> > > 
> > > [root at networker ~]# getfattr -d -m. -e hex
> > > /glustexp/pri1/brick/alfresco.qc2
> > > getfattr: Removing leading '/' from absolute path names
> > > # file: glustexp/pri1/brick/alfresco.qc2
> > > security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> > > trusted.afr.gv_pri-client-0=0x000001390000000000000000
> > > trusted.afr.gv_pri-client-1=0x000000000000000000000000
> > > trusted.gfid=0x298c76de7c8643a3909f7ef77dc294fe
> > > 
> > > [root at networker2 ~]# getfattr -d -m. -e hex
> > > /glustexp/pri1/brick/alfresco.qc2
> > > getfattr: Removing leading '/' from absolute path names
> > > # file: glustexp/pri1/brick/alfresco.qc2
> > > security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
> > > trusted.afr.gv_pri-client-0=0x000001390000000000000000
> > > trusted.afr.gv_pri-client-1=0x000000000000000000000000
> > > trusted.gfid=0x298c76de7c8643a3909f7ef77dc294fe
> > > 
> > > 
> > > When "heal info" stops reporting alfresco.qc2 I get:
> > > 
> > > [root at networker glusterfs]# getfattr -d -m. -e hex
> > > /glustexp/pri1/brick/alfresco.qc2
> > > getfattr: Removing leading '/' from absolute path names
> > > # file: glustexp/pri1/brick/alfresco.qc2
> > > security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> > > trusted.afr.gv_pri-client-0=0x000000000000000000000000
> > > trusted.afr.gv_pri-client-1=0x000000000000000000000000
> > > trusted.gfid=0x298c76de7c8643a3909f7ef77dc294fe
> > > 
> > > [root at networker2 ~]# getfattr -d -m. -e hex
> > > /glustexp/pri1/brick/alfresco.qc2
> > > getfattr: Removing leading '/' from absolute path names
> > > # file: glustexp/pri1/brick/alfresco.qc2
> > > security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
> > > trusted.afr.gv_pri-client-0=0x000000000000000000000000
> > > trusted.afr.gv_pri-client-1=0x000000000000000000000000
> > > trusted.gfid=0x298c76de7c8643a3909f7ef77dc294fe
> > > 
> > > 
> > > Fabio
> > > 
> > > ----- Messaggio originale -----
> > > > Da: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> > > > A: "Fabio Rosati" <fabio.rosati at geminformatica.it>
> > > > Cc: "Gluster-users at gluster.org List" <gluster-users at gluster.org>
> > > > Inviato: Venerdì, 24 gennaio 2014 11:36:12
> > > > Oggetto: Re: [Gluster-users] Replication delay
> > > > 
> > > > This time when you stop the VM, could you get the output of "getfattr
> > > > -d
> > > > -m.
> > > > -e hex <file-path-on-brick>" on both the bricks to debug further.
> > > > 
> > > > Pranith
> > > > ----- Original Message -----
> > > > > From: "Fabio Rosati" <fabio.rosati at geminformatica.it>
> > > > > To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> > > > > Cc: "Gluster-users at gluster.org List" <gluster-users at gluster.org>
> > > > > Sent: Friday, January 24, 2014 3:58:38 PM
> > > > > Subject: Re: [Gluster-users] Replication delay
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > ----- Messaggio originale -----
> > > > > > Da: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> > > > > > A: "Fabio Rosati" <fabio.rosati at geminformatica.it>
> > > > > > Cc: "Gluster-users at gluster.org List" <gluster-users at gluster.org>
> > > > > > Inviato: Venerdì, 24 gennaio 2014 11:02:15
> > > > > > Oggetto: Re: [Gluster-users] Replication delay
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > ----- Original Message -----
> > > > > > > From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
> > > > > > > To: "Fabio Rosati" <fabio.rosati at geminformatica.it>
> > > > > > > Cc: "Gluster-users at gluster.org List" <gluster-users at gluster.org>
> > > > > > > Sent: Friday, January 24, 2014 3:29:19 PM
> > > > > > > Subject: Re: [Gluster-users] Replication delay
> > > > > > > 
> > > > > > > Hi Fabio,
> > > > > > >      This is a known issue that has been addressed on master. It
> > > > > > >      may
> > > > > > >      be
> > > > > > >      backported to 3.5. When a file is undergoing changes, it may
> > > > > > >      appear
> > > > > > >      in
> > > > > > >      'gluster volume heal <volname> info' output even when it
> > > > > > >      doesn't
> > > > > > >      need
> > > > > > >      any self-heal.
> > > > > > > 
> > > > > > > Pranith
> > > > > > 
> > > > > > Sorry, I just saw that there is a self-heal happening for 15
> > > > > > minutes
> > > > > > when
> > > > > > you
> > > > > > stop the VMs. How are you checking that the self-heal is happening?
> > > > > 
> > > > > When I stop the VM for alfresco.qc2, "heal info" still reports
> > > > > alfresco.qc2
> > > > > as in need for healing for about 15min.
> > > > > It seems this is a real out-of-sync situation because if I check the
> > > > > two
> > > > > bricks I get different modification times up until they are healed
> > > > > (no
> > > > > more
> > > > > reported by "heal info"). This is the bricks' status for alfresco.qc2
> > > > > while
> > > > > the VM is halted:
> > > > > 
> > > > > [root at networker ~]# ll /glustexp/pri1/brick/
> > > > > totale 27769492
> > > > > -rw-------. 2 qemu qemu 8212709376 24 gen 11:16 alfresco.qc2
> > > > > [...]
> > > > > 
> > > > > [root at networker2 ~]# ll /glustexp/pri1/brick/
> > > > > totale 27769384
> > > > > -rw-------. 2 qemu qemu 8212709376 24 gen 11:05 alfresco.qc2
> > > > > [...]
> > > > > 
> > > > > Bricks' status AFTER "heal info" doesn't report alfresco.qc2 anymore:
> > > > > 
> > > > > [root at networker ~]# ll /glustexp/pri1/brick/
> > > > > totale 27769492
> > > > > -rw-------. 2 qemu qemu 8212709376 24 gen 11:05 alfresco.qc2
> > > > > 
> > > > > [root at networker2 ~]# ll /glustexp/pri1/brick/
> > > > > totale 27769384
> > > > > -rw-------. 2 qemu qemu 8212709376 24 gen 11:05 alfresco.qc2
> > > > > 
> > > > > Thanks for helping!
> > > > > 
> > > > > Fabio
> > > > > 
> > > > > > 
> > > > > > Pranith
> > > > > > > 
> > > > > > > ----- Original Message -----
> > > > > > > > From: "Fabio Rosati" <fabio.rosati at geminformatica.it>
> > > > > > > > To: "Gluster-users at gluster.org List"
> > > > > > > > <gluster-users at gluster.org>
> > > > > > > > Sent: Friday, January 24, 2014 3:17:27 PM
> > > > > > > > Subject: [Gluster-users] Replication delay
> > > > > > > > 
> > > > > > > > Hi All,
> > > > > > > > 
> > > > > > > > in a distributed-replicated volume hosting some VMs disk images
> > > > > > > > (GlusterFS
> > > > > > > > 3.4.2 on CentOS 6.5, qemu-kvm with glusterfs native support, no
> > > > > > > > fuse
> > > > > > > > mount),
> > > > > > > > I always get the same two files that need healing:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > [root at networker ~]# gluster volume heal gv_pri info
> > > > > > > > Gathering Heal info on volume gv_pri has been successful
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Brick nw1glus.gem.local:/glustexp/pri1/brick
> > > > > > > > Number of entries: 2
> > > > > > > > /alfresco.qc2
> > > > > > > > /remlog.qc2
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Brick nw2glus.gem.local:/glustexp/pri1/brick
> > > > > > > > Number of entries: 2
> > > > > > > > /alfresco.qc2
> > > > > > > > /remlog.qc2
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Brick nw3glus.gem.local:/glustexp/pri2/brick
> > > > > > > > Number of entries: 0
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Brick nw4glus.gem.local:/glustexp/pri2/brick
> > > > > > > > Number of entries: 0
> > > > > > > > 
> > > > > > > > This is not a split-brain situation (I checked) and If I stop
> > > > > > > > the
> > > > > > > > two
> > > > > > > > VMs
> > > > > > > > that use these images, I get the two files healed/synced in
> > > > > > > > about
> > > > > > > > 15min.
> > > > > > > > This is too much time, IMHO.
> > > > > > > > In this volume there are other VM images with (smaller) disk
> > > > > > > > images
> > > > > > > > replicated on the same bricks and they get synced "in
> > > > > > > > real-time".
> > > > > > > > 
> > > > > > > > These are the volume's details, the host "networker" is
> > > > > > > > nw1glus.gem.local
> > > > > > > > :
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > [root at networker ~]# gluster volume info gv_pri
> > > > > > > > 
> > > > > > > > Volume Name: gv_pri
> > > > > > > > Type: Distributed-Replicate
> > > > > > > > Volume ID: 3d91b91e-4d72-484f-8655-e5ed8d38bb28
> > > > > > > > Status: Started
> > > > > > > > Number of Bricks: 2 x 2 = 4
> > > > > > > > Transport-type: tcp
> > > > > > > > Bricks:
> > > > > > > > Brick1: nw1glus.gem.local:/glustexp/pri1/brick
> > > > > > > > Brick2: nw2glus.gem.local:/glustexp/pri1/brick
> > > > > > > > Brick3: nw3glus.gem.local:/glustexp/pri2/brick
> > > > > > > > Brick4: nw4glus.gem.local:/glustexp/pri2/brick
> > > > > > > > Options Reconfigured:
> > > > > > > > server.allow-insecure: on
> > > > > > > > storage.owner-uid: 107
> > > > > > > > storage.owner-gid: 107
> > > > > > > > 
> > > > > > > > [root at networker ~]# gluster volume status gv_pri detail
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Status of volume: gv_pri
> > > > > > > > ------------------------------------------------------------------------------
> > > > > > > > Brick : Brick nw1glus.gem.local:/glustexp/pri1/brick
> > > > > > > > Port : 50178
> > > > > > > > Online : Y
> > > > > > > > Pid : 25721
> > > > > > > > File System : xfs
> > > > > > > > Device : /dev/mapper/vg_guests-lv_brick1
> > > > > > > > Mount Options : rw,noatime
> > > > > > > > Inode Size : 512
> > > > > > > > Disk Space Free : 168.4GB
> > > > > > > > Total Disk Space : 194.9GB
> > > > > > > > Inode Count : 102236160
> > > > > > > > Free Inodes : 102236130
> > > > > > > > ------------------------------------------------------------------------------
> > > > > > > > Brick : Brick nw2glus.gem.local:/glustexp/pri1/brick
> > > > > > > > Port : 50178
> > > > > > > > Online : Y
> > > > > > > > Pid : 27832
> > > > > > > > File System : xfs
> > > > > > > > Device : /dev/mapper/vg_guests-lv_brick1
> > > > > > > > Mount Options : rw,noatime
> > > > > > > > Inode Size : 512
> > > > > > > > Disk Space Free : 168.4GB
> > > > > > > > Total Disk Space : 194.9GB
> > > > > > > > Inode Count : 102236160
> > > > > > > > Free Inodes : 102236130
> > > > > > > > ------------------------------------------------------------------------------
> > > > > > > > Brick : Brick nw3glus.gem.local:/glustexp/pri2/brick
> > > > > > > > Port : 50182
> > > > > > > > Online : Y
> > > > > > > > Pid : 14571
> > > > > > > > File System : xfs
> > > > > > > > Device : /dev/mapper/vg_guests-lv_brick2
> > > > > > > > Mount Options : rw,noatime
> > > > > > > > Inode Size : 512
> > > > > > > > Disk Space Free : 418.3GB
> > > > > > > > Total Disk Space : 433.8GB
> > > > > > > > Inode Count : 227540992
> > > > > > > > Free Inodes : 227540973
> > > > > > > > ------------------------------------------------------------------------------
> > > > > > > > Brick : Brick nw4glus.gem.local:/glustexp/pri2/brick
> > > > > > > > Port : 50181
> > > > > > > > Online : Y
> > > > > > > > Pid : 21942
> > > > > > > > File System : xfs
> > > > > > > > Device : /dev/mapper/vg_guests-lv_brick2
> > > > > > > > Mount Options : rw,noatime
> > > > > > > > Inode Size : 512
> > > > > > > > Disk Space Free : 418.3GB
> > > > > > > > Total Disk Space : 433.8GB
> > > > > > > > Inode Count : 227540992
> > > > > > > > Free Inodes : 227540973
> > > > > > > > 
> > > > > > > > fuse-mount of the gv_pri volume:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > [root at networker ~]# ll -h /mnt/gluspri/
> > > > > > > > totale 37G
> > > > > > > > -rw-------. 1 qemu qemu 7,7G 24 gen 10:21 alfresco.qc2
> > > > > > > > -rw-------. 1 qemu qemu 4,2G 24 gen 10:22 check_mk-salmo.qc2
> > > > > > > > -rw-------. 1 qemu qemu 27M 23 gen 16:42 newnxserver.qc2
> > > > > > > > -rw-------. 1 qemu qemu 1,1G 23 gen 13:38 newubutest1.qc2
> > > > > > > > -rw-------. 1 qemu qemu 11G 24 gen 10:17 nxserver.qc2
> > > > > > > > -rw-------. 1 qemu qemu 8,1G 24 gen 10:17 remlog.qc2
> > > > > > > > -rw-------. 1 qemu qemu 5,6G 24 gen 10:19 ubutest1.qc2
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Do you think this is the expected behaviour, maybe due to
> > > > > > > > caching?
> > > > > > > > What
> > > > > > > > if
> > > > > > > > the most updated node goes down while the VMs are running?
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Thanks a lot,
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Fabio Rosati
> > > > > > > > 
> > > > > > > > _______________________________________________
> > > > > > > > Gluster-users mailing list
> > > > > > > > Gluster-users at gluster.org
> > > > > > > > http://supercolony.gluster.org/mailman/listinfo/gluster-users
> > > > > > > _______________________________________________
> > > > > > > Gluster-users mailing list
> > > > > > > Gluster-users at gluster.org
> > > > > > > http://supercolony.gluster.org/mailman/listinfo/gluster-users
> > > > > > > 
> > > > > >
> > > > > 
> > > >
> > > 
> > 
>