[Bugs] [Bug 1532868] gluster upgrade causes vm disk errors

Fri Jan 19 03:41:13 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1532868

Krutika Dhananjay <kdhananj at redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugs at josemaas.net
              Flags|                            |needinfo?(bugs at josemaas.net
                   |                            |)

--- Comment #2 from Krutika Dhananjay <kdhananj at redhat.com> ---
(In reply to bugs from comment #0)
> Description of problem:
> We have a gluster replica x3 that is used for vm storage. After upgrading
> from 3.10 to 3.12 ovirt started logging "VM <vm name> has been paused due to
> unknown storage error." and pausing the effected vms. In order to upgrade I
> used the gluster upgrade guide. 
> 
> After looking into it some more I found the following lines in our mount
> log. To me this seems that there is an issue with sharding. Note copying the
> vm image and using the new one seems to fix the problem
> 
> [2018-01-09 15:45:47.098380] E
> [shard.c:426:shard_modify_size_and_block_count]
> (-->/usr/lib64/glusterfs/3.12.3/xlator/cluster/distribute.so(+0x6abed)
> [0x7f7241b3dbed]
> -->/usr/lib64/glusterfs/3.12.3/xlator/features/shard.so(+0xbafe)
> [0x7f72418bbafe]
> -->/usr/lib64/glusterfs/3.12.3/xlator/features/shard.so(+0xb35b)
> [0x7f72418bb35b] ) 0-tortoise-shard: Failed to get
> trusted.glusterfs.shard.file-size for 6493ab88-f4a8-4696-a52e-6425a595fc80
> [2018-01-09 15:45:47.098419] W [fuse-bridge.c:874:fuse_attr_cbk]
> 0-glusterfs-fuse: 4369015: STAT()
> /aee19709-5859-4e48-b761-c4f8a140ea61/images/855cfe92-22ad-4f40-94f8-
> 2547fa5e0f8e/e29713bf-557a-4943-a7c5-c29edc141c01 => -1 (Invalid argument)
> 
> Version-Release number of selected component (if applicable):
> centos-release-gluster310.noarch   1.0-1.el7.centos           @extras       
> 
> centos-release-gluster312.noarch   1.0-1.el7.centos           @extras       
> 
> glusterfs.x86_64                   3.12.3-1.el7              
> @centos-gluster312
> glusterfs-api.x86_64               3.12.3-1.el7              
> @centos-gluster312
> glusterfs-cli.x86_64               3.12.3-1.el7              
> @centos-gluster312
> glusterfs-client-xlators.x86_64    3.12.3-1.el7              
> @centos-gluster312
> glusterfs-fuse.x86_64              3.12.3-1.el7              
> @centos-gluster312
> glusterfs-gnfs.x86_64              3.12.3-1.el7              
> @centos-gluster312
> glusterfs-libs.x86_64              3.12.3-1.el7              
> @centos-gluster312
> glusterfs-server.x86_64            3.12.3-1.el7              
> @centos-gluster312
> libntirpc.x86_64                   1.5.3-1.el7               
> @centos-gluster312
> nfs-ganesha.x86_64                 2.5.3-1.el6               
> @centos-gluster312
> nfs-ganesha-gluster.x86_64         2.5.3-1.el6               
> @centos-gluster312
> userspace-rcu.x86_64               0.10.0-3.el7              
> @centos-gluster312
> 
> previous version:
> Dec 06 03:50:42 Updated: glusterfs-libs.x86_64 3.10.8-1.el7
> Dec 06 03:50:42 Updated: glusterfs.x86_64 3.10.8-1.el7
> Dec 06 03:50:42 Updated: glusterfs-client-xlators.x86_64 3.10.8-1.el7
> Dec 06 03:50:42 Updated: glusterfs-api.x86_64 3.10.8-1.el7
> Dec 06 03:50:42 Updated: glusterfs-fuse.x86_64 3.10.8-1.el7
> Dec 06 03:50:42 Updated: glusterfs-cli.x86_64 3.10.8-1.el7
> Dec 06 03:50:45 Updated: glusterfs-server.x86_64 3.10.8-1.el7
> Dec 06 03:50:45 Updated: glusterfs-ganesha.x86_64 3.10.8-1.el
> 
> How reproducible:
> Happened on multiple vm images. I do not have another cluster to try it on
> 
> Steps to Reproduce:
> 1.
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:
> Volume Name: tortoise
> Type: Replicate
> Volume ID: d4c00537-f1e8-4c43-b21d-90c9a6c5dee9
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: stor1.example.com:/tort/brick
> Brick2: stor2.example.com:/tort/brick
> Brick3: stor3.example.com:/tort/brick (arbiter)
> Options Reconfigured:
> user.cifs: off
> features.shard: on
> cluster.shd-wait-qlength: 10000
> cluster.shd-max-threads: 8
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> cluster.server-quorum-type: server
> cluster.eager-lock: enable
> performance.low-prio-threads: 32
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> auth.allow: *
> network.ping-timeout: 10
> cluster.quorum-type: auto
> server.allow-insecure: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> performance.strict-o-direct: enable
> network.remote-dio: enable
> transport.address-family: inet
> nfs.disable: on
> features.shard-block-size: 256MB
> cluster.enable-shared-storage: disable

1. So this issue was seen *after* the upgrade and not *during* the same right?
2. Were you running rebalance/remove-brick when you saw this issue?
3. If this bug is consistently reproducible, could you please get the tcpdump
output of the client machine and share the same along with the newly generated
client and brick logs?
Here's what you need to do to capture tcpdump output:
tcpdump -i <network interface name> -w <output-filename>.pcap

-Krutika

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.