[Gluster-users] Fwd: vm paused unknown storage error one node out of 3 only

David Gossage dgossage at carouselchecks.com
Sat Aug 13 11:37:28 UTC 2016


Here is reply again just in case.  I got quarantine message so not sure if
first went through or wll anytime soon.  Brick logs weren't large so Ill
just include as text files this time

The attached file bricks.zip you sent to <kdhananj at redhat.com>;<Gluster
-users at gluster.org> on 8/13/2016 7:17:35 AM was quarantined. As a safety
precaution, the University of South Carolina quarantines .zip and .docm
files sent via email. If this is a legitimate attachment <
kdhananj at redhat.com>;<Gluster-users at gluster.org> may contact the Service
Desk at 803-777-1800 (servicedesk at sc.edu) and the attachment file will be
released from quarantine and delivered.


On Sat, Aug 13, 2016 at 6:15 AM, David Gossage <dgossage at carouselchecks.com>
wrote:

> On Sat, Aug 13, 2016 at 12:26 AM, Krutika Dhananjay <kdhananj at redhat.com>
> wrote:
>
>> 1. Could you share the output of `gluster volume heal <VOL> info`?
>>
> Results were same moments after issue occurred as well
> Brick ccgl1.gl.local:/gluster1/BRICK1/1
> Status: Connected
> Number of entries: 0
>
> Brick ccgl2.gl.local:/gluster1/BRICK1/1
> Status: Connected
> Number of entries: 0
>
> Brick ccgl4.gl.local:/gluster1/BRICK1/1
> Status: Connected
> Number of entries: 0
>
>
>
>> 2. `gluster volume info`
>>
> Volume Name: GLUSTER1
> Type: Replicate
> Volume ID: 167b8e57-28c3-447a-95cc-8410cbdf3f7f
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> Options Reconfigured:
> cluster.locking-scheme: granular
> nfs.enable-ino32: off
> nfs.addr-namelookup: off
> nfs.disable: on
> performance.strict-write-ordering: off
> cluster.background-self-heal-count: 16
> cluster.self-heal-window-size: 1024
> server.allow-insecure: on
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> network.remote-dio: enable
> cluster.eager-lock: enable
> performance.stat-prefetch: on
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> storage.owner-gid: 36
> storage.owner-uid: 36
> performance.readdir-ahead: on
> features.shard: on
> features.shard-block-size: 64MB
> diagnostics.brick-log-level: WARNING
>
>
>
>> 3. fuse mount logs of the affected volume(s)?
>>
>  [2016-08-12 21:34:19.518511] W [MSGID: 114031] [client-rpc-fops.c:3050:client3_3_readv_cbk]
> 0-GLUSTER1-client-1: remote operation failed [No such file or directory]
> [2016-08-12 21:34:19.519115] W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk]
> 0-GLUSTER1-client-0: remote operation failed [No such file or directory]
> [2016-08-12 21:34:19.519203] W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk]
> 0-GLUSTER1-client-1: remote operation failed [No such file or directory]
> [2016-08-12 21:34:19.519226] W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk]
> 0-GLUSTER1-client-2: remote operation failed [No such file or directory]
> [2016-08-12 21:34:19.520737] W [MSGID: 108008]
> [afr-read-txn.c:244:afr_read_txn] 0-GLUSTER1-replicate-0: Unreadable
> subvolume -1 found with event generation 3 for gfid e18650c4-02c0-4a5a-bd4c-bbdf5fbd9c88.
> (Possible split-brain)
> [2016-08-12 21:34:19.521393] W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk]
> 0-GLUSTER1-client-2: remote operation failed [No such file or directory]
> [2016-08-12 21:34:19.522269] E [MSGID: 109040] [dht-helper.c:1190:dht_migration_complete_check_task]
> 0-GLUSTER1-dht: (null): failed to lookup the file on GLUSTER1-dht [Stale
> file handle]
> [2016-08-12 21:34:19.522341] W [fuse-bridge.c:2227:fuse_readv_cbk]
> 0-glusterfs-fuse: 18479997: READ => -1 gfid=31d7c904-775e-4b9f-8ef7-888218679845
> fd=0x7f00a80bde58 (Stale file handle)
> [2016-08-12 21:34:19.521296] W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk]
> 0-GLUSTER1-client-1: remote operation failed [No such file or directory]
> [2016-08-12 21:34:19.521357] W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk]
> 0-GLUSTER1-client-0: remote operation failed [No such file or directory]
> [2016-08-12 22:15:08.337528] I [MSGID: 109066]
> [dht-rename.c:1568:dht_rename] 0-GLUSTER1-dht: renaming
> /7c73a8dd-a72e-4556-ac88-7f6813131e64/images/ec4f5b10-
> 02b1-435c-a7e1-97e399532597/0e6ed1c3-ffe0-43b0-9863-439ccc3193c9.meta.new
> (hash=GLUSTER1-replicate-0/cache=GLUSTER1-replicate-0) =>
> /7c73a8dd-a72e-4556-ac88-7f6813131e64/images/ec4f5b10-
> 02b1-435c-a7e1-97e399532597/0e6ed1c3-ffe0-43b0-9863-439ccc3193c9.meta
> (hash=GLUSTER1-replicate-0/cache=GLUSTER1-replicate-0)
> [2016-08-12 22:15:12.240026] I [MSGID: 109066]
> [dht-rename.c:1568:dht_rename] 0-GLUSTER1-dht: renaming
> /7c73a8dd-a72e-4556-ac88-7f6813131e64/images/78636a1b-
> 86dd-4aaf-8b4f-4ab9c3509e88/4707d651-06c6-446b-b9c8-408004a55ada.meta.new
> (hash=GLUSTER1-replicate-0/cache=GLUSTER1-replicate-0) =>
> /7c73a8dd-a72e-4556-ac88-7f6813131e64/images/78636a1b-
> 86dd-4aaf-8b4f-4ab9c3509e88/4707d651-06c6-446b-b9c8-408004a55ada.meta
> (hash=GLUSTER1-replicate-0/cache=GLUSTER1-replicate-0)
> [2016-08-12 22:15:11.105593] I [MSGID: 109066]
> [dht-rename.c:1568:dht_rename] 0-GLUSTER1-dht: renaming
> /7c73a8dd-a72e-4556-ac88-7f6813131e64/images/ec4f5b10-
> 02b1-435c-a7e1-97e399532597/0e6ed1c3-ffe0-43b0-9863-439ccc3193c9.meta.new
> (hash=GLUSTER1-replicate-0/cache=GLUSTER1-replicate-0) =>
> /7c73a8dd-a72e-4556-ac88-7f6813131e64/images/ec4f5b10-
> 02b1-435c-a7e1-97e399532597/0e6ed1c3-ffe0-43b0-9863-439ccc3193c9.meta
> (hash=GLUSTER1-replicate-0/cache=GLUSTER1-replicate-0)
> [2016-08-12 22:15:14.772713] I [MSGID: 109066]
> [dht-rename.c:1568:dht_rename] 0-GLUSTER1-dht: renaming
> /7c73a8dd-a72e-4556-ac88-7f6813131e64/images/78636a1b-
> 86dd-4aaf-8b4f-4ab9c3509e88/4707d651-06c6-446b-b9c8-408004a55ada.meta.new
> (hash=GLUSTER1-replicate-0/cache=GLUSTER1-replicate-0) =>
> /7c73a8dd-a72e-4556-ac88-7f6813131e64/images/78636a1b-
> 86dd-4aaf-8b4f-4ab9c3509e88/4707d651-06c6-446b-b9c8-408004a55ada.meta
> (hash=GLUSTER1-replicate-0/cache=GLUSTER1-replicate-0)
>
> 4. glustershd logs
>>
> Nothing recent same on all 3 storage nodes
> [2016-08-07 08:48:03.593401] I [glusterfsd-mgmt.c:1600:mgmt_getspec_cbk]
> 0-glusterfs: No change in volfile, continuing
> [2016-08-11 08:14:03.683287] I [MSGID: 100011] [glusterfsd.c:1323:reincarnate]
> 0-glusterfsd: Fetching the volume file from server...
> [2016-08-11 08:14:03.684492] I [glusterfsd-mgmt.c:1600:mgmt_getspec_cbk]
> 0-glusterfs: No change in volfile, continuing
>
>
>
>> 5. Brick logs
>>
>  Their have been some error in brick logs I hadn't noticed occurring.
> I've zip'd and attached all 3 nodes logs, but from this snippet on one node
> none of them seem to coincide with the  time window when migration had
> issues.  f9a7f3c5-4c13-4020-b560-1f4f7b1e3c42 shard refers to an image
> for a different vm than one I had issues with as well.  Maybe gluster is
> trying to do some sort of make shard test before writing out changes that
> would go to that image and that shard file?
>
> [2016-08-12 18:48:22.463628] E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> f9a7f3c5-4c13-4020-b560-1f4f7b1e3c42.697 failed [File exists]
> [2016-08-12 18:48:24.553455] E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> f9a7f3c5-4c13-4020-b560-1f4f7b1e3c42.698 failed [File exists]
> [2016-08-12 18:49:16.065502] E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> f9a7f3c5-4c13-4020-b560-1f4f7b1e3c42.738 failed [File exists]
> The message "E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> f9a7f3c5-4c13-4020-b560-1f4f7b1e3c42.697 failed [File exists]" repeated 5
> times between [2016-08-12 18:48:22.463628] and [2016-08-12 18:48:22.514777]
> [2016-08-12 18:48:24.581216] E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> f9a7f3c5-4c13-4020-b560-1f4f7b1e3c42.698 failed [File exists]
> The message "E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> f9a7f3c5-4c13-4020-b560-1f4f7b1e3c42.738 failed [File exists]" repeated 5
> times between [2016-08-12 18:49:16.065502] and [2016-08-12 18:49:16.107746]
> [2016-08-12 19:23:40.964678] E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> 83794e5d-2225-4560-8df6-7c903c8a648a.1301 failed [File exists]
> [2016-08-12 20:00:33.498751] E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> 0e5ad95d-722d-4374-88fb-66fca0b14341.580 failed [File exists]
> [2016-08-12 20:00:33.530938] E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> 0e5ad95d-722d-4374-88fb-66fca0b14341.580 failed [File exists]
> [2016-08-13 01:47:23.338036] E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> 18843fb4-e31c-4fc3-b519-cc6e5e947813.211 failed [File exists]
> The message "E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> 18843fb4-e31c-4fc3-b519-cc6e5e947813.211 failed [File exists]" repeated
> 16 times between [2016-08-13 01:47:23.338036] and [2016-08-13
> 01:47:23.380980]
> [2016-08-13 01:48:02.224494] E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> ffbbcce0-3c4a-4fdf-b79f-a96ca3215657.211 failed [File exists]
> [2016-08-13 01:48:42.266148] E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> 18843fb4-e31c-4fc3-b519-cc6e5e947813.177 failed [File exists]
> [2016-08-13 01:49:09.717434] E [MSGID: 113022] [posix.c:1245:posix_mknod]
> 0-GLUSTER1-posix: mknod on /gluster1/BRICK1/1/.shard/
> 18843fb4-e31c-4fc3-b519-cc6e5e947813.178 failed [File exists]
>
>
>> -Krutika
>>
>>
>> On Sat, Aug 13, 2016 at 3:10 AM, David Gossage <
>> dgossage at carouselchecks.com> wrote:
>>
>>> On Fri, Aug 12, 2016 at 4:25 PM, Dan Lavu <dan at redhat.com> wrote:
>>>
>>>> David,
>>>>
>>>> I'm seeing similar behavior in my lab, but it has been caused by
>>>> healing files in the gluster cluster, though I attribute my problems to
>>>> problems with the storage fabric. See if 'gluster volume heal $VOL info'
>>>> indicates files that are being healed, and if those reduce in number, can
>>>> the VM start?
>>>>
>>>>
>>> I haven't had any files in a state of being healed according to either
>>> of the 3 storage nodes.
>>>
>>> I shut down one VM that has been around awhile a moment ago then told it
>>> to start on the one ovirt server that complained previously.  It ran fine,
>>> and I was able to migrate it off and on the host no issues.
>>>
>>> I told one of the new VM's to migrate to the one node and within seconds
>>> it paused from unknown storage errors no shards showing heals nothing with
>>> an error on storage node.  Same stale file handle issues.
>>>
>>> I'll probably put this node in maintenance later and reboot it.  Other
>>> than that I may re-clone those 2 reccent VM's.  maybe images just got
>>> corrupted though why it would only fail on one node of 3 if image was bad
>>> not sure.
>>>
>>>
>>> Dan
>>>>
>>>> On Thu, Aug 11, 2016 at 7:52 AM, David Gossage <
>>>> dgossage at carouselchecks.com> wrote:
>>>>
>>>>> Figure I would repost here as well.  one client out of 3 complaining
>>>>> of stale file handles on a few new VM's I migrated over. No errors on
>>>>> storage nodes just client.  Maybe just put that one in maintenance and
>>>>> restart gluster mount?
>>>>>
>>>>> *David Gossage*
>>>>> *Carousel Checks Inc. | System Administrator*
>>>>> *Office* 708.613.2284
>>>>>
>>>>> ---------- Forwarded message ----------
>>>>> From: David Gossage <dgossage at carouselchecks.com>
>>>>> Date: Thu, Aug 11, 2016 at 12:17 AM
>>>>> Subject: vm paused unknown storage error one node out of 3 only
>>>>> To: users <users at ovirt.org>
>>>>>
>>>>>
>>>>> Out of a 3 node cluster running oVirt 3.6.6.2-1.el7.centos with a 3
>>>>> replicate gluster 3.7.14 starting a VM i just copied in on one node of the
>>>>> 3 gets the following errors.  The other 2 the vm starts fine.  All ovirt
>>>>> and gluster are centos 7 based. VM on start of the one node it tries to
>>>>> default to on its own accord immediately puts into paused for unknown
>>>>> reason.  Telling it to start on different node starts ok.  node with issue
>>>>> already has 5 VMs running fine on it same gluster storage plus the hosted
>>>>> engine on different volume.
>>>>>
>>>>> gluster nodes logs did not have any errors for volume
>>>>> nodes own gluster logs had this in log
>>>>>
>>>>> dfb8777a-7e8c-40ff-8faa-252beabba5f8 couldnt find in .glusterfs
>>>>> .shard or images/
>>>>>
>>>>> 7919f4a0-125c-4b11-b5c9-fb50cc195c43 is the gfid of the bootable
>>>>> drive of the vm
>>>>>
>>>>> [2016-08-11 04:31:39.982952] W [MSGID: 114031]
>>>>> [client-rpc-fops.c:3050:client3_3_readv_cbk] 0-GLUSTER1-client-2:
>>>>> remote operation failed [No such file or directory]
>>>>> [2016-08-11 04:31:39.983683] W [MSGID: 114031]
>>>>> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-GLUSTER1-client-2:
>>>>> remote operation failed [No such file or directory]
>>>>> [2016-08-11 04:31:39.984182] W [MSGID: 114031]
>>>>> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-GLUSTER1-client-0:
>>>>> remote operation failed [No such file or directory]
>>>>> [2016-08-11 04:31:39.984221] W [MSGID: 114031]
>>>>> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-GLUSTER1-client-1:
>>>>> remote operation failed [No such file or directory]
>>>>> [2016-08-11 04:31:39.985941] W [MSGID: 108008]
>>>>> [afr-read-txn.c:244:afr_read_txn] 0-GLUSTER1-replicate-0: Unreadable
>>>>> subvolume -1 found with event generation 3 for gfid
>>>>> dfb8777a-7e8c-40ff-8faa-252beabba5f8. (Possible split-brain)
>>>>> [2016-08-11 04:31:39.986633] W [MSGID: 114031]
>>>>> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-GLUSTER1-client-2:
>>>>> remote operation failed [No such file or directory]
>>>>> [2016-08-11 04:31:39.987644] E [MSGID: 109040]
>>>>> [dht-helper.c:1190:dht_migration_complete_check_task] 0-GLUSTER1-dht:
>>>>> (null): failed to lookup the file on GLUSTER1-dht [Stale file handle]
>>>>> [2016-08-11 04:31:39.987751] W [fuse-bridge.c:2227:fuse_readv_cbk]
>>>>> 0-glusterfs-fuse: 15152930: READ => -1 gfid=7919f4a0-125c-4b11-b5c9-fb50cc195c43
>>>>> fd=0x7f00a80bdb64 (Stale file handle)
>>>>> [2016-08-11 04:31:39.986567] W [MSGID: 114031]
>>>>> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-GLUSTER1-client-0:
>>>>> remote operation failed [No such file or directory]
>>>>> [2016-08-11 04:31:39.986567] W [MSGID: 114031]
>>>>> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-GLUSTER1-client-1:
>>>>> remote operation failed [No such file or directory]
>>>>> [2016-08-11 04:35:21.210145] W [MSGID: 108008]
>>>>> [afr-read-txn.c:244:afr_read_txn] 0-GLUSTER1-replicate-0: Unreadable
>>>>> subvolume -1 found with event generation 3 for gfid
>>>>> dfb8777a-7e8c-40ff-8faa-252beabba5f8. (Possible split-brain)
>>>>> [2016-08-11 04:35:21.210873] W [MSGID: 114031]
>>>>> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-GLUSTER1-client-1:
>>>>> remote operation failed [No such file or directory]
>>>>> [2016-08-11 04:35:21.210888] W [MSGID: 114031]
>>>>> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-GLUSTER1-client-2:
>>>>> remote operation failed [No such file or directory]
>>>>> [2016-08-11 04:35:21.210947] W [MSGID: 114031]
>>>>> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-GLUSTER1-client-0:
>>>>> remote operation failed [No such file or directory]
>>>>> [2016-08-11 04:35:21.213270] E [MSGID: 109040]
>>>>> [dht-helper.c:1190:dht_migration_complete_check_task] 0-GLUSTER1-dht:
>>>>> (null): failed to lookup the file on GLUSTER1-dht [Stale file handle]
>>>>> [2016-08-11 04:35:21.213345] W [fuse-bridge.c:2227:fuse_readv_cbk]
>>>>> 0-glusterfs-fuse: 15156910: READ => -1 gfid=7919f4a0-125c-4b11-b5c9-fb50cc195c43
>>>>> fd=0x7f00a80bf6d0 (Stale file handle)
>>>>> [2016-08-11 04:35:21.211516] W [MSGID: 108008]
>>>>> [afr-read-txn.c:244:afr_read_txn] 0-GLUSTER1-replicate-0: Unreadable
>>>>> subvolume -1 found with event generation 3 for gfid
>>>>> dfb8777a-7e8c-40ff-8faa-252beabba5f8. (Possible split-brain)
>>>>> [2016-08-11 04:35:21.212013] W [MSGID: 114031]
>>>>> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-GLUSTER1-client-0:
>>>>> remote operation failed [No such file or directory]
>>>>> [2016-08-11 04:35:21.212081] W [MSGID: 114031]
>>>>> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-GLUSTER1-client-1:
>>>>> remote operation failed [No such file or directory]
>>>>> [2016-08-11 04:35:21.212121] W [MSGID: 114031]
>>>>> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 0-GLUSTER1-client-2:
>>>>> remote operation failed [No such file or directory]
>>>>>
>>>>> I attached vdsm.log starting from when I spun up vm on offending node
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160813/3b557238/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: brick1.1.log
Type: application/octet-stream
Size: 109504 bytes
Desc: not available
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160813/3b557238/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: brick1.2.log
Type: application/octet-stream
Size: 115036 bytes
Desc: not available
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160813/3b557238/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: brick1.4.log
Type: application/octet-stream
Size: 123166 bytes
Desc: not available
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160813/3b557238/attachment-0005.obj>


More information about the Gluster-users mailing list