[Bugs] [Bug 1641440] New: [RHHi] Mount hung and not accessible

Mon Oct 22 05:56:38 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1641440

            Bug ID: 1641440
           Summary: [RHHi] Mount hung and not accessible
           Product: GlusterFS
           Version: 5
         Component: sharding
          Keywords: Triaged
          Severity: high
          Priority: high
          Assignee: bugs at gluster.org
          Reporter: kdhananj at redhat.com
        QA Contact: bugs at gluster.org
                CC: bugs at gluster.org, jbyers at stonefly.com,
                    knarra at redhat.com, rhs-bugs at redhat.com,
                    sabose at redhat.com, sankarshan at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1603118, 1605056

+++ This bug was initially created as a clone of Bug #1605056 +++

+++ This bug was initially created as a clone of Bug #1603118 +++

Description of problem:

One of the hosts in the ovirt-gluster hyperconverged cluster is in
non-operational status. Looking through the vdsm logs, the following error is
seen.

2018-07-18 18:02:08,353+0530 WARN  (itmap/1) [storage.scanDomains] Could not
collect metadata file for domain path
/rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:_vmstore
(fileSD:845)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 834, in
collectMetaFiles
    metaFiles = oop.getProcessPool(client_name).glob.glob(mdPattern)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line
107, in glob
    return self._iop.glob(pattern)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 560, in
glob
    return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 451, in
_sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 11] Resource temporarily unavailable

There are no errors in the vmstore mount logs, however the mount is hung -
cannot access
/rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:_vmstore

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-54.12

How reproducible:
This was seen on a running environment.

Steps to Reproduce:
NA

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-07-19
04:33:28 EDT ---

This bug is automatically being proposed for the release of Red Hat Gluster
Storage 3 under active development and open for bug fixes, by setting the
release flag 'rhgs‑3.4.0' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Krutika Dhananjay on 2018-07-19 12:07:20 EDT ---

Found the RCA.

For this bug to be hit, the LRU list in shard needs to become full. Its size is
16K. That means at least 16K shards should have been accessed from a
glusterfs/RHHI mount.

(How do you know when you've hit that mark and can stop generating more data
when you're testing? Take the statedump of the mount. Under section
"[features/shard.$VOLNAME-shard]" you should see the following line
"inode-count=16384")

Next, a vm should have been migrated from the host where it was created/used
for a while to another host.

Next, delete the vm from the new host.

Access this vm from the old host. You should get "No such file or directory".
(I don't know if RHV accessed this file from the first host triggering
destruction ("forget") of the base inode of this vm. But it is also quite
likely in this case that fuse forced the destruction of the inode upon seeing
high memory pressure, and at some point this now invalid pointer is accessed
leading to strange behavior - like a crash or a hang).

Now perform some more io on one of the other vms that the first host might be
managing. At some point, your client will either crash or hang.

======================================================
Here is a simpler way to hit the bug on your non-RHHI (even single node) setup:

1. Create a replica 3 volume and start it.
2. Enable shard on it. And set shard-block-size to 4MB.
3. Create 2 fuse mounts - $M1 and $M2.
4. From $M1, create a 65GB size file (use dd maybe).
(Why 64GB? To hit the lru limit, you need 16K shards. That's 16K * 4MB = 64GB.
This is where setting shard-block size to 4MB helps for the purpose of this
test. With default 64MB size, more time and space will be needed to recreate
the issue since now a 1TB image will need to be created to hit the bug).

5. Read that file entirely from $M2. (use cat maybe).
6. Delete the file from $M1.
7. Stat the file from $M2. (Should fail with "No such file or directory").
8. Now start dd on a second file from $M2.

Mount process associated with $M2 must crash soon.

-Krutika

--- Additional comment from Krutika Dhananjay on 2018-07-19 12:09:50 EDT ---

(In reply to Krutika Dhananjay from comment #2)
> Found the RCA.
> 
> For this bug to be hit, the LRU list in shard needs to become full. Its size
> is 16K. That means at least 16K shards should have been accessed from a
> glusterfs/RHHI mount.
> 
> (How do you know when you've hit that mark and can stop generating more data
> when you're testing? Take the statedump of the mount. Under section
> "[features/shard.$VOLNAME-shard]" you should see the following line
> "inode-count=16384")
> 
> Next, a vm should have been migrated from the host where it was created/used
> for a while to another host.
> 
> Next, delete the vm from the new host.
> 
> Access this vm from the old host. You should get "No such file or directory".
> (I don't know if RHV accessed this file from the first host triggering
> destruction ("forget") of the base inode of this vm. But it is also quite
> likely in this case that fuse forced the destruction of the inode upon
> seeing high memory pressure, and at some point this now invalid pointer is
> accessed leading to strange behavior - like a crash or a hang).
> 
> Now perform some more io on one of the other vms that the first host might
> be managing. At some point, your client will either crash or hang.
> 
> ======================================================
> Here is a simpler way to hit the bug on your non-RHHI (even single node)
> setup:
> 
> 1. Create a replica 3 volume and start it.
> 2. Enable shard on it. And set shard-block-size to 4MB.
> 3. Create 2 fuse mounts - $M1 and $M2.
> 4. From $M1, create a 65GB size file (use dd maybe).

Sorry, typo. This should be 64GB (although the bug is recreatable with 65GB
block size too!)

-Krutika

> (Why 64GB? To hit the lru limit, you need 16K shards. That's 16K * 4MB =
> 64GB. This is where setting shard-block size to 4MB helps for the purpose of
> this test. With default 64MB size, more time and space will be needed to
> recreate the issue since now a 1TB image will need to be created to hit the
> bug).
> 
> 5. Read that file entirely from $M2. (use cat maybe).
> 6. Delete the file from $M1.
> 7. Stat the file from $M2. (Should fail with "No such file or directory").
> 8. Now start dd on a second file from $M2.
> 
> Mount process associated with $M2 must crash soon.
> 
> -Krutika

--- Additional comment from Worker Ant on 2018-07-23 03:33:01 EDT ---

COMMIT: https://review.gluster.org/20544 committed in master by "Krutika
Dhananjay" <kdhananj at redhat.com> with a commit message- features/shard: Make
lru limit of inode list configurable

Currently this lru limit is hard-coded to 16384. This patch makes it
configurable to make it easier to hit the lru limit and enable testing
of different cases that arise when the limit is reached.

The option is features.shard-lru-limit. It is by design allowed to
be configured only in init() but not in reconfigure(). This is to avoid
all the complexity associated with eviction of least recently used shards
when the list is shrunk.

Change-Id: Ifdcc2099f634314fafe8444e2d676e192e89e295
updates: bz#1605056
Signed-off-by: Krutika Dhananjay <kdhananj at redhat.com>

--- Additional comment from Worker Ant on 2018-07-23 03:43:43 EDT ---

REVIEW: https://review.gluster.org/20550 (features/shard: Hold a ref on base
inode when adding a shard to lru list) posted (#1) for review on master by
Krutika Dhananjay

--- Additional comment from Worker Ant on 2018-10-15 23:38:29 EDT ---

COMMIT: https://review.gluster.org/20550 committed in master by "Krutika
Dhananjay" <kdhananj at redhat.com> with a commit message- features/shard: Hold a
ref on base inode when adding a shard to lru list

In __shard_update_shards_inode_list(), previously shard translator
was not holding a ref on the base inode whenever a shard was added to
the lru list. But if the base shard is forgotten and destroyed either
by fuse due to memory pressure or due to the file being deleted at some
point by a different client with this client still containing stale
shards in its lru list, the client would crash at the time of locking
lru_base_inode->lock owing to illegal memory access.

So now the base shard is ref'd into the inode ctx of every shard that
is added to lru list until it gets lru'd out.

The patch also handles the case where none of the shards associated
with a file that is about to be deleted are part of the LRU list and
where an unlink at the beginning of the operation destroys the base
inode (because there are no refkeepers) and hence all of the shards
that are about to be deleted will be resolved without the existence
of a base shard in-memory. This, if not handled properly, could lead
to a crash.

Change-Id: Ic15ca41444dd04684a9458bd4a526b1d3e160499
updates: bz#1605056
Signed-off-by: Krutika Dhananjay <kdhananj at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1603118
[Bug 1603118] [RHHi] Mount hung and not accessible
https://bugzilla.redhat.com/show_bug.cgi?id=1605056
[Bug 1605056] [RHHi] Mount hung and not accessible
-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
You are the assignee for the bug.