[Bugs] [Bug 1722390] "All subvolumes are down" when all bricks are online

Mon Feb 7 11:40:58 UTC 2022

https://bugzilla.redhat.com/show_bug.cgi?id=1722390

ryan at 7fivefive.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|CLOSED                      |MODIFIED
         Resolution|INSUFFICIENT_DATA           |---
              Flags|needinfo?(ksubrahm at redhat.c |
                   |om)                         |
           Keywords|                            |Reopened

--- Comment #10 from ryan at 7fivefive.com ---
Hi @ksubrahm at redhat.com,

Sorry, I didn't see a notification from this bug regarding the update.
We're still seeing this issue with Gluster 8.5.

Please find the requested information below:
- Is this problem seen only with volume "mcv01"?
MCV01 is the only volume that we have exported via Samba, the other volume is
just for CTDB. We've seen this on other clusters that have 2x2 replicated
volumes though.

- Logs are showing that bricks on nodes "mcn01" & "mcn03" are down. There are
other volumes with bricks on these nodes. Are they not flooding the client logs
with similar messages?
The logs say this, but as far as I can tell, all bricks in the volume are
healthy and online. I'm not seeing these messages via the FUSE client logs. We
have MCV01 mounted on all nodes via FUSE, however clients never access the
volume via this as we use the VFS module.

- Are you able to do IO from this client without any errors?
Yes, the system is fully functional we can read and write from the volume as
expected. Replication looks to be working fine too, as there are no pending
heals etc.

- Whether the bricks which needed replacing were on these nodes and are they
the first 10 bricks of volume mcv01?
It seems that all replicated sub-volumes on all nodes are being effected.
[2022-02-07 10:28:15.890242] E [MSGID: 108006]
[afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-0: All
subvolumes are down. Going offline until at least one of them comes back up.
[2022-02-07 10:28:15.890549] E [MSGID: 108006]
[afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-1: All
subvolumes are down. Going offline until at least one of them comes back up.
[2022-02-07 10:28:15.890771] E [MSGID: 108006]
[afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-3: All
subvolumes are down. Going offline until at least one of them comes back up.
[2022-02-07 10:28:15.890933] E [MSGID: 108006]
[afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-4: All
subvolumes are down. Going offline until at least one of them comes back up.
[2022-02-07 10:28:15.891049] E [MSGID: 108006]
[afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-2: All
subvolumes are down. Going offline until at least one of them comes back up.
[2022-02-07 10:28:15.891225] E [MSGID: 108006]
[afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-5: All
subvolumes are down. Going offline until at least one of them comes back up.
[2022-02-07 10:28:15.891355] E [MSGID: 108006]
[afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-6: All
subvolumes are down. Going offline until at least one of them comes back up.
[2022-02-07 10:28:15.891528] E [MSGID: 108006]
[afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-7: All
subvolumes are down. Going offline until at least one of them comes back up.
[2022-02-07 10:28:15.891655] E [MSGID: 108006]
[afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-8: All
subvolumes are down. Going offline until at least one of them comes back up.
[2022-02-07 10:28:15.891788] E [MSGID: 108006]
[afr-common.c:6071:__afr_handle_child_down_event] 0-mcv01-replicate-9: All
subvolumes are down. Going offline until at least one of them comes back up.

- Provide the output of "gluster volume info mcv01".
Please find this uploaded to the ticket with filename mcv01_info_07022022.txt

- Provide the statedump of the client processes which are showing these
messages.
(https://docs.gluster.org/en/latest/Troubleshooting/statedump/#generate-a-statedump)
Could you advise how to get a statedump from a VFS client? I tried the usual
way but the dump was not being generated.

- Give the client vol file which will be present inside
"/var/lib/glusterd/vols/<volume-name>/"
There were quite a few client volfiles. I've compressed them and attached them
to this ticket with the filename mcv01_client_volfiles_07022022.zip.

I did have one thought: Could the 'auth.allow: 172.30.30.*' be at fault here?
We use that to prevent non-cluster nodes from mounting the volume, but I'm
wondering if we need to include 127.0.0.1 and other local loopback addresses in
this, as the Samba server is installed on all Gluster nodes, so it may be that
the connection is coming over a local loopback IP rather than their backend
network IP.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
https://bugzilla.redhat.com/show_bug.cgi?id=1722390