Zenon Panoussis oracle at provocation.net
Tue Mar 16 18:15:46 UTC 2021

> Yes if the dataset is small, you can try rm -rf of the dir 
> from the mount (assuming no other application is accessing 
> them on the volume) launch heal once so that the heal info 
> becomes zero and then copy it over again .

I did approximately so; the rm -rf took its sweet time and the
number of entries to be healed kept diminishing as the deletion
progressed. At the end I was left with

Mon Mar 15 22:57:09 CET 2021
Gathering count of entries to be healed on volume gv0 has been successful

Brick node01:/gfs/gv0
Number of entries: 3

Brick mikrivouli:/gfs/gv0
Number of entries: 2

Brick nanosaurus:/gfs/gv0
Number of entries: 3

and that's where I've been ever since, for the past 20 hours.
SHD has kept trying to heal them all along and the log brings
us back to square one:

[2021-03-16 14:51:35.059593 +0000] I [MSGID: 108026] [afr-self-heal-entry.c:1053:afr_selfheal_entry_do] 0-gv0-replicate-0: performing entry selfheal on 94aefa13-9828-49e5-9bac-6f70453c100f
[2021-03-16 15:39:43.680380 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-0: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
[2021-03-16 15:39:43.769604 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-2: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
[2021-03-16 15:39:43.908425 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-1: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]

In other words, deleting and recreating the unhealable files
and directories was a workaround, but the underlying problem
persists and I can't even begin to look for it when I have no
clue what errno 22 means in plain English.

In any case, glusterd.log is full of messages like

[2021-03-16 15:37:03.398619 +0000] I [MSGID: 106533] [glusterd-volume-ops.c:717:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume gv0
[2021-03-16 15:37:03.791452 +0000] E [MSGID: 106061] [glusterd-server-quorum.c:260:glusterd_is_volume_in_server_quorum] 0-management: Dict get failed [{Key=cluster.server-quorum-type}]

Every single "received heal vol req" message is immediately followed
by a "dict get failed", always for server-quorum-type, for hours on
end. And I begin to smell a bug. The CLI can query the value OK:

# gluster volume get gv0 cluster.server-quorum-type
Option                                  Value
------                                  -----
cluster.server-quorum-type              off

Checking all quorum-related settings, I get

# gluster volume get gv0 all |grep quorum
cluster.quorum-type                     auto
cluster.quorum-count                    (null) (DEFAULT)
cluster.server-quorum-type              off
cluster.server-quorum-ratio             51
cluster.quorum-reads                    no (DEFAULT)
disperse.quorum-count                   0 (DEFAULT)

I never touched any of them and none of them appear in volume info
under "Options Reconfigured", so don't know why three of them are
not marked as defaults.

Next, I tried setting server-quorum-type=server. The server-quorum-type
problem went away and I got a new kind of dict get failure:

The message "E [MSGID: 106061] [glusterd-volgen.c:2564:brick_graph_add_pump] 0-management: Dict get failed [{Key=enable-pump}]" repeated 2 times between [2021-03-16 17:12:18.677594 +0000] and [2021-03-16 17:12:18.779859 +0000]

I tried rolling back server-quorum-type=server and got this error:

# gluster volume set gv0 cluster.server-quorum-type off
volume set: failed: option server-quorum-type off: 'off' is not valid (possible options are none, server.)

Aha, but previously and by default it was clearly "off", not "none".
That's bug somewhere and that is what was causing the dict get failures
on server-quorum-type. The missing dict enable-pump that's required
by server-quorum-type=server looks also like a bug because there is
no such setting:

# gluster volume get gv0 all |grep pump

There are more similarly strange complaints in the glusterd log:

[2021-03-16 17:25:43.134207 +0000] E [MSGID: 106434] [glusterd-utils.c:13379:glusterd_get_value_for_vme_entry] 0-management: xlator_volopt_dynload error (-1)
[2021-03-16 17:25:43.141816 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for localtime-logging key
[2021-03-16 17:25:43.143185 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-seckey key
[2021-03-16 17:25:43.143340 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-keyid key
[2021-03-16 17:25:43.143484 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-bucketid key
[2021-03-16 17:25:43.143621 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-hostname key

If none of this stuff is used in the first place, it should not
be triggering errors and warnings. If the S3 plugin is not enabled,
the S3 keys should not even be checked. Both the checking of the
keys and the error logging are bugs.

Cool, I'm discovering more and more stuff that needs fixing, but
I'm making zero progress with my healing problem. I'm still stuck
with errno=22.

