[Gluster-users] Broken status, peer probe, "DNS resolution failed on host" and "Error disabling sockopt IPV6_V6ONLY: "Protocol not available" after updating from gluster 7.9 to 9.1

Artem Russakovskii archon810 at gmail.com
Fri Jul 23 00:05:30 UTC 2021


Hi all,

I just filed this ticket https://github.com/gluster/glusterfs/issues/2648,
and wanted to bring it to your attention. Any feedback would be appreciated.

Description of problem:
We have a 4-node replicate cluster running gluster 7.9. I'm currently
setting up a new cluster on a new set of machines and went straight for
gluster 9.1.

However, I was unable to probe any servers due to this error:

[2021-07-17 00:31:05.228609 +0000] I [MSGID: 106487]
[glusterd-handler.c:1160:__glusterd_handle_cli_probe] 0-glusterd:
Received CLI probe req nexus2 24007
[2021-07-17 00:31:05.229727 +0000] E [MSGID: 101075]
[common-utils.c:3657:gf_is_local_addr] 0-management: error in
getaddrinfo [{ret=Name or service not known}]
[2021-07-17 00:31:05.230785 +0000] E [MSGID: 106408]
[glusterd-peer-utils.c:217:glusterd_peerinfo_find_by_hostname]
0-management: error in getaddrinfo: Name or service not known
 [Unknown error -2]
[2021-07-17 00:31:05.353971 +0000] I [MSGID: 106128]
[glusterd-handler.c:3719:glusterd_probe_begin] 0-glusterd: Unable to
find peerinfo for host: nexus2 (24007)
[2021-07-17 00:31:05.375871 +0000] W [MSGID: 106061]
[glusterd-handler.c:3488:glusterd_transport_inet_options_build]
0-glusterd: Failed to get tcp-user-timeout
[2021-07-17 00:31:05.375903 +0000] I
[rpc-clnt.c:1010:rpc_clnt_connection_init] 0-management: setting
frame-timeout to 600
[2021-07-17 00:31:05.377021 +0000] E [MSGID: 101075]
[common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo
[{family=10}, {ret=Name or service not known}]
[2021-07-17 00:31:05.377043 +0000] E
[name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
resolution failed on host nexus2
[2021-07-17 00:31:05.377147 +0000] I [MSGID: 106498]
[glusterd-handler.c:3648:glusterd_friend_add] 0-management: connect
returned 0
[2021-07-17 00:31:05.377201 +0000] I [MSGID: 106004]
[glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management:
Peer <nexus2> (<00000000-0000-0000-0000-000000000000>), in state
<Establishing Connection>, has disconnected from glusterd.
[2021-07-17 00:31:05.377453 +0000] E [MSGID: 101032]
[store.c:464:gf_store_handle_retrieve] 0-: Path corresponding to
/var/lib/glusterd/glusterd.info. [No such file or directory]

I then wiped the /var/lib/glusterd dir to start clean and downgraded to
7.9, then attempted to peer probe again. This time, it worked fine, proving
7.9 is working, same as it is on prod.

At this point, I made a volume, started it, and played around with testing
to my satisfaction. Then I decided to see what would happen if I tried to
upgrade this working volume from 7.9 to 9.1.

The end result is:

   - gluster volume status is only showing the local gluster node and not
   any of the remote nodes
   - data does seem to replicate, so the connection between the servers is
   actually established
   - logs are now filled with constantly repeating messages like so:

[2021-07-22 23:29:31.039004 +0000] E
[name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
resolution failed on host nexus2
[2021-07-22 23:29:31.039212 +0000] E
[name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
resolution failed on host citadel
[2021-07-22 23:29:31.039304 +0000] E
[name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
resolution failed on host hive
The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6]
0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service
not known}]" repeated 119 times between [2021-07-22 23:27:34.025983
+0000] and [2021-07-22 23:29:31.039302 +0000]
[2021-07-22 23:29:34.039369 +0000] E [MSGID: 101075]
[common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo
[{family=10}, {ret=Name or service not known}]
[2021-07-22 23:29:34.039441 +0000] E
[name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
resolution failed on host nexus2
[2021-07-22 23:29:34.039558 +0000] E
[name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
resolution failed on host citadel
[2021-07-22 23:29:34.039659 +0000] E
[name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
resolution failed on host hive
[2021-07-22 23:29:37.039741 +0000] E
[name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
resolution failed on host nexus2
[2021-07-22 23:29:37.039921 +0000] E
[name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
resolution failed on host citadel
[2021-07-22 23:29:37.040015 +0000] E
[name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
resolution failed on host hive

When I issue a command in cli:

==> cli.log <==
[2021-07-22 23:38:11.802596 +0000] I [cli.c:840:main] 0-cli: Started
running gluster with version 9.1
**[2021-07-22 23:38:11.804007 +0000] W [socket.c:3434:socket_connect]
0-glusterfs: Error disabling sockopt IPV6_V6ONLY: "Operation not
supported"**
[2021-07-22 23:38:11.906865 +0000] I [MSGID: 101190]
[event-epoll.c:670:event_dispatch_epoll_worker] 0-epoll: Started
thread with index [{index=0}]

**Mandatory info:** **- The output of the `gluster volume info` command**:

gluster volume info

Volume Name: ap
Type: Replicate
Volume ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: nexus2:/mnt/nexus2_block1/ap
Brick2: forge:/mnt/forge_block1/ap
Brick3: hive:/mnt/hive_block1/ap
Brick4: citadel:/mnt/citadel_block1/ap
Options Reconfigured:
performance.client-io-threads: on
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
cluster.self-heal-daemon: enable
client.event-threads: 4
cluster.data-self-heal-algorithm: full
cluster.lookup-optimize: on
cluster.quorum-count: 1
cluster.quorum-type: fixed
cluster.readdir-optimize: on
cluster.heal-timeout: 1800
disperse.eager-lock: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
network.inode-lru-limit: 500000
network.ping-timeout: 7
network.remote-dio: enable
performance.cache-invalidation: on
performance.cache-size: 1GB
performance.io-thread-count: 4
performance.md-cache-timeout: 600
performance.rda-cache-limit: 256MB
performance.read-ahead: off
performance.readdir-ahead: on
performance.stat-prefetch: on
performance.write-behind-window-size: 32MB
server.event-threads: 4
cluster.background-self-heal-count: 1
performance.cache-refresh-timeout: 10
features.ctime: off
cluster.granular-entry-heal: enable

- The output of the gluster volume status command:

gluster volume status
Status of volume: ap
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick forge:/mnt/forge_block1/ap            49152     0          Y       2622
Self-heal Daemon on localhost               N/A       N/A        N       N/A

Task Status of Volume ap
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command:

gluster volume heal ap enable
Enable heal on volume ap has been successful

gluster volume heal ap
Launching heal operation to perform index self heal on volume ap has
been unsuccessful:
Self-heal daemon is not running. Check self-heal daemon log file.

- The operating system / glusterfs version:
OpenSUSE 15.2, glusterfs 9.1.


Sincerely,
Artem

--
Founder, Android Police <http://www.androidpolice.com>, APK Mirror
<http://www.apkmirror.com/>, Illogical Robot LLC
beerpla.net | @ArtemR <http://twitter.com/ArtemR>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210722/99c3f5b5/attachment.html>


More information about the Gluster-users mailing list