[Gluster-users] Broken status, peer probe, "DNS resolution failed on host" and "Error disabling sockopt IPV6_V6ONLY: "Protocol not available" after updating from gluster 7.9 to 9.1

Tue Jul 27 02:14:13 UTC 2021

Hi everyone,

Well, I went back to Gluster's own repo (
https://download.opensuse.org/repositories/home:/glusterfs:/Leap15.2-9/openSUSE_Leap_15.2/x86_64/)
rather than using OpenSUSE's filesystem one (
https://download.opensuse.org/repositories/filesystems/openSUSE_Leap_15.2/x86_64/),
and tried upgrading from 7.9 to 9.1 again. This time, everything worked
fine, though the status commands were failing until all the nodes got
upgraded due to this:

gluster volume status
Locking failed on hive. Please check log file for details.
Locking failed on citadel. Please check log file for details.

Something about RPC:

[2021-07-27 00:47:49.288575] E
[rpcsvc.c:194:rpcsvc_get_program_vector_sizer] 0-rpc-service: RPC procedure
7 not available for Program GlusterD svc mgmt v3
[2021-07-27 00:47:49.288608] E [rpcsvc.c:350:rpcsvc_program_actor]
0-rpc-service: RPC Program procedure not available for procedure 7 in
GlusterD svc mgmt v3 for  192.168.210.155:49141

During this time, the servers kept syncing, and the fuse fs was available,
so there was no downtime. And as I mentioned, after upgrading, everything
seems fine.

With this in mind, it'd have been much more user-friendly if the version
from the filesystem repo without IPv6 options compiled in didn't explode in
the way it did. I couldn't even probe for peers after installing. IMO
there's still a legitimate issue to be fixed here and a test to add to the
codebase.

Sincerely,
Artem

--
Founder, Android Police <http://www.androidpolice.com>, APK Mirror
<http://www.apkmirror.com/>, Illogical Robot LLC
beerpla.net | @ArtemR <http://twitter.com/ArtemR>

On Fri, Jul 23, 2021 at 7:09 PM Strahil Nikolov <hunter86_bg at yahoo.com>
wrote:

> Can you try setting "transport.address-family: inet" at
> /etc/glusterfs/glusterd.vol on all nodes ?
>
> About the rpms, if they are not yet built - the only other option is to
> build them from source.
>
> I assume , that the second try is on a fresh set of systems without any
> remnants of old Gluster install.
>
> Best Regards,
> Strahil Nikolov
>
>
>
>
>
>
> В петък, 23 юли 2021 г., 07:55:01 ч. Гринуич+3, Artem Russakovskii <
> archon810 at gmail.com> написа:
>
>
>
>
>
> Hi Strahil,
>
> I am using repo builds from
> https://download.opensuse.org/repositories/filesystems/openSUSE_Leap_15.2/x86_64/
> (currently glusterfs-9.1-lp152.88.2.x86_64.rpm) and don't build them.
>
> Perhaps the builds at
> https://download.opensuse.org/repositories/home:/glusterfs:/Leap15.2-9/openSUSE_Leap_15.2/x86_64/
> are better (currently glusterfs-9.1-lp152.112.1.x86_64.rpm), does anyone
> know?
>
> None of the repos currently have 9.3.
>
> And regardless, I don't care for gluster using IPv6 if IPv4 works fine. Is
> there a way to make it stop trying to use IPv6 and only use IPv4?
>
> Sincerely,
> Artem
>
> --
> Founder, Android Police, APK Mirror, Illogical Robot LLC
> beerpla.net | @ArtemR
>
>
> On Thu, Jul 22, 2021 at 9:09 PM Strahil Nikolov <hunter86_bg at yahoo.com>
> wrote:
> > Did you try with latest 9.X ? Based on the release notes that should be
> 9.3 .
> >
> > Best Regards,
> > Strahil Nikolov
> >
> >
> >>
> >>
> >> On Fri, Jul 23, 2021 at 3:06, Artem Russakovskii
> >> <archon810 at gmail.com> wrote:
> >>
> >>
> >>
> >> Hi all,
> >>
> >> I just filed this ticket
> https://github.com/gluster/glusterfs/issues/2648, and wanted to bring it
> to your attention. Any feedback would be appreciated.
> >>
> >> Description of problem:
> >> We have a 4-node replicate cluster running gluster 7.9. I'm currently
> setting up a new cluster on a new set of machines and went straight for
> gluster 9.1.
> >> However, I was unable to probe any servers due to this error:
> >> [2021-07-17 00:31:05.228609 +0000] I [MSGID: 106487]
> [glusterd-handler.c:1160:__glusterd_handle_cli_probe] 0-glusterd: Received
> CLI probe req nexus2 24007
> >> [2021-07-17 00:31:05.229727 +0000] E [MSGID: 101075]
> [common-utils.c:3657:gf_is_local_addr] 0-management: error in getaddrinfo
> [{ret=Name or service not known}]
> >> [2021-07-17 00:31:05.230785 +0000] E [MSGID: 106408]
> [glusterd-peer-utils.c:217:glusterd_peerinfo_find_by_hostname]
> 0-management: error in getaddrinfo: Name or service not known
> >>  [Unknown error -2]
> >> [2021-07-17 00:31:05.353971 +0000] I [MSGID: 106128]
> [glusterd-handler.c:3719:glusterd_probe_begin] 0-glusterd: Unable to find
> peerinfo for host: nexus2 (24007)
> >> [2021-07-17 00:31:05.375871 +0000] W [MSGID: 106061]
> [glusterd-handler.c:3488:glusterd_transport_inet_options_build] 0-glusterd:
> Failed to get tcp-user-timeout
> >> [2021-07-17 00:31:05.375903 +0000] I
> [rpc-clnt.c:1010:rpc_clnt_connection_init] 0-management: setting
> frame-timeout to 600
> >> [2021-07-17 00:31:05.377021 +0000] E [MSGID: 101075]
> [common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo
> [{family=10}, {ret=Name or service not known}]
> >> [2021-07-17 00:31:05.377043 +0000] E
> [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
> resolution failed on host nexus2
> >> [2021-07-17 00:31:05.377147 +0000] I [MSGID: 106498]
> [glusterd-handler.c:3648:glusterd_friend_add] 0-management: connect
> returned 0
> >> [2021-07-17 00:31:05.377201 +0000] I [MSGID: 106004]
> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
> <nexus2> (<00000000-0000-0000-0000-000000000000>), in state <Establishing
> Connection>, has disconnected from glusterd.
> >> [2021-07-17 00:31:05.377453 +0000] E [MSGID: 101032]
> [store.c:464:gf_store_handle_retrieve] 0-: Path corresponding to
> /var/lib/glusterd/glusterd.info. [No such file or directory]
> >>
> >> I then wiped the /var/lib/glusterd dir to start clean and downgraded to
> 7.9, then attempted to peer probe again. This time, it worked fine, proving
> 7.9 is working, same as it is on prod.
> >> At this point, I made a volume, started it, and played around with
> testing to my satisfaction. Then I decided to see what would happen if I
> tried to upgrade this working volume from 7.9 to 9.1.
> >> The end result is:
> >>     * gluster volume status is only showing the local gluster node and
> not any of the remote nodes
> >>     * data does seem to replicate, so the connection between the
> servers is actually established
> >>     * logs are now filled with constantly repeating messages like so:
> >> [2021-07-22 23:29:31.039004 +0000] E
> [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
> resolution failed on host nexus2
> >> [2021-07-22 23:29:31.039212 +0000] E
> [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
> resolution failed on host citadel
> >> [2021-07-22 23:29:31.039304 +0000] E
> [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
> resolution failed on host hive
> >> The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6]
> 0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service not
> known}]" repeated 119 times between [2021-07-22 23:27:34.025983 +0000] and
> [2021-07-22 23:29:31.039302 +0000]
> >> [2021-07-22 23:29:34.039369 +0000] E [MSGID: 101075]
> [common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo
> [{family=10}, {ret=Name or service not known}]
> >> [2021-07-22 23:29:34.039441 +0000] E
> [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
> resolution failed on host nexus2
> >> [2021-07-22 23:29:34.039558 +0000] E
> [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
> resolution failed on host citadel
> >> [2021-07-22 23:29:34.039659 +0000] E
> [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
> resolution failed on host hive
> >> [2021-07-22 23:29:37.039741 +0000] E
> [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
> resolution failed on host nexus2
> >> [2021-07-22 23:29:37.039921 +0000] E
> [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
> resolution failed on host citadel
> >> [2021-07-22 23:29:37.040015 +0000] E
> [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
> resolution failed on host hive
> >>
> >> When I issue a command in cli:
> >> ==> cli.log <==
> >> [2021-07-22 23:38:11.802596 +0000] I [cli.c:840:main] 0-cli: Started
> running gluster with version 9.1
> >> **[2021-07-22 23:38:11.804007 +0000] W [socket.c:3434:socket_connect]
> 0-glusterfs: Error disabling sockopt IPV6_V6ONLY: "Operation not
> supported"**
> >> [2021-07-22 23:38:11.906865 +0000] I [MSGID: 101190]
> [event-epoll.c:670:event_dispatch_epoll_worker] 0-epoll: Started thread
> with index [{index=0}]
> >>
> >> **Mandatory info:** **- The output of the `gluster volume info`
> command**:
> >> gluster volume info
> >>
> >> Volume Name: ap
> >> Type: Replicate
> >> Volume ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> >> Status: Started
> >> Snapshot Count: 0
> >> Number of Bricks: 1 x 4 = 4
> >> Transport-type: tcp
> >> Bricks:
> >> Brick1: nexus2:/mnt/nexus2_block1/ap
> >> Brick2: forge:/mnt/forge_block1/ap
> >> Brick3: hive:/mnt/hive_block1/ap
> >> Brick4: citadel:/mnt/citadel_block1/ap
> >> Options Reconfigured:
> >> performance.client-io-threads: on
> >> nfs.disable: on
> >> storage.fips-mode-rchecksum: on
> >> transport.address-family: inet
> >> cluster.self-heal-daemon: enable
> >> client.event-threads: 4
> >> cluster.data-self-heal-algorithm: full
> >> cluster.lookup-optimize: on
> >> cluster.quorum-count: 1
> >> cluster.quorum-type: fixed
> >> cluster.readdir-optimize: on
> >> cluster.heal-timeout: 1800
> >> disperse.eager-lock: on
> >> features.cache-invalidation: on
> >> features.cache-invalidation-timeout: 600
> >> network.inode-lru-limit: 500000
> >> network.ping-timeout: 7
> >> network.remote-dio: enable
> >> performance.cache-invalidation: on
> >> performance.cache-size: 1GB
> >> performance.io-thread-count: 4
> >> performance.md-cache-timeout: 600
> >> performance.rda-cache-limit: 256MB
> >> performance.read-ahead: off
> >> performance.readdir-ahead: on
> >> performance.stat-prefetch: on
> >> performance.write-behind-window-size: 32MB
> >> server.event-threads: 4
> >> cluster.background-self-heal-count: 1
> >> performance.cache-refresh-timeout: 10
> >> features.ctime: off
> >> cluster.granular-entry-heal: enable
> >>
> >> - The output of the gluster volume status command:
> >> gluster volume status
> >> Status of volume: ap
> >> Gluster process                             TCP Port  RDMA
> Port  Online  Pid
> >>
> ------------------------------------------------------------------------------
> >> Brick forge:/mnt/forge_block1/ap            49152
> 0          Y       2622
> >> Self-heal Daemon on localhost               N/A
> N/A        N       N/A
> >>
> >> Task Status of Volume ap
> >>
> ------------------------------------------------------------------------------
> >> There are no active volume tasks
> >>
> >> - The output of the gluster volume heal command:
> >> gluster volume heal ap enable
> >> Enable heal on volume ap has been successful
> >>
> >> gluster volume heal ap
> >> Launching heal operation to perform index self heal on volume ap has
> been unsuccessful:
> >> Self-heal daemon is not running. Check self-heal daemon log file.
> >>
> >> - The operating system / glusterfs version:
> >> OpenSUSE 15.2, glusterfs 9.1.
> >>
> >>
> >> Sincerely,
> >> Artem
> >>
> >> --
> >> Founder, Android Police, APK Mirror, Illogical Robot LLC
> >> beerpla.net | @ArtemR
> >>
> >> ________
> >>
> >>
> >>
> >> Community Meeting Calendar:
> >>
> >> Schedule -
> >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> >> Bridge: https://meet.google.com/cpu-eiue-hvk
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> https://lists.gluster.org/mailman/listinfo/gluster-users
> >>
> >>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210726/943aea0c/attachment.html>