[Gluster-devel] Update on georep failure

Yaniv Kaul ykaul at redhat.com
Tue Feb 2 19:06:09 UTC 2021


On Tue, Feb 2, 2021 at 8:14 PM Michael Scherer <mscherer at redhat.com> wrote:

> Hi,
>
> so we finally found the cause of the georep failure, after several days
> of work from Deepshika and I.
>
> Short story:
> ============
>
> side effect of adding libtirpc-devel on EL 7:
> https://github.com/gluster/project-infrastructure/issues/115


Looking at
https://github.com/gluster/glusterfs-patch-acceptance-tests/pull/191 - we
weren't supposed to use it?
From
https://github.com/gluster/glusterfs/blob/d1d7a6f35c816822fab51c820e25023863c239c1/glusterfs.spec.in#L61
:
# Do not use libtirpc on EL6, it does not have xdr_uint64_t() and
xdr_uint32_t
# Do not use libtirpc on EL7, it does not have xdr_sizeof()
%if ( 0%{?rhel} && 0%{?rhel} <= 7 )
%global _without_libtirpc --without-libtirpc
%endif


CentOS 7 has an ancient version, CentOS 8 has a newer version, so perhaps
just one CentOS 8 slaves?
Y.


>
> Long story:
> ===========
>
> So we first puzzled on why it was failing just on some builders and not
> others, especially since it was working fine on softserve VMs.
>
> We tried to look for the usual suspect, rebooted, reinstalled, searched
> if there was something weird (too much ssh keys, not enough inode, some
> hardware issue), but nothing obvious.
>
> After trying to find my way in the logs file and a few weird leads
> (like, why gsyncd was running gcc ? (answer: ctypes)), I was left with
> a rather cryptic message:
>
> [2021-02-02 15:19:00.040817 +0000] I
> [socket.c:929:__socket_server_bind] 0-socket.gfchangelog: closing
> (AF_UNIX) reuse check socket 18
> [2021-02-02 15:19:02.041641 +0000] W [xdr-
> rpcclnt.c:68:rpc_request_to_xdr] 0-rpc: failed to encode call msg
> [2021-02-02 15:19:02.041673 +0000] E [rpc-
> clnt.c:1507:rpc_clnt_record_build_record] 0-gfchangelog: Failed to
> build record header
> [2021-02-02 15:19:02.041683 +0000] W [rpc-clnt.c:1664:rpc_clnt_submit]
> 0-gfchangelog: cannot build rpc-record
> [2021-02-02 15:19:02.041692 +0000] E [MSGID: 132023] [gf-
> changelog.c:285:gf_changelog_setup_rpc] 0-gfchangelog: Could not
> initiate probe RPC, bailing out!!!
> [2021-02-02 15:19:02.041809 +0000] E [MSGID: 132022] [gf-
> changelog.c:583:gf_changelog_register_generic] 0-gfchangelog: Error
> registering with changelog xlator
>
> Given that all gluster is around RPC, it would be unlikely that rpc is
> broken, but that's the only messages we had.
>
>
> We also found that the only builder that was working was builder 210.
> Upon looking, we found that 210 failed to be updated with ansible, due
> to some debugging we forgot to revert, which made this task fail:
>
> https://github.com/gluster/gluster.org_ansible_configuration/blob/master/roles/gluster_qa_scripts/tasks/main.yml#L7
>
> But it wasn't clear how that would change anything, since the only diff
> was a "set -e" that wasn't removed.
>
> Then Deepshika started to test more than georep, and she noticed that a
> lot of others tests were failing, with the same exact message about
> rpc.
>
> And she started to wonder if anything was recently changed. And indeed:
>
> # rpm -qa --last | head -n 15
> yum-plugin-auto-update-debug-info-1.1.31-54.el7_8.noarch mar. 02 févr.
> 2021 14:04:59 UTC
> python3-debuginfo-3.6.8-18.el7.x86_64         mar. 02 févr. 2021
> 14:04:58 UTC
> glibc-debuginfo-2.17-317.el7.x86_64           mar. 02 févr. 2021
> 14:04:57 UTC
> glibc-debuginfo-common-2.17-317.el7.x86_64    mar. 02 févr. 2021
> 14:04:53 UTC
> gpg-pubkey-b6792c39-53c4fbdd                  mar. 02 févr. 2021
> 14:04:34 UTC
> tzdata-java-2021a-1.el7.noarch                mer. 27 janv. 2021
> 09:09:27 UTC
> tzdata-2021a-1.el7.noarch                     mer. 27 janv. 2021
> 09:09:26 UTC
> sudo-1.8.23-10.el7_9.1.x86_64                 mer. 27 janv. 2021
> 09:09:26 UTC
> libtirpc-devel-0.2.4-0.16.el7.x86_64          mar. 26 janv. 2021
> 12:53:45 UTC
> java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64 mar. 26 janv. 2021
> 05:06:44 UTC
>
> We added libtirpc-devel on the 26/01.
>
> libtirpc-devel would, as the name imply, change something around the
> rpc subsystem.
>
> It happened around last week, when we started to notice the problem.
>
> It was not applied to 210, because 210 failed before it got to that
> point (since ansible stop as soon as the git update failed, and jenkins
> builder role is after the gluster-qa-script update).
>
> It was not applied to softserve provided VM either, so tests where
> working fine there.
>
> And indeed, once the package got removed, the tests were working again.
>
> Follow up
> =========
>
> So, I would like to know exactly what should be tested. Is gluster not
> compatible with libtirpc on C7 (as it work on C8), or is there some
> weird issue ? (cause from what I remember, RPC format is supposed to be
> compatible and covered by a specification)
>
> Should we test on C8 only ?
>
>
> --
> Michael Scherer / He/Il/Er/Él
> Sysadmin, Community Infrastructure
>
>
>
> -------
>
> Community Meeting Calendar:
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
>
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20210202/6d4c5ff3/attachment-0001.html>


More information about the Gluster-devel mailing list