[Gluster-infra] The curious case of the failling test of Centos 8

Mon Sep 28 11:32:34 UTC 2020

Thank you Michael for the detailed information.

On Mon, Sep 28, 2020 at 4:37 PM Michael Scherer <mscherer at redhat.com> wrote:

> Hi,
>
>
> The intro
> =========
>
> So we are trying to get the tests running on Centos 8. Upon
> investigation, we had a few tests failing, and thanks to Deepshika work
> on https://github.com/gluster/project-infrastructure/issues/20, this
> narrowed that on 6 tests, and I decided to investigate them one by one.
>
> I fixed some of them infra side, but one was caused by 1 single bug
> triggering faillure in 3 tests, and RH people pointed us to a kernel
> issue: https://github.com/gluster/glusterfs/issues/1402
>
> As we are running Centos 8.2, it should have been fixed, but it wasn't.
>
> So the question was "why is the kernel not up to date on our builder",
> which will be our epic quest for today.
>
> The builders
> ============
>
> We are running our test builds on AWS EC2. We have a few centos 8
> builders, installed from the only public image we had at that time, a
> unofficial one from Frontline. While I tend to prefer official images
> (for reasons that will be clear later), this was the easiest way to get
> it running, while waiting for the Centos 8 images that would for sure
> be there "real soon".
>
> We have automated upgrade on the builders with cron, so they should be
> using the latest kernel, and if that is not the case, it should just be
> 1 reboot away. As we reboot on failure, and as kernel version seldomly
> impact tests, we are usually good on that point, but maybe Centos 8
> scripts were a bit different. This was a WIP after all.
>
> I take the 1st builder I get, connect to it and check the kernel:
> # uname -a
> Linux builder212.int.aws.gluster.org 4.18.0-80.11.2.el8_0.x86_64 #1 SMP
> Tue Sep 24 11:32:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>
> We need a newer one, so I reboot, I test again
>
> # uname -a
> Linux builder212.int.aws.gluster.org 4.18.0-80.11.2.el8_0.x86_64 #1 SMP
> Tue Sep 24 11:32:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>
> Ok, that's a problem. I check the grub configuration, indeed there is
> no trace of anything but 1 single kernel. Curiously, there is also
> trace of Centos 7 on the disk, a point that will be important for
> later, and do not smell good:
>
> # cat /etc/grub.conf
> default=0
> timeout=0
>
>
> title CentOS Linux 7 (3.10.0-957.1.3.el7.x86_64)
>         root (hd0)
>         kernel /boot/vmlinuz-3.10.0-957.1.3.el7.x86_64 ro
> root=UUID=f41e390f-835b-4223-a9bb-9b45984ddf8d console=hvc0
> LANG=en_US.UTF-8
>         initrd /boot/initramfs-3.10.0-957.1.3.el7.x86_64.img
>
>
> So the configuration of the kernel is not changeded, which mean
> something went wrong.
>
> The kernel
> ==========
>
> The Linux kernel is a quite important part of the system and requires
> some special care for upgrade. For example, you have to generate a
> initramfs based on what is on the disk, you have to modify
> configuration files around for grub/lilo/etc, etc, etc.
>
> While there is effort to move the mess from grub to the firmware of the
> system with UEFI and systemd, we are not there yet, so on EL8, we are
> still doing it the old way, with scripts run after packages
> installations (called later scriptlets in this document). The said
> script are shipped with the package, and in the case of kernel, the
> work is done by /bin/kernel-install, as it can be seen by the commands
> to show scriptlets:
>
> # rpm -q --scripts kernel-core-4.18.0-193.19.1.el8_2.x86_64
> postinstall scriptlet (using /bin/sh):
>
> if [ `uname -i` == "x86_64" -o `uname -i` == "i386" ] &&
>    [ -f /etc/sysconfig/kernel ]; then
>   /bin/sed -r -i -e 's/^DEFAULTKERNEL=kernel-
> smp$/DEFAULTKERNEL=kernel/' /etc/sysconfig/kernel || exit $?
> fi
> preuninstall scriptlet (using /bin/sh):
> /bin/kernel-install remove 4.18.0-193.19.1.el8_2.x86_64
> /lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz || exit $?
> if [ -x /usr/sbin/weak-modules ]
> then
>     /usr/sbin/weak-modules --remove-kernel 4.18.0-193.19.1.el8_2.x86_64
> || exit $?
> fi
> posttrans scriptlet (using /bin/sh):
> if [ -x /usr/sbin/weak-modules ]
> then
>     /usr/sbin/weak-modules --add-kernel 4.18.0-193.19.1.el8_2.x86_64 ||
> exit $?
> fi
> /bin/kernel-install add 4.18.0-193.19.1.el8_2.x86_64
> /lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz || exit $?
>
> here, we can see that there is 3 scriptlets in shell, run in 3
> different phases of the package upgrade:
> - postinstall
> - preuninstall
> - posttrans
>
> Postinstall is run after the installation of the package, preuninstall
> is run before the removal, and posttrans is run once the whole
> transaction is finished. See
> https://rpm-packaging-guide.github.io/#triggers-and-scriptlets
>
> The interesting one is the posttrans one, since that's the one who
> install the kernel configuration, and either it failed, or wasn't run.
>
> So to verify that, the next step is to run the command:
>
> /bin/kernel-install add 4.18.0-193.19.1.el8_2.x86_64
> /lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz
>
> And from a quick look, this seemed to work fine. And a quick reboot
> later, I confirmed it worked fine, and the kernel was now up to date.
>
> While I could have stopped here, I think it is important to find the
> real problem.
>
> On the internal of rpm package upgrade
> ======================================
>
> Since the scriptlet who face issue is %posttrans, that mean the
> transaction, eg, the complete upgrade, failed. %posttrans is run after
> the transaction, if it was successful, and if that wasn't run, that
> mean it wasn't successful.
>
> This is usually not something that happen, unless people reboot during
> the upgrade (a seemingly bad idea, but that happened on a regular basis
> in the past on end users laptops), or if a rpm failed to upgrade
> properly.
>
> While the hypothesis of reboot during upgrade was tempting, there is no
> way it could happen on 3 different unused systems several time. So I
> went on the next step, run yum upgrade to check. And while looking at
> the  yum output, this line caught my eyes:
>
> Upgrading        : yum-4.2.17-
> 7.el8_2.noarch
>
>                              129/526
> Error unpacking rpm package yum-4.2.17-7.el8_2.noarch
>   Upgrading        : python3-dnf-plugins-core-4.0.12-
> 4.el8_2.noarch
>
>         130/526
> error: unpacking of archive failed on file /etc/yum/pluginconf.d: cpio:
> File from package already exists as a directory in system
> error: yum-4.2.17-7.el8_2.noarch: install failed
>
> It seems that yum itself failed to be upgraded, which in turn mean the
> whole transaction failed, which in turn mean that:
> - %posttrans scriptlet was not run (so no bootloader config change)
> - yum would try to upgrade again in the future (and fail again)
> - all others packages would be on the disk, so kernel is installed,
> just not used on boot
>
> The error "File from package already exists as a directory in system"
> is quite specific, and to explain the issue, we must look at how rpm
> upgrade files.
>
> The naive way of doing packages upgrades is to remove first the files,
> and then add the new ones. But this is not exactly a good idea if there
> is a issue during the upgrade, so rpm first add the new files, and
> later remove what need to be removed. This way, you do not lose files
> in case of problem such as reboot, hardware errors, etc.
>
> However, this scheme fail in a very specific case, replacing a
> directory by a symlink/file, since you cannot create the symlink since
> the directory is still here in the first place (especially if there is
> something in the directory). There is various workarounds, see
>
> https://docs.fedoraproject.org/en-US/packaging-guidelines/Directory_Replacement/
>
> But this is a long standing well know limitation of various packaging
> systems.
>
> The symptoms are that error message and finding the various files with
> a random prefix on disk:
>
> # ls /etc/yum
>  fssnap.d
>  pluginconf.d.bak
> 'pluginconf.d;5f719e89'
>  protected.d
> 'protected.d;5f71a160'
> 'protected.d;5f71a1be'
> 'vars;5f71a8c8'
>  yum7
>  pluginconf.d
> 'pluginconf.d;5e4ba094'
> 'pluginconf.d;5f71a052'
>  protected.d.bak
> 'protected.d;5f71a18f'
>  vars
>  version-groups.conf
>
> So one way to fix is to get the directory out of the way, another is to
> fix the package (cf packaging doc). So I decided to fix by moving
> /etc/yum/pluginconf.d, it failed again with protected.d and vars.
>
>
> So once that's done, yum can be upgraded, and so the system should
> work, right ?
>
>
> But as we are diving deep, why is there a upgrade issue in the first
> place ? Where does this yum package come from ? Why no one reported
> such a glaring problem  ?
>
>
> A quick look on a fresh VM show that it indeed come from Centos:
>
> [centos at ip-172-31-31-150 ~]$ rpm -qi yum
> Name        : yum
> Version     : 4.0.9.2
> Release     : 5.el8
> Architecture: noarch
> Install Date: Sat 26 Oct 2019 04:44:11 AM UTC
> Group       : Unspecified
> Size        : 60284
> License     : GPLv2+ and GPLv2 and GPL
> Signature   : RSA/SHA256, Tue 02 Jul 2019 02:09:17 AM UTC, Key ID
> 05b555b38483c65d
> Source RPM  : dnf-4.0.9.2-5.el8.src.rpm
> Build Date  : Mon 13 May 2019 07:35:13 PM UTC
> Build Host  : ppc64le-01.mbox.centos.org
> Relocations : (not relocatable)
> Packager    : CentOS Buildsys <bugs at centos.org>
> Vendor      : CentOS
> URL         : https://github.com/rpm-software-management/dnf
> Summary     : Package manager
> Description :
> Utility that allows users to manage packages on their systems.
> It supports RPMs, modules and comps groups & environments.
>
>
> It look like a legit rpm, so here goes my hope of being just able to blame
> some bad rpm. But that's still odd.
>
> On why you shouldn't use non official image
> ============================================
>
> As I said, something weird about that image was the fact that Centos 7 was
> in the configuration of grub 1, but
> also the fact that grub1 config was around. Investigating, I found that
> several files in /etc/yum were not owned by rpm, which is
> kinda curious, cause that would mean some custom change on the image.
>
> For example, what is that yum7 directory:
>
> # rpm -qf /etc/yum/yum7/
> file /etc/yum/yum7 is not owned by any package
>
> A quick search on github lead me to this:
> https://github.com/johnj/centos-to8-upgrade/blob/master/to8.sh#L86
>
> So it seems that the image was created by using a Centos 7 image, then
> upgraded in place to Centos 8, and then uploaded, hence
> the left over files from Centos 7 around. But still, the rpm is signed, so
> I need to verify that.
>
> I found the original package on
> http://vault.centos.org/8.0.1905/BaseOS/x86_64/os/Packages/yum-4.0.9.2-5.el8.noarch.rpm
> ,
>
> After a quick download and extract:
> $ curl
> http://vault.centos.org/8.0.1905/BaseOS/x86_64/os/Packages/yum-4.0.9.2-5.el8.noarch.rpm
> |rpm2cpio |cpio -id
>
> we can see ....
> $ ls -l etc/yum
> total 0
> lrwxrwxrwx. 1 misc misc 14 28 sept. 12:36 pluginconf.d -> ../dnf/plugins
> lrwxrwxrwx. 1 misc misc 18 28 sept. 12:36 protected.d -> ../dnf/protected.d
> lrwxrwxrwx. 1 misc misc 11 28 sept. 12:36 vars -> ../dnf/vars
>
> So on a fresh installed Centos 8 system, the symlinks should be there.
>
> So this mean that the upgrade issue is a left over from the Centos 7 to
> Centos 8 in place upgrade operation. On Centos 7, /etc/yum/vars is
> a directory, on Centos 8, that's a symlink. The newer version of the
> upgrade script take that in account (there is a mv), but not
> the one that was used to create the image, and so the yum upgrade in place
> fail as it failed when we tried on Centos 8.
>
> And since that's a unsupported operation, there is no chance to have it
> fixed in Centos 8 (or RHEL 8 for that matter), by adding
> the right scriptlet in yum.
>
>
> Conclusion
> ==========
>
> Since that mail is already long, my conclusion would be:
>
> - trust no one, especially unofficial images on AWS marketplace.
>
> I would love to add the truth and the official images are out there, but I
> checked this morning, still not the case.
>
> --
> Michael Scherer / He/Il/Er/Él
> Sysadmin, Community Infrastructure
>
>
>
> _______________________________________________
> Gluster-infra mailing list
> Gluster-infra at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-infra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20200928/1911d952/attachment-0001.html>