[Gluster-infra] The curious case of the failling test of Centos 8

Mon Sep 28 12:38:43 UTC 2020

After reading the excellent analysis of the issue I am perplexed -
what is the path to the desired end state? If "official images"
continue to be absent then we'll have to make do with the "unofficial
ones". However, this note is also something which perhaps is relevant
and important to the CentOS community - can we find a contact to
highlight it to?

/s

On Mon, 28 Sep 2020 at 16:37, Michael Scherer <mscherer at redhat.com> wrote:
>
> Hi,
>
>
> The intro
> =========
>
> So we are trying to get the tests running on Centos 8. Upon
> investigation, we had a few tests failing, and thanks to Deepshika work
> on https://github.com/gluster/project-infrastructure/issues/20, this
> narrowed that on 6 tests, and I decided to investigate them one by one.
>
> I fixed some of them infra side, but one was caused by 1 single bug
> triggering faillure in 3 tests, and RH people pointed us to a kernel
> issue: https://github.com/gluster/glusterfs/issues/1402
>
> As we are running Centos 8.2, it should have been fixed, but it wasn't.
>
> So the question was "why is the kernel not up to date on our builder",
> which will be our epic quest for today.
>
> The builders
> ============
>
> We are running our test builds on AWS EC2. We have a few centos 8
> builders, installed from the only public image we had at that time, a
> unofficial one from Frontline. While I tend to prefer official images
> (for reasons that will be clear later), this was the easiest way to get
> it running, while waiting for the Centos 8 images that would for sure
> be there "real soon".
>
> We have automated upgrade on the builders with cron, so they should be
> using the latest kernel, and if that is not the case, it should just be
> 1 reboot away. As we reboot on failure, and as kernel version seldomly
> impact tests, we are usually good on that point, but maybe Centos 8
> scripts were a bit different. This was a WIP after all.
>
> I take the 1st builder I get, connect to it and check the kernel:
> # uname -a
> Linux builder212.int.aws.gluster.org 4.18.0-80.11.2.el8_0.x86_64 #1 SMP
> Tue Sep 24 11:32:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>
> We need a newer one, so I reboot, I test again
>
> # uname -a
> Linux builder212.int.aws.gluster.org 4.18.0-80.11.2.el8_0.x86_64 #1 SMP
> Tue Sep 24 11:32:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>
> Ok, that's a problem. I check the grub configuration, indeed there is
> no trace of anything but 1 single kernel. Curiously, there is also
> trace of Centos 7 on the disk, a point that will be important for
> later, and do not smell good:
>
> # cat /etc/grub.conf
> default=0
> timeout=0
>
>
> title CentOS Linux 7 (3.10.0-957.1.3.el7.x86_64)
>         root (hd0)
>         kernel /boot/vmlinuz-3.10.0-957.1.3.el7.x86_64 ro
> root=UUID=f41e390f-835b-4223-a9bb-9b45984ddf8d console=hvc0
> LANG=en_US.UTF-8
>         initrd /boot/initramfs-3.10.0-957.1.3.el7.x86_64.img
>
>
> So the configuration of the kernel is not changeded, which mean
> something went wrong.
>
> The kernel
> ==========
>
> The Linux kernel is a quite important part of the system and requires
> some special care for upgrade. For example, you have to generate a
> initramfs based on what is on the disk, you have to modify
> configuration files around for grub/lilo/etc, etc, etc.
>
> While there is effort to move the mess from grub to the firmware of the
> system with UEFI and systemd, we are not there yet, so on EL8, we are
> still doing it the old way, with scripts run after packages
> installations (called later scriptlets in this document). The said
> script are shipped with the package, and in the case of kernel, the
> work is done by /bin/kernel-install, as it can be seen by the commands
> to show scriptlets:
>
> # rpm -q --scripts kernel-core-4.18.0-193.19.1.el8_2.x86_64
> postinstall scriptlet (using /bin/sh):
>
> if [ `uname -i` == "x86_64" -o `uname -i` == "i386" ] &&
>    [ -f /etc/sysconfig/kernel ]; then
>   /bin/sed -r -i -e 's/^DEFAULTKERNEL=kernel-
> smp$/DEFAULTKERNEL=kernel/' /etc/sysconfig/kernel || exit $?
> fi
> preuninstall scriptlet (using /bin/sh):
> /bin/kernel-install remove 4.18.0-193.19.1.el8_2.x86_64
> /lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz || exit $?
> if [ -x /usr/sbin/weak-modules ]
> then
>     /usr/sbin/weak-modules --remove-kernel 4.18.0-193.19.1.el8_2.x86_64
> || exit $?
> fi
> posttrans scriptlet (using /bin/sh):
> if [ -x /usr/sbin/weak-modules ]
> then
>     /usr/sbin/weak-modules --add-kernel 4.18.0-193.19.1.el8_2.x86_64 ||
> exit $?
> fi
> /bin/kernel-install add 4.18.0-193.19.1.el8_2.x86_64
> /lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz || exit $?
>
> here, we can see that there is 3 scriptlets in shell, run in 3
> different phases of the package upgrade:
> - postinstall
> - preuninstall
> - posttrans
>
> Postinstall is run after the installation of the package, preuninstall
> is run before the removal, and posttrans is run once the whole
> transaction is finished. See
> https://rpm-packaging-guide.github.io/#triggers-and-scriptlets
>
> The interesting one is the posttrans one, since that's the one who
> install the kernel configuration, and either it failed, or wasn't run.
>
> So to verify that, the next step is to run the command:
>
> /bin/kernel-install add 4.18.0-193.19.1.el8_2.x86_64
> /lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz
>
> And from a quick look, this seemed to work fine. And a quick reboot
> later, I confirmed it worked fine, and the kernel was now up to date.
>
> While I could have stopped here, I think it is important to find the
> real problem.
>
> On the internal of rpm package upgrade
> ======================================
>
> Since the scriptlet who face issue is %posttrans, that mean the
> transaction, eg, the complete upgrade, failed. %posttrans is run after
> the transaction, if it was successful, and if that wasn't run, that
> mean it wasn't successful.
>
> This is usually not something that happen, unless people reboot during
> the upgrade (a seemingly bad idea, but that happened on a regular basis
> in the past on end users laptops), or if a rpm failed to upgrade
> properly.
>
> While the hypothesis of reboot during upgrade was tempting, there is no
> way it could happen on 3 different unused systems several time. So I
> went on the next step, run yum upgrade to check. And while looking at
> the  yum output, this line caught my eyes:
>
> Upgrading        : yum-4.2.17-
> 7.el8_2.noarch
>
>                              129/526
> Error unpacking rpm package yum-4.2.17-7.el8_2.noarch
>   Upgrading        : python3-dnf-plugins-core-4.0.12-
> 4.el8_2.noarch
>
>         130/526
> error: unpacking of archive failed on file /etc/yum/pluginconf.d: cpio:
> File from package already exists as a directory in system
> error: yum-4.2.17-7.el8_2.noarch: install failed
>
> It seems that yum itself failed to be upgraded, which in turn mean the
> whole transaction failed, which in turn mean that:
> - %posttrans scriptlet was not run (so no bootloader config change)
> - yum would try to upgrade again in the future (and fail again)
> - all others packages would be on the disk, so kernel is installed,
> just not used on boot
>
> The error "File from package already exists as a directory in system"
> is quite specific, and to explain the issue, we must look at how rpm
> upgrade files.
>
> The naive way of doing packages upgrades is to remove first the files,
> and then add the new ones. But this is not exactly a good idea if there
> is a issue during the upgrade, so rpm first add the new files, and
> later remove what need to be removed. This way, you do not lose files
> in case of problem such as reboot, hardware errors, etc.
>
> However, this scheme fail in a very specific case, replacing a
> directory by a symlink/file, since you cannot create the symlink since
> the directory is still here in the first place (especially if there is
> something in the directory). There is various workarounds, see
> https://docs.fedoraproject.org/en-US/packaging-guidelines/Directory_Replacement/
>
> But this is a long standing well know limitation of various packaging
> systems.
>
> The symptoms are that error message and finding the various files with
> a random prefix on disk:
>
> # ls /etc/yum
>  fssnap.d
>  pluginconf.d.bak
> 'pluginconf.d;5f719e89'
>  protected.d
> 'protected.d;5f71a160'
> 'protected.d;5f71a1be'
> 'vars;5f71a8c8'
>  yum7
>  pluginconf.d
> 'pluginconf.d;5e4ba094'
> 'pluginconf.d;5f71a052'
>  protected.d.bak
> 'protected.d;5f71a18f'
>  vars
>  version-groups.conf
>
> So one way to fix is to get the directory out of the way, another is to
> fix the package (cf packaging doc). So I decided to fix by moving
> /etc/yum/pluginconf.d, it failed again with protected.d and vars.
>
>
> So once that's done, yum can be upgraded, and so the system should
> work, right ?
>
>
> But as we are diving deep, why is there a upgrade issue in the first
> place ? Where does this yum package come from ? Why no one reported
> such a glaring problem  ?
>
>
> A quick look on a fresh VM show that it indeed come from Centos:
>
> [centos at ip-172-31-31-150 ~]$ rpm -qi yum
> Name        : yum
> Version     : 4.0.9.2
> Release     : 5.el8
> Architecture: noarch
> Install Date: Sat 26 Oct 2019 04:44:11 AM UTC
> Group       : Unspecified
> Size        : 60284
> License     : GPLv2+ and GPLv2 and GPL
> Signature   : RSA/SHA256, Tue 02 Jul 2019 02:09:17 AM UTC, Key ID
> 05b555b38483c65d
> Source RPM  : dnf-4.0.9.2-5.el8.src.rpm
> Build Date  : Mon 13 May 2019 07:35:13 PM UTC
> Build Host  : ppc64le-01.mbox.centos.org
> Relocations : (not relocatable)
> Packager    : CentOS Buildsys <bugs at centos.org>
> Vendor      : CentOS
> URL         : https://github.com/rpm-software-management/dnf
> Summary     : Package manager
> Description :
> Utility that allows users to manage packages on their systems.
> It supports RPMs, modules and comps groups & environments.
>
>
> It look like a legit rpm, so here goes my hope of being just able to blame some bad rpm. But that's still odd.
>
> On why you shouldn't use non official image
> ============================================
>
> As I said, something weird about that image was the fact that Centos 7 was in the configuration of grub 1, but
> also the fact that grub1 config was around. Investigating, I found that several files in /etc/yum were not owned by rpm, which is
> kinda curious, cause that would mean some custom change on the image.
>
> For example, what is that yum7 directory:
>
> # rpm -qf /etc/yum/yum7/
> file /etc/yum/yum7 is not owned by any package
>
> A quick search on github lead me to this:
> https://github.com/johnj/centos-to8-upgrade/blob/master/to8.sh#L86
>
> So it seems that the image was created by using a Centos 7 image, then upgraded in place to Centos 8, and then uploaded, hence
> the left over files from Centos 7 around. But still, the rpm is signed, so I need to verify that.
>
> I found the original package on http://vault.centos.org/8.0.1905/BaseOS/x86_64/os/Packages/yum-4.0.9.2-5.el8.noarch.rpm ,
>
> After a quick download and extract:
> $ curl  http://vault.centos.org/8.0.1905/BaseOS/x86_64/os/Packages/yum-4.0.9.2-5.el8.noarch.rpm  |rpm2cpio |cpio -id
>
> we can see ....
> $ ls -l etc/yum
> total 0
> lrwxrwxrwx. 1 misc misc 14 28 sept. 12:36 pluginconf.d -> ../dnf/plugins
> lrwxrwxrwx. 1 misc misc 18 28 sept. 12:36 protected.d -> ../dnf/protected.d
> lrwxrwxrwx. 1 misc misc 11 28 sept. 12:36 vars -> ../dnf/vars
>
> So on a fresh installed Centos 8 system, the symlinks should be there.
>
> So this mean that the upgrade issue is a left over from the Centos 7 to Centos 8 in place upgrade operation. On Centos 7, /etc/yum/vars is
> a directory, on Centos 8, that's a symlink. The newer version of the upgrade script take that in account (there is a mv), but not
> the one that was used to create the image, and so the yum upgrade in place fail as it failed when we tried on Centos 8.
>
> And since that's a unsupported operation, there is no chance to have it fixed in Centos 8 (or RHEL 8 for that matter), by adding
> the right scriptlet in yum.
>
>
> Conclusion
> ==========
>
> Since that mail is already long, my conclusion would be:
>
> - trust no one, especially unofficial images on AWS marketplace.
>
> I would love to add the truth and the official images are out there, but I checked this morning, still not the case.
>
> --
> Michael Scherer / He/Il/Er/Él
> Sysadmin, Community Infrastructure
>
>
>
> _______________________________________________
> Gluster-infra mailing list
> Gluster-infra at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-infra

-- 
sankarshan mukhopadhyay
<https://about.me/sankarshan.mukhopadhyay>