[Gluster-infra] The curious case of the failling test of Centos 8

Mon Sep 28 11:06:44 UTC 2020

Hi,

The intro
=========

So we are trying to get the tests running on Centos 8. Upon
investigation, we had a few tests failing, and thanks to Deepshika work
on https://github.com/gluster/project-infrastructure/issues/20, this 
narrowed that on 6 tests, and I decided to investigate them one by one.

I fixed some of them infra side, but one was caused by 1 single bug
triggering faillure in 3 tests, and RH people pointed us to a kernel
issue: https://github.com/gluster/glusterfs/issues/1402 

As we are running Centos 8.2, it should have been fixed, but it wasn't.

So the question was "why is the kernel not up to date on our builder",
which will be our epic quest for today.

The builders
============

We are running our test builds on AWS EC2. We have a few centos 8
builders, installed from the only public image we had at that time, a
unofficial one from Frontline. While I tend to prefer official images
(for reasons that will be clear later), this was the easiest way to get
it running, while waiting for the Centos 8 images that would for sure
be there "real soon".

We have automated upgrade on the builders with cron, so they should be
using the latest kernel, and if that is not the case, it should just be
1 reboot away. As we reboot on failure, and as kernel version seldomly
impact tests, we are usually good on that point, but maybe Centos 8
scripts were a bit different. This was a WIP after all.

I take the 1st builder I get, connect to it and check the kernel:
# uname -a 
Linux builder212.int.aws.gluster.org 4.18.0-80.11.2.el8_0.x86_64 #1 SMP
Tue Sep 24 11:32:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

We need a newer one, so I reboot, I test again

# uname -a 
Linux builder212.int.aws.gluster.org 4.18.0-80.11.2.el8_0.x86_64 #1 SMP
Tue Sep 24 11:32:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Ok, that's a problem. I check the grub configuration, indeed there is
no trace of anything but 1 single kernel. Curiously, there is also
trace of Centos 7 on the disk, a point that will be important for
later, and do not smell good:

# cat /etc/grub.conf 
default=0
timeout=0

title CentOS Linux 7 (3.10.0-957.1.3.el7.x86_64)
	root (hd0)
	kernel /boot/vmlinuz-3.10.0-957.1.3.el7.x86_64 ro
root=UUID=f41e390f-835b-4223-a9bb-9b45984ddf8d console=hvc0
LANG=en_US.UTF-8
	initrd /boot/initramfs-3.10.0-957.1.3.el7.x86_64.img

So the configuration of the kernel is not changeded, which mean
something went wrong.

The kernel
==========

The Linux kernel is a quite important part of the system and requires
some special care for upgrade. For example, you have to generate a
initramfs based on what is on the disk, you have to modify
configuration files around for grub/lilo/etc, etc, etc. 

While there is effort to move the mess from grub to the firmware of the
system with UEFI and systemd, we are not there yet, so on EL8, we are
still doing it the old way, with scripts run after packages
installations (called later scriptlets in this document). The said
script are shipped with the package, and in the case of kernel, the
work is done by /bin/kernel-install, as it can be seen by the commands
to show scriptlets:

# rpm -q --scripts kernel-core-4.18.0-193.19.1.el8_2.x86_64
postinstall scriptlet (using /bin/sh):

if [ `uname -i` == "x86_64" -o `uname -i` == "i386" ] &&
   [ -f /etc/sysconfig/kernel ]; then
  /bin/sed -r -i -e 's/^DEFAULTKERNEL=kernel-
smp$/DEFAULTKERNEL=kernel/' /etc/sysconfig/kernel || exit $?
fi
preuninstall scriptlet (using /bin/sh):
/bin/kernel-install remove 4.18.0-193.19.1.el8_2.x86_64
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz || exit $?
if [ -x /usr/sbin/weak-modules ]
then
    /usr/sbin/weak-modules --remove-kernel 4.18.0-193.19.1.el8_2.x86_64 
|| exit $?
fi
posttrans scriptlet (using /bin/sh):
if [ -x /usr/sbin/weak-modules ]
then
    /usr/sbin/weak-modules --add-kernel 4.18.0-193.19.1.el8_2.x86_64 ||
exit $?
fi
/bin/kernel-install add 4.18.0-193.19.1.el8_2.x86_64
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz || exit $?

here, we can see that there is 3 scriptlets in shell, run in 3
different phases of the package upgrade:
- postinstall
- preuninstall
- posttrans

Postinstall is run after the installation of the package, preuninstall
is run before the removal, and posttrans is run once the whole
transaction is finished. See 
https://rpm-packaging-guide.github.io/#triggers-and-scriptlets 

The interesting one is the posttrans one, since that's the one who
install the kernel configuration, and either it failed, or wasn't run. 

So to verify that, the next step is to run the command:

/bin/kernel-install add 4.18.0-193.19.1.el8_2.x86_64
/lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz

And from a quick look, this seemed to work fine. And a quick reboot
later, I confirmed it worked fine, and the kernel was now up to date.

While I could have stopped here, I think it is important to find the
real problem.

On the internal of rpm package upgrade
======================================

Since the scriptlet who face issue is %posttrans, that mean the
transaction, eg, the complete upgrade, failed. %posttrans is run after
the transaction, if it was successful, and if that wasn't run, that
mean it wasn't successful.

This is usually not something that happen, unless people reboot during
the upgrade (a seemingly bad idea, but that happened on a regular basis
in the past on end users laptops), or if a rpm failed to upgrade
properly. 

While the hypothesis of reboot during upgrade was tempting, there is no
way it could happen on 3 different unused systems several time. So I
went on the next step, run yum upgrade to check. And while looking at
the  yum output, this line caught my eyes:

Upgrading        : yum-4.2.17-
7.el8_2.noarch                                                         

                             129/526
Error unpacking rpm package yum-4.2.17-7.el8_2.noarch
  Upgrading        : python3-dnf-plugins-core-4.0.12-
4.el8_2.noarch                                                         

        130/526
error: unpacking of archive failed on file /etc/yum/pluginconf.d: cpio:
File from package already exists as a directory in system
error: yum-4.2.17-7.el8_2.noarch: install failed

It seems that yum itself failed to be upgraded, which in turn mean the
whole transaction failed, which in turn mean that:
- %posttrans scriptlet was not run (so no bootloader config change)
- yum would try to upgrade again in the future (and fail again)
- all others packages would be on the disk, so kernel is installed,
just not used on boot

The error "File from package already exists as a directory in system"
is quite specific, and to explain the issue, we must look at how rpm
upgrade files. 

The naive way of doing packages upgrades is to remove first the files,
and then add the new ones. But this is not exactly a good idea if there
is a issue during the upgrade, so rpm first add the new files, and
later remove what need to be removed. This way, you do not lose files
in case of problem such as reboot, hardware errors, etc. 

However, this scheme fail in a very specific case, replacing a
directory by a symlink/file, since you cannot create the symlink since
the directory is still here in the first place (especially if there is
something in the directory). There is various workarounds, see 
https://docs.fedoraproject.org/en-US/packaging-guidelines/Directory_Replacement/

But this is a long standing well know limitation of various packaging
systems.

The symptoms are that error message and finding the various files with
a random prefix on disk:

# ls /etc/yum
 fssnap.d       
 pluginconf.d.bak    
'pluginconf.d;5f719e89'
 protected.d
'protected.d;5f71a160' 
'protected.d;5f71a1be' 
'vars;5f71a8c8'
 yum7
 pluginconf.d 
'pluginconf.d;5e4ba094' 
'pluginconf.d;5f71a052'
 protected.d.bak 
'protected.d;5f71a18f'   
 vars                    
 version-groups.conf

So one way to fix is to get the directory out of the way, another is to
fix the package (cf packaging doc). So I decided to fix by moving
/etc/yum/pluginconf.d, it failed again with protected.d and vars.

So once that's done, yum can be upgraded, and so the system should
work, right ?

But as we are diving deep, why is there a upgrade issue in the first
place ? Where does this yum package come from ? Why no one reported
such a glaring problem  ?

A quick look on a fresh VM show that it indeed come from Centos:

[centos at ip-172-31-31-150 ~]$ rpm -qi yum
Name        : yum
Version     : 4.0.9.2
Release     : 5.el8
Architecture: noarch
Install Date: Sat 26 Oct 2019 04:44:11 AM UTC
Group       : Unspecified
Size        : 60284
License     : GPLv2+ and GPLv2 and GPL
Signature   : RSA/SHA256, Tue 02 Jul 2019 02:09:17 AM UTC, Key ID
05b555b38483c65d
Source RPM  : dnf-4.0.9.2-5.el8.src.rpm
Build Date  : Mon 13 May 2019 07:35:13 PM UTC
Build Host  : ppc64le-01.mbox.centos.org
Relocations : (not relocatable)
Packager    : CentOS Buildsys <bugs at centos.org>
Vendor      : CentOS
URL         : https://github.com/rpm-software-management/dnf
Summary     : Package manager
Description :
Utility that allows users to manage packages on their systems.
It supports RPMs, modules and comps groups & environments.

It look like a legit rpm, so here goes my hope of being just able to blame some bad rpm. But that's still odd.

On why you shouldn't use non official image
============================================

As I said, something weird about that image was the fact that Centos 7 was in the configuration of grub 1, but 
also the fact that grub1 config was around. Investigating, I found that several files in /etc/yum were not owned by rpm, which is 
kinda curious, cause that would mean some custom change on the image.

For example, what is that yum7 directory:

# rpm -qf /etc/yum/yum7/
file /etc/yum/yum7 is not owned by any package

A quick search on github lead me to this:
https://github.com/johnj/centos-to8-upgrade/blob/master/to8.sh#L86

So it seems that the image was created by using a Centos 7 image, then upgraded in place to Centos 8, and then uploaded, hence
the left over files from Centos 7 around. But still, the rpm is signed, so I need to verify that.

I found the original package on http://vault.centos.org/8.0.1905/BaseOS/x86_64/os/Packages/yum-4.0.9.2-5.el8.noarch.rpm , 

After a quick download and extract:
$ curl  http://vault.centos.org/8.0.1905/BaseOS/x86_64/os/Packages/yum-4.0.9.2-5.el8.noarch.rpm  |rpm2cpio |cpio -id 

we can see ....
$ ls -l etc/yum
total 0
lrwxrwxrwx. 1 misc misc 14 28 sept. 12:36 pluginconf.d -> ../dnf/plugins
lrwxrwxrwx. 1 misc misc 18 28 sept. 12:36 protected.d -> ../dnf/protected.d
lrwxrwxrwx. 1 misc misc 11 28 sept. 12:36 vars -> ../dnf/vars

So on a fresh installed Centos 8 system, the symlinks should be there.

So this mean that the upgrade issue is a left over from the Centos 7 to Centos 8 in place upgrade operation. On Centos 7, /etc/yum/vars is 
a directory, on Centos 8, that's a symlink. The newer version of the upgrade script take that in account (there is a mv), but not
the one that was used to create the image, and so the yum upgrade in place fail as it failed when we tried on Centos 8.

And since that's a unsupported operation, there is no chance to have it fixed in Centos 8 (or RHEL 8 for that matter), by adding
the right scriptlet in yum.

Conclusion
==========

Since that mail is already long, my conclusion would be:

- trust no one, especially unofficial images on AWS marketplace.

I would love to add the truth and the official images are out there, but I checked this morning, still not the case.

-- 
Michael Scherer / He/Il/Er/Él
Sysadmin, Community Infrastructure

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.gluster.org/pipermail/gluster-infra/attachments/20200928/975e5282/attachment.sig>