<div dir="ltr">Thank you Michael for the detailed information. <div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Sep 28, 2020 at 4:37 PM Michael Scherer &lt;<a href="mailto:mscherer@redhat.com">mscherer@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

<br>

<br>

The intro<br>

=========<br>

<br>

So we are trying to get the tests running on Centos 8. Upon<br>

investigation, we had a few tests failing, and thanks to Deepshika work<br>

on <a href="https://github.com/gluster/project-infrastructure/issues/20" rel="noreferrer" target="_blank">https://github.com/gluster/project-infrastructure/issues/20</a>, this <br>

narrowed that on 6 tests, and I decided to investigate them one by one.<br>

<br>

I fixed some of them infra side, but one was caused by 1 single bug<br>

triggering faillure in 3 tests, and RH people pointed us to a kernel<br>

issue: <a href="https://github.com/gluster/glusterfs/issues/1402" rel="noreferrer" target="_blank">https://github.com/gluster/glusterfs/issues/1402</a> <br>

<br>

As we are running Centos 8.2, it should have been fixed, but it wasn&#39;t.<br>

<br>

So the question was &quot;why is the kernel not up to date on our builder&quot;,<br>

which will be our epic quest for today.<br>

<br>

The builders<br>

============<br>

<br>

We are running our test builds on AWS EC2. We have a few centos 8<br>

builders, installed from the only public image we had at that time, a<br>

unofficial one from Frontline. While I tend to prefer official images<br>

(for reasons that will be clear later), this was the easiest way to get<br>

it running, while waiting for the Centos 8 images that would for sure<br>

be there &quot;real soon&quot;.<br>

<br>

We have automated upgrade on the builders with cron, so they should be<br>

using the latest kernel, and if that is not the case, it should just be<br>

1 reboot away. As we reboot on failure, and as kernel version seldomly<br>

impact tests, we are usually good on that point, but maybe Centos 8<br>

scripts were a bit different. This was a WIP after all.<br>

<br>

I take the 1st builder I get, connect to it and check the kernel:<br>

# uname -a <br>

Linux <a href="http://builder212.int.aws.gluster.org" rel="noreferrer" target="_blank">builder212.int.aws.gluster.org</a> 4.18.0-80.11.2.el8_0.x86_64 #1 SMP<br>

Tue Sep 24 11:32:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux<br>

<br>

We need a newer one, so I reboot, I test again<br>

<br>

# uname -a <br>

Linux <a href="http://builder212.int.aws.gluster.org" rel="noreferrer" target="_blank">builder212.int.aws.gluster.org</a> 4.18.0-80.11.2.el8_0.x86_64 #1 SMP<br>

Tue Sep 24 11:32:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux<br>

<br>

Ok, that&#39;s a problem. I check the grub configuration, indeed there is<br>

no trace of anything but 1 single kernel. Curiously, there is also<br>

trace of Centos 7 on the disk, a point that will be important for<br>

later, and do not smell good:<br>

<br>

# cat /etc/grub.conf <br>

default=0<br>

timeout=0<br>

<br>

<br>

title CentOS Linux 7 (3.10.0-957.1.3.el7.x86_64)<br>

        root (hd0)<br>

        kernel /boot/vmlinuz-3.10.0-957.1.3.el7.x86_64 ro<br>

root=UUID=f41e390f-835b-4223-a9bb-9b45984ddf8d console=hvc0<br>

LANG=en_US.UTF-8<br>

        initrd /boot/initramfs-3.10.0-957.1.3.el7.x86_64.img<br>

<br>

<br>

So the configuration of the kernel is not changeded, which mean<br>

something went wrong.<br>

<br>

The kernel<br>

==========<br>

<br>

The Linux kernel is a quite important part of the system and requires<br>

some special care for upgrade. For example, you have to generate a<br>

initramfs based on what is on the disk, you have to modify<br>

configuration files around for grub/lilo/etc, etc, etc. <br>

<br>

While there is effort to move the mess from grub to the firmware of the<br>

system with UEFI and systemd, we are not there yet, so on EL8, we are<br>

still doing it the old way, with scripts run after packages<br>

installations (called later scriptlets in this document). The said<br>

script are shipped with the package, and in the case of kernel, the<br>

work is done by /bin/kernel-install, as it can be seen by the commands<br>

to show scriptlets:<br>

<br>

# rpm -q --scripts kernel-core-4.18.0-193.19.1.el8_2.x86_64<br>

postinstall scriptlet (using /bin/sh):<br>

<br>

if [ `uname -i` == &quot;x86_64&quot; -o `uname -i` == &quot;i386&quot; ] &amp;&amp;<br>

   [ -f /etc/sysconfig/kernel ]; then<br>

  /bin/sed -r -i -e &#39;s/^DEFAULTKERNEL=kernel-<br>

smp$/DEFAULTKERNEL=kernel/&#39; /etc/sysconfig/kernel || exit $?<br>

fi<br>

preuninstall scriptlet (using /bin/sh):<br>

/bin/kernel-install remove 4.18.0-193.19.1.el8_2.x86_64<br>

/lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz || exit $?<br>

if [ -x /usr/sbin/weak-modules ]<br>

then<br>

    /usr/sbin/weak-modules --remove-kernel 4.18.0-193.19.1.el8_2.x86_64 <br>

|| exit $?<br>

fi<br>

posttrans scriptlet (using /bin/sh):<br>

if [ -x /usr/sbin/weak-modules ]<br>

then<br>

    /usr/sbin/weak-modules --add-kernel 4.18.0-193.19.1.el8_2.x86_64 ||<br>

exit $?<br>

fi<br>

/bin/kernel-install add 4.18.0-193.19.1.el8_2.x86_64<br>

/lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz || exit $?<br>

<br>

here, we can see that there is 3 scriptlets in shell, run in 3<br>

different phases of the package upgrade:<br>

- postinstall<br>

- preuninstall<br>

- posttrans<br>

<br>

Postinstall is run after the installation of the package, preuninstall<br>

is run before the removal, and posttrans is run once the whole<br>

transaction is finished. See <br>

<a href="https://rpm-packaging-guide.github.io/#triggers-and-scriptlets" rel="noreferrer" target="_blank">https://rpm-packaging-guide.github.io/#triggers-and-scriptlets</a> <br>

<br>

The interesting one is the posttrans one, since that&#39;s the one who<br>

install the kernel configuration, and either it failed, or wasn&#39;t run. <br>

<br>

So to verify that, the next step is to run the command:<br>

<br>

/bin/kernel-install add 4.18.0-193.19.1.el8_2.x86_64<br>

/lib/modules/4.18.0-193.19.1.el8_2.x86_64/vmlinuz<br>

<br>

And from a quick look, this seemed to work fine. And a quick reboot<br>

later, I confirmed it worked fine, and the kernel was now up to date.<br>

<br>

While I could have stopped here, I think it is important to find the<br>

real problem.<br>

<br>

On the internal of rpm package upgrade<br>

======================================<br>

<br>

Since the scriptlet who face issue is %posttrans, that mean the<br>

transaction, eg, the complete upgrade, failed. %posttrans is run after<br>

the transaction, if it was successful, and if that wasn&#39;t run, that<br>

mean it wasn&#39;t successful.<br>

<br>

This is usually not something that happen, unless people reboot during<br>

the upgrade (a seemingly bad idea, but that happened on a regular basis<br>

in the past on end users laptops), or if a rpm failed to upgrade<br>

properly. <br>

<br>

While the hypothesis of reboot during upgrade was tempting, there is no<br>

way it could happen on 3 different unused systems several time. So I<br>

went on the next step, run yum upgrade to check. And while looking at<br>

the  yum output, this line caught my eyes:<br>

<br>

Upgrading        : yum-4.2.17-<br>

7.el8_2.noarch                                                         <br>

<br>

                             129/526<br>

Error unpacking rpm package yum-4.2.17-7.el8_2.noarch<br>

  Upgrading        : python3-dnf-plugins-core-4.0.12-<br>

4.el8_2.noarch                                                         <br>

<br>

        130/526<br>

error: unpacking of archive failed on file /etc/yum/pluginconf.d: cpio:<br>

File from package already exists as a directory in system<br>

error: yum-4.2.17-7.el8_2.noarch: install failed<br>

<br>

It seems that yum itself failed to be upgraded, which in turn mean the<br>

whole transaction failed, which in turn mean that:<br>

- %posttrans scriptlet was not run (so no bootloader config change)<br>

- yum would try to upgrade again in the future (and fail again)<br>

- all others packages would be on the disk, so kernel is installed,<br>

just not used on boot<br>

<br>

The error &quot;File from package already exists as a directory in system&quot;<br>

is quite specific, and to explain the issue, we must look at how rpm<br>

upgrade files. <br>

<br>

The naive way of doing packages upgrades is to remove first the files,<br>

and then add the new ones. But this is not exactly a good idea if there<br>

is a issue during the upgrade, so rpm first add the new files, and<br>

later remove what need to be removed. This way, you do not lose files<br>

in case of problem such as reboot, hardware errors, etc. <br>

<br>

However, this scheme fail in a very specific case, replacing a<br>

directory by a symlink/file, since you cannot create the symlink since<br>

the directory is still here in the first place (especially if there is<br>

something in the directory). There is various workarounds, see <br>

<a href="https://docs.fedoraproject.org/en-US/packaging-guidelines/Directory_Replacement/" rel="noreferrer" target="_blank">https://docs.fedoraproject.org/en-US/packaging-guidelines/Directory_Replacement/</a><br>

<br>

But this is a long standing well know limitation of various packaging<br>

systems.<br>

<br>

The symptoms are that error message and finding the various files with<br>

a random prefix on disk:<br>

<br>

# ls /etc/yum<br>

 fssnap.d       <br>

 pluginconf.d.bak    <br>

&#39;pluginconf.d;5f719e89&#39;<br>

 protected.d<br>

&#39;protected.d;5f71a160&#39; <br>

&#39;protected.d;5f71a1be&#39; <br>

&#39;vars;5f71a8c8&#39;<br>

 yum7<br>

 pluginconf.d <br>

&#39;pluginconf.d;5e4ba094&#39; <br>

&#39;pluginconf.d;5f71a052&#39;<br>

 protected.d.bak <br>

&#39;protected.d;5f71a18f&#39;   <br>

 vars                    <br>

 version-groups.conf<br>

<br>

So one way to fix is to get the directory out of the way, another is to<br>

fix the package (cf packaging doc). So I decided to fix by moving<br>

/etc/yum/pluginconf.d, it failed again with protected.d and vars.<br>

<br>

<br>

So once that&#39;s done, yum can be upgraded, and so the system should<br>

work, right ?<br>

<br>

<br>

But as we are diving deep, why is there a upgrade issue in the first<br>

place ? Where does this yum package come from ? Why no one reported<br>

such a glaring problem  ?<br>

<br>

<br>

A quick look on a fresh VM show that it indeed come from Centos:<br>

<br>

[centos@ip-172-31-31-150 ~]$ rpm -qi yum<br>

Name        : yum<br>

Version     : 4.0.9.2<br>

Release     : 5.el8<br>

Architecture: noarch<br>

Install Date: Sat 26 Oct 2019 04:44:11 AM UTC<br>

Group       : Unspecified<br>

Size        : 60284<br>

License     : GPLv2+ and GPLv2 and GPL<br>

Signature   : RSA/SHA256, Tue 02 Jul 2019 02:09:17 AM UTC, Key ID<br>

05b555b38483c65d<br>

Source RPM  : dnf-4.0.9.2-5.el8.src.rpm<br>

Build Date  : Mon 13 May 2019 07:35:13 PM UTC<br>

Build Host  : <a href="http://ppc64le-01.mbox.centos.org" rel="noreferrer" target="_blank">ppc64le-01.mbox.centos.org</a><br>

Relocations : (not relocatable)<br>

Packager    : CentOS Buildsys &lt;<a href="mailto:bugs@centos.org" target="_blank">bugs@centos.org</a>&gt;<br>

Vendor      : CentOS<br>

URL         : <a href="https://github.com/rpm-software-management/dnf" rel="noreferrer" target="_blank">https://github.com/rpm-software-management/dnf</a><br>

Summary     : Package manager<br>

Description :<br>

Utility that allows users to manage packages on their systems.<br>

It supports RPMs, modules and comps groups &amp; environments.<br>

<br>

<br>

It look like a legit rpm, so here goes my hope of being just able to blame some bad rpm. But that&#39;s still odd.<br>

<br>

On why you shouldn&#39;t use non official image<br>

============================================<br>

<br>

As I said, something weird about that image was the fact that Centos 7 was in the configuration of grub 1, but <br>

also the fact that grub1 config was around. Investigating, I found that several files in /etc/yum were not owned by rpm, which is <br>

kinda curious, cause that would mean some custom change on the image.<br>

<br>

For example, what is that yum7 directory:<br>

<br>

# rpm -qf /etc/yum/yum7/<br>

file /etc/yum/yum7 is not owned by any package<br>

<br>

A quick search on github lead me to this:<br>

<a href="https://github.com/johnj/centos-to8-upgrade/blob/master/to8.sh#L86" rel="noreferrer" target="_blank">https://github.com/johnj/centos-to8-upgrade/blob/master/to8.sh#L86</a><br>

<br>

So it seems that the image was created by using a Centos 7 image, then upgraded in place to Centos 8, and then uploaded, hence<br>

the left over files from Centos 7 around. But still, the rpm is signed, so I need to verify that.<br>

<br>

I found the original package on <a href="http://vault.centos.org/8.0.1905/BaseOS/x86_64/os/Packages/yum-4.0.9.2-5.el8.noarch.rpm" rel="noreferrer" target="_blank">http://vault.centos.org/8.0.1905/BaseOS/x86_64/os/Packages/yum-4.0.9.2-5.el8.noarch.rpm</a> , <br>

<br>

After a quick download and extract:<br>

$ curl  <a href="http://vault.centos.org/8.0.1905/BaseOS/x86_64/os/Packages/yum-4.0.9.2-5.el8.noarch.rpm" rel="noreferrer" target="_blank">http://vault.centos.org/8.0.1905/BaseOS/x86_64/os/Packages/yum-4.0.9.2-5.el8.noarch.rpm</a>  |rpm2cpio |cpio -id <br>

<br>

we can see ....<br>

$ ls -l etc/yum<br>

total 0<br>

lrwxrwxrwx. 1 misc misc 14 28 sept. 12:36 pluginconf.d -&gt; ../dnf/plugins<br>

lrwxrwxrwx. 1 misc misc 18 28 sept. 12:36 protected.d -&gt; ../dnf/protected.d<br>

lrwxrwxrwx. 1 misc misc 11 28 sept. 12:36 vars -&gt; ../dnf/vars<br>

<br>

So on a fresh installed Centos 8 system, the symlinks should be there.<br>

<br>

So this mean that the upgrade issue is a left over from the Centos 7 to Centos 8 in place upgrade operation. On Centos 7, /etc/yum/vars is <br>

a directory, on Centos 8, that&#39;s a symlink. The newer version of the upgrade script take that in account (there is a mv), but not<br>

the one that was used to create the image, and so the yum upgrade in place fail as it failed when we tried on Centos 8.<br>

<br>

And since that&#39;s a unsupported operation, there is no chance to have it fixed in Centos 8 (or RHEL 8 for that matter), by adding<br>

the right scriptlet in yum.<br>

<br>

<br>

Conclusion<br>

==========<br>

<br>

Since that mail is already long, my conclusion would be:<br>

<br>

- trust no one, especially unofficial images on AWS marketplace.<br>

<br>

I would love to add the truth and the official images are out there, but I checked this morning, still not the case.<br>

<br>

-- <br>

Michael Scherer / He/Il/Er/Él<br>

Sysadmin, Community Infrastructure<br>

<br>

<br>

<br>

_______________________________________________<br>

Gluster-infra mailing list<br>

<a href="mailto:Gluster-infra@gluster.org" target="_blank">Gluster-infra@gluster.org</a><br>

<a href="https://lists.gluster.org/mailman/listinfo/gluster-infra" rel="noreferrer" target="_blank">https://lists.gluster.org/mailman/listinfo/gluster-infra</a></blockquote></div>