Warm Reboot on Linux with kexec (Remember QEMM?)

If you are old enough to remember QEMM from back in the ’90s, along with other tools we used to squeeze every last byte of memory under the 640KB limit, you may remember a rather cool feature it had – warm reboot.

What is a Warm Reboot?

Reboot involves the computer doing a Power-On Self Test (POST). This takes time, often as much as a few minutes on some servers and workstations. While you are setting something up and need to test frequently that things come up correctly at boot time, the POST can make progress painfully slow. If only we had something like the warm reboot feature that QEMM had back in the ’90s, which allowed us to reset the RAM and reboot DOS without rebooting the entire machine and suffer the POST time. Well, such a thing does actually exist in modern Linux.

Enter kexec

kexec allows us to do exactly this – load a new kernel, kill all processes, and hand over control to the new kernel as the bootloader does at boot time. What do we need for this magic to work? On a modern distro, not much, it is all already included. Let’s start with a script that I use and explain what each component does:

#!/bin/bash

systemctl isolate multi-user.target

rmmod nvidia_drm nvidia_modeset nvidia_uvm
rmmod nvidia

kexec --load=/boot/vmlinuz-$(uname -r) \
      --initrd=/boot/initramfs-$(uname -r).img \
      --command-line="$(cat /proc/cmdline)"

kexec --exec

Let’s look at the kexec lines first. uname -r returns the current kernel version. $(uname -r) bash syntax allows is to take the output of a command and use it as a string in the invoking command. On recent CentOS 8 here is what we get:

$ uname -r
4.18.0-193.6.3.el8_2.centos.plus.x86_64
$ echo $(uname -r)
4.18.0-193.6.3.el8_2.centos.plus.x86_64

The kernel and initial ramdisk usually have the kernel version in their names in /boot/:

$ ls /boot/
initramfs-4.18.0-193.6.3.el8_2.centos.plus.x86_64.img
vmlinuz-4.18.0-193.6.3.el8_2.centos.plus.x86_64

So in our warm reboot script, vmlinuz-$(uname -r) will expand to vmlinuz-4.18.0-193.6.3.el8_2.centos.plus.x86_64. Similar will happen with the initramfs file name.

Next, what is in /proc/cmdline ? This contains the boot parameters that our currently running kernel was booted with, as provided in our grub conifguration, for example:

$ cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos2)/vmlinuz-4.18.0-193.6.3.el8_2.centos.plus.x86_64 root=ZFS=tank/ROOT quiet elevator=deadline transparent_hugepage=never

This is the minimum needed to boot the kernel. Once we have supplied this information, we initiate the shutdown and process purge, and hand over to the new kernel, using:

kexec --exec.

But what are the systemctl and rmmod lines about? They are mostly to work around finnickiness of Nvidia drivers and GPUs. If you execute kexec immediately, with the Nvidia driver still running, the GPU won’t reset properly and won’t get properly re-initialised by the driver when the kernel warm-boots. So we have to rmmod the nvidia driver. Legacy nvidia driver only includes the nvidia module. Newer versions also include nvidia_drm, nvidia_modeset and nvidia_uvm which depend on the nvidia module, so we have to remove those first. But before we do that, we have to make sure that Xorg isn’t running, otherwise we won’t be able to unload the nvidia driver. To make sure graphical environment isn’t running, we switch the runlevel target to multi-user.target (on a workstation we are probably running graphical.target by default). Once Xorg is no longer running, we can proceed with unloading the nvidia driver modules. And with that done, we can proceed with the warm boot and enjoy a reboot time saving.

Virtual Performance – Or Lack Thereof

People always seem very shocked when I suggest that virtualization comes with a very substantial performance penalty even when virtualization hardware extensions are used. Concerningly, this surprise often comes from people who have already either committed their organization’s IT infrastructure to virtualization, or have made firm plans to do so. The only thing I can conclude in these cases, unbelievable as it may appear, is that they haven’t done any performance testing of their own to assess the solution they are planning to adopt.

So I decided to document some basic performance tests that show just how substantial the performance hit of virtualization is.

Test Setup

Hardware:
Core2 Quad 3.2GHz
8GB of RAM
2x500GB 7200rpm SATA DM RAID1 for the main system
1x250GB 7200rpm SATA for testing

Virtual Test Configuration (VMware Player 4.0.4, Xen 4.1.2 (PV and HVM), KVM (RHEL6), VirtualBox 4.1.18):
CPU Cores: 4 (all)
RAM: 6GB
Disk: System booting off the 2×500 RAID1. Raw 250GB SATA disk passed to the VM.

Disk write caching was enabled in the VMware configuration. You may think that this unfairly gives the VM configuration an advantage, but as you will see from the results, even with this “cheat”, the performance is still very disappointing compared to bare metal. In any case, the amount of disk I/O is negligible – the caches and the working set always fit into memory.

Physical Test Configuration:
CPU Cores: 4 (all)
RAM: 6GB (limited using mem=6G boot parameter)
Disk: Booting directly off the same 250GB SATA disk used for VM testing, with the same kernel and configuration.

The Test

The test performed is the compile of the vanilla 2.6.32.59 Linux kernel. This is the script used for testing:

#!/bin/bash

echo Cleaning...
make clean > /dev/null 2>&1
make mrproper > /dev/null 2>&1
sync
echo 3 > /proc/sys/vm/drop_caches
echo Configuring...
make allmodconfig > /dev/null 2>&1
echo Syncing...
sync
find . -type f -print0 | xargs --null cat > /dev/null 
echo "Timing build..."
time (make -j16 all > /dev/null 2>&1)

The source tree is cleaned and all caches dropped. The allmodconfig configuration is used to get some degree of testing of disk I/O by creating the maximum number of files. Caches are then primed by pre-loading all the source files. This is done in order to more accurately measure the CPU and RAM subsystems without bottlenecking on disk I/O. The CPU in the system has 4 cores, and 16 build threads are used to ensure the CPU and memory I/O are saturated, but without causing enough memory pressure to cause swapping.

On the host and in the guest, all unnecessary services and processes were stopped (especially crond which could theoretically cause additional load on the system that would distort the results).

All tests were carried out 3 times in a row, and the best result for each is considered here (the differences between the runs were minimal).

This is very much a redneck, brute-force test. There isn’t much finesse to it. But I like tests like this because they cannot be cheated with the sort of smoke and mirrors illusions that virtualization software is very good at applying.

Results

Bare metal:1,042.523s(100%)
Xen 4.1.2 (PV):1,316.984s(79.16%)
VMware ESXi 5.0.0:1,361.321s(76.58%)
VMware Player 5.0.0:1,478.732s(70.50%)
VMware Player 4.0.4:1,520.023s(68.59%)
KVM (RHEL6):1,691.849s(61.62%)
Xen 4.1.2 (HVM):2,839.442s(36.72%)
VirtualBox 4.1.18:8,876.945s(19.06%)

Note: No, this is not a typo – VirtualBox really is that bad.

To make this difference easier to visualise, here it is on graphs

Virtualization Performance – Time in Seconds

To give a better idea of relative performance, here it is in % points, with bare metal being 100%.

Virtualization Performance – Relative Difference

The difference is substantial even with the least poorly performing hypervisor. Virtualization performance is over a 5th (21%) down with paravirtualized Xen down compared to bare metal, and nearly a quarter (24%) lower than bare metal with VMware ESXi, and even worse with KVM. Or if you prefer to look at it the other way around, bare metal is more than a quarter as fast again (26.32%) as the best performing hypervisor on the same hardware.

Don’t get me wrong – virtualization is handy for all sorts of low-performance tasks. In cases where it is used to consolidate a number of mostly idle systems into one mostly idle system, it brings clear benefits. (Except maybe in the case of VirtualBox – the performance there is just too appalling for anything, and HVM Xen is pretty poor, too.) But for uses where performance is important, thoughts of virtualizing need to undergo a serious reality check. Even if your system is designed to scale completely horizontally, requiring 26%+ of extra hardware (best case scenario, it could be a lot worse depending on which hypervisor you use) is likely to put a significant strain on your budget and running costs.

Note: It is worth stressing that these tests are carried out on hardware with VT-x, and support for this is enabled and used for all the tested hypervisors. So the results here are based on optimal hardware support.

Here is a link to an excellent paper on virtualisation performance overheads with similar findings to my brief research.

Our servers are performance optimized by MySQL experts at Shattered Silicon.

Hardware Accelerated SSL on SheevaPlug (Marvell Kirkwood ARM) Using OpenSSL on Fedora

I have recently been spending a quite a lot of time working on Linux on various ARM devices. It is quite amazing what ARM hardware is capable of nowdays. One of the most popular ARM based machines available is the SheevaPlug. The performance of it is pretty good for a small server – my experience shows that the 1.2GHz Marvell Kirkwood 88F6281 compares quite favoutably to the likes of 1.66GHz Intel Atom N450 in terms of both server performance and especially in terms power usage. Atom N450 systems have a typical power draw of about 22W idle and 28W under load – a far cry from the supposed 7.6W total of 5.5W N450 + 2.1W NM10. SheevaPlug, on the other hand, draws 2.3W idle and 7W under load.

In some areas, however, the Atom does hold a performance advantage, especially in usage that requires heavy number crunching – unlike the Marvell KirkwoodAtom N450 has a FPU and SIMD capability via the SSE/SSE2/SSSE3 instruction sets. One set of applications that get better performance on Atom N450 are the ones doing encryption, for example OpenSSL. Or do they…

Not quite. The Kirkwood ARM has an ace up it’s sleeve, and as it turns out, it is one powerful enough to allow it to close the gap against a processor with 4x the power budget. It has a hardware crypto engine that supports MD5, SHA1 and AES-128 acceleration.

Unfortunately, mainstream Linux distributions don’t come with the hardware crypto acceleration enabled, and most of the documentation available is sufficiently out of date to be unapplicable to the current generation of distributions. All of it points at OCF Linux, which hasn’t been updated for kernels past 2.6.33 and OpenSSL 0.9.8n, both of which are deprecated. I have modified the kernel patches to make them work on 2.6.35, but unfortunately the cryptodev driver uses locked ioctl operation which has been removed from the kernel starting with 2.6.36, so further modifications are required to make it work on later kernels. OCF Linux also doesn’t appear to have been updated since late 2010. But things are not as bad as it initially seems – it turns out that there is an alternative.

The reason kernel patches are required is because acceleration depends on the BSD style cryptodev kernel interface. There is an alternative, more up to date project that provides this much less intrusively: Cryptodev-linux. It provides a standalone driver that doesn’t require the entire kernel to be recompiled for it, and it works with the 2.6.36+ kernels.

That just leaves OpenSSL support. Well, it turns out that OpenSSL 1.0.0 already comes with support for cryptodev hardware offload, it just isn’t enabled by default. It has to be enabled during the configure stage by providing -DHAVE_CRYPTODEV (for encryption offload) and -DUSE_CRYPTODEV_DIGESTS (for hashing offload). If you are building against Cryptodev-linux you will also have to provide the -DHASH_MAX_LEN=64 parameter – this is normally in OCF‘s cryptodev.h header file, but isn’t present in the header files that Cryptodev-linux provides. Not a big deal, but something to bear in mind when you are building your own OpenSSL with cryptodev engine support.

So, how big a difference does the Kirkwood‘s acceleration make? Quite a substantial one. Here is what openssl speed test produces:

Kirkwood without cryptodev:
# openssl speed -evp aes-128-cbc
Doing aes-128 cbc for 3s on 16 size blocks: 1870065 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 64 size blocks: 516074 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 256 size blocks: 132474 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 1024 size blocks: 33342 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 8192 size blocks: 4171 aes-128 cbc’s in 3.00s

Kirkwood with cryptodev:
# openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 85277 aes-128-cbc’s in 0.08s
Doing aes-128-cbc for 3s on 64 size blocks: 82960 aes-128-cbc’s in 0.08s
Doing aes-128-cbc for 3s on 256 size blocks: 59806 aes-128-cbc’s in 0.03s
Doing aes-128-cbc for 3s on 1024 size blocks: 40939 aes-128-cbc’s in 0.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 8227 aes-128-cbc’s in 0.00s

The results show, predictably, that with very small (unrealistically small) data blocks, software-only userspace crypto is faster due to less context switching. With 1KB blocks, however, hardware crypto is 23% faster, and with 8KB blocks the hardware engine goes twice as fast as the software-only option. But what is really impressive is the reduction in CPU time. Because the hardware crypto engine is asynchronous, there is practically no CPU time required when using it, which is important since it leaves the CPU free to get on with other tasks.

For comparison, there are the Atom N450 results:

# openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 3813930 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 1098375 aes-128-cbc’s in 2.99s
Doing aes-128-cbc for 3s on 256 size blocks: 294884 aes-128-cbc’s in 2.99s
Doing aes-128-cbc for 3s on 1024 size blocks: 74520 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 9245 aes-128-cbc’s in 2.99s

So the Atom is faster all around – on 1KB blocks it is 82% faster, which reduces to a 12% advantage using 8KB blocks. But let us not forget that we could, in theory, run two instances of OpenSSL, one with hardware offload and one without, which would give us the combined total performance of both, if that is all we needed the machine to do. This would give us figures of approximately:

1KB: 33342+40939=74281
8KB: 4171+8227=12398

This ties with the Atom using 1KB blocks, and beats it by 34% using 8KB blocks – all in a power envelope 4x smaller. Pretty impressive.

Installing Cryptodev-linux is trivially simple, and is simply a matter of the usual “make; make install” procedure after extracting the tar ball (make sure you have the kernel headers for your kernel installed and available in /lib/modules/$(uname -r)/build/).

I mentioned above the required additional parameters to make OpenSSL build with cryptodev support. On Fedora 13’s OpenSSL‘s source package, you can edit the relevant line in the spec file. The relevant section on my version reads:

./Configure –prefix=/usr –openssldir=%{_sysconfdir}/pki/tls ${sslflags} zlib enable-camellia enable-seed enable-tlsext enable-rfc3779 enable-cms enable-md2 no-idea no-mdc2 no-rc5 no-ec no-ecdh no-ecdsa –with-krb5-flavor=MIT –enginesdir=%{_libdir}/openssl/engines –with-krb5-dir=/usr -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS -DHASH_MAX_LEN=64 shared threads ${sslarch} fips

In case you cannot modify/build it yourself, here are the packages:
/wp-content/uploads/2011/05/openssl-1.0.0-1.kw.fc13.src.rpm
/wp-content/uploads/2011/05/openssl-1.0.0-1.kw.fc13.armv5tel.rpm
/wp-content/uploads/2011/05/openssl-devel-1.0.0-1.kw.fc13.armv5tel.rpm