Aug 9 ’10

Linux on System z Kernel Dumps

by Editor in z/Journal

The Linux kernel code is stable, but even the best kernel hackers are only human and make mistakes. So, while kernel crashes are rare, they can occur and are unpleasant events; all services the machine provides are interrupted and the system must be rebooted. To find the cause of such crashes, kernel dumps containing the crashed system state are often the only approach.

When a user-space crash occurs, a core dump is written containing memory and register contents at the time of the crash. Writing such core dumps is possible because the Linux kernel is still fully operational. This is clearly more difficult when the kernel itself crashes. Either the dying kernel must dump itself or some other program independent from the kernel must perform that task.

This article reviews Linux kernel dump methods, describes the current Kdump process, compares System z dump tools, and offers an introduction to Linux dump analysis tools.

History

The Linux Kernel Crash Dumps (LKCD) project implemented one of the first Linux kernel dump methods. However, Linus Torvalds never accepted those patches into the Linux kernel because the currently active kernel was responsible for creating the dump. This meant the code creating the dump relied on kernel infrastructure that could have been affected by the original kernel problem. For example, if the kernel crashed because of a disk driver failure, a successful LKCD dump was unlikely because that code was also needed to write the dump. LKCD is no longer active; the last LKCD kernel patch was released for Linux 2.6.10.

Diskdump and Netdump were other dump Linux mechanisms; both had problems similar to LKCD and were never accepted into the mainline kernel.

For Linux on System z, IBM developers used another approach: standalone dump tools. When a kernel crash occurs, the standalone dump tool is started and loads into the first 64KB of memory, which Linux doesn’t use. Available since 2001, this functionality writes the memory and register information to a separate DASD partition or to a channel-attached tape device. z/VM also supports VMDUMP, a hypervisor dump method.

Kdump Operation

Kdump, developed after the failure of LKCD et al., uses a completely separate kernel to write the dump. With Kdump, the first (production) kernel reserves some memory for a second (Kdump) kernel. Depending on the architecture, currently, 128MB to 256MB are reserved. The second kernel is loaded into the reserved memory region; if the first kernel crashes, kexec boots the second kernel from memory and writes the dump file. Kdump was accepted upstream for Linux 2.6.13 in 2005; Red Hat Enterprise Linux 5 (RHEL5) and SUSE Linux Enterprise Server 10 (SLES 10) were the first distributions to include it.

Kdump is supported on the i686, x86_64, ia64, and ppc64 architectures. Depending on the architecture, the first and second kernels may or may not be the same. When the second kernel gets control, it runs in the reserved memory and doesn’t alter the rest of memory. It then exports all memory from the first kernel to user space with two virtual files: /dev/oldmem and /proc/vmcore. The /proc/vmcore file is in Executable and Linkable Format (ELF) core dump format and contains memory and CPU register information. An init script (see Figure 1) tests whether /proc/vmcore exists and copies the contents into a local file system or sends it to a remote host using scp. After the dump is saved, the first kernel is started again using the reboot command.

 

System z Dumps

Unlike Kdump, the Linux on System z standalone dump tools don’t require reserved memory; they’re installed on a storage device and IPLing from that device triggers the dump process.

Note that some of the features described in the following might be available only on the latest Red Hat Enterprise Linux and SUSE Linux Enterprise Server distributions.

 

DASD and tape standalone dump tools: Standalone dump tools for DASD and channel-attached tape devices are available. The tools are written in Assembler and are loaded into the first 64KB memory that isn’t used by the Linux kernel.

System z standalone dumps use two tools from the s390-tools package: zipl prepares dump devices, and zgetdump copies kernel dumps from DASD or tape into a file system (see Figure 2).

 

These steps prepare partition /dev/dasdd1 on DASD 1000 for a standalone dump:

  1. Format DASD: dasdfmt /dev/dasdd.
  2. Create a partition: fdasd -a /dev/dasdd.
  3. Install the dump tool: zipl -d /dev/dasdd1.

After a system crash, an IPL from the DASD device creates the dump. Before the IPL, all CPUs must be stopped and the register state of the boot CPU saved by issuing the commands in Figure 3 on the VM console of the crashed guest. After rebooting the guest, the dump can be copied into a file system using zgetdump:

# zgetdump /dev/dasdd1 > /mydumps/dump.s390

It’s also possible to copy the dump to a remote system using Secure Shell (ssh):

# zgetdump /dev/dasdd1 | ssh user@host "cat > dump.s390"

The zipl and zgetdump tools for channel-attached devices currently support single- and multi-volume ECKD DASD, single-volume Fixed Block Architecture (FBA) DASD, and 3480, 3490, 3590, and 3592 tape.

 

SCSI dump: Support was also added for Linux on System z guests and LPARs that have only Small Computer System Interface (SCSI) Fibre Channel Protocol (FCP) disk. Accessing these disks in a Storage Area Network (SAN) using zSeries FCP (ZFCP) is complex and support couldn’t be fitted into the first 64KB of memory, like for the DASD and tape dump tools. Instead, a second Linux kernel is used that’s conceptually similar to the Kdump approach. But unlike Kdump, this kernel isn’t loaded into guest memory in advance.

The ZFCP dump kernel is IPLed from SCSI disk using a new dump operand on IPL. With this operand, the first few megabytes of memory are saved in a hidden area the Processor Resource/Systems Manager (PR/SM) or z/VM hypervisor owns. The ZFCP dump kernel is then loaded into that saved memory region. Using a z-specific hardware interface, the ZFCP dump kernel can access the hidden memory. As with Kdump, the ZFCP dump kernel then exports all memory using a virtual file. A ZFCP dump user-space application running in a ramdisk then copies that file into a local file system on the SCSI disk where the dump tool was installed (see Figure 4).

 

Preparing a SCSI disk for ZFCP dumps requires these steps:

  1. Prepare partition on SCSI disk: fdisk /dev/sdb.
  2. Create ext3 file system: mke2fs -j /dev/sdb1.
  3. Mount file system: mount /dev/sdb1 /mnt.
  4. Install SCSI dump tool: zipl -D  /dev/sdb1  -t /mnt.

When running in a Logical Partition (LPAR), IPLing that SCSI disk using the SCSI dump load type on the HMC triggers a dump.

When running under z/VM, the dump device must be defined using a cp command (the example shown uses WWPN=500507630300C562, LUN=401040B400000000):

# set dumpdev portname 50050763 0300C562 lun 401040B4 00000000

To trigger the dump under z/VM, a ZFCP adapter (device number 1700 in this example) must be specified for IPL with the dump (see Figure 5). The ZFCP dump tool writes the dump as a file into the specified file system. This file can be used directly for dump analysis; no zgetdump tool is needed.

 

VMDUMP: Under the z/VM hypervisor, the VMDUMP command can be used to create dumps for VM guests. To use VMDUMP, no preparation of any dump device is required; the dump file is written into virtual SPOOL. This dump mechanism should be used only for small guests because it’s quite slow. A Linux tool called vmur copies dump SPOOL files into Linux, and the vmconvert tool converts VMDUMPs into Linux-readable dump format (the --convert option on vmur can also convert the dump on the fly while receiving it from SPOOL). VMDUMP is the only non-disruptive dump method for Linux on System z and is also the only method to dump Named Saved Systems (NSSs).

Example of VMDUMP use:

  1. Trigger VMDUMP on the VM console: #cp vmdump.
  2. Boot Linux.
  3. Receive dump in Linux format from reader: vmur rec -c <dump spool id> dumpfile.

Automatic dump: When the Linux kernel crashes because of a non-recoverable error, normally, a kernel function named panic is called. By default, panic stops Linux. With Linux on System z, panic can be configured to automatically take a dump and re-IPL. The dump device is specified in the file /etc/sysconfig/dumpconf. Figure 6 shows the configuration for a DASD dump device. The service script dumpconf enables the configuration:

# service dumpconf start

and chkconfig can make the behavior persistent across reboot:

# chkconfig --add dumpconf

With this configuration, DASD device 0.0.4000 will be used for dump in case of a kernel crash. After the dump process finishes, the current system is rebooted. To instead stop the system after dumping, specify “ON_PANIC=dump” without the “reipl.”

How System z Dumps Compare to Kdump

When Kdump was released, the IBM Linux on System z team considered adopting that dump method, but rejected it due to reliability concerns.

The IPL mechanism on System z performs a hardware reset on all attached devices. So the dump tools can work with fully initialized devices. An IPL to start the dump process will always work, even if CPUs are looping with disabled interrupts. Like Kdump, the System z dump tools are independent of the state of the first kernel; however, the System z dump tools don’t share memory with the first kernel, so there’s no way to overwrite the code of the tools, as can happen for Kdump. Another advantage of the System z dump tools is that they don’t require reserved memory. This is especially important under z/VM with many guests.

The main disadvantage of the System z tools is that they’re different from Kdump, which is used on most other platforms; this makes them unfamiliar to many. Installer dump support under Red Hat and SUSE Linux is limited for System z. Kdump also has filtering mechanisms for dumping only kernel pages that are important for dump analysis, reducing dump size and dump time.

Dump Analysis Tools

After a kernel dump has been created, it must be read by an analysis tool for problem determination. Two dump analysis tools are available for Linux, lcrash, and crash. The lcrash tool is part of the LKCD project and isn’t being actively developed; crash, developed by a company called Mission Critical Linux and now maintained by Red Hat, will probably be the Linux dump analysis tool of the future.

The kernel dump analysis tools support many commands:

A Simple Dump Analysis Scenario

Let’s consider how crash is used. The sleep program is started (this is an example only); then a dump is created, Linux is rebooted, and the dump is opened with crash. Apart from the dump file, crash normally needs two additional files: vmlinux and vmlinux.debug. These contain kernel symbol addresses and the datatype description, respectively. In some distributions, these two files are merged. For our example, the following steps have been performed:

  1. Start sleep program: /bin/sleep 1000.
  2. Create DASD dump (/dev/dasdd1).
  3. Reboot Linux system.
  4. Copy dump: zgetdump /dev/dasdd1 > dump.s390.
  5. Start crash tool: crash /boot/vmlinux /usr/lib/debug/boot/vmlinux.debug dump.s390.

Figure 7 shows all processes in the dump as well as the sleep process. The Process Identifier (PID) of the sleep process is 26735. The parent of the sleep process is the bash shell process with PID 26617 (see Figure 8). The sleep process has executed the system call “nanosleep”; the top-most function on the stack is “schedule” (the Linux function where all processes normally sleep until the scheduler wakes them up again).

 

Summary

This article described the history of Linux dump methods. After Linus Torvalds rejected dump methods such as LKCD, the Kdump method was finally accepted in the mainline kernel. On System z, architecture-specific dump tools existed several years before Kdump, and remain in use. These include standalone dump tools for DASD and channel-attached tapes, a dump tool for ZFCP SCSI disks, and the hypervisor dump method, VMDUMP. The main advantage of these System z dump tools is reliability.