Apr 15 ’09
OpenSSL Performance for Linux on System z: Exploiting the Crypto Express2 Accelerator
The best way to secure data being exchanged over an insecure network such as the Internet is to encrypt it. The Crypto Express2 (CEX2) feature for IBM System z provides cryptographic functions implemented in hardware, which otherwise must be calculated as software algorithms. The software implementation of a cryptographic algorithm is much more expensive in terms of CPU costs compared to a hardware-accelerated algorithm.
CEX2 is an optional card and is a replacement for the older PCI Cryptographic Accelerator (PCICA). PCICA was available for the z800 and z900. CEX2 was introduced for the zSeries z890 and z990 machines and is supported for Systems z9 and z10. Each CEX2 card provides two PCI-X adapters. A PCI-X adapter can be configured either as a cryptographic coprocessor (CEX2C) for secure key encrypted transactions (not discussed in detail here) or as a cryptographic accelerator (CEX2A) for the Secure Sockets Layer (SSL) protocol. Because SSL uses a clear key to protect its data in an SSL session, a CEX2A works only in clear key mode. This article shows the performance throughput improvements when exploiting a CEX2A for the Linux SSL implementation (OpenSSL).
There’s also a lower cost cryptographic feature, Crypto Express2-1P (CEX2-1P), designed to address small and midrange security requirements (e.g., System z10 BC). CEX2-1P provides one PCI-X adapter per feature instead of two.
CEX2 asynchronously executes any cryptographic requests to the Central Processor (CP) on a System z. So cryptographic requests will be calculated in parallel while other tasks can be executed on the CP. When configured as CEX2A, a subset of cryptographic functions is enabled that accelerates intensive public key operations often used in the SSL protocol stack. So, a CEX2A was designed for SSL acceleration and should be used only for that purpose.
CEX2C provides a high-security, high-throughput cryptographic subsystem. The cryptographic hardware relieves the main processor from the tasks involved in performing functions such as:
• Advanced Encryption Standard (AES)
• Data Encryption Standard (DES)
• Triple DES (TDES)
• Rivest-Shamir-Adleman (RSA) cipher
• Secure Hash Functions (SHA).
The coprocessor design protects cryptographic keys and sensitive custom applications. The software running in the coprocessor can be customized to
meet special requirements.
A CEX2C can speed up an SSL connection as well, but it isn’t optimized for that purpose. Using CEX2A allows much higher SSL handshake rates.
Using Cryptographic Hardware
The SSL communication protocol was designed to provide secure communication over an open, insecure network. The connection between a client and server is established by executing a so-called SSL handshake process. During an SSL handshake, the keys for the symmetric cipher are exchanged and the type of symmetric cipher for the data encryption is negotiated. An asymmetric cipher, also known as a public key cipher, called RSA, is used for that purpose, which is quite CPU-intensive. However, SSL tries to use it as little as possible. SSL usually applies public key encryption only to agree on an encryption key for a symmetric cipher.
The CEX2A provides hardware support for the cryptographic functions SSL uses in the handshake process. By offloading these cipher calculations from the CP or Integrated Facility for Linux (IFL) to external cryptographic hardware, the overall SSL handshake rate can dramatically increase. As mentioned, the data exchanged via the SSL secured connection is encrypted with a symmetric cipher (e.g., TDES or AES-128).
Symmetric ciphers are used for encrypting larger data portions because they’re significantly faster than asymmetric ciphers. By choosing a symmetric cipher supported by the System z Central Processor Assist for Cryptographic Function (CPACF) of a certain System z machine, the overall SSL performance can be increased, too. There are several symmetric ciphers supported for that purpose. Usually, they differ in the number of bits used for the key length. For example, AES-128 uses 128 bits for the key length and AES-256 uses 256 bits. The more bits used for the key length, the more secure the symmetric cipher. A longer key length usually implies less cipher throughput performance and higher CPU costs—especially when the cipher must be calculated in software and CPACF acceleration can’t be used.
By offloading the symmetric cipher calculations from the CP to CPACF, you can reduce the CPU costs and gain higher SSL traffic throughput rates. Moreover, CPACF supports different flavors of secure hash functions, which the SSL protocol uses, too (see Figure 1).
The Linux on System z SSL Environment
Applications such as secure Web servers can use the SSL protocol to encrypt their network traffic. OpenSSL is the open source library and toolkit implementing the SSL v2/v3 protocol. It’s available for many different operating systems, including Linux. Since OpenSSL 0.9.6, the library was extended to interact with external cryptographic hardware. The interfaces for a specific hardware vendor are put into so-called engine modules. For example, the engine module ibmca contains a shared object for the IBM Cryptographic Accelerator (ICA). The engine ibmca requires the interface library libICA to communicate with the ICA. Usually, both packages are pre-installed or available as separate packages—opensslibmca and libica—for the current Linux on System z distributions. These two packages are required when using any System z cryptographic hardware (CPACF and/or CEX2A) support for OpenSSL.
Once a CEX2 feature is properly configured for your Logical Partition (LPAR) or z/VM, the Linux generic cryptographic device driver, zcrypt, must be loaded to use the CEX2 hardware. Figure 2 shows all involved software/ hardware layers.
CEX2A supports public key operations in clear key mode only for SSL handshake processes, which are slow and CPU-intensive. Figure 3 shows the SSL handshake rates for a certain number of parallel connections. The workload used to measure SSL handshakes exchanges only a few bytes of data so the data encryption part (symmetric cipher) can be ignored.
One CEX2A adapter can drive up to 3,300 new connections per second (i.e., SSL handshakes), which is the adapter limit when using more than 16 parallel SSL connections. By adding further CEX2A adapters, you can go beyond the limit of one adapter. For example, a second adapter doubles the number of possible handshakes. When using no CEX2 feature, the maximum handshake limit is already reached at approximately 750 connections per second when using four logical processors. In this case, the four processors are the limiting factor. Based on this measurement environment, you can drive 4.4 times more connections per second with a single CEX2A adapter available.
Figure 4 displays the CPU load for 32 parallel SSL connections. The left bar is the CPU load for the measurement with no CEX2A adapter. All four processors are busy doing the RSA operations in software. So 94 percent of the processor load is the user time portion. The right bar shows the processor load for the same measurement with the CEX2A adapter available. Because the RSA operations are now offloaded to CEX2A, only two out of four processors are busy. The larger system time (kernel code running) is a result of using the generic crypto device driver, zcrypt. However, the total processor load is only 50 percent.
Generic Cryptographic Device Driver Polling Thread
Starting with the zcrypt device driver version 2.1.0 (available since Novell SUSE Linux Enterprise 10 [SLES10] SP1 and RHEL5.1), the device driver provides a configurable polling thread. The polling thread queries the cryptographic adapter for finished cryptographic requests that were offloaded to the adapter.
To list the version of the currently loaded crypto device driver, enter:
Figure 5 shows that enabling the polling thread uses the CEX2A adapter best. This advantage is especially noticeable in the ranges where the adapter isn’t fully utilized (one through eight parallel SSL connections). With the polling thread disabled, the handshake rate may decrease, depending on the number of parallel SSL connections. This occurs because finished cryptographic requests are fetched from the adapter only with a Linux kernel timer interrupt, which is every one one-hundredth of a second. This explains why 100 connections per second for a single SSL connection is the maximum handshake rate.
To fully exploit a CEX2A adapter (especially when the adapter isn’t fully utilized) when running in LPAR or as a guest under z/VM, turn on the device driver polling thread. If the polling thread is enabled, the benefit is faster retrieval of any finished cryptographic requests from the adapter. When enabling the polling thread, remember that there are slightly higher CPU costs for the polling thread itself. However, as seen before, the overall processor load dramatically drops when using a CEX2A adapter. All measurements shown in this article were conducted with the polling thread enabled, except for the polling thread comparison measurements.
When the cryptographic adapter is idle and there are no more outstanding cryptographic requests, the polling thread is inactive and there’s no additional overhead. Because there are additional CPU costs for the polling thread, it can be turned off. This is a trade-off between throughput and CPU cost (see Figure 6).
Configuring the Polling Thread
For older distributions (shipping device driver version 2.1.0), the polling thread is enabled by default. Starting with Novell SLES10 SP2 and RHEL 5.2, the polling thread is disabled by default. To load the device driver with an enabled polling thread, enter:
modprobe z90crypt [ options] poll_thread=1
(for monolithic module)
modprobe ap [ options] poll_thread=1 (for discrete modules)
To load the device driver with a disabled polling thread, enter:
modprobe z90crypt [ options] poll_thread=0
(for monolithic module)
modprobe ap [ options] poll_thread=0 (for discrete modules)
When running as a guest under z/VM, APAR VM64440 is recommended (available for z/VM 5.2 and 5.3). To dynamically enable the polling thread, enter:
echo 1 > /sys/bus/ap/poll_thread
To dynamically disable the polling thread, enter:
echo 0 > /sys/bus/ap/poll_thread
To check the current status of the polling thread, enter:
1 – means enabled
0 – means disabled
The measurement environment used for this article was an IBM System z9 (2094-S18) configured with:
• An LPAR with four logical processors
• IBM System Storage DS8300 (2107- 922)
• One Crypto Express2 (CEX2) feature
• SLES10 SP1, including Linux generic zcrypt device driver 2.1.0, OpenSSL 0.9.8a, Interface Library (libICA) 1.3.7, and OpenSSL HW engine support ibmca 1.3.7.
Important data that’s exchanged over an open, insecure network should be protected to save the company assets. Encryption algorithms or protocols are available for that purpose. The SSL protocol was designed to secure a connection over an open network. However, cryptographic algorithms are CPU-intensive. This can possibly result in poor application performance, the stronger the encryption algorithm is.
Additional cryptographic hardware is recommended to provide acceptable performance for a secured system such as a secure Web server. Ensure that your Linux and LPAR application configuration setup can use the CPACF and CEX2 hardware. Usually, this requires correct settings in the application configuration files and doesn’t work out of the box.
To learn more, refer to IBM documentation, including Linux on System z Device Drivers, Features, and Commands (SC33-8411-00). You can access these and other resources at http://www.ibm.com/developerworks/linux/linux390/development_documentation.html. More performance-related information regarding Linux on System z is accessible at http://www.ibm.com/developerworks/linux/linux390/perf/