IT Management

When it comes to storage networks—or networks in general—being proactive hasn’t always been easy or, in some cases, even possible. However, the storage networking world has really evolved over the past two years when it comes to managing the storage area network (SAN) infrastructure. A proactive management methodology is going to pay you immense dividends with the entire SAN, both for FICON and open systems.

Note: To better understand the concepts presented here, first take a look at the column titled “Proactive vs. Reactive Management of Your Storage Networks” on page 56 of this issue.

A Proactive Management Methodology

The network matters for storage, and that’s becoming increasingly apparent for mainframe environments. Two-site business continuity architectures are standard, and many mainframe shops are moving to, or have already implemented, a three-site business continuity architecture. These sites are connected via cascaded FICON directors. The cross-site connectivity (interswitch links or ISLs) are crucial. In shops implementing synchronous DASD replication between sites, an outage/failure on these ISLs simply can’t be tolerated. Your company can’t afford to have an ISL fail; you need to anticipate when it may fail so you can take corrective action ahead of time. 

Storage network fabric operating systems have introduced several features over the past two years that have enabled proactive management. As the SAN hardware (directors and switches) and fabric operating systems are enhanced, the management software for the SAN typically introduces functionality to manage those enhancements and provides the capability to monitor them.

Being More Proactive

There are several ways you can be more proactive in the management of your storage network. Many of these best practices focus on the ISLs, as those are often the most critical component in your SAN infrastructure. You should ask your SAN vendor for specific details on if and how they implement these technologies and features:

Pre-production diagnostics. Before you even put a new SAN fabric into production, you should put at least your ISLs through some testing. Diagnostic capabilities exist in Gen 5 Fibre Channel (FC) platforms to perform testing that will ensure the optical and signal integrity for the optics and cables. This is sometimes referred to as diagnostic port, or D_Port mode.

Forward error correction code (FEC). FEC enables recovery from bit errors on ISLs over time by proactively introducing error recovery code. This error recovery code doesn’t impact latency in any measurable way, but it does significantly enhance transmission reliability and performance. FEC will be required for Gen 6 FC platforms as part of the standard. It’s available today on Gen 5 SAN hardware.

Buffer credit loss recovery. Buffer credits are crucial to performance over distance. Therefore, they’re crucial to the performance of traffic traversing ISLs. Buffer credit loss recovery helps overcome performance degradation and ISL congestion due to buffer credit loss. Buffer credit shortages are detected early and proactively corrected. Exact implementation of this technology varies by vendor.

Policy-based threshold monitoring and alerting. Although monitoring thresholds and having alerts/notifications was available prior to last year, the process to set up these thresholds and the corresponding alerts was tedious and time-consuming; the thresholds and alerts typically had to be set up on an individual port-by-port basis. The larger the storage network, the more complex this process could be, and the more time it would take. Advancements have been made over the past 15 months that have introduced the concept of policy-based threshold monitoring and alerting. These tools are part of the Fabric Operating System (FOS). They leverage pre-built rule/policy templates to greatly simplify the threshold configuration, monitoring and alerting. Organizations can configure the entire SAN fabric (or multiple fabrics) at one time using common rules and policies, or customize policies for specific ports or switch elements—all through a single dialog. The integrated dashboard displays an overall switch health report, along with details on out-of-policy conditions, to help administrators quickly pinpoint potential issues and easily identify trends and other behaviors occurring on a switch or fabric.

Bottleneck detection mechanisms. A bottleneck is a port in the fabric where frames can’t get through as fast as they should; i.e., where the offered load is greater than the achieved egress throughput. Bottlenecks can cause undesirable degradation in throughput on various links. When a bottleneck occurs at one place, other points in the fabric can experience bottlenecks as the traffic backs up. The bottleneck detection feature detects two types of bottlenecks: latency bottlenecks and congestion bottlenecks. A latency bottleneck is a port where the offered load exceeds the rate at which the other end of the link can continuously accept traffic, but doesn’t exceed the physical capacity of the link. This condition can be caused by a device attached to the fabric that’s slow to process received frames and send back credit returns. A latency bottleneck due to such a device can spread through the fabric and slow down unrelated flows that share links with the slow flow. A congestion bottleneck is a port that’s unable to transmit frames at the offered rate because the offered rate is greater than the physical data rate of the line. For example, this condition can be caused by trying to transfer data at 8 Gbps over a 4 Gbps ISL. 

Bottleneck detection mechanisms identify and alert you to device or ISL congestion as well as abnormal latency levels in the storage network. The mechanism typically works in conjunction with the SAN management software to automatically monitor and detect network congestion and latency in the fabric, provide visualization of bottlenecks in a GUI-based connectivity map, and proactively (there’s that word again) identify which devices and hosts are impacted and/or potentially impacted by a bottlenecked port. You can set alert thresholds for the severity and duration of the bottleneck. If a bottleneck is reported, you can then investigate and optimize the resource allocation for the fabric.

Flow monitors. Monitoring real-time bandwidth consumption by hosts/applications on ISLs can help easily identify hot spots and potential network congestion. The leading tools available today are able to automatically learn (discover) flows and non-disruptively monitor flow performance. Users can monitor all flows from a specific host to multiple targets/logical unit numbers (LUNs) or from multiple hosts to a specific target/LUN; monitor all flows across a specific ISL; or perform LUN-level monitoring of specific frame types to identify resource contention or congestion that’s impacting application performance. More advanced implementation will include the capability to mirror these flows for further analysis if troubleshooting is needed. Ideally, you won’t need to add intrusive network taps or pay for additional third-party software for this functionality; your SAN vendors currently have these tools built into their hardware, operating system and management software. 

Dashboard functionality in your SAN management software. Dashboards are all the rage, and the latest SAN management platform has built-in dashboard functionality designed to work with all the other technologies previously described. Customizable dashboards let you closely monitor the components of the SAN you’re responsible for. You no longer need to go through endless GUI drop-down menus. The dashboards let you monitor things and then drill down into potential areas of concern by simply double-clicking on the dashboard component you wish to examine in greater detail. 

Education. The tools and technology are only as good as the level of knowledge of the personnel managing them. Learn how to fully utilize the technology and features you have at your disposal. Also, take advantage of the (often free) training offered by your SAN vendor. They have every interest in you learning how to use their tools. An educated user is likely a happier user. 

Conclusion

There are a wide variety of technology enhancements that have been introduced with the intent to allow you to be much more proactive in managing your storage network and its components. At the same time, these enhancements have become much more user-friendly. Isn’t it time to be proactive?