Modern Fibre Channel (FC) Storage Area Network (SAN) fabrics, both Fibre Channel Protocol (FCP) and Fibre Connection (FICON), have evolved from simple data transport mechanisms to much more complex infrastructure. Fabrics containing multiple hosts across a wide spectrum of operating systems and hundreds of storage ports are common today. I/O levels and data traffic volumes, particularly across the core of a SAN fabric, are much higher.
Fabric usage has also changed significantly. In 2013, there are many more High Availability (HA) requirements and more complex workloads. Hypervisors and virtualized hosts in significant quantity (such as in a Linux on System z implementation) make it more difficult to isolate application problems when application performance becomes an issue. Storage virtualization exacerbates this with its own unique I/O requirements.
All this has a serious impact on storage, and particularly SAN fabric problem determination. There are now more entities to manage, including storage volumes, Logical Unit Numbers (LUNs), hosts, storage arrays, virtual machines, etc., and, therefore, more things that can go wrong. As a result, the operational environment is much more difficult to manage than even a few years ago. Rogue, or poorly behaving devices, have more impact on production environments than previously. All the innovation in workloads and storage infrastructures has generated new behaviors and a corresponding difference in SAN fabric traffic patterns and management demands.
The result of all this change is that the user is likely to see increasing issues with application performance. These issues seem to be associated with storage performance, but can be difficult to pinpoint and correct. Faulty or improperly configured devices, misbehaving hosts and faulty or substandard FC media can significantly impact the performance of FC fabrics and the applications they support. In most cases, these issues can’t be corrected or completely mitigated in the fabric itself; the behavior must be addressed directly. However, with the proper knowledge and capabilities, the fabric can often identify and sometimes mitigate or protect against the effects of these misbehaving components to provide better fabric resiliency.
Here we will discuss three aspects of SAN fabric resiliency:
1. HA five 9s architecture and designing for redundancy
2. Detecting abnormal behavior in external components (typically, servers/hosts or storage devices) that can negatively impact the SAN fabric so you can identify and fix the faulty device
3. Mechanisms that protect the SAN fabric from adverse effects caused by a faulty component, including one or more actions you can invoke automatically using a switch when faulty behavior is detected. This can contain and isolate the impact of the misbehaving component in the fabric. This should be considered a temporary measure. Ultimately, the faulty or improperly configured component must be addressed to resolve the problem completely and permanently.
Creating Five 9s Availability for SAN Fabrics
Massive amounts of data are created, transmitted and stored every day. Such data—whether in the form of financial transactions, online purchases, customer demographics, correspondence, spreadsheets or any number of business applications—is the livelihood of businesses across the globe. When it comes to customer transactions, it’s imperative that none is lost due to an IT system failure. Users demand near-100 percent system and infrastructure availability. This is no longer a luxury; it’s a necessity. Mission-critical applications and operations require truly reliable services and support, especially for their important I/O traffic.
HA is valuable to all businesses, but to some, it’s more crucial than others. Deploying HA must be a conscious objective; it requires time, resources and money. HA is used to ensure constant connection of servers to storage networks and storage devices and a reliable data flow; there’s also a premium to pay when dealing with the Total Cost of Acquisition (TCA) of HA equipment.
However, the Internet has emphasized that HA equals viability. If companies lack reliable, available HA solutions for the continuing operation of their equipment, they lose money. If a company's server fails, customers are apt to click over to a competitor. If mission-critical computers involved in manufacturing are damaged through machine failure, inventory may come up short and schedules could be missed. If a database application can’t reach its data due to I/O fabric failures, seats might not get filled on flights, hotel room reservations might go to a competitor or credit card transactions might be delayed—costing many thousands, sometimes even millions, of dollars in damage to the company’s bottom line.
Storage networking is an important part of this infrastructure. Because of their dependency on electronic devices, storage networks can fail. It may be due to software or hardware problems, but failures do occur. That’s why, rather than taking big risks, businesses running strategic processes in their computer environments will integrate HA solutions into their operations.