May 30 ’14
Private Enterprise Cloud: From Insight to an Implementation Strategy
The article “Architecting for a Private Cloud: From Caching Framework to Elastic Caching Cloud Platform Service” (available at http://esmpubs.com/f2ivm) discussed an elastic caching service for private clouds. Interestingly enough, the majority of questions and feedback we received were actually related to the subject of private clouds rather than elastic caching. The spectrum of views and opinions was remarkably wide, including suggestions that private clouds don’t really exist; i.e., clouds can only be public. This article explores the subject of private cloud as well as the drivers and technologies behind it, and examines two architectural models for establishing a private cloud.
Workload Characteristics Define Your Cloud
Fundamentally, in the IT world, it’s all about the workload, its characteristics and requirements. Workload describes work that an application or application component perform, the load placed on the underlying system infrastructure. There are many types of IT workloads, which have different characteristics:
• Traditional relational database server workloads that store data on disk, thus requiring dedicated, high-performance I/O, reliable storage and data sharing for high availability. Usually scales vertically. DB2 and Oracle are among the top relational database management systems (RDBMSes).
• Mission-critical, online, high-volume transaction (online transaction processing [OLTP]) workloads with strict quality of service (QoS) characteristics such as predictable response time, high or continuous availability and ACID (atomicity, consistency, isolation and durability) properties. Typically used in conjunction with an RDBMS.
• No-SQL workloads that store data without using a relational data model on share-nothing, horizontally scalable system infrastructures. Here data is equally distributed between independent servers by the process called sharding, which allows large volumes of data to be stored and scale up/down easily. Data is often automatically replicated so it can be quickly and transparently replaced with no application disruption. Examples of this workload are in-memory caching products such as IBM eXtremeScale and document stores such as MongoDB.
• Hadoop and the rest of the family (MapReduce, Hive, Pig, ZooKeeper), which are workloads based on horizontally scalable, inexpensive, distributed file system architecture with built-in fault tolerance and fault compensation capabilities.
Every type of workload has specific scalability, performance, reliability and security characteristics that the underlying infrastructure must accommodate, but we can identify two primary types:
• Workloads that rely on one or a few large nodes that scale vertically by adding resources to the nodes; i.e., scale-up architectures, where the underlying infrastructures are responsible for fault tolerance and high availability. These workloads are often stateful and frequently dependent on shared storage architecture.
• Distributed workloads that scale horizontally by adding additional nodes; i.e., scale-out architectures. These workloads are often stateless and fault tolerance is built into the software. The strategy of providing scalability and fault tolerance is arguably the major differentiator between the workloads since it has a major impact on the architectures of the underlying system infrastructures.
In traditional data centers, reliability is typically achieved by implementing active-passive, active-active or N+1 redundancy and using enterprise-grade hardware that detects and mitigates hardware failures. The foundation of traditional data center high-availability engineering is focused on providing redundant hardware: Clustered servers, data replication to ensure consistency between redundant servers, redundant network cards and RAID disk arrays are techniques to provide redundancy to possible points of failure in the system. When it comes to database high availability, particularly for transaction processing systems, enterprises favor shared-disk architecture; for example, Oracle Real Application Clusters (RAC) with shared cache and shared disk architecture, Parallel Sysplex technology supporting data sharing for DB2 z/OS and IBM’s PureScale with its cluster-based, shared-disk architecture. To further mitigate the impact of hardware failures, the virtualization platform/hypervisors offer a range of mechanisms such as automated virtual restart and virtual machine relocation; these include VMWare VMotion and Live Guest Relocation (LGR) in the z/VM world.
When it comes to public cloud providers such as Amazon Web Services (AWS), the picture changes. AWS, overwhelmingly the dominant vendor of the Cloud Infrastructure as a Service (Cloud IaaS) market, according to Gartner, approaches the resiliency problem quite differently. Werner Vogels, Amazon’s CTO, is often quoted as saying: “Everything fails all the time.” AWS infrastructures integrate the entire solution: hardware, software and data center designs that don’t provide traditional redundancy. An application running on AWS should expect failure of hardware and failure of storage. In an AWS architecture, software must be resilient and able to distribute load across loosely coupled compute, network and storage nodes. SQL database services are also focused on a share-nothing, scale-out architecture that leverages the principles of distributed computing to provide scale while maintaining compliance with ACID, SQL and all the properties of an RDBMS. For many people, this model—scale-out, ready-to-fail, distributed, partitioned logic and data applications running on system infrastructures comprised from commodity-grade hardware without traditional resiliency and redundancies—constitutes a cloud architecture.
Of course, there are other public clouds as well. VMWare vCHS (vCloud Hybrid Service), despite its name, is a public cloud service. VMWare focused its public cloud on providing the same high-availability, service-level agreements (SLAs) that exist in traditional data centers, so you don’t have to rewrite or rearchitect existing applications to ensure their availability. In other words, vCHS seems to focus on supporting traditional enterprise workload using traditional resiliency patterns.
Enterprise private clouds are more likely to look similar to the VMWare vCHS offering rather than AWS. Typically, these clouds run on virtualized, enterprise-grade hardware, network-attached storage and redundant network devices. They may offer services on-demand, enabled via user portal. However, according to Forrester, only 25 percent of enterprises do (see “Four Common Approaches to Private Cloud” by Lauren Nelson under “Resources”). Clearly, enterprise private clouds don’t offer the premise of unlimited capacity. Not many private clouds can boast of abstracted storage, network, security, load-balancing and full-stack automated provisioning. Most important of all, typical enterprise clouds are focusing on supporting enterprise workload with scale-up characteristics that require high-availability/disaster recovery (DR) from the underlying infrastructure.
Since we have two different architectural models for building system infrastructures targeting different types of workloads, the opinion that “enterprises don’t have clouds” has some grounds. It points out the differences between a dominant public cloud provider’s model and an enterprise model, which is based on data centers with system infrastructures that provide resiliency technologies and a cloud-like, self-service delivery model. One of the root causes for this is traditional enterprise workloads, particularly common in the financial industry, that tend to require massive, I/O-bound OLTP processing with ACID qualities. Additionally, it may not be visible from the outside, but enterprise workloads also include third-party applications that are essential to support core business functions. The architecture of those third-party applications can’t be described as forward-thinking by any stretch of the imagination, but, nevertheless, to support business requirements, private enterprise system infrastructures must provide efficient hosting for these workloads.
Another aspect that often muddies the waters around private clouds is a phenomenon commonly known as “cloudwashing,” a conscious or unconscious practice to provide ambiguous descriptions about capabilities associated with cloud technologies. Avoiding cloudwashing, accurately describing the capabilities enabled in the enterprise data centers, is highly desirable. Announcing to business partners or potential customers that “we have cloud services,” while in fact delivering something quite different, hardly helps establish trusting communications and improve future relationships.
Ultimately, if we refer to the enterprise model as a cloud-like delivery model or as a private cloud, that doesn’t change the fact that enterprises have significant drivers to introduce it in their data centers in order to:
• Support business agility and respond quicker to business needs when hosting enterprise workloads, which characteristics (I/O and CPU bound, infrastructure-level resiliency) may be better supported by custom-tailored system infrastructures in private data centers
• Ensure compliancy with stringent corporate security policies and keep sensitive data on-premise
• Speed up infrastructure resource provisioning and application deployment
• Have full control over problem resolution
• Decrease overall complexity of managing enterprise data centers.
According to a Forrester survey, 55 percent of North American and European enterprises plan to prioritize building an internal private cloud in 2014 (see “Four Common Approaches to Private Cloud” under “Resources” at the end of this article). Given the real demand for a cloud-like delivery model in private data centers, it would be helpful to clear the ambiguity around “private cloud” and have a working definition of private cloud architecture that reflects the current realities and focuses on addressing current enterprise pain points:
• Accommodating workloads with different QoS characteristics:
o Traditional workloads that require high availability on the infrastructure level
o Scale-out, distributed workloads that require elasticity but don’t rely on the underlying infrastructure for high availability.
• On-demand, self-service access to pools of infrastructure resources for enterprise users while supporting multitenancy. There are opinions that private clouds are “single-tenant,” but most enterprise insiders would tell you that enterprises are very much multitenant environments and require a high degree of isolation between workloads due to both technical and organizational reasons.
Two Architectural Models for Private Cloud Infrastructures
Many existing enterprise private cloud strategies appear to be extensions or a natural evolution of virtualization strategies. After all, it’s normal to want to use existing virtualization platforms, making the most of investments and supporting the traditional workloads that currently prevail in enterprise portfolios. However, traditional DR solutions for business continuity typically require deploying data and applications across multiple data centers using an active-active or, more often, an active-passive standby approach. Additionally, hardware estimations are normally done for peak usage. Unfortunately, this strategy usually results in numerous resources sitting idle. Focusing just on establishing and maintaining redundant enterprise hardware across multiple data centers to support traditional enterprise workloads may be a costly, complex and unfulfilling proposition.
Times are changing. Newer, distributed types of scale-out workloads, such as web and mobile applications and NoSQL, have started to make their way into enterprise portfolios. For many enterprises, it may be prudent to balance and augment cloud strategies using a workload-centric approach.
The workload-centric approach will likely lead private cloud owners to consider different high-availability strategies, including an approach where the uptime metrics are met through replication and failure mitigation provided by the software capable of accommodating infrastructure failures.
IBM’s eXtreme Scale, which offers in-memory caching, is an example of an elastic, middleware-layer product that’s architected for software-level fault tolerance (see the earlier referenced article). In other words, enterprises may want to take a page from the book of the dominant cloud providers such as AWS and develop a more fine-grained approach to resiliency and high availability, embracing distributed, scale-out types of workloads (see Figure 1).
To support a cloud-like delivery model for private system infrastructures for either type of workload—traditional enterprise or distributed—it’s insufficient to have automated provisioning of compute resources. The cloud-like delivery model requires self-service provisioning capabilities of other infrastructure resources: network, storage as well as security (see “Technology Overview for Cloud-Enabled System Infrastructure” by Lydia Leong under “Resources”). And that’s exactly the area where many enterprise private cloud efforts have stumbled. Until recently, the automation and provisioning domain was occupied only by vendor’s specific, highly proprietary solutions. The choices available to enterprises were limited either to completely proprietary, single-vendor-driven products that offered a combination of data center and virtualization management or, alternatively, to DIY approaches with help from automation and configuration tools such as Chef and Puppet. As for the first option, the problems with vendor lock-in aren’t limited to the vendor’s viability. Vendor lock-in puts private cloud owners at the mercy of a single company’s vision of cloud management. It allows the vendor to control your access to industry trends, which impacts your economic model.
Last year, the landscape of products available for establishing private enterprise cloud infrastructures changed significantly with OpenStack entering the enterprise market. OpenStack is a three-year old, open source cloud operating framework that provides fundamental infrastructure services around compute, storage and networking for the cloud. OpenStack is the most active open source project, with around 200 companies contributing to its development. The list of OpenStack sponsors is so long that it’s much easier to point out who isn’t on the list: AWS and Microsoft. Oracle is the latest addition to the list, announcing integration of OpenStack into many of its products.
OpenStack is modular in nature and consists of several open source projects and APIs that manage compute, storage and networking resources. OpenStack supports pluggable architecture and mechanisms for hypervisor, bare-metal, container-based virtualization, storage and networking, thus allowing enterprises to utilize existing enterprise hardware investments. OpenStack brings transparency and competition to the cloud management market, which is a great thing for enterprises. From an enterprise perspective, the biggest benefit of OpenStack may not be the open source OpenStack distributions themselves, but the ecosystem and competition created around OpenStack by vendors that deliver products, drivers, devices, plug-ins and services based on OpenStack. With OpenStack, enterprises are getting a chance to become more vendor-independent (see the sidebar).
OpenStack design tenets are clearly focused on distributed, scale-out models similar to the AWS model. However, various vendors started to offer products based on OpenStack, but have extended support to traditional enterprise workloads. These offerings typically provide support for traditional enterprise hypervisors, additional features that cater to infrastructure-level resiliency.
Consequently, based on common OpenStack fabric, we may have the ability to establish enterprise private clouds supporting two different architectural models:
• The distributed model of public elastic clouds such as AWS
• The enterprise model based on enterprise-class hardware with added high-availability benefits.
Further following AWS footsteps, OpenStack ventures toward an “IaaS+” model, providing more high-level services and focusing on application infrastructures. The latest release of OpenStack, codenamed “Havana,” includes the notable addition of Heat service. Heat is a template-based orchestration service for provisioning “stacks” that are just sets of application infrastructure resources defined based on blueprints. The blueprints can include virtual machines, floating IP addresses, storage and security groups.
Template-based infrastructure provisioning has increasingly become a defacto standard for cloud offerings. VMWare stack features vApp, a relatively simple concept that allows users to pull together a collection of one or more virtual machines, their associated network configuration and additional settings. You can start and stop a vApp as a single unit and specify the start-up order for all virtual machines included in the vApp. AWS offers a pretty sophisticated CloudFormation service that allows users to describe templates that consist of multiple virtual machines, storage, networking, security groups, elastic load-balancing settings and start-up order. AWS CloudFormation also notifies users when each individual resource, as well as the entire stack, is up and ready to run. It includes variables, user inputs, outputs, flow control and a small standard library of functions. Similar to AWS CloudFormation, OpenStack Heat service addresses the problem of creating repeatable cloud infrastructures for complex, multitiered architectures, which likely will be a welcome addition for private clouds. Heat engine can take in AWS CloudFormation templates, but also has its own syntax called Hot. Using a template-based approach to creating cloud infrastructures simplifies provisioning across various enterprise infrastructures managed by OpenStack, thus reducing human errors (see Figure 2).
OpenStack on Top of VMware Technologies and z/VM
It’s safe to assume that efforts aimed at establishing private OpenStack-based clouds for scale-out, distributed workload will be Greenfield-type deployments. In this context, Greenfield deployment refers to efforts to build brand new infrastructures that aren’t constrained by any existing constructs. However, not many companies are ready to throw away their investments in existing virtualization products and platforms. Of more importance, enterprises must provide hosting and support for traditional enterprise workloads, which currently dominate enterprise portfolios. Hence, there’s a good chance that a large chunk of OpenStack deployments in the enterprise will be Brownfield-type deployments. A Brownfield deployment is a type of project that aims to extend existing enterprise infrastructures with cloud-like delivery models. Greenfield efforts may be somewhat simpler to undertake, as you don’t have to worry about disrupting the support of the existing applications. Additionally, public-cloud-like models don’t require establishing complex resiliency solutions. On the flip side, Brownfield deployment scenarios that introduce cloud-like delivery models for the existing enterprise virtualized platforms using OpenStack will need to address some pretty complex problems, particularly related to hypervisor management.
The default OpenStack hypervisor is open source Kernel-based Virtual Machine (KVM). While KVM hypervisor may be an attractive option for Greenfield, public-cloud-like enterprise cloud deployments for Brownfield OpenStack deployments that support traditional enterprise hypervisors, such as VMWare ESX and IBM’s PowerVM and z/VM, are likely to be critical.
KVM hypervisor may be an attractive option for Greenfield, public-cloud-like enterprise cloud deployments. For Brownfield, OpenStack deployments that target to augment the capabilities of the existing system infrastructures, support for enterprise hypervisors, such as VMWare ESX and IBM’s PowerVM and z/VM, are likely to be critical.
VMWare and Mirantis have announced their intent to provide enterprise-class support for OpenStack services. This solution will permit enterprise customers to enable a cloud-like delivery model based on OpenStack sitting on top of vSphere-managed virtualized servers.
When it comes to IBM hypervisors, the IBM SmartCloud Orchestrator (SCO) product offers OpenStack layer for PowerVM and z/VM. In this IBM offering, OpenStack distributions are extended with value-add, enterprise-level features and complemented with a pattern (template)-driven deployment engine; e.g., IWD component and a full-scale, global orchestration engine based on BPM software from IBM (formerly Lombardi). Additionally, IBM changed a number of default products used by OpenStack. The publicly available OpenStack distribution utilizes RabbitMQ as a default queue and MySQL database while IBM uses Apache Qpid and DB2 LUW. SCO version 2.3 released in October 2013 is based on an OpenStack release codenamed Grizzly and offers support for PowerVM. The product will offer support for all hypervisors, including z/VM, based on an OpenStack Icehouse release coming later in 2014.
We discussed various workload characteristics at the beginning of this article, but realistically, complex business software systems often consist of different types of workloads. Some components of the system may have a scale-out type of architecture while others can scale only vertically and require high availability on the infrastructure level. IBM’s System z servers offer efficient support for both types of workload (scale-out and scale-up) with its z/VM and z/OS virtualization, which adds considerable value to Brownfield OpenStack deployment efforts. If the qualities we’re after are scalability, resource pooling and reliability on the infrastructure-level (remember, we’re talking about private clouds here), then System z servers engineered for multilayer reliability and serviceability and built-in redundancy fit the bill nicely. Scale-out workloads designed to withstand hardware failures don’t require the reliability of the System z server, but additional benefits of consolidation, which can be differentiators for complex enterprise applications, need to be taken into consideration as well.
There are some prerequisites to enabling SmartCloud Orchestrator/OpenStack support for z/VM. On the z/VM side, it requires z/VM on 6.3 level. On the SmartCloud Orchestrator side, enabling high availability of SmartCloud Orchestrator 2.3 components involves VMWare vSphere High Availability (HA) cluster. VMWare vSphere HA cluster, shared storage and vMotion-enabled network are required to keep individual VM instances with specific SmartCloud services up and running (see Figure 3).
IBM System z servers are high-end systems, topping the charts of enterprise-grade hardware. The IBM System z virtualization platform for Linux offers massive horizontal scalability and is capable of hosting thousands of virtual machines on a single server. The control plane for System z cloud management must scale accordingly. Further focus on scalability and high availability would likely be welcome additions in subsequent SCO releases. Having a distributed, elastic architecture for each individual service within the cloud management control plane would help ensure that the individual services have sufficient capacity to meet the increasing demand. Correspondingly, it would be beneficial to have an option to install SmartCloud Orchestrator on Linux on z/VM.
Private enterprise clouds don’t have to be extensions of virtualization strategies. Following a workload-centered approach and pulling together balanced, optimized, end-to-end solutions that leverage both a public cloud model as well as a cloud-enabled model based on enterprise-grade hardware may be a compelling strategy that provides tangible benefits to enterprise customers.
• “Four Common Approaches to Private Cloud” by Lauren Nelson at http://blogs.forrester.com/lauren_nelson/13-10-28-four_common_approaches_to_private_cloud_0
• “Technology Overview for Cloud-Enabled System Infrastructure” by Lydia Leong at https://www.gartner.com/doc/2256515/technology-overview-cloudenabled-infrastructure.
• OpenStack Design Tenets at https://wiki.openstack.org/wiki/BasicDesignTenets.
OpenStack Design Tenets
1. Scalability and elasticity are our main goals.
2. Any feature that limits our main goals must be optional.
3. Everything should be asynchronous. If you can’t do something asynchronously, see #2.
4. All required components must be horizontally scalable.
5. Always use shared nothing (SN) architecture or sharding.
6. If you can’t use SN, see #2.
7. Distribute everything—especially logic. Move logic to where state naturally exists.
8. Accept eventual consistency and use it where it’s appropriate.