The CDP Operational Database (COD) builds on the foundation of existing operational database capabilities that were available with Apache HBase and/or Apache Phoenix in legacy CDH and HDP deployments. Within the context of a broader data and analytics platform implemented in the Cloudera Data Platform (CDP), COD will function as highly scalable relational and non-relational […]
The CDP Operational Database (COD) builds on the foundation of existing operational database capabilities that were available with Apache HBase and/or Apache Phoenix in legacy CDH and HDP deployments. Within the context of a broader data and analytics platform implemented in the Cloudera Data Platform (CDP), COD will function as highly scalable relational and non-relational transactional database allowing users to leverage big data in operational applications as well as the backbone of the analytical ecosystem, being leveraged by other CDP experiences (e.g., Cloudera Machine Learning or Cloudera Data Warehouse), to deliver fast data and analytics to downstream components. Compared to legacy Apache HBase or Phoenix implementations, COD has been architected to enable organizations optimize infrastructure costs, streamline application development lifecycle and accelerate time to value.
The intent of this article is to demonstrate the value proposition of COD as a multi-modal operational database capability over legacy HBase deployments across three value areas:
The sections that follow dive into the technology capabilities of COD and, more broadly, the Cloudera Data Platform that deliver these value propositions.
There are two major drivers of technology cost optimization with COD:
The cloud-native consumption model delivers lower cloud infrastructure TCO versus both on-premises and IaaS deployments of Apache HBase by employing a) elastic compute resources b) cloud-native design patterns for high-availability and c) cost efficient object storage as the primary storage layer.
As a cloud native offering, COD uses a pricing model that comprises Cloud Consumption Units (CCUs). Spend based on CCUs depends on actual usage of the platform, as COD invokes compute resources dynamically based on read / write usage patterns and releases them automatically when usage declines. Consequently, cost is commensurate to business value derived from the platform and organizations will avoid high CapEx outlays, prolonged procurement cycles and significant administrative effort to meet future capacity needs.
To avoid duplication of compute resources in high availability (HA) deployments, COD has adopted vendor-specific cloud-native design patterns (e.g., AWS and Azure standards) reducing cost, complexity and ensuing risk mitigation in HA scenarios:
That type of architecture results in consolidation of compute and storage resources by up to a factor of 6 (moving to COD from an HA based IaaS model) reducing associated cloud infrastructure costs.
Before we delve into the topic of storage however, we will quantify compute savings over the lift-and-shift deployment model by conducting a sensitivity analysis across different combinations of factors contributing to the variation of cost savings on a node instance basis. These factors include current environment utilization, deployment region (that influences compute unit costs by cloud provider), type of instances used in the IaaS deployment etc.
To quantify the savings opportunity on AWS, we compared the annual costs of a Highly available IaaS deployment (dual availability zone configuration) across all supported COD regions and for three different ‘hdfs capacity overhead’ scenarios, each reflecting the low, mid and high end of that overhead that corresponds to the incremental compute deployed over and above the nodes required by the Apache HBase and / or Phoenix storage footprint:
The chart above presents the average annual cost savings potential per Apache HBase node deployed in a Highly available IaaS deployment for a range of node utilization scenarios between 25%-60% that we have observed in most of client environments. The cost comparison was conducted using list EC2 pricing for 3-Year (All Upfront reserved) RHEL instances between five instance types that are commonly used in IaaS scenarios and an i3.2xlarge instance used by COD on AWS. As we can see from the chart, organizations should expect to see annual savings in the range of $12K-$40K on a node basis for most instance types used in IaaS deployments.
Similarly, in the case of Azure, the annual savings opportunity was estimated by employing a scenario-based approach, using analogous assumptions based on Azure-specific characteristics, available virtual machines and compute billing types. For instance, we are using the D8 v3 instance type for COD workloads on Azure and we calculated the savings opportunity based on 1-year reserved pricing for RHEL instances, since Azure doesn’t offer the 3-year reserved pricing billing type for most of the regions where RHEL-based Virtual Machines are available:
When it comes to storage, COD takes advantage of cloud-native capabilities for data storage by:
To quantify the range of benefits for storage when moving from a HA IaaS deployment to COD in the Public Cloud, we will consider the same scenario as above: A HA deployment with the dual site configuration and a 3x data replication factor. In addition, we have assumed a hdfs buffer of ~25% (incremental storage capacity to accomodate storage consumption growth without manually scaling the cluster):
The violin plot above illustrates the distribution of storage savings on a per-TB basis for three SSD storage types used in most IaaS implementations across different regions where COD is available. The dots in the chart correspond to the different deployment regions and, as the plot suggests, clients should typically expect to see savings between 85% – 95% on the total storage bill.
The migration from previous versions of Apache HBase to version 2.2.x included in CDP PvC and CDP Public Cloud will also deliver substantial performance improvements that will translate into infrastructure cost savings / avoidance (e.g., avoidance of further OpEx / CapEx for use case growth). For example, in a recent performance comparison between CDH 5 and CDP 7, workload performance was up to 20% better on CDP 7 based on the YCSB benchmark:
In addition, CDP 7 with JDK 11 in the YCSB benchmark delivered 5-10% better performance when compared against JDK8:
In the section above, we presented in detail the potential for optimizing infrastructure costs (both on-premises and in the cloud) by migrating a CDH or HDP deployment of Apache HBase and / or Apache Phoenix to COD, the cloud native experience of the Cloudera Data Platform for operational database workloads:
Operational efficiency is the value area where COD delivers the greatest improvement, and spans across all operational domains, including database management and administration and application development activities:
The sections below drill down into the specific capabilities that accelerate different data lifecycle activities:
Platform management streamlines activities related to initial environment build-out, ongoing management and issue resolution. The major capabilities that improve day-to-days tasks of a platform / database administrator include the following:
When it comes to Security and Governance, COD leverages capabilities available with the Shared Data Experience (SDX), to streamlining authorization, authentication and auditing capabilities across all Cloudera experiences:
In addition to the database / platform management efficiencies introduced previously, COD delivers additional capabilities that improve the DevOps lifecycle:
Based on the framework above and the empirical evidence from successful COD implementations, we expect to see the following operational benefits throughout the application development lifecycle:
The metrics above correspond to the efficiency delivered with COD by migrating an existing Apache HBase and / or Apache Phoenix implementation that has been deployed on-premises or retrofitted to run in the Public Cloud as an IaaS deployment with CDH / HDP. The ranges reflect different environment configurations / levels of maturity that will determine the level of benefits introduced with COD. Those parameters include e.g.
Environment complexity in terms of different clusters / environments, number of technical use cases intertwined together (i.e., Apache HBase, Store and Spark) etc. In general, the more complex the current CDH / HDP environment is, the greater the improvement potential given the improved automation that COD delivers (thus reducing manual and repetitive steps across multiple environments) and the greater simplicity in scaling and tuning separate CDP data experiences (that the technical use cases currently deployed would be converted to).
Baseline Environment Performance given the current read / write workload pattern. Organizations that have historically faced challenges with read-heavy and write-heavy consumption patterns (e.g., large backlogs of incoming data or regionserver hotspotting that could cause instability to the environment) would benefit the most, given the increased automation and self-tuning / self-healing capabilities that we have introduced with the Cloudera Operational Database.
Internal Technical Expertise: Existing users that have deployed Apache HBase and / or Apache Phoenix but lack the internal expertise required to scale their existing deployment, will find that COD removes that adoption barrier by simplifying deployment of more complex environments. That is because it requires less expertise / effort to deploy and manage more complex use cases with Apache HBase and / or Apache Phoenix. That improvement applies to all stakeholders involved in such a deployment, Platform Engineers, Database Administrators and Application Developers, with the latter group benefiting the most from the enriched developer toolset that includes ANSI SQL support, making writing applications easier for Software Engineers familiar with RDBMS app development concepts and programming languages.
Ultimately, the level of operational improvements will vary on a client basis, however, efficiencies will be applicable to both mature, large scale implementations of Apache HBase and / or Apache Phoenix that will benefit from improved complexity management and automated issue resolution and smaller, emerging deployments where organizations will be able to use familiar concepts to build enterprise-grade applications without the configuration and scalability challenges of the past (e.g., capacity projections, environment sizing and tuning).
The ulterior motive behind the evolution of the Operational Database, was to develop a modern multi-modal dbPaaS offering that improves agility and simplicity eliminating the need for complex management and tuning required for HBase. As a consequence, COD enables faster revenue realization for new revenue streams and de-risks (i.e., ensures) revenue realization for existing ones.
In the sections above, we outlined the value proposition of COD over legacy Apache HBase deployments on CDH and HDP across value and technology areas:
To learn more about the technology capabilities that we have added to COD please refer to some of the more technical blogs such as distributed transaction support, and performance configurations. Further reading on some of the CDP capabilities such as data exploration, security automation using Ranger and automated TSL management will provide greater insights into platform ecosystem improvements.
The Value Management team can help you quantify the value of migrating your on-prem or IaaS environments to CDP Public Cloud.
Authors would like to thank Mike Forrest who helped with the arduous task of collecting AWS pricing metrics
The post Value Proposition of the Cloudera Operational Database over Legacy Apache HBase Deployments appeared first on Cloudera Blog.