Data pipelines are in high demand in today’s data-driven organizations. As critical elements in supplying trusted, curated, and usable data for end-to-end analytic and machine learning workflows, the role of data pipelines is becoming indispensable. To keep up, data pipelines are being vigorously reshaped with modern tools and techniques. At Cloudera, we recently introduced several […]
Data pipelines are in high demand in today’s data-driven organizations. As critical elements in supplying trusted, curated, and usable data for end-to-end analytic and machine learning workflows, the role of data pipelines is becoming indispensable. To keep up, data pipelines are being vigorously reshaped with modern tools and techniques. At Cloudera, we recently introduced several cutting-edge innovations in our Cloudera Data Engineering experience (CDE) as part of our Enterprise Data Cloud product — Cloudera Data Platform (CDP) — to serve the growing demands.
In this three-part blog series, we will outline key elements of our state-of-the-art CDE service – covering motivations (in Part 1), key capabilities (in Part 2), and a step-by-step how-to-guide (in Part 3).
As data pipelines rapidly grow in complexity, scale, and scope, the burden of keeping up and staying agile, falls on the strength and versatility of the solution that power these pipelines. Most data pipelines deployed in production suffer from one or more of the following shortcomings:
Often, these can be traced back to the weaknesses in the underlying data engineering solution architectures that have become archaic for modern data pipelines — posing a perennial problem for the data architects, data engineers, and data administrators. This becomes especially acute as the downstream consumers of these pipelines start multiplying in great numbers feeding the likes of data warehouses and machine learning practitioners.
Furthermore, the need for a robust data engineering solution architecture comes to a head when viewed from the lens of the needs of Lines Of Businesses (LOB) that utilize data pipelines as part of the end to end workflows that feed their use cases. In the most common scenario, data is ingested into object stores in the cloud from myriad sources and then curated (formated, corrected, transformed), optimized (structured for specific needs), and orchestrated (sequenced, managed) in a timely manner to feed the downstream LOB use cases. Today’s enterprises are required to ingest, prepare and deliver data faster than ever in history. Because of this, automated, intelligent, and reliable data engineering workflows are key to ensuring a robust end-to-end workflow.
CDE is the only cloud-native service purpose-built for enterprise data engineering teams who are tasked with crafting complex yet reliable data pipelines at scale and across many LOBs. CDE is an all-inclusive data engineering toolset that enables orchestration automation, advanced pipeline monitoring, visual troubleshooting, and comprehensive management tools to streamline ETL processes across enterprise analytics teams.
With CDE we have specifically addressed the shortcomings highlighted earlier, especially in the context of end to end workflows, seamlessly integrating tools such as Apache Spark, Apache Hive, Apache Airflow and Apache Atlas to enable:
Unlike other software products in the market that have taken a fragmented approach towards data engineering, Cloudera is taking a more integrative approach. With CDE, we are satisfying the demand for modernizing critical data pipelines not as isolated data processing, but as part of end-to-end workflows that power LOB use cases.
In the next blog in this series (Part 2), we will explore, in detail, key capabilities of the Airflow orchestrated CDE solution and highlight their value in modernizing data pipelines.
To learn more about leveraging data engineering for analytics success, download the Taking Your Data Lifecycle to the Next Level eBook.
The post Modernizing Data Pipelines using Cloudera Data Platform – Part 1 appeared first on Cloudera Blog.