At Ascend, we are excited to introduce a new paradigm for managing the development lifecycle of data products — Declarative Pipeline Workflows. Keying off the movement toward declarative descriptions in the DevOps community, and leveraging Ascend’s Dataflow Control Plane, Declarative Pipeline Workflows are a powerful tool that allows data engineers to develop, test, and productionize data pipelines with an agility and stability that has so far been lacking in the DataOps world.
In this blog post, we will look at three different scenarios that have traditionally been very challenging to DataOps practitioners and demonstrate how Declarative Pipeline Workflows solve these problems.
Scenario 1: Deploy a Data Product to Production
Like software products, data products are developed in isolation from the production systems that are exposed to downstream consumers, in order to maintain the stability of production systems. The “push to production” for data products is much more burdensome, however. Not only is code being pushed, but the data generated by that code must also be updated. Doing this without disruption to the production systems is difficult.
Challenge: Managing the sequence of events required to modify a large number of big data jobs and configure them to run in the proper sequence is traditionally a difficult process, involving hundreds to thousands of lines of code and considerable manual oversight. Every code change potentially affects the global configuration of the production environment, and each new deploy requires that new dependencies, schedules, storage management, and failure conditions be identified and accounted for.
With Declarative Pipeline Workflows and Ascend’s Dataflow Control Plane, data engineers only run a few shell commands to move an entire data service from staging into production. The entire system is described declaratively in YAML files, organized hierarchically. These files can be committed to a Git repository, and controlled using the familiar patterns from DevOps: branching, merging, pull requests, and code reviews. Because these files are declarative, they do not include any orchestration code. This simplifies the effort of reviewing changes because every change to the files is a logic change—reviewers do not need to sift through lines of imperative orchestration code to find the relevant changes.
Once a production branch is prepared in Git, a single ascend apply
command communicates with the Ascend Dataflow Control Plane and handles all the reconfiguration and rescheduling of jobs to “make it so”. As part of this process, a configuration file can also be read, and production parameters interpolated into the YAML via Jinja templates. The same set of files can be applied, using a different configuration file, to another location in Ascend, making it simple to set up independent data services for Development, Staging, and Production, all the while leveraging the Dataflow Control Plane to optimize the computations.
==> git checkout production
# push the code on this branch to the production Data Service in Ascend
==> ascend apply --recursive --config production.yaml \
--input ./Service1 Service1_Prod
Challenge: Deploying new logic to a data product in production is traditionally slow and expensive, because there are a large number of big data jobs that need to be run to recompute data in the production environment, even though the same (or nearly the same) data has already been calculated at least once before during development and staging.
With data products, it’s not just code that gets pushed to production, but all the data, too, and simultaneously. Since it’s so difficult to determine the state of all that data, and which portions of the data are affected by the code changes being pushed, standard practice is to recompute all data in the production environment.
Using Declarative Pipeline Workflows, these data dependencies are automatically handled by the Dataflow Control Plane. Moreover, for computation that has already been completed and tested in the staging environment, all intermediate datasets will be intelligently reused for production. Ascend will intelligently reprocess only the datasets that are different due to configuration parameters, such as storing the final dataset into a production data location.
During this process, and after the new production code goes live, Ascend continues to automatically
- generate the Spark jobs based on the SQL or PySpark logic provided by the users
- optimize the Spark jobs to calculate how many executors are needed and how big each executor needs to be
- monitor the Spark jobs until they complete, and recovers with best effort when any failures are detected
Scenario 2: CI/CD for Data Product
Continuous Integration and Deployment is a well-entrenched practice for software products, but is rarely practiced for data products. If CI/CD were practical for data products, data engineers would be able to confidently push incremental changes without worrying about downstream data integrity.
Challenge: Running CI/CD for data products has traditionally been viewed as impractical to the point of being impossible—nobody is prepared to automatically apply each logic change to the full set of historic production data, and then update production consumers with the latest results. Not only is it infeasible, it’s considered too risky due to the possibility of cascading failures or the generation of bad data that may pollute multiple downstream consumers.
With Ascend Declarative Pipeline Workflows, frequent integration deploys can be incorporated into the development life cycle so that Ascend automatically runs each commit on the designated scope of data, generating new data results, and executing all predefined data validations. This workflow requires deploying multiple instances of a pipeline and maintaining the appropriate live data for each, a task which the Ascend Dataflow Control Plane is uniquely capable of performing. Whereas the compute and storage costs of maintaining completely independent replicas of pipelines would be prohibitive, Ascend solves this by keeping them logically separate while sharing storage and amortizing compute costs between them.
Ascend also makes it easy for individual developers to maintain their own copy of an entire Data Service, allowing them to develop in isolation using the “latest and greatest” code and data. With CI/CD in place on the master branch, all developers have instant access to the current state of the product.
Here is an example of how a data engineer can use Declarative Pipeline Workflows to set up a personal instance of the pipeline in Ascend to test a change before submitting a PR to the CI/CD system.
==> git checkout -b ken/dev
# make sure my personal copy of Service1 is current in Ascend
==> ascend apply --recursive --input ./Service1 Service1_ken
==> cd Service1
==> emacs Df1/*.yaml
# deploy my changes to my personal copy of Service1 in Ascend
==> ascend apply --recursive --input ./Df1 Service1_ken.Df1
==> git add Df1
==> git commit
==> git push origin ken/dev
While CI/CD aims to keep errors from getting pushed production, sometimes, despite our best efforts, we don’t recognize an error until it’s been rolled out. We have to decide either to roll back to the last known good version or fix the problem in the current version (roll forward). In software products, reverting code is such an easy and safe proposition that the immediate response to an error escaping into production is usually to roll back, and buy the team some time to analyze and fix the problems. With data products, rollback is such a daunting proposition that data engineering teams are forced to work on the problem while the error is still impacting production, which puts them under tremendous pressure.
Scenario 3: Production Rollback
Challenge: Traditionally, it’s expensive, time consuming, and sometimes impossible to rollback changes deployed to a production data product because the previous version of the production results may not be recoverable directly even if it had been backed up manually. Incorrectly computed data will usually have cascading effects, and it can be difficult to prove which data is unaffected, so a global rollback is often required. A rollback usually means rerunning a set of expensive and likely slow big data jobs and, as a result, most data teams would rather live with long release cycles for new features for production data products.
With Declarative Pipeline Workflows, Ascend allows users to rollback both code and data changes, in a synchronized manner. When the code is reverted, Ascend will automatically track data lineage across the entire system to determine which data was affected. The Dataflow Control Plane then intelligently identifies the data fragments that were associated with the previous version of the code, and resurfaces those to downstream data consumers.
Here’s an example of using Declarative Pipeline Workflows to revert a code change that has already been deployed to production:
==> git revert
==> ascend apply --recursive --config production.yaml \
--input ./Service1 Service1_Prod