Many enterprise data teams choose Ascend for its ease of use, single pane of glass, and high levels of automation. The data engineers on those teams are often pleasantly surprised when they discover the powerful partitioning capabilities of the platform, and the precise level of control they have over how Ascend handles data. Let’s take a closer look at these capabilities.
What is data partitioning?
In this age of big data, it is not uncommon to deal with billions of records and terabytes of data. Processing such volumes requires smart approaches to segmenting it into manageable subsets in order to:
- Distribute and manage workloads over time and resources
- Apply processing techniques that save time and money
- Avoid a lot of pain when large datasets exceed the limits of technology
Partitioning creates subsets of the data that can greatly improve performance and reduce costs in that they:
- Can be stored on distributed file systems
- Can accelerate processing across multiple servers or nodes in parallel
- Can process individual subsets of the overall data
- Can use smaller and cheaper compute resources
- Can be operated in bulk, such as merged, moved, indexed, or deleted.
Common partitioning techniques include:
Why does data partitioning matter in Ascend?
Modern data clouds have removed the need for you to explicitly partition data for performance reasons. Similarly, the Ascend platform has harnessed partitioning techniques to achieve scalability that is unmatched by conventional ETL/ELT tools. With Ascend’s DataAwareTM technology, data is partitioned in order to reduce the number of times it needs to be processed over the lifespan of the pipeline.
When developing your pipelines in Ascend, you have direct access to leverage these powerful capabilities to tune your intelligent data pipelines. This way you can control how data is processed in order to adjust and tune for performance, or enjoy the defaults built into the platform that are already optimized for most common use cases.
What are the best strategies for partitioning in Ascend?
The primary purpose of partitioning Ascend data pipelines is to reduce the re-processing of data as changes happen to either the code or the data. There are two primary strategies to partition data:
- When the data is a time series: Range partitioning is particularly efficient as it applies changes to the most recent partitions that are still in active use.
- When the data contains specific objects: Horizontal “object”-based partitioning can dramatically reduce resource usage in subsequent transformation processing.
How to Do Partitioning in Ascend
On Ascend, data engineers can adapt partitioning strategies as data flows through their intelligent pipelines. You can change the partitioning at different points along the way:
- At the entry point of data: Ascend Read Connectors will apply your partitioning preference automatically as data enters your intelligent pipelines.
- During transformation steps: Ascend automatically provides a default “after-state” partitioning strategy for you, which you can override with your own configuration for special cases.
- At output: Ascend Write Connectors automatically choose between time series and full reduction partitions in order to minimize the rewriting of data to output systems whenever possible. You can override this behavior with your own configuration.
Data Partitioning Still Matters
While modern data platforms have largely eliminated it, the partitioning of data in pipelines remains a powerful tool to reduce the amount of reprocessing triggered by changing data and revisions in pipeline logic. Ascend data pipeline automation platform saves you money by automatically applying partitioning to all of your data. In addition, the platform puts several partitioning strategies at your fingertips to tune and optimize your pipelines further.