I went for a hike the other day. As much as I enjoy nature, I’m a geek at heart so I naturally started thinking about the various levels of awareness that data tooling has. Simple scripts have almost no awareness. They can connect to data and execute some logic but they catalog no usable metadata. Scripts are only “code aware.”
A script doesn’t know about scheduling unless you configure a cron job, or other scheduler, to kick-off the script. The cron job doesn’t understand the code or the changes to the input data. These configurations are “calendar aware.”
Event driven implementations might wrap the code in a lambda function that listens for changes to an object store and executes after object updates are complete. These implementations tend to be more efficient because they run when needed. Of course, if a new version of an object is dropped off that is the same as the previous version, the lambda will happily process the new object. Software like this is “event aware.”
In real-world implementations, there are dozens, perhaps thousands, of these types of workloads so companies start to invest in more sophisticated tools. The most mature ETL/ELT tools will store metadata about the processes. Which jobs to run in what order, how long to wait for a job to complete, and who to notify about errors are typical types of things that metadata helps with. This information is needed, for example, to manage traditional databases where loading dimensions needs to complete before facts should be added else risking relational errors. Introducing capabilities like these makes the software “process aware.”
As I balanced on river rocks while navigating across a creek, I started wondering: Is “process aware” the most mature automation awareness? I’ve personally experienced countless conversations revolving around when and where to restart the loading process to correct quality issues for a few records. Truncating and restarting a failed transformation is the low-risk “process aware” method to reprocess the entire dataset. Solutions for data quality management, missing records, late records, duplicate records, and more can certainly be engineered into a “process aware” data flow process. Is this how I want to allocate precious engineering time? What needs to be true for these problems to be with solved out-of-the-box software that is more than simply “process aware”? Would a higher level of awareness shift engineering work away from manual patching and toward business value generation?
The lightbulb came on while untying my hiking boots. What if the automation understood deeper details, even record level details, about the data it is managing? What problems would be solved if the system knew changes to either code or input data would affect the outcome? Could the system leverage knowledge of the data to touch only the data that needs to be touched and simplify the scheduling? Solutions to these questions would require the automation to become “data aware.”
Fortunately, a self-managed “data aware” automation platform exists! Schedule some demo time to see how data awareness drives value with out-of-the-box functionality.