We build AI software in two modes: experimentation and productization. During experimentation, we are trying to see if modern technology will solve our problem. If it does, we move on to productization and build reliable data pipelines at scale.
This presents a cyclical dependency when it comes to data engineering. We need reliable and maintainable data engineering pipelines during experimentation, but don’t know what that pipeline should do until after we’ve completed the experiments. In the past, I and many data scientists I know have used an ad-hoc combination of bash scripts and Jupyter Notebooks to wrangle experimental data. While this may have been the fastest way to get experimental results and model building, it’s really a technical debt that has to be paid down the road.
Specifically, the ad-hoc approach to experimental data pipelines causes pain points around:
- Reproducibility: Ad-hoc experimentation structures puts you at risk of making results that others can’t reproduce, which can lead to product downtime if or when you need to update your approach. Simple mistakes like executing a notebook cell twice or forgetting to seed a random number generator can usually be caught. But other, more insidious problems can occur, such as behavior changes between dependency versions.
- Readability: If you’ve ever come across another person’s experimental code, you know it’s hard to find where to start. Even documented projects might just say “run x script, y notebook, etc”, and it’s often unclear where the data come from and if you’re on the right track. Similarly, code reviews for data science projects are often hard to read: it’s asking a lot for a reader to differentiate between notebook code for data manipulation and code for visualization.
- Maintainability: It’s common during data science projects to do some exploratory analysis or generate early results, and then revise how your data is processed or gathered. This becomes difficult and tedious when all of these steps are an unstructured collection of notebooks or scripts. In other words, the pipeline is hard to maintain: updating or changing it requires you to keep track of the whole thing.
- Shareability: Ad-hoc collections of notebooks and bash scripts are also difficult for a team to work on concurrently. Each member has to ensure their notebooks are up to date (version control on notebooks is less than ideal), and that they have the correct copy of any intermediate data.
A lot of the issues above aren’t new to the software engineering discipline and have been largely solved in that space. This is where Kedro comes in. Kedro is a framework for building data engineering pipelines whose structure forces you to follow good software engineering practices. By using Kedro in the experimentation phase of projects, we build maintainable and reproducible data pipelines that produce consistent experimental results.
Specifically, Kedro has you organize your data engineering code into one or more pipelines. Each pipeline consists of a number of nodes: a functional unit that takes some data sets and parameters as inputs and produces new data sets, models, or artifacts.
This simple but strict project structure is augmented by their data catalog: a YAML file that specifies how and where the input and output data sets are to be persisted. The data sets can be stored either locally or in a cloud data storage service such as S3.
I started using Kedro about six months ago, and since then have leveraged it for different experimental data pipelines. Some of these pipelines were for building models that eventually were deployed to production, and some were collaborations with team members. Below, I’ll discuss the good and bad things I’ve found with Kedro and how it helped us create reproducible, maintainable data pipelines.
- Reproducibility: I can’t say enough good things here: they nailed it. Their dependency management took a bit of getting used to but it forces a specific version on all dependencies, which is awesome. Also, the ability to just type
kedro install and
kedro run to execute the whole pipeline is fantastic. You still have to remember to seed random number generators, but even that is easy to remember if you put it in their
- Function Isolation: Kedro’s fixed project structure encourages you to think about what logical steps are necessary for your pipeline, and write a single node for each step. As a result, each node tends to be short (in terms of lines of code) and specific (in terms of logic). This makes each node easy to write, test, and read later on.
- Developer Parallelization: The small nodes also make it easier for developers to work together concurrently. It’s easy to spot nodes that won’t depend on each other, and they can be coded concurrently by different people.
- Intermediate Data: Perhaps my favorite thing about Kedro is the data catalog. Just add the name of an output data set to
catalog.yml and BOOM, it’ll be serialized to disk or your cloud data store. This makes it super easy to build up the pipeline: you work on one node, commit it, execute it, and save the results. It also comes in handy when working on a team. I can run an expensive node on a big GPU machine and save the results to S3, and another team member can simply start from there. It’s all baked in.
- Code Re-usability: I’ll admit I have never re-used a notebook. At best I pulled up an old one to remind myself how I achieved some complex analysis, but even then I had to remember the intricacies of the data. The isolation of nodes, however, makes it easy to re-use them. Also, Kedro’s support for modular pipelines (i.e., packaging a pipeline into a pip package) makes it simple to share common code. We’ve created modular pipelines for common tasks such as image processing.
While Kedro has solved many of the quality challenges in experimental data pipelines, we have noticed a few gotchas that required less than elegant work arounds:
- Incremental Dataset: This support exists for reading data, but it’s lacking for writing datasets. This affected us a few times when we had a node that would take 8-10 hours to run. We lost work if the node failed part of the way through. Similarly, if the result data set didn’t fit in memory, there wasn’t a good way to save incremental results since the writer in Kedro assumes all partitions are in memory. This GitHub issue may fix it if the developers address it, but for now you have to manage partial results on your own.
- Pipeline Growth: Pipelines can quickly get hard to follow since the input and outputs are just named variables that may or may not exist in the data catalog. Kedro Viz helps with this, but it’s a bit annoying to switch between the navigator and code. We’ve also started enforcing name consistency between the node names and their functions, as well as the data set names in the pipeline and the argument names in the node functions. Finally, making more, smaller pipelines is also a good way to keep your sanity. While all of these techniques help you to mentally keep track, it’s still the trade off you make for coding the pipelines by naming the inputs and outputs.
- Visualization: This isn’t really considered much in Kedro, and is the one thing I’d say notebooks still have a leg up on. Kedro makes it easy for you to load the Kedro context in a notebook, however, so you can still fire one up to do some visualization. Ultimately, though, I’d love to see better support within Kedro for producing a graphical report that gets persisted to the 08_reporting layer. Right now we worked around this by making a node that renders a notebook to disk, but it’s a hack at best. I’d love better support for generating final, highly visual reports that can be versioned in the data catalog much like the intermediate data.
So am I a Kedro convert? Yah, you betcha. It replaces the spider-web of bash scripts and Python notebooks I used to use for my experimental data pipelines and model training, and enables better collaboration among our teams. It won’t replace a fully productionalized stream-based data pipeline for me, but it absolutely makes sure my experimental pipelines are maintainable, reproducible, and shareable.