Building Blocks of Production-Ready Code: Reproducibility
Reproducibility and test-driven development are foundational to production-ready code
In engineering, reproducibility is foundational to creating production-ready projects. It allows engineers to build on each other's work and allows for the development of new processes and tools for the field as a whole.
But in data engineering and data science, reproducibility is not often practiced. Data scientists don't always track how they got to their results, or even what methods they used—and that can make it hard for others to replicate those results.
Reproducibility in Code
It's easy to make a function or class that works once, but what if you want to use it again? What if you want to build on it? This is where the idea of reproducible code comes in. Reproducible code refers to software that can be easily reused in different situations, by different people, and with different sets of data. This is important because it means that the code can be used in different contexts, which makes it more flexible and adaptable.
So what does reproducible code look like?
Easy to read
When you go through someone else's code and see something that you do not understand, having clearly labeled functions and variables will help you figure out what the code is doing. If a function name has no context, or if all the variables are labeled with numbers instead of words, this could make understanding difficult—especially if the author did not really think about how others would read their work.
Reusable
One of the biggest benefits of reproducible code is that it can be reused later on without having to rewrite all of it from scratch again (and again). For example, if I am working on an app that uses machine learning algorithms for image recognition tasks like identifying faces in photos then having a library of pre-existing algorithms available would make it much easier to add this feature to my app.
Modularized
Reusable code is easier to maintain and update when it is divided into smaller modules (or functions and classes) with clearly defined responsibilities. For example, if you need to integrate some new machine learning models into your app then it would make sense to have separate files with just the code that handles those models—rather than having them mixed together in a single large file that does everything from loading data from files on disk all the way through training and testing algorithms.
Ensuring Reproducibility
Ensuring reproducibility requires a more extensive analysis of your code, but I would start by implementing test-driven development.
The typical software development process is creating the function and then creating the tests. However, what if we were to create the test functions first? That’s what test-driven development suggests.
The benefit of using this type of methodology is that we have to know what we are trying to build before we actually build it. The process of designing and laying out these functions will save you a ton of time in the end.
The disadvantage is that it will take more time upfront to build out functions or classes. In data science or engineering, time is typical of the essence. However, the initial time you spent in creating these tested functions and classes will make the iteration of data pipelines or feature engineering much quicker and with higher accuracy and less likelihood of human error.
Examples of reproducible projects will be available in the "Jupyter Notebook to Production" course coming out this month on my website.