Solving Data Engineer's Struggle for Clean Code
Defining the problems we face as data engineers and how to solve them.
In Robert C. Martin’s book Clean Code: A Handbook of Agile Software Craftsmanship, he invited well-known and deeply experienced programmers to define clean code.
Amongst their definitions, a common theme of simplicity and emphasis on the code to do one thing well. I do agree with this idea of simplicity, but one thing that stuck with me was the quote that Michael Feathers wrote.
“Clean code always looks like it was written by someone who cares.”
-Michael Feathers, author of Working Effectively with Legacy Code
I write a ton of code daily, so this quote struck me close to home. I asked myself, “When is the last time I looked at code, and it looked like I cared?” and “When is the last time I saw any great code?”
These questions brought me back to when I worked a ton with software engineers and was the lone data scientist or data engineer. I learned so much during those times about what it means to be a great engineer and their support throughout that process. They lived out the caring part, so why am I struggling to write that type of code?
And that’s when I realized what the main difference was. Software engineering has been a respected field for a longer time. They have processes in place, accountability, and, most often, support from their managers. Most importantly, software engineers’ Rate of Investment (ROI) depends on how well they write their code.
For data engineers, it is not that simple. Because a data engineer’s ROI depends on the data we produce, our priority always lies with the arrival of the data to the end user. Those priorities are then pushed to us by stakeholders; most do not have any tech or code background, which only adds to the dilemma. We are forced to write code that is not the simplest to maintain or care about. And when do we clean up that code? Never. Because, supposedly, that means we are not adding value to the team.
In the short-term, sure, forgetting about the technical debt works out. However, in the long term, ignoring that technical debt means the data pipelines will not withstand the test of time. Pipelines will start failing, and the delays in the data to the customers will start adding up.
How can we avoid this altogether? Is there a way to avoid it together? There is. Let the data engineers be engineers. Part of the data engineers’ job that differs from the rest of the data team is that we love to code. Still, with the adjunct of talking with stakeholders, we continue to get boggled down with random questions and context-switching that saps our energy dry (literally). We need processes that encourage internal and external stakeholders to talk with the managers and project managers to figure out the problem before directly sending them to the engineers so that engineers have time to design, write code, and make beautiful pieces of art.
There is another aspect that I need to mention here. Data engineers are blamed when things are not working out in the data team. When things are working out, and the data is accessible, the data analysts or data scientists get the credit. Add to that, data engineers are an afterthought in most business scenarios. For example, the product team has decided to change things from the front end. Software engineers then change how APIs are called and how the data is pushed to the backend. And then, after all of those changes, we get to be called in because the product team needs that data. If data engineers were included in the initial design process, we can design and prepare for those changes, and the product team will have their data faster. This example also showed that we are the foundation for figuring out whether a project is successful.
4 practical solutions
The data engineering team is the foundation of the company’s success, so how can you support this often overlooked and underappreciated team?
it will require a few intentional but practical changes:
Product and engineering changes will require the involvement of a spokesperson from the data team.
I am not saying that this person has to be the same person all the time. However, the individual will have to know the ins and outs of the data pipeline architecture and ensure that the changes in the product and engineering teams will be synced with what data engineering is doing.
Give regular shoutouts to the data engineers who worked behind the project.
We all want to feel appreciated, but data engineers tend to get left out. Some suggestions that I have seen work:
Give them shout-outs or write the names who supported the project in the Monthly Stakeholder Reviews
Create a slack channel where the rest of the data team or engineering team can celebrate successes
Treat data and its team as part of the product or feature change.
I know this is the same as number 1, but it is important.
You would not want to launch a product without a marketing team. Why would you want to do a product launch without the data team? They provide the numbers to give you feedback on how well the product or feature launch is doing.
Build code contracts between the data producers and data consumers.
This change will keep both software engineering and data engineering accountable to each other. Providing these data contracts in code will clarify a lot of the confusion today when API requests do not match what data engineers expect; data engineers have to go in and try to fix it instead of working on other features. By eliminating or at least limiting the number of fire drills or reactive work we get, we will have time and energy to support and care about our code and learn from one another.
You probably noticed a theme in those actionable items—communication. The key is to be intentional about communicating what changes are happening to the data engineering team so that their work becomes less reactive and more active. Changing to active work will give data engineers the energy to let them be engineers and start taking care of their code to handle its scalability and provide an ROI.