5 Steps to Set Your Data Projects Up for Success
Strengthening the data relationship one step at a time
Building a relationship is complicated. Wait, what? Why are we talking about relationships in this article? Hang in there with me!
How long did it take for you to build up your relationship with your friends? How about trust? It usually takes time, energy, and consistency.
Similarly, we need to build a relationship with data. Many of us do not realize that our relationship with data is tainted because of poor software practices or insufficient resources. And it is hard to encourage upper-level management to invest in a project with no connection with data or has proven futile.
Naturally, it makes us wonder whether we can have a healthy relationship with this data. And I do believe we can. Specifically, time, effort, and consistency can heal the data relationship.
I will talk about these five ways that will provide stepping stones in building your relationship with your data.
A relationship needs a strong foundation
The first step in building your relationship with data is to locate all of the data pipelines and transformations in one location.
When a relationship starts, we will need to create a strong foundation. We usually establish this by spending time together and figuring out whether we have hobbies in common, etc. The same is true for your relationship with data. We will need to figure out where our data pipelines are, what data they are pulling, etc. These initial investigations will give us a solid foundation.
I am surprised at how many times I have seen data pipelines everywhere. And I mean everywhere. Some would be running in an old database running on a mac in the office, others Azure Data Factory, and some would be running on a virtual machine. Each location had different changes in the code, so I was not sure which one was the truth. We updated them by hand if changes needed to happen in these data pipelines. As a result, I had hit a ton of rough patches in my relationship with data.
The result was astonishing when I moved all of the code into Github so that we now had one definitive source of truth and could keep track of the versions.
2. Relationship requires a variety of experiences
The second step in building your relationship with data is to provide an atmosphere where we know what to expect before testing it out in an unknown environment.
Whenever I meet someone for the first time or second time, I tend to pick an area where I am comfortable. I know what to expect. In this same instance, we need to choose a location first that we are comfortable with. In our area, this area will be referred to as the staging area, essentially a copy of production without end-users. This staging area will not take care of all the possible problems, but it will limit the number that we will get in production.
And that will limit the number of mistakes caused by our transformations, building our relationship with data.
3. Balance your focus
The third step is to limit the number of transformations in database queries.
Building a relationship requires focus and intention. If either side does too much in the relationship, the relationship will not grow. This will often lead to a re-prioritization or self-examination.
As the relationship grows with your data and the first initial data warehouse is set up, the data warehouse can handle both the transformations and the pulling of the data for reports. However, as product and data grow, the transformations will become more prominent and take up more resources, leading to an overworked database and failing queries.
To create a healthy balance between transformations and queries, we will need to limit the number of transformations in the database. Your team will know best in this instance, but here are some suggestions. If you have a ton of data, using something like Spark will be great to run those transformations. Meanwhile, Python code can just as quickly process a small dataset.
I moved the transformations from SQL inside AWS Redshift and Tableau, translated them to Python, containerized these pipelines, and ran them continuously using AWS Fargate.
4. Examine your relationship
The fourth step is to look at your data model.
What do you want out of this relationship? Are there parts that are missing? What kind of questions can you ask to grow that part of the relationship?
This is essentially what you are doing when analyzing your data model. Data modeling captures the semantics and structure of the data in a database. It involves identifying the entities, attributes, relationships, and constraints needed to represent the information in a database.
A good data model should be easy to understand, maintain and use. Additionally, it should be structured to make sense for the people who need to use it and built on a sound database design that reflects the needs of the business.
There are a few indications that your data model needs to be updated:
Your SQL queries from your database are longer than a page.
Your SQL queries take longer than a minute to run.
Re-analyzing the data model needs to happen regularly — it is an iterative process. As you learn more about your data and your relationship grows, your data model will improve.
5. Check-in regularly
In every relationship, most people will check in regularly to see how the other is doing. So why do we not do this with our data?
The fifth step is to check your data regularly. Out of all the steps, this one is the most energy and time intensive as it will need to be repeated periodically, just like building a relationship :).
First, we will need to do a data profile on the data. In comparison to streaming data, looking at batch data is relatively straightforward. You can “scan” your data and return the number of null values, average and median of columns, and similar metrics.
We will need to save those metrics everything time we run them to a metric repository.
Finally, we will need to analyze those values in light of the historical values. Unusual values will appear in the data that we can quickly identify using z-score or anomaly detection. Once those weird values are determined, a notification will be sent.
Using something like flyte or prefect, we can calculate these metrics in batches and get the results to Slack or PagerDuty. Similarly, I have set up something like re_data or deequ as well.
I am still learning how to check data quality on streaming data sources. You could say that I don’t have any experience with speed dating data. Ha.
Conclusion
It takes time, energy, and consistency to grow the relationships, especially with data. Building this strong relationship requires a strong foundation of data pipelines, various experiences, balancing your focus, examining the current relationship, and finally, regular check-ins. You will find that implementing these steps will result in a stronger relationship. And it is ultimately this strong relationship that will allow you to have successful projects.