I did not realize, initially, how important it was to incorporate clean code practices into my development process. However, as I have progressed in my data engineering career, I have come to understand that following DataOps principles is a key step in the right direction. Building out these practices can be a time-consuming process, but it is well worth the effort in the long run. In this article, I want to walk through practical steps to move toward implementing DataOps practice.
Code Tests
There are a few ways to test code.
The first major way is unit tests; these automated tests are used to test individual units of code, such as functions or methods, to ensure they are working as intended. When a feature is created or fixed, we will run these tests to verify that the changes do not break existing functionality. Unit tests are an important part of the software development process, as they help to catch bugs early and make it easier to maintain and update the codebase over time.
The second major way is integration tests. Integration tests are a type of code test that are used to test how different units of code work together. Unlike unit tests, which test individual units of code in isolation, integration tests test how different parts of the system interact with each other. For example, integration tests can be used to test how a smaller data process that pulls data from a database affects the overall pipeline. Integration tests are typically more complex than unit tests, as they involve multiple components and dependencies. They help to ensure that the various parts of the system are functioning correctly together and that there are no issues with the integration of different components.
Data Tests
In data checks, there are two distinct subtypes: static checks and dynamic checks. Static checks are usually where most of us begin our data quality journey, as they are the simplest form of data checks. With static checks, data is evaluated against certain predetermined criteria, such as data type and field length, to determine whether it meets the specified expectations. Dynamic checks, on the other hand, are more complex and examine the data against multiple criteria in order to identify discrepancies or irregularities. Dynamic checks may also take into account any changes that have occurred in the data over time to ensure accuracy.
Regardless of what type of check is used, it is important to take into consideration all of the potential data quality issues that could arise and to proactively address them. I have compiled a comprehensive list of all the possible data quality issues that could arise so that you can address each of them individually.
✅ Syntactic checks
These checks ensure that the data conforms to the specified format, such as validating that a date field is in the correct format or that a phone number contains only numeric characters.
✅ Semantic checks
These checks ensure that the data makes sense in the context of the domain, such as validating that a date of birth is not in the future or that a product code exists in the product catalog.
✅ Completeness checks
These checks ensure that all required data is present, such as checking that all mandatory fields have been filled in.
✅ Validity checks
These checks validate that the data is accurate and conforms to a set of business rules, such as checking that an email address is valid or that a credit card number passes a Luhn check.
✅ Consistency checks
These checks ensure that the data is consistent across different sources, such as checking that a customer's name and address match between different systems.
✅ Accuracy checks
These checks validate that the data is accurate according to external standards or data, such as comparing a mailing address against a reference dataset to detect errors.
✅ Uniqueness checks
These checks ensure that there are no duplicate values in the data, such as checking for duplicate customer records or unique keys.
Use a version control system
A version control system manages changes to files. Some common use cases include
Keeping track of different versions of a file, so you can easily switch back and forth between them.
Allowing multiple people to work on the same files simultaneously, without conflicts.
Keeping a record of who made each change and when.
Reverting to a previous version of a file if something goes wrong.
Creating branches for new features or bug fixes, which can be developed separately from the main codebase and then merged back in when they are ready.
Set up multiple environments for your code
I would implement three distinct environments — dev, staging, and production — each of which serves a unique purpose. The dev environment would provide the space to develop and test out new changes and features, while the staging environment would be used for more comprehensive testing and to finalize any changes before pushing them to the production environment. The production environment would ultimately be used to deploy changes and features live, and would also provide the space to monitor and troubleshoot any ongoing issues.
NOTE: This also includes setting up branches to correspond to each of these environments. The staging environment should be an exact replica of the production environment. Ideally, the entire database, but realistically that can be expensive. Focus on keeping at least one periods year of each of the tables in production.
Reuse, modularize, and parameterize
Teams can boost productivity by reusing and containerizing code. Current data pipelines are generally built in one big pipeline, often referred to as a monolith. In every pipeline, there are parts that are used repetitively. Think reading files, removing characters, and similar types of functions should be placed together.
Not all of those functions need to be in classes, but the few that can be organized in classes will need to follow the SOLID principles.
These classes should make use of the SOLID principles, in order to ensure that they are easy to maintain and update.
The Single Responsibility Principle (SRP) states that each class should only have one responsibility and that responsibility should be clearly defined.
The Open/Closed Principle (OCP) states that a class should be open to extension, but closed to modification.
The Liskov Substitution Principle (LSP) states that any derived class should be a substitute for its parent class.
The interface Segregation Principle (ISP) states that clients should not be forced to depend on methods they do not use.
The dependency Inversion Principle (DIP) states that high-level modules should not depend on low-level modules; instead, both should depend on abstractions.
Containerize and deployment
Each pipeline and its respective independencies can then be containerized and packaged in a way that allows for efficient portability and scalability, allowing for easy deployment in different environments. This process of packaging the pipelines into containers provides an effective way to manage the complexities of the system. It ensures that the pipelines are independent of each other and, as a result, can be deployed in different environments without the need for complex reconfigurations. Additionally, the containers can be quickly and easily deployed allowing for faster development cycles. Furthermore, by packaging the pipelines into containers, there is a higher degree of security as the pipelines are more isolated from each other, reducing the risk of one pipeline affecting the others. Overall, this process of packaging the pipelines into containers is an effective way to manage the complexities of the system while still allowing for quick and easy deployment.
Work without fear or heroism
The most essential element of DataOps is to promote ongoing improvement and evolution while ensuring a supportive environment with no judgment or criticism. This necessitates a strong focus on collaboration and communication between teams, understanding the need for a shared goal, and a shared understanding of what success looks like. The idea is to create an environment that is open and understanding, with everyone having the opportunity to contribute their own ideas and solutions without fear of discrimination. It is essential that everyone is working together towards a common goal and that any mistakes are seen as learning opportunities, not as moments of blame or failure. This is essential to the success of DataOps and to creating a culture of continual improvement and development.
Final thoughts
This article discusses the steps for implementing DataOps practices for clean code, such as using code tests (unit and integration) and data tests (static and dynamic) as well as setting up multiple environments, using a version control system, and reusing, modularizing, and parameterizing code.
If you are looking for coding examples, my “Practical Dataops” course will be coming out at the end of the month and can be pre-order at a 30% discount. Check it out here!
This is such a comprehensive, practical guide. Thanks for sharing :)