DataOps is a fascinating and innovative field. It has a significant impact on everyone in the data industry regardless if you are a data engineer, data scientist, data analyst, and machine learning engineer. Since joining a new team that lacks solid code best practices, I have become acutely aware of the need for them.
And I love it.
I love collaborating with a team to identify effective strategies. It is an empowering experience with the ultimate objective of streamlining operations and enhancing our processes
That is why, this week, I will focus on the first three tasks I would undertake when starting a project from scratch. Whether you are looking at a pile of messy code or starting fresh, you should always prioritize the following: separating code environments, implementing automated testing, and introducing code reviews.
Separating Environments
The first step I do in a new project is to ensure that we can have multiple environments.
I will always suggest that the team has 3 environments.
Development — This environment will be the place where everything is hooked up. Features get built. Regardless, if you make changes to this environment, nothing will happen.
Staging — Once you have created the necessary components, it's time to test them. The testing environment is an exact replica of the production environment where your features will be deployed. This allows you to test your changes before going live to ensure everything works as intended and avoid any unexpected issues. Testing in a separate environment also makes debugging and troubleshooting easier, saving you time and resources. Use the opportunity to thoroughly test all aspects of your new feature and make any necessary adjustments for success in production.
Production - This environment is exposed to the customers (whether internal or external), so any of the changes that are made here will have an impact.
Within those three environments, we would have its own replica of the production database. But before you decide to duplicate your database across all three environments, consider that this could be a huge financial cost dump.
There are a few options that we can do to still have a representation of the data.
Downsample the data so that you still have a ton of data (e.g. across all of the dates)
Only keep a period worth of data for the use case that is needed (e.g. only the last year of data)
Whatever you end up choosing, make sure you update the newest tables also in the staging area. There is nothing worse than realizing that you are working with data from a table that is not even in a staging environment.
Implementing automated testing
We just talked a ton about the importance of environments and how we need to especially test our code. We need to ensure that our tests are up to standard and that catching those errors end before staging and production. So why are we not making our testing also automated? Automated testing refers to the practice of running tests on your codebase using a tool or service to detect problems. There are many types of automated testing, but some common types include unit testing, integration testing, and end-to-end testing.
I often use the following GitHub action for Python. Every time a Pull Request is created, it will run the pre-commit hook. It will run flake8, a linter that is often used to check for a list of standards found in PEP8.
name: Commit Checks
on:
pull_request:
types: [opened, edited, reopened, synchronize]
jobs:
run-pre-commit:
name: Pre-commit check
runs-on: ubuntu-latest
steps:
- name: Check out
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
- name: Install pre-commit
run: pip install pre-commit && pre-commit install
- name: Run pre-commit on committed files
run: pre-commit run flake8
You can change that last line
pre-commit run flake8
and run any of the libraries you want to run.
Just make sure you also set them up locally so you have the right files in place, especially .pre-commit-config.yaml
file that tells the computer what needs to be done. Here is an example of that file:
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 22.10.0
hooks:
- id: black
We are telling the pre-commit hook to ensure that our yaml files are correct and our files do not have any trailing whitespaces, and we have the same formatting regardless of who pushes it.
Implementing a code review process
Code reviews are a critical part of maintaining code quality and ensuring that best practices are being followed. A code review is a process in which one or more developers review a piece of code and provide feedback on its quality, readability, and adherence to best practices. A code review can identify potential bugs, security issues, and other problems before they make their way into staging and production.
And I do not mean talking to each other about the code either. I mean setting up Teams in your GitHub repository, setting up CODEOWNERS, and then setting up auto-assignment.
Most importantly, code reviews are not about the individuals themselves—It is to ensure that the code is up to standard. Be respectful about the changes they need to make so that they can make them and learn from them in the future. Remember, we're all here to learn and grow together.
Conclusion
The first three recommendations I will make to any team starting from scratch are to separate code environments, implement automated testing, and implement a code review process. We often forget that implementing these somewhat simple recommendations will save us so much in the long run. Just ask any data engineering team and you will see the power of DataOps.