Data Engineering Archetypes
How do these archetypes differ from software engineers and analytic engineers?
Data engineers are the bread and butter of every data team or organization out there. They are responsible for designing, constructing, and maintaining the systems and infrastructure that enable the collection, storage, processing, and retrieval of data.
In my years of experience as a data engineer, I have found two different and unique archetypes that exist — the engineer archetype and the analytics archetype.
In this article, we will cover:
what an engineer archetype is and how it differs from a software engineer,
what the analytics archetype looks like and how it differs from an analytic engineer,
and lastly, the factors that identify the perfect composition for a data team.
Engineer Archetype
This type of data engineer stands out for their strong coding abilities. These engineers are highly skilled in languages such as Scala, Java, and other lower-level programming languages and frameworks such as Kubernetes. They have a deep understanding of how to write efficient code that can handle large amounts of data at any scheduling type including daily batch, hourly batch, near real-time, real-time, and everything in between. The main goal is to handle the infrastructure that supports a lot of the data pipelines that get built on top of it.
This archetype is commonly observed in larger organizations and big tech that require the management of extensive open-source infrastructure. This includes handling big data and ensuring that the systems in place can handle the scale of operations. Additionally, since many off-the-shelf data tools cannot cope with this level of scale, it becomes essential to develop custom solutions that can handle the unique needs of the organization.
Many data engineers in this archetype come from software development backgrounds. Software engineers are great at optimizing software, but they often forget what to do with the data. They actually know how to manage a Kubernetes cluster (or another distributed framework) and implement the data engineering fundamentals, so that we can bring in close-to-all data from even the largest scale. And that it does not mean you need to downsample or drop any of it.
Analytics Archetype
The analytical archetype data engineer is a critical asset to any modern business. Their coding skills are top-notch, but their primary focus is on providing direct value to the company as a whole. This means they take a holistic approach to data engineering and work closely with analysts to understand their needs. They create data pipelines and have an excellent understanding of database design, which allows them to pull data most efficiently and accurately. In addition, they are skilled in data visualization techniques (including creating their own visualization websites using tools like Plotly and Dash) to showcase aggregated data to their internal and external customers.
The analytical archetype data engineer is proficient in using a wide range of tools to get the job done. Many of the tools that they use will abstract away the heavy-engineering tasks that the software engineering archetype would otherwise be able to handle. This is mostly because managing those heavy-engineering tasks will take up the time and energy needed to sync with the analytics engineers and analysts to ensure that they have the tools that they need.
That brings up a good point — how does this role differ from analytics engineers?
I did not see a real use case for this type of engineer until recently when the companies I worked for implemented Data Build Tool (DBT).
Within our organization, we have data analysts, data engineers, and data scientists.
Here is a definition for each of our roles:
Data analysts pull data from the database and visualize that data to provide insights.
Data engineers get the data from one area to the other area into a database.
Data scientists create the models from the data in the data lake or database
Unfortunately, data engineers were not well acquainted with what the analysts needed so the data just ended up being rambled into the database without a specific order. On top of that, we had years of data modeling technical debt, so a lot of the tables were denormalized, duplicated, or not being used. The process of implementing DBT forced us to identify which tables were actually needed, so even building this out for the first time takes time and effort from the analysts. Then, the analysts cannot create the dashboards or research that is asked of them by the stakeholders, and then they build the table that they need for those stakeholders, adding new tables back into DBT. And the technical debt cycle continues.
So what if we had a new team set up whose sole focus is to maintain the data model that the analysts need? That would be the analytic engineers. They would build out the data model (of course with the analysts’ help), use the data that the data engineers pulled for them, and create a repo with the defined business logic. Then, the data engineer would support those analytic engineers by providing them with the data that they need.
What is the perfect composition for my data team?
The composition can consist of all engineer archetypes, some of each, and all analytic archetypes. It is difficult to identify the right composition of the team without knowing certain factors.
Amount of data
The composition of your data engineering team should be heavily influenced by the amount of data that needs to be transferred each day. If the amount is less than 5GB, then the team does not need a custom solution and can stick with a pre-built one, which is perfect for the analytic archetype.
However, as the need for a custom solution grows, it might be advisable to steer the composition towards the engineer archetype that loves to build new solutions that can scale.
Tooling being used
Tools can have a significant impact on the makeup of your team. If you are already using Kubernetes or Apache Spark, you will likely need more engineers with those skill sets. Conversely, if you are using tools that abstract away software, you can have a greater focus on the analytic archetype.
Finances
Smaller budgets typically will lean towards an engineer archetype. These data engineers are adept at creating customized setups using readily available resources, making them ideal for scrappier budgets. They have a unique ability to handle complex (think custom) projects that require a high degree of scalability and can often find creative solutions that allow for growth and expansion. By leveraging their skills and expertise, an engineer archetype can help businesses achieve their goals without breaking the bank.
The amount of data, current (or future) tooling being used and finances have a great influence on the composition of the data engineer archetypes on your team.
Final thoughts
Data engineers are responsible for designing, constructing, and maintaining data systems and infrastructure. They can be divided into two archetypes: engineer and analytics. Engineer archetypes are skilled in programming languages and frameworks and focus on infrastructure that supports data pipelines. Analytics archetypes focus on providing value to the company and work closely with analysts to understand their needs. The composition of a data team should be influenced by factors such as the amount of data, tooling being used, and finances.