All Case Studies Design Development Interviews Machine Learning Project Management

Insights from Data Engineer Recruitment

It has been two years since we started our data science team. At the beginning, we have built a strong group focused mostly on machine learning. In the meantime, the team has naturally evolved to its present form of two subgroups: data engineering and machine learning. Both focused on the two ends of successful AI adoption, working hand to hand.


This is how we understand End-to-End solutions proposed to our clients. The Data Engineering team works with Data and Big Data and ensures their efficient processing for Machine Learning, which produces fine-tuned, complex machine learning or deep learning models. This division is inevitable. Machine learning has drifted apart from data processing. If you want to propose the best solutions in these two domains, you need to specialize. Experts in data processing are needed for more sophisticated and data-hungry ML models. 


However, it is our experience from the last two months that many candidates treat these two roles interchangeably. We noticed lots of people with ML background applying to our Data Engineering team. Nearly 78% of candidates had only ML-related experience and 11% of candidates mentioned data engineering in their applications.


We checked our role description and it was OK with us. We are looking for someone who has worked in a cloud environment, with large volumes and a variety of data (using both RDBMs and cluster processing). Someone who knows how to set up a healthy backbone for our ML/DL models with some basic knowledge of how models work and what type of data they ingest. The perfect candidate would have a set of skills from the ones presented in the following diagram:


They should be able to build scaled cloud solutions with existing providers (AWS, GCP, Azure) and manipulate data using these structures (simple storage like S3 and data streaming in Firehose). They should also be able to use existing infrastructure like Oracle or Postgres DBs to create pipelines (or big data equivalents like AWS Athena or GCP Big Query) important for stakeholders. And, finally, they should be able to scale it out with computational frameworks to adopt big data analysis.


We know that back in the old Data Science times (around 2015) this would be like searching for a unicorn.But this type of specialization is more cohesive and groups skills and technologies close to each other.

Still, candidates declared more ML-related skills (training classification models, deep learning neural networks, hyperparameter search). Some of them had experience in working with cloud storage such as S3. They rarely mentioned ETL pipelines, building data warehouse solutions or parallel computing clusters. 


And how is it in your Data Science departments? Did you notice similar trends? Or maybe the contrary? Please share your story with us.

remote work best practices
READ ALSO FROM Machine Learning
Read also
Need a successful project?
Estimate project or contact us