At Netguru, we’re continuously trying to optimise our internal processes. One of our processes consists in balancing the skills of individual employees when composing project teams, which ranks rather high on the headaches-per-hour scale. As such, we needed to get smart about it.I conducted a little experiment, taking advantage of our internal People app. The app is essentially a directory of all Netguru employees, containing information on the skills they have, how good they are at them and what projects they are currently working on. I am only playing with data and searching for the right solution. For now, my goal is to:

take advantage of the fact that our internal apps store interesting data that can be fetched and processed with ease.

show that Machine Learning is not so hard. It will become easy as pie when you look at my code. :)

learn about k-means clustering and how to naively postprocess the results of clustering.

Mission Statement

I will show you how to use the k-means algorithm to cluster all Netguru’s tech employees into data-driven teams based on their skills from the People app. We will try not only to use the algorithm, which is based on a gem, obviously, but also learn something from our data and better understand the results that machine learning algorithms produce.

Dataset

I downloaded the People production database to my localhost and created a simple service object that fetched the users and their skills to save it to CSV. Our dataset consists of rows that represent users and columns that represent skills:

user_id

user_name

Ruby on Rails

Sinatra

Grape

Spree

...

138

Jon Doe

3

2

3

0

...

90

Will Smith

3

2

2

0

...

Our dataset needs to have a digital representation of skill mastery across all columns. I intentionally skipped all boolean skills to avoid the problem of non-unified representation (integers vs. boolean).

The K-means Algorithm

We are going to use the popular k-means clustering algorithm to cluster Netguru’s employees. I won’t describe how the algorithm works in detail, but you can read about it here. Simply put, k-means works by randomly selecting k points (called centroids, where k is the number of our teams) in an n-dimensional space (where n stands for our skill mastery). After this, each iteration of the algorithm:

assigns each employee to the nearest centroid;

recalculates the position of each centroid to match the mean position of all employees assigned to this centroid.

At some point, in this very simple two-step algorithm, there will be no more change between iterations. This will mean that the employees are clustered. We have to bear in mind that we should always analyse our dataset in terms of the characteristics of machine learning algorithms to understand the results we obtain. K-means has some key points to remember:

when the number of dimensions is much bigger than the number of clusters; the clusters might be a little bit random (we have 15 teams and over 120 skills – not good);

k-means clustering works heuristically, and it starts with a random number of centroids, which will yield fresults in each run;

algorithm clusters data points in non-equal clusters, which means that outliers will probably be clustered in individual single-datapoint groups; on the other hand, clustering into equally sized groups is NP-hard.

We will try to address these points in next steps.

Postprocessing Data

To address the problem of having a much bigger number of skills than the number of teams, we can easily get rid of some of the skills we think are not so important for the clustering.

Clustering

I performed the clustering with the k-means-clusterer gem. It turns out that it’s much simpler than you’d think. Check it out in this gist.

Conclusions from Clustering

What can we learn from automated data-driven clustering of our employees’ skills? Well, there are multiple situations in which you might take advantage of the results. Think about using these clusters to make better matches when building teams for the upcoming projects. It could also be very interesting to apply labels based on an employee’s role in the company (e.g. junior, regular and senior developer) to check if there are some patterns, e.g. all senior employees should be in one cluster. If you find some regular employees there, it might be a good time to promote them – data don’t lie! That said, we should also take a few things into consideration when trying to make decisions based on data. First of all, not everyone understands the skill representation in the same way. I am sure that a lot of people have some skills ranked too high, while others have them ranked too low. Also, if we represent skills on a scale from 0 to 4, going from 3 to 4 is much harder than going from 0 to 1. This makes the differences not equal in reality, but we must assume they are when we calculate distances in a Euclidean space. Machine learning and data science are absolutely amazing and will definitely shape the future of every business in the world. We need to bear in mind, though, that to make smart decisions we need to understand the foundations and origins of our data, know its limits and take a step back when analysing the results. Curious about what it's like to be a developer at Netguru and work on projects like this one? Find out here.