- take advantage of the fact that our internal apps store interesting data that can be fetched and processed with ease.
- show that Machine Learning is not so hard. It will become easy as pie when you look at my code. :)
- learn about k-means clustering and how to naively postprocess the results of clustering.
Mission StatementI will show you how to use the k-means algorithm to cluster all Netguru’s tech employees into data-driven teams based on their skills from the People app. We will try not only to use the algorithm, which is based on a gem, obviously, but also learn something from our data and better understand the results that machine learning algorithms produce.
DatasetI downloaded the People production database to my localhost and created a simple service object that fetched the users and their skills to save it to CSV. Our dataset consists of rows that represent users and columns that represent skills:
|user_id||user_name||Ruby on Rails||Sinatra||Grape||Spree||...|
The K-means AlgorithmWe are going to use the popular k-means clustering algorithm to cluster Netguru’s employees. I won’t describe how the algorithm works in detail, but you can read about it here. Simply put, k-means works by randomly selecting k points (called centroids, where k is the number of our teams) in an n-dimensional space (where n stands for our skill mastery). After this, each iteration of the algorithm:
- assigns each employee to the nearest centroid;
- recalculates the position of each centroid to match the mean position of all employees assigned to this centroid.
- when the number of dimensions is much bigger than the number of clusters; the clusters might be a little bit random (we have 15 teams and over 120 skills – not good);
- k-means clustering works heuristically, and it starts with a random number of centroids, which will yield fresults in each run;
- algorithm clusters data points in non-equal clusters, which means that outliers will probably be clustered in individual single-datapoint groups; on the other hand, clustering into equally sized groups is NP-hard.
Postprocessing DataTo address the problem of having a much bigger number of skills than the number of teams, we can easily get rid of some of the skills we think are not so important for the clustering.
ClusteringI performed the clustering with the k-means-clusterer gem. It turns out that it’s much simpler than you’d think. Check it out in this gist.
Conclusions from ClusteringWhat can we learn from automated data-driven clustering of our employees’ skills? Well, there are multiple situations in which you might take advantage of the results. Think about using these clusters to make better matches when building teams for the upcoming projects. It could also be very interesting to apply labels based on an employee’s role in the company (e.g. junior, regular and senior developer) to check if there are some patterns, e.g. all senior employees should be in one cluster. If you find some regular employees there, it might be a good time to promote them – data don’t lie! That said, we should also take a few things into consideration when trying to make decisions based on data. First of all, not everyone understands the skill representation in the same way. I am sure that a lot of people have some skills ranked too high, while others have them ranked too low. Also, if we represent skills on a scale from 0 to 4, going from 3 to 4 is much harder than going from 0 to 1. This makes the differences not equal in reality, but we must assume they are when we calculate distances in a Euclidean space.
Machine learning and data science are absolutely amazing and will definitely shape the future of every business in the world. We need to bear in mind, though, that to make smart decisions we need to understand the foundations and origins of our data, know its limits and take a step back when analysing the results. Curious about what it's like to be a developer at Netguru and work on projects like this one? Find out here.