How to Create Agile Data Science Projects?

Photo of Piotr Gloger

Piotr Gloger

Aug 29, 2022 • 16 min read
how is data science used in marketing light

Traditional Scrum processes, although useful in standard software development, may not be the best choice when it comes to data science projects. To follow an agile software development approach to data science, Kanban may be a much better choice of framework.

Projects from the field of Data Science differ substantially from standard software development projects. The main difference is that developing a machine learning (ML) model involves a lot of experimentation, which means implementing modifications depending on the results. It is an iterative process: Machine Learning Engineer needs to come back to the task (i.e. data processing) when an experiment result suggests so. Consequently the whole project is hard to split into discrete tasks that can be swiped to the ‘done’ field on the JIRA board regularly. Moreover, the tasks often tend to be long (i.e., more than 5 days) so they are not a good fit for sprint-based work. This means that the traditional Scrum process - using sprints etc. - should be modified for such projects.

Below you can find some sets of rules that a well-managed Data Science project should follow, and how Kanban can help you achieve success in implementing agile in data science.

What Does ‘agile’ Mean in Data Science?

Essentially, agile data science means using a set of practices to manage the process of the project in an iterative fashion. This allows for teams to keep updating their process and goals based on the results of previous experiments and investigations. This isn’t to say that regular software development isn’t agile, but the source of the amendments is different: in traditional software development oftentimes changes have to be made based on the client requirements. And in agile data science, continuous changes are made based on investigations carried out by the team.

Data science projects follow an agile manifesto, but its execution depends on the specifics of the project. Let’s take a look at the core values:

  • Deliver working software
  • Respond to change immediately over following a predetermined plan
  • Working with customers is more important than contract negotiation
  • Interactions with individuals over processes and tools

What is the Data Science Project Lifecycle?

The data science project lifecycle contains of 11 steps spanning everything from problem identification to model deployment:

Data Science Project Lifecycle

  • Problem identification: The very first thing for any data science team is to identify what the problem is and how data science can be leveraged to fix it.
  • Business understanding: The process of understanding what the customer wants to gain from the project from a business perspective.
  • Data Collection: Collecting relevant data to achieve the end business goals.
  • Data pre-processing: Data collected in various forms and formats has to be pre-processed to allow proper usage.
  • Data analysis: Available and pre-processed data can now be analysed by data scientists to understand the data in-depth.
  • Data modelling: Deciding how we model the data and which tasks would be suitable for modelling.
  • Model training: Training the chosen model using the pre-processed data.
  • Model evaluation: Monitoring the models created to decide which one is the most effective.
  • Model deployment: Using the trained model by exposing it to real time data.
  • Model monitoring: Using the model to find out how it behaves in a real-world scenario to see whether key indicators are achieved or not.
  • Taking action: Taking action based on insight helps businesses make strategic decisions such as predictions for supply.

What is Needed for a Successful Machine Learning Model?

There are three major things we need to create a successful machine learning model: Data, Deliverables, and Rationality. Each has their own uses, but we can see exactly why each one is so vital when we look at examples of where we didn’t have each one:

First: Data

And by that, we mean good quality and large amounts of useful data. However, sometimes even if there is not enough data we can still solve the client's problem. There are two main approaches to handle that situation.

The first one of them being to extend the dataset - we can just label more data. It may sound silly but sometimes it is really worth considering whether spending one week labelling new data will solve our problem. It may be a quicker and better solution than working on any other technically-advanced method of solving the lack of data.

However, oftentimes this is too expensive or difficult. In such a case, we can still extend the dataset by finding similar publicly available datasets, buying data from 3rd party vendors or enlarging the dataset artificially using data augmentation techniques. The last approach would be to use a technique called transfer learning. You can use an ML model which is trained on large amounts of data and fine-tune it to solve your task.

Example: We had a situation on a project where the client didn't have any data to train at all. Luckily, our data science team referred to language models which were pre trained on large datasets, usually by large tech companies. Thanks to that we were able to build a working solution - the client had PDF materials which were useful for testing it. The quality of the first output wasn't optimal, but because it was a human-in-the-loop approach (which means the user was able to edit the outcome), we still could generate a lot of savings (of time and effort) with this model.

Second: Deliverables

Getting to know exactly what problem needs to be solved is key for successful ML implementation. Having a DoD for us as PMs sounds obvious, but some clients came to us for an already defined output that usually had nothing to do with their real problem (and what the provided data could be used for).

Example: iPhone face recognition. Recognising faces is a nice feature, but the crucial result here is to discover faces that are actually allowed to access the iPhone - these two scenarios will be determined with different DoD.

Third: Rationality

We must be aware that the model learns based on received data and we are not always able to shape the direction of outcome.

Example: A few years ago, Amazon started using a Machine Learning model to support their recruitment processes in ranking the candidates. Their model was trained on resumes delivered during the past 10 years, where IT was a man’s domain - as a result the model didn’t consider women as strong candidates (no matter their skills).

What’s the Best Way to Manage a Machine Learning Project?

Successful, agile data science projects are based on a few factors running smoothly: gathering information, client cooperation, deliverables and overall progress. Here are some tips and tricks to boost each area to create your own successful data science project:

Gathering information

  • Get access to Client’s data as early as possible - it is very helpful to discover early on how we can use it (or is it useful at all) and what value it holds.
  • Reserve a consequent amount of time to do data exploration instead of rushing into data experimentation. Let your ML Engineers and data scientists spend some time on finding solutions to similar problems - maybe there is already labelled data that can be used, or they can figure out different approaches to find the best one. This can be extra useful here is ML Canvas.
  • Highlight risks and assumptions based on the data exploration to avoid missing client’s expectations.

Client cooperation

  • Educate clients on ML and help them understand how the solution will work.
  • Define early on how the results/demo of the ML model are going to be presented to the client.
  • Get to know the business context (building a marketing costs’ optimising tool was much easier for me due to similar background in the past)

Deliverables

  • Once you know what the problem is to be solved, it is good to define expected results - make sure you and your data science teams know if you have actually delivered the results you previously defined.

Progress

  • Training, testing, training, testing, training, testing, training, testing… You may feel like it never ends! Sometimes training takes up to a few days but it turns out to be completely useless, so it is very hard to fit it into time-boxes. This is why we recommend value-boxes.
  • If needed, ensure to involve senior ML leadership to check results of the ML model are not going as planned to dig up the root cause.
  • Keep an eye on the metric result - control if it resolves the actual problem and still targets the business value.

What Framework Should You Choose for Your Data Science Project?

The first question that arises when commencing a project is often this: what framework should be used to run the project efficiently?

In the intro we have already mentioned that Scrum might not be the best choice when we are dealing with the experimental phase of the project. Experiments generate new questions that need to be answered (thus generating new tasks), and by their nature these cannot be predicted before the start of the sprint. A typical data centred project involves a lot of tasks that are connected with data processing and transformations. Consequently, we use that pre-processed data to train models. When we investigate the model's performance, more often than not, we come back to the data and perform further pre-processing to improve the results. That is why it is so difficult to estimate how long a research task will take at the beginning of the process and fit them into neat time boxes.

We now know why Scrum is not the best choice, but what framework would work better for such projects?

From our experience, working with Kanban yields the best results. It gives more space for experimentation; it lets us add tickets while we are performing experiments. It also gives better status tracking and we often write down results of experiments as comments to the ticket, meaning they don’t get lost when the sprint ends.

At some point, a machine learning project moves to more mature phases, during which the tasks might be more predictable in terms of length (model deployment, data versioning etc.) and it could be a moment to move to Scrum (not recommended, but possible).

To sum up:

  • We can choose to follow Scrum (with sprints) or Kanban (separate from the rest of the development process). It can be changed with the progress of the project.
  • In the early experimental phase, always use Kanban.
  • When the ML system is already running and only time-boxed improvements are performed, you can move to Scrum.

What are the Scrum Guidelines to Manage Data Science Projects in Netguru?

If for some reason you still decide that Scrum is the best choice for your project, there are a set of guidelines that we follow to optimise the process:

  • The default length of the sprint should be 2 weeks.
  • The sprint should NOT be shorter than the time estimated for the task at hand (if the task is estimated to take 3 weeks - the sprint should last 3 weeks).
  • We set a goal for the sprint and the direction of experiments.
  • We do NOT set any specific number of experiments that will be run during a sprint.
  • We do NOT specify what exact experiments will be conducted, as any succeeding experiment is based on the previously run experiments (the directions might change as a result of previous experiment) - we make sure to report that to the PM.
  • We choose an epic to get released during a sprint, without the exact assigning of story points for each task contained in that epic.
  • During a sprint each ML developer gives his traffic light concerning the status of the current epic (if it is on fire, or under control, or better than expected) once a week.

What are the Meeting Guidelines in Data Science Projects in Netguru?

Whilst you’re in the throes of your project, it’s important to remember to meet regularly with your team, data scientists and with your client so you can make sure everything is on the right track. Here are some useful rule-of-thumb guidelines we use at Netguru around project meetings:

  • The default value for stand-up meetings with the Project Manager is 2 times a week. The tasks are most often not granular enough to be discussed on a more regular basis (this may be more often if the tasks are more granular).
  • The default value for stand-up meetings of a project ML team is 3 times a week.
  • Machine Learning engineers should provide a demo of the results to the clients after performing some set of experiments. As a general rule, some kind of demo (new set of results, visuals if possible) should be produced every 2 weeks (this may differ substantially depending on specific project).
  • Internal demos can be produced asynchronously – this could be a picture, video or results obtained with some approach regarding some metric posted on the project’s Slack channel.
  • Full daily routine makes more sense with a full team (backend, frontend, etc.) onboard. Also, when there are some questions towards the data and the client possesses expert knowledge, it makes sense to conduct dailies with the client more often.
  • When a PM needs to better understand the current status and timeline, a roadmap meeting every week or two is a good idea. On the roadmap meetings simply look at the JIRA roadmap of epics (or a Gantt chart) and ask ML engineers to adjust the boundaries’ timelines. It will allow you to find potential dependencies with other technologies. If such a meeting is regular - like sprint planning with the rest of the development team - the ML Engineers should only be needed for the first 15 minutes of such a meeting.

What are the Project Manager Responsibilities During Data Science Projects?

As the project manager on a team working towards agile data science projects, you should prioritise certain responsibilities over others. For example, identifying red flags on the project should NOT be the PM's responsibility. It should be part of Senior / Tech Lead ML Engineer, as Data Science projects are complex in their nature (especially on regular-only projects). It would be a better idea to collect feedback from the team. For example: each ML developer gives his traffic light concerning the status of the current epic (if it is on fire, or under control, or better than expected) once a week.

PMs should be, for the most part, responsible for solving issues which are data-concerned such as:

  • Pushing for required accesses to the data.
  • Hiring annotators (if the project needs it).
  • Creating accounts on labelling tools.
  • Coordinating the flow of incoming data for the use of the Data Science team, but consulting with a Data Science team member to make sure there is nothing lost in translation on the line between ML Developer => PM => client => data provider.
  • When needed, the PM should organise calls with data providers.
  • When needed, the PM should make sure both parties (developer, client / expert) can initiate easy contact between each other.

Data Scientists and Project Managers Working Shoulder to Shoulder in Agile Data Science Projects

Machine Learning projects may ruin most of the project management habits, but hey - we are all still learning! You should be aware that you may not deliver anything particular during sprints (again: working it time-boxes may be very frustrating). The whole process of delivering a product is a constant, repeatable data exploration. This is why feedbacking loops and close cooperation between project managers and data scientists is so important. It allows the process to be adjusted whenever it’s required.

Due to this, it is crucial to keep an eye on all the risks and manage them up to date - both you, the development team and the client must be prepared that not all of the goals will be achieved.

However, remember that rules regarding data science project management are not set in stone and, if for some specific project an alternative setup makes sense, feel free to make some modifications. It is also a good idea to consult your approach to the project with Project Managers who already had a chance to work on such projects.

Photo of Piotr Gloger

More posts by this author

Piotr Gloger

Senior Machine Learning Engineer at Netguru
The Future of Innovation  We’re building the new model of consulting     Discover more

We're Netguru!


At Netguru we specialize in designing, building, shipping and scaling beautiful, usable products with blazing-fast efficiency
Let's talk business!

Trusted by:

  • Vector-5
  • Babbel logo
  • Merc logo
  • Ikea logo
  • Volkswagen logo
  • UBS_Home