What Is Kaggle? An Overview

Photo of Robert Kostrzewski

Robert Kostrzewski

Apr 9, 2019 • 5 min read
Netguru-Biuro-2018-5873-1

Have you ever wondered what steps you should take first to dive deep into the Machine Learning world?

You can consider various scenarios – including academia classes, online tutorials, and reading books. All of these approaches have their special benefits and may help you become an AI Engineer. On the other hand, it is often said that the learning process is at its best when done hands-on.

The best way to improve your Machine Learning skills is to practice. Mostly because each real AI case is different and the more stories you are familiar with the more chances there are you will be successful.

One of the possible ways of improving your skills in solving real problems is taking part in Kaggle competitions.

Quoting Wikipedia:

Kaggle is the world's largest community of data scientists and machine learners, owned by Google, Inc. Kaggle got its start by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and short form AI education.

Indeed, Kaggle is not only a platform for running competitions, it’s a whole community eager to share the knowledge.

Let’s see how it works. After the registration process, you can browse the active and past competitions. By a competition, we mean a specific Machine Learning problem to solve. Often, the basis of the task is to find the AI model best fitting the specific conditions. Sometimes accuracy is not the only goal – some tasks’ authors also demand a good execution speed of your algorithm.

Kaggle competitions are real contests, with real time leaderboards and sometimes real-world prizes. For instance, a competition sponsored by the TwoSigma corporation offered prizes with $100,000 in total value! To give another example, Santander Bank sponsored $65,000-worth of prizes for the best solutions of its case study.
In most cases, each user sends their own solutions and gets placed themselves in the general ranking. In some competitions, it is possible to enroll as a team, where the overall score is the average of the teammemers’ results.

Competition’s authors always share the datasets competitors should use to build their models. They are split into train and test datasets. The train dataset contains both features and “answers”, while test dataset is often a small part of final dataset on which a solution is validated. The score of each iteration is just the evaluation of a specific solution based on another part of the test dataset. That’s why guessing the solution is not possible in some cases, or at least quite difficult to reach.

Types of Kaggle competitions

Due to data safety and possibilities of cheating, there are two methods of result evaluation. The first one is a standard competition, in which only final results are valuable. In this case, a user evaluates their machine learning model locally, prepares a file with the final answers predictions, and submits it to the Kaggle website. Then, after a few seconds the user knows their score. The decision is the next step: is that the best possible score or would any improvements be implemented in the next iteration? There are five iterations possible per day. It is a security handle, designed so as not to allow hackers to stress Kaggle servers with loads of scores sent.

The other type of competition is a Kernel-based competition. The user can test their model locally but sending a solution file is no longer a method of evaluating a score. In that type of competition, you create a Kernel – an online script with your code that executes the model creation and data evaluation. Then, Kaggle servers evaluate your script. This type of competition is usually a bit more complex and strict, because here, your solution to be portable, smooth, and CPU- and memory-efficient.

Most interesting recent Kaggle competitions

  1. Quora Insincere Questions Classification – Detection of toxic content to improve online conversations
  2. Two Sigma: Using News to Predict Stock Movements – Using news analytics to predict stock price performance
  3. Airbus Ship Detection Challenge – Detection of ships on satellite images as quickly as possible
  4. Home Credit Default Risk – Credit default prediction
  5. Digit Recognizer – Learn computer vision fundamentals with the famous MNIST data

The best platform for Machine Learning beginners

Kaggle might not be the best approach for advanced Machine Learning developers, since its values and evaluations differ a bit from real-world cases (read more), but I still believe it is a good entry point for beginners. Through taking part in exciting competitions, we can learn new Machine Learning trends (especially Deep Learning topics are covered there) and methods. If we start from scratch, we can jump into a basic competition, which is more of a Machine Learning tutorial. Then, we can go into Deep Learning tutorials, etc.

What is also an advantage of Kaggle is that we can use a variety of Machine Learning frameworks. Kaggle supports Caffe, Keras, Tensorflow, PyTorch, etc. We don’t have to limit our learning to only one technology.

Photo of Robert Kostrzewski

More posts by this author

Robert Kostrzewski

Robert obtained a Bachelor degree in Computer Science at the Wroclaw University of Science with a...
How to build products fast?  We've just answered the question in our Digital Acceleration Editorial  Sign up to get access

We're Netguru!

At Netguru we specialize in designing, building, shipping and scaling beautiful, usable products with blazing-fast efficiency
Let's talk business!

Trusted by:

  • Vector-5
  • Babbel logo
  • Merc logo
  • Ikea logo
  • Volkswagen logo
  • UBS_Home