Getting ‘More’ out of your Kaggle Notebooks
I joined Kaggle about five years ago. I got to know about this site from a MOOC that I was undertaking at that time. From 2015 till 2019, I had been using Kaggle only to download datasets. I did attempt the immensely popular Titanic Competition to change my status from green to blue, i.e. from Novice to Contributor, but other than that I wasn’t very much active on the platform. It was only in late 2019 that I started actively contributing and writing notebooks on Kaggle. From analysis to exploratory Data Analysis, I experimented with a lot of ideas. I studied other people’s work, took inspirations and learnt a lot. Finally, after months of Kaggling, I became a Kaggle Notebook GrandMaster in June 2020.
This article is a compilation of my learnings when it comes to writing effective notebooks. This isn’t a guide but just my experiences over the course of months. Let’s first begin by understanding what a Kaggle Notebook is and how is it used.
What are Kaggle Notebooks?
A Notebook is a storytelling format for sharing code and analyses. It is a cloud computing environment that enables reproducible and collaborative work. Anyone can create a Notebook right in Kaggle and embed charts directly into them. Kaggle Notebooks are of two kinds:
- Scripts — files that execute everything as code sequentially
- Notebooks — Jupyter notebooks consisting of a sequence of cells
In the Notebooks IDE, you have access to an interactive session running in a Docker container with pre-installed packages, the ability to mount versioned data sources, customizable compute resources like GPUs, and more.
Tips to get ‘More’ out of your Kaggle Notebooks
Let’s now quickly jump into some of the tips that I keep in mind before attempting to create a new notebook.
1. Tell compelling stories through Notebooks
“The purpose of a storyteller is not to tell you how to think, but to give you questions to think upon.”Brandon Sanderson, The Way of Kings
Notebooks are an excellent tool to get your ideas across. They allow you to interactively explore data, create visualizations and then share the results with the world. In a way, you can combine both code and a writeup in the same environment. Use crisp visualizations to create a compelling storyline. Pick up a unique problem and try to work through it. Define the purpose of the notebook at the beginning itself and then wrap it up with a fitting conclusion, to create an impactful story.
In the notebook titled Geek Girls Rising: Myth or Reality! , I analysed the 2019 Kaggle ML and DS Survey data for Women’s Representation in Machine Learning and Data Science. The problem statement that I chose to work upon was whether the women participation in Kaggle was improving and how did it compare to the previous years. I created a report by analysing the data across various attributes like gender, countries, age groups etc. Finally, I concluded the analysis with key take ways and some recommendations.
2. Collaborate with others
“Many ideas grow better when transplanted into another mind than the one where they sprang up.”Oliver Wendell Holmes
Collaboration is an integral part of Data Science, be it in research or the Open Source area. The importance of teaming up in Kaggle competitions cannot be emphasized enough. However, even the Kaggle notebooks have an extremely powerful collaboration feature. Multiple users can co-own and edit a Notebook. This could be helpful in a couple of scenarios.
- When you are taking part in competitions, you can collaboratively work on your code with your teammates, in the notebook.
- When working on analyzing a dataset, you can collaborate with people and create impactful project reports.
3. Contributing to Competitions via starter notebooks
Contribution is the key
You want to contribute to the competitions but not ready to compete yet? Well, start writing starter notebooks when a new competition launches. Such notebooks are typically of two kinds:
- Notebooks which perform a basic or advanced Exploratory data analysis. These notebooks help others quickly understand the nature and pattern of data, thereby saving them a lot of time. Others highly appreciate an excellent EDA notebook.
- Notebooks containing quick baselines. Such notebooks act as a stepping stone for people looking to compete in the competitions. They can use the baseline to build upon their own analysis.
Here is a glance at some of the EDA and starter notebooks for the competition: SIIM-ISIC Melanoma Classification for identifying melanoma in lesion images
4. Teach something new
The Best Way to Learn Something is to Teach it to Someone Else
Try teaching about a new library or some new functions. This is especially helpful for beginners who sometimes have difficulty following the official documentation. However, make sure, you use some new datasets to showcase the working of the libraries/functions. Duplicating the entire documentation as it is is not a good idea.
In the notebook Useful Python libraries for Data Science, I compiled some of the useful but lesser-known Python libraries which can really come in handy for the Data Analysis and Machine learning tasks.
Similarly, in the notebook, Advanced Pyspark for Exploratory Data Analysis, Tien Tran, showcases how to use PySpark and its advantages over Pandas for handling big data.
5. Beware of the Data Visualization Pitfalls
“The purpose of visualisation is insight and not pictures .”Ben Shneiderman
A picture is definitely worth a thousand words, but too many pictures defeat the purpose of clarity. There are a few points that you should consider if you want to make visualizations that stand out and also help in the storyline.
- Keep the visualizations simple and to the point.
- Explain each chart concisely. Do not leave the charts to be interpreted by the reader. Make others aware of your point of view.
- Make clear and readable charts. Be mindful of the colour blind and make charts that can be interpreted by everyone.
- Do not overdo animations. Use it only if they fit in the storyline.
- Make the axis and labels clearly visible. Use proper font selection and font size. Display the legends and title clearly.
6. Make your notebooks reproducible
“Why should you care about reproducibility?
Because the person most likely to need to reproduce your work… is you.”Dr Rachael Tatman -Reproducible Machine Learning
A reproducible example allows someone else to recreate your analysis using the same data. This makes a lot of sense since you are putting your work out in the public for them to use. This purpose gets defeated if others cannot reproduce your work on or off Kaggle. Rachael Tatman has put a wonderful kernel on Reproducible research best practices which lists some of the best practices for doing reproducible work. Here are some of the tips from the above study:
- Put all your imports, import x or library(x) at the top of your notebook
- Break up long lines at logical places
- Make your variable names sensible and human-readable
- Comment your code!
- make sure to set all the random number generators (RNGs)
- Ensure that both your code and data are logically organised.
The following example(again taken from Rachel’s slides) clearly emphasise the power of writing clean, modular and reproducible code.
7. Keep an eye on the Errors
“One man’s crappy software is another man’s full-time job.”Jessica Gaston
Run your entire notebooks before publishing. A notebook containing errors or graphs that do not render isn’t something that you would want to share with the world, forget about getting upvotes. Also, make sure the dataset connected to the notebook is also error-free.
8. Upvote before you Fork
The term “Forking” comes from version control. Forking a notebook means to make a copy of it as it currently is. It is a common tendency for people to fork good notebooks or baselines to build up their code on them. However, some people will a fork a notebook or use other’s work but will not show an appreciation by upvoting the original work. If you have found somebody’s code to be so useful that you ended up using it, why not show the author some gratitude?
The notebook highlighted below has more forks than upvotes. Strange!
9. Follow Notebook Etiquettes
Getting Inspired is Human, but Plagiarising is evil
- Do not lift content directly from other notebooks. If you think you want to reuse a chunk of code, give clear attribution.
- Refrain from Spamming the notebooks by asking for votes. A good notebook will get the eye of the fellow Kagglers. You can also put your content out there on your social media like Linkedin to Twitter to tell the world that you have created a new notebook. But do not keep asking for votes, especially in return for the vote that you have given.
- Do not follow people blindly because of their ‘Title‘ on Kaggle. Have a look at their work, make sure it is honest and genuine and adds value. Only then follow people.
10. Show Appreciation for good work
Showing appreciation is one of the simplest yet one of the most powerful things humans can do for each other
Finally, do not shy away from appreciating great notebooks by upvoting them and giving them a shoutout. Here’s a great example of appreciating other’s work by Kaggle GM – Head or Tails. He periodically showcases Kaggle Notebooks, which he feels haven’t gotten their due.
Kaggle Notebooks are a great tool to get your thoughts across. Search or curate some cool datasets and use notebooks to create some outstanding analysis. In the end, do not forget to enjoy the process. There is so much to learn from the fantastic Kaggle community out there. But the most important thing is to attempt — for the secret of getting ahead is getting started.