Getting ‘More’ out of your Kaggle Notebooks


I joined Kaggle about five years ago. I got to know about this site from a MOOC that I was undertaking at that time. From 2015 till 2019, I had been using Kaggle only to download datasets. I did attempt the immensely popular Titanic Competition to change my status from green to blue, i.e. from Novice to Contributor, but other than that I wasn’t very much active on the platform. It was only in late 2019 that I started actively contributing and writing notebooks on Kaggle. From analysis to exploratory Data Analysis, I experimented with a lot of ideas. I studied other people’s work, took inspirations and learnt a lot. Finally, after months of Kaggling, I became a Kaggle Notebook GrandMaster in June 2020.

This article is a compilation of my learnings when it comes to writing effective notebooks. This isn’t a guide but just my experiences over the course of months. Let’s first begin by understanding what a Kaggle Notebook is and how is it used.


What are Kaggle Notebooks?

A Notebook is a storytelling format for sharing code and analyses. It is a cloud computing environment that enables reproducible and collaborative work. Anyone can create a Notebook right in Kaggle and embed charts directly into them. Kaggle Notebooks are of two kinds:

  • Scripts — files that execute everything as code sequentially
  • Notebooks — Jupyter notebooks consisting of a sequence of cells

In the Notebooks IDE, you have access to an interactive session running in a Docker container with pre-installed packages, the ability to mount versioned data sources, customizable compute resources like GPUs, and more.


Tips to get ‘More’ out of your Kaggle Notebooks

Let’s now quickly jump into some of the tips that I keep in mind before attempting to create a new notebook.

1. Tell compelling stories through Notebooks

“The purpose of a storyteller is not to tell you how to think, but to give you questions to think upon.”

Brandon Sanderson, The Way of Kings

Notebooks are an excellent tool to get your ideas across. They allow you to interactively explore data, create visualizations and then share the results with the world. In a way, you can combine both code and a writeup in the same environment. Use crisp visualizations to create a compelling storyline. Pick up a unique problem and try to work through it. Define the purpose of the notebook at the beginning itself and then wrap it up with a fitting conclusion, to create an impactful story.

In the notebook titled Geek Girls Rising: Myth or Reality! , I analysed the 2019 Kaggle ML and DS Survey data for Women’s Representation in Machine Learning and Data Science. The problem statement that I chose to work upon was whether the women participation in Kaggle was improving and how did it compare to the previous years. I created a report by analysing the data across various attributes like gender, countries, age groups etc. Finally, I concluded the analysis with key take ways and some recommendations.

An excerpt from the Notebook: Geek Girls Rising: Myth or Reality!. In this notebook I analysed the 2019 Kaggle ML and DS Survey data for Women’s Representation in Machine Learning and Data Science. The problem statement that I chose to work upon was whether the women participation in Kaggle was improving and how did it compare to the previous years.I created a report by analysing the data across various attributes like gender, countries,Age groups etc. I concluded the analysis with key take ways and some recommendations.
An excerpt from the Notebook: Geek Girls Rising: Myth or Reality!

2. Collaborate with others

“Many ideas grow better when transplanted into another mind than the one where they sprang up.”

Oliver Wendell Holmes

Collaboration is an integral part of Data Science, be it in research or the Open Source area. The importance of teaming up in Kaggle competitions cannot be emphasized enough. However, even the Kaggle notebooks have an extremely powerful collaboration feature. Multiple users can co-own and edit a Notebook. This could be helpful in a couple of scenarios.

  • When you are taking part in competitions, you can collaboratively work on your code with your teammates, in the notebook.
  • When working on analyzing a dataset, you can collaborate with people and create impactful project reports.

Creating, Reading & Writing Data”, a Notebook from the Advanced Pandas Kaggle Learn track, is one example of a great collaborative notebook.

“Creating, Reading & Writing Data”, a Notebook from the Advanced Pandas Kaggle Learn track, is one example of a great collaborative notebook.
Collaborating through Kaggle Notebooks

3. Contributing to Competitions via starter notebooks

Contribution is the key

You want to contribute to the competitions but not ready to compete yet? Well, start writing starter notebooks when a new competition launches. Such notebooks are typically of two kinds:

  • Notebooks which perform a basic or advanced Exploratory data analysis. These notebooks help others quickly understand the nature and pattern of data, thereby saving them a lot of time. Others highly appreciate an excellent EDA notebook.
  • Notebooks containing quick baselines. Such notebooks act as a stepping stone for people looking to compete in the competitions. They can use the baseline to build upon their own analysis.

Here is a glance at some of the EDA and starter notebooks for the competition: SIIM-ISIC Melanoma Classification for identifying melanoma in lesion images

SIIM-ISIC Melanoma Classification competition Notebooks on kaggle
SIIM-ISIC Melanoma Classification competition Notebooks

4. Teach something new

The Best Way to Learn Something is to Teach it to Someone Else

Try teaching about a new library or some new functions. This is especially helpful for beginners who sometimes have difficulty following the official documentation. However, make sure, you use some new datasets to showcase the working of the libraries/functions. Duplicating the entire documentation as it is is not a good idea.

In the notebook Useful Python libraries for Data Science, I compiled some of the useful but lesser-known Python libraries which can really come in handy for the Data Analysis and Machine learning tasks.

Libraries covered in Kaggle Notebook: Useful Python libraries for Data Science
Libraries covered in Kaggle Notebook: Useful Python libraries for Data Science

Similarly, in the notebook, Advanced Pyspark for Exploratory Data Analysis, Tien Tran, showcases how to use PySpark and its advantages over Pandas for handling big data.


5. Beware of the Data Visualization Pitfalls

“The purpose of visualisation is insight and not pictures .”

Ben Shneiderman
A collection of top 10 DataViz caveats by data-to-viz.com
A collection of top 10 DataViz caveats by data-to-viz.com

A picture is definitely worth a thousand words, but too many pictures defeat the purpose of clarity. There are a few points that you should consider if you want to make visualizations that stand out and also help in the storyline.

  • Keep the visualizations simple and to the point.
  • Explain each chart concisely. Do not leave the charts to be interpreted by the reader. Make others aware of your point of view.
  • Make clear and readable charts. Be mindful of the colour blind and make charts that can be interpreted by everyone.
  • Do not overdo animations. Use it only if they fit in the storyline.
  • Make the axis and labels clearly visible. Use proper font selection and font size. Display the legends and title clearly.

6. Make your notebooks reproducible

“Why should you care about reproducibility?

Because the person most likely to need to reproduce your work… is you.”

Dr Rachael Tatman -Reproducible Machine Learning

A reproducible example allows someone else to recreate your analysis using the same data. This makes a lot of sense since you are putting your work out in the public for them to use. This purpose gets defeated if others cannot reproduce your work on or off Kaggle. Rachael Tatman has put a wonderful kernel on Reproducible research best practices which lists some of the best practices for doing reproducible work. Here are some of the tips from the above study:

  • Put all your imports, import x or library(x) at the top of your notebook
  • Break up long lines at logical places
  • Make your variable names sensible and human-readable
  • Comment your code!
  • make sure to set all the random number generators (RNGs)
  • Ensure that both your code and data are logically organised.

The following example(again taken from Rachel’s slides) clearly emphasise the power of writing clean, modular and reproducible code.

Do's and Dont's for writing Code
Do’s and Dont’s for writing reproducible Code . Source

7. Keep an eye on the Errors

“One man’s crappy software is another man’s full-time job.”

Jessica Gaston

Run your entire notebooks before publishing. A notebook containing errors or graphs that do not render isn’t something that you would want to share with the world, forget about getting upvotes. Also, make sure the dataset connected to the notebook is also error-free.


8. Upvote before you Fork

The term “Forking” comes from version control. Forking a notebook means to make a copy of it as it currently is. It is a common tendency for people to fork good notebooks or baselines to build up their code on them. However, some people will a fork a notebook or use other’s work but will not show an appreciation by upvoting the original work. If you have found somebody’s code to be so useful that you ended up using it, why not show the author some gratitude?

The notebook highlighted below has more forks than upvotes. Strange!


9. Follow Notebook Etiquettes

Getting Inspired is Human, but Plagiarising is evil

  • Do not lift content directly from other notebooks. If you think you want to reuse a chunk of code, give clear attribution.
  • Refrain from Spamming the notebooks by asking for votes. A good notebook will get the eye of the fellow Kagglers. You can also put your content out there on your social media like Linkedin to Twitter to tell the world that you have created a new notebook. But do not keep asking for votes, especially in return for the vote that you have given.
  • Do not follow people blindly because of their ‘Title‘ on Kaggle. Have a look at their work, make sure it is honest and genuine and adds value. Only then follow people.

10. Show Appreciation for good work

Showing appreciation is one of the simplest yet one of the most powerful things humans can do for each other

Finally, do not shy away from appreciating great notebooks by upvoting them and giving them a shoutout. Here’s a great example of appreciating other’s work by Kaggle GM – Head or Tails. He periodically showcases Kaggle Notebooks, which he feels haven’t gotten their due.


Conclusion

Kaggle Notebooks are a great tool to get your thoughts across. Search or curate some cool datasets and use notebooks to create some outstanding analysis. In the end, do not forget to enjoy the process. There is so much to learn from the fantastic Kaggle community out there. But the most important thing is to attempt — for the secret of getting ahead is getting started.

Categories: KaggleTags: , , ,

4 comments

  1. Thanks Parul. i will start with the notebooks. if there is any particular notebook which helped you during your journey will be greatly helpful.

    Like

  2. Thanks a lot Parul. I wanted to request the name of any specific notebooks you used in the earlier days when you were learning to work on notebooks. It will be a great information.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: