Streamline your data science code repository and tooling quickly and efficiently
Originally published here
Good Code is its own best documentation
Dr. Rachael Tatman, in one of her presentation, highlighted the importance of code reproducibility in a very subtle way :
“Why should you care about reproducibility? Because the person most likely to need to reproduce your work… is you.”
This is true on so many levels. Have you ever found yourself in a situation where it became difficult to decipher your codebase? Do you often end up with multiple files like
untitled2.ipynb? Well, if not all, a few of us must have undoubtedly faced the brunt of bad coding practices on few occasions. The situation is even more common in data science. Often, we limit our focus on the analysis and the end product while ignoring the quality of the code that is responsible for the analysis.
Why is reproducibility a vital ingredient in the data science pipeline? I have touched upon this topic in another blog post, and I’ll borrow a few lines from there. A reproducible example allows someone else to recreate your analysis using the same data. This makes a lot of sense since you put your work out in the public for them to use. This purpose gets defeated if others cannot reproduce your work. In this article, let’s look at three useful tools that can streamline and help you in creating structured and reproducible projects.
Creating a good project structure
Let’s say you want to create a project which contains code to analyze the sentiments of the movie reviews. There are three essential steps to create a good project structure:
1. Automating project template creation with Cookiecutter Data Science
There is not a clear consensus in the community on best practices for organizing machine learning projects. That is why they are a plethora of choices, and this lack of clarity leads to confusion. Fortunately, there is a workaround, thanks to people at DrivenData. They have created a tool called Cookiecutter Data Science which is a standardized but flexible project structure for doing and sharing data science work. A few lines of code set up a whole series of subdirectories and make it easier to start, structure, and share analysis. You can read more about the tool on their project home page. Let’s get to in interesting part and see it in action.
pip install cookiecutter or conda config --add channels conda-forge conda install cookiecutter
Starting a new project
Head over to your terminal and run the following command. It will automatically populate a directory with the required files.
A sentiment Analysis directory gets created on the specified path, which in the above case is the Desktop.
Note : Cookiecutter data science will be moving to version 2 soon, and hence there slight change in how the command is used in the future. This means you will have to use
ccds ...rather than
cookiecutter ...in the command above. As per the Github repository, this version of the template will still be available but one would have to explicitly use
-c v1to select it. Keep an eye on the documentaion, when the change happens.
Creating a good Readme with readme.so
After creating the skeleton of the project next, you need to populate it. But before that, there is an important file to be updated — the README. A README is a markdown file that communicates essential information about your project. It tells others what the project is about, the project’s license, how others can contribute to the project, etc. I have seen many people putting tremendous effort into their projects but failing to create decent READMEs. If you are one of them, there is some good news in the form of a project called readme.so.
A good soul has just put an end to writing READMEs manually. Katherine Peterson recently created a simple editor allowing you to create and customize your project’s readme quickly. Github even retweeted Katherine’s tweet.
The editor is pretty intuitive. You only need to click on a section to edit the content, and the section gets added to your readme. Choose the ones you like from an extensive collection. You can also move the sections depending upon the location where you want them on the page. Once you have all things in place, go ahead and copy the content or download the file and add it to your existing project.
Push your code to Github
We are almost done. The only thing left is to push the code to Github(or any version control platform of your choice). You can do that easily via Git. Here is a handy cheat sheet containing the most important and commonly used Git commands for easy reference.
Alternatively, if you use Visual Studio Code(VS Code), like me, it is already taken care of. VS Code makes it possible to publish any project directly to GitHub without having to create a repository first. VS Code will create the repository for you and control whether or not it should be public or private. The only thing required from your side is to provide authentication to GitHub through VS Code.
That is all you need to set up a robust and structured project base. All the above steps have been summarized in the following video in case you want to look at all the steps in sync.
End to end video showcasing the tools used in the article
Creating structured and reproducible projects might seem difficult in the beginning but offer advantages in the long run. In this article, we looked at three useful tools that can help us in this task. While cookiecutter data science gives a clean project template, readme.so automatically populates a readme file. Finally, the VS Code can help us push the project onto the web for source control and collaboration. This creates the necessary foundation for a good data science project. Now you can begin working on your data and derive insights from it to be shared with various stakeholders.
👉 Interested in reading other articles by myself. This repo contains all the articles written by me category-wise.