Automate your data science project structure in three easy steps

Streamline your data science code repository and tooling quickly and efficiently Originally published here Good Code is its own best documentation Dr. Rachael Tatman, in one of her presentation, highlighted the importance of code reproducibility in a very subtle way : “Why should you care about reproducibility? Because the person most likely to need to reproduce … Continue reading Automate your data science project structure in three easy steps

The curious case of Simpson’s Paradox

Statistical tests and analysis can be confounded by a simple misunderstanding of the data Statistics rarely offers a single “right”way of doing anything — Charles Wheelan in Naked Statistics In 1996, Appleton, French, and Vanderpump conducted an experiment to study the effect of smoking on a sample of people. The study was conducted over twenty years and included 1314 … Continue reading The curious case of Simpson’s Paradox

Reducing memory usage in pandas with smaller datatypes

Optimizing pandas memory usage by the effective use of datatypes Managing large datasets with pandas is a pretty common issue. As a result, a lot of libraries and tools have been developed to ease that pain. Take, for instance, the pydatatable library mentioned below. Using Python’s datatable library seamlessly on Kaggle Despite this, there are … Continue reading Reducing memory usage in pandas with smaller datatypes

5 Real World datasets for honing your Exploratory Data Analysis skills

The best way to learn data science is by doing it https://www.freepik.com/vectors/data If you are just getting started in Data Science and looking for some cool datasets to play with, this might be the article for you. A lot of courses and books never really move beyond the classic titanic and the Iris datasets. Not that … Continue reading 5 Real World datasets for honing your Exploratory Data Analysis skills