Beware of the Dummy variable trap in pandas

Important caveats to be kept in mind when encoding data with pandas.get_dummies() Handling categorical variables forms an essential component of a machine learning pipeline. While machine learning algorithms can naturally handle the numerical variables, the same is not valid for their categorical counterparts. Although there are algorithms like LightGBM and Catboost that can inherently handle the categorical variables, it is … Continue reading Beware of the Dummy variable trap in pandas

A better way to visualize Decision Trees with the dtreeviz library

An open-source package for decision tree visualization and model interpretation Originally published here It is rightly said that a picture is worth a thousand words. This axiom is equally applicable for machine learning models. If one can visualize and interpret the result, it instills more confidence in the model’s predictions. Visualizing how a machine learning … Continue reading A better way to visualize Decision Trees with the dtreeviz library

A hands-on guide to ‘sorting’ dataframes in Pandas

My tryst with the pandas’ library continues. Of late, I have been trying to look deeper into this library and consolidating some of the pandas’ features in byte-sized articles. I have written articles on reducing memory usage while working with pandas, converting XML files into a pandas dataframe easily, getting started with time series in pandas, and many more. In this article, … Continue reading A hands-on guide to ‘sorting’ dataframes in Pandas

Using Python’s datatable library seamlessly on Kaggle

Managing large datasets on Kaggle without fearing about the out of memory error Image by user Datatable is a Python package for manipulating large dataframes. It has been created to provide big data support and enable high performance. This toolkit resembles pandas very closely but is more focused on speed. It supports out-of-memory datasets, multi-threaded data processing, … Continue reading Using Python’s datatable library seamlessly on Kaggle