Beware of the Dummy variable trap in pandas

Important caveats to be kept in mind when encoding data with pandas.get_dummies() Handling categorical variables forms an essential component of a machine learning pipeline. While machine learning algorithms can naturally handle the numerical variables, the same is not valid for their categorical counterparts. Although there are algorithms like LightGBM and Catboost that can inherently handle the categorical variables, it is … Continue reading Beware of the Dummy variable trap in pandas

There is more to ‘pandas.read_csv()’ than meets the eye

A deep dive into some of the parameters of the read_csv function in pandas Pandas is one of the most widely used libraries in the Data Science ecosystem. This versatile library gives us tools to read, explore and manipulate data in Python. The primary tool used for data import in pandas is read_csv().This function accepts the file path of a … Continue reading There is more to ‘pandas.read_csv()’ than meets the eye

A hands-on guide to ‘sorting’ dataframes in Pandas

My tryst with the pandas’ library continues. Of late, I have been trying to look deeper into this library and consolidating some of the pandas’ features in byte-sized articles. I have written articles on reducing memory usage while working with pandas, converting XML files into a pandas dataframe easily, getting started with time series in pandas, and many more. In this article, … Continue reading A hands-on guide to ‘sorting’ dataframes in Pandas

Reducing memory usage in pandas with smaller datatypes

Optimizing pandas memory usage by the effective use of datatypes Managing large datasets with pandas is a pretty common issue. As a result, a lot of libraries and tools have been developed to ease that pain. Take, for instance, the pydatatable library mentioned below. Using Python’s datatable library seamlessly on Kaggle Despite this, there are … Continue reading Reducing memory usage in pandas with smaller datatypes