My tryst with the pandas’ library continues. Of late, I have been trying to look deeper into this library and consolidating some of the pandas’ features in byte-sized articles. I have written articles on reducing memory usage while working with pandas, converting XML files into a pandas dataframe easily, getting started with time series in pandas, and many more. In this article, I’ll touch upon a very useful aspect of data analysis, and that is sorting. We’ll begin with a brief introduction and then quickly jump on some ways to perform sorting efficiently in pandas.
If you are an excel user, one of the first things that you’ll do when you open a spreadsheet is to sort them in ascending or descending order based on a column(or columns). Technically, sorting is a way of ordering elements in an order based on their rank. There is complete documentation dedicated to various sorting algorithms in the programming field. Here are some of the commonly used sorting algorithms.
The different sorting algorithms have differt underlying principles of sorting data. For instance, in bubble sort, the algorithm compares an element to its neighbor and swaps them in the desired order. On the other hand, merge sort is based on based on the divide and conquer technique. in selection sort, the algorithm. The unordered array is first divided into equal halves and then combined in a sorted manner. Explaining about the intricacies of all these algorithms is beyond the scope of this article but if it has piques your interest, here is a great article which explains the concept with great visualisations.
Sorting in pandas
Let’s now see how we can perform sorting in pandas. For this, I’ll use a very interesting dataset consisting of a list of the top hundred Most starred Github repositories. It is publicly available on Kaggle. The dataset is updated daily and consists of other attributes like several forks, project description, language, and even project description. Let’s load the dataset and look at its structure.
df = pd.read_csv('Most starred Github Repositories.csv')
Let’s quickly go over the various columns of the dataset:
- Project Name: Name of the repository in Github
- Stars: A bookmark or display of appreciation for a repository.
- Forks : A fork is a copy of a repository that you manage.
- Language : Main programming languages used in the project
- Open Issues : Issues are suggested improvements, tasks or questions related to the repository. The issues which haven’t been resolved are labelled as open issues.
- Description : A paragraph detailing the purpose of the project.
- Last Commit: A commit, or “revision”, is an individual change to a file (or set of files). This field stores the date and time of the last commit.
Note: All the above definitions have been taken from the Github glossary.
The current dataset is ordered by the number of Stars ⭐️ i.e, the project with the maximum number of stars comes first and so on. Pandas supports three kinds of sorting: sorting by index labels, sorting by column values, and sorting by a combination of both. Let’s now look at the different ways of sorting this dataset with some examples:
1. Sorting on a single column
The function used for sorting in pandas is called
DataFrame.sort_values(). It is used to sort a
DataFrame by its column or row values. Let’s sort the dataset by the Forks column.
forks = df.sort_values(by='Forks',ascending=False)
dataframe.sort_values comes with a lot of parameters. We will touch upon a few important ones as we advance through the article. In the above example, we have encountered two of them :
- by: The optional
byparameter is used to specify the column(s) which are used to determine the sorted order.
- ascending: specifies whether to sort the dataframe in ascending or descending order. The default value is ascending. To sort in descending order, we need to specify
2. Sorting on multiple columns
Pandas also make it possible to sort the dataset on multiple columns. Simply, pass in the list of the desired columns names in the
sort_values function as follows:
In the example above, we have sorted the dataframe based on the number of
open issues and the number of
stars a project has. Note that by default, the sorting has been done in ascending order.
3. Sorting by Multiple Columns With Different Sort Orders
When sorting by multiple columns, it is also possible to pass in different sort orders for different columns.
df.sort_values(by=['Open Issues', 'Stars'],
In the above examples, the dataframe will be first sorted on the
Open Issues column in ascending order and then on the
Stars column in descending order.
4. Sorting by index
Another way of sorting a dataframe would be by its index. In section 1, we created a dataframe named forks. This is just another version of the original dataframe, which has been sorted on the
Forks columns. The dataframe appears like this:
As is evident, the index is unsorted. We can sort it by using the
Alternatively, you can sort the index in descending order by passing in the
ascending=False the argument in the function above.
5. Ignore the index while sorting
The index column can also be ignored entirely while sorting the dataframe. This results in an index labeled from 0 to n-1 where n refers to the number of observations.
ignore_index is not set to True(or default), the resulting sorted dataframe would have been:
6. Choosing the sorting algorithm
We touched upon the topic of different sorting algorithms in the beginning. By default,
sort_values uses the quicksort algorithm. However, we can choose between ‘quicksort,’ ‘mergesort’ and ‘heapsort’ algorithm using the kind parameter . Remember that this option is only applied when sorting on a single column or label.
7. Sorting by column names
Additionally, we can also sort the dataframe using the column names instead of the rows using the sort_index() function. For this we need to set the axis parameter to 1.
The columns above have been sorted in ascending alphabetical order. By setting
ascending=False, the sorting can be done in descending order also.
8. Performing operations in-place
By setting the
inplace parameter to
True, all the sorting operations are done in place. This means that the existing dataframe gets modified. When
inplace = False the operations take place on a copy of the dataframe, which is then returned. The original dataframe remains unchanged.
9. Handling missing values
Data usually contains null values. Using the
na_position as first or last, in
sort_values() function, we can choose to puts NaNs at the beginning or at the end.
df.sort_values(by='Forks', na_position='first') #NaN placed first
df.sort_values(by='Forks', na_position='last') #NaN placed in the end
10. Apply the key function to the values before sorting
We can also apply a key function to the values before sorting. The function expects a
Series and return a Series with the same shape as the input. It will be applied to each column in by independently. In the example below, we first convert the column
Project Name in lowercase and then sort the dataframe on this column
df.sort_values(by='Project Name',key=lambda col: col.str.lower())[:5]
Conclusion and additional resources
In this article we looked at the different ways of sorting a dataframe using the pandas library. We looked at the usage of both sort_values() as well as the sort_index() functions along with their parameters. The official documentation is an excellent resource if you are thinking of going deeper into the details.
Originally published here