A quick overview of the super useful Collections module of Python.
If the implementation is hard to explain, it’s a bad idea :The Zen of Python
Python is a pretty powerful language and a large part of this power comes from the fact that it supports modular programming. Modular programming is essentially the process of breaking down a large and complex programming task into smaller and more manageable subtask/module. Modules are like LEGO bricks which can be bundled up together to create a larger task.
Modularity has a lot of advantages when writing code like:
- Reusability
- Maintainability
- Simplicity
Functions, modules and packages are all constructs in Python that promote code modularization.
Objective
Through this article, we will explore Python’s Collections Module. This module aims to improve the functionalities and provides alternatives to Python’s general purpose built-in containers such as dict, list, set, and tuple.
Introduction
Let’s begin the article with a quick glance at the concept of modules and packages.
Module
A module is nothing but a .py script that can be called in another .py script. A module is a file containing Python definitions and statements which helps to implement a set of functions. The file name is the module name with the suffix .py
appended. Modules are imported from other modules using the import
command. Let’s import the math module.
# import the library
import math
#Using it for taking the log
math.log(10)
2.302585092994046
Python’s in-built modules
Python has innumerable inbuilt modules and there are packages already created for almost any use case you can think of. Check out the complete list here.
Two very important functions come in handy when exploring modules in Python — the dir
and help
functions.
- The built-in function
dir()
is used to find out which functions are implemented in each module. It returns a sorted list of strings:
print(dir(math))
- After locating our desired function in the module, we can read more about it using the
help
function, inside the Python interpreter:
help(math.factorial)
Packages
Packages are a collection of related modules stacked up together. Numpy and Scipy, the core machine Learning packages, are made up of a collection of hundreds of modules. here is a partial list of sub-packages available within SciPy.
Let us now hop over to the actual objective of this article which is to get to know about the Python’s Collection module. This is just an overview and for detailed explanations and examples please refer to the official Python documentation.
Collections Module
Collections is a built-in Python module that implements specialized container datatypes providing alternatives to Python’s general purpose built-in containers such as dict
, list
, set
, and tuple
.
Some of the useful data structures present in this module are:
1. namedtuple()
The data stored in a plain tuple can only be accessed through indexes as can be seen in the example below:
plain_tuple = (10,11,12,13)
plain_tuple[0]
10
plain_tuple[3]
13
We can’t give names to individual elements stored in a tuple. Now, this might not be needed in simple cases. However, in case a tuple has many fields this might be kind of a necessity and will also impact the code’s readability.
It is here that namedtuple’s functionality comes into the picture. It is a function for tuples with Named Fields and can be seen as an extension of the built-in tuple data type. Named tuples assign meaning to each position in a tuple and allow for more readable, self-documenting code. Each object stored in them can be accessed through a unique (human-readable) identifier and this frees us from having to remember integer indexes. Let’s see its implementation.
from collections import namedtuple
fruit = namedtuple('fruit','number variety color')
guava = fruit(number=2,variety='HoneyCrisp',color='green')
apple = fruit(number=5,variety='Granny Smith',color='red')
We construct the namedtuple by first passing the object type name (fruit) and then passing a string with the variety of fields as a string with spaces between the field names. We can then call on the various attributes:
guava.color
'green'
apple.variety
'Granny Smith'
Namedtuples are also a memory-efficient option when defining an immutable class in Python.
2. Counter
Counter is a dict subclass which helps to count hashable objects. The elements are stored as dictionary keys while the object counts are stored as the value. Let’s work through a few examples with Counter.
#Importing Counter from collections
from collections import Counter
- With Strings
c = Counter('abcacdabcacd')
print(c)
Counter({'a': 4, 'c': 4, 'b': 2, 'd': 2})
- With Lists
lst = [5,6,7,1,3,9,9,1,2,5,5,7,7]
c = Counter(lst)
print(c)
Counter({'a': 4, 'c': 4, 'b': 2, 'd': 2})
- With Sentence
s = 'the lazy dog jumped over another lazy dog'
words = s.split()
Counter(words)
Counter({'another': 1, 'dog': 2, 'jumped': 1, 'lazy': 2, 'over': 1, 'the': 1})
Counter objects support three methods beyond those available for all dictionaries:
- elements()
Returns a count of each element and If an element’s count is less than one, it is ignored.
c = Counter(a=3, b=2, c=1, d=-2)
sorted(c.elements())
['a', 'a', 'a', 'b', 'b', 'c']
- most_common([n])
Returns a list of the most common elements with their counts. The number of elements has to be specified as n. If none is specified it returns the count of all the elements.
s = 'the lazy dog jumped over another lazy dog'
words = s.split()
Counter(words).most_common(3)
[('lazy', 2), ('dog', 2), ('the', 1)]
Common patterns when using the Counter() object
sum(c.values()) # total of all counts
c.clear() # reset all counts
list(c) # list unique elements
set(c) # convert to a set
dict(c) # convert to a regular dictionary c.items() # convert to a list like (elem, cnt)
Counter(dict(list_of_pairs)) # convert from a list of(elem, cnt)
c.most_common()[:-n-1:-1] # n least common elements
c += Counter() # remove zero and negative counts
3. defaultdict
Dictionaries are an efficient way to store data for later retrieval having an unordered set of key: value pairs. Keys must be unique and immutable objects.
fruits = {'apple':300, 'guava': 200}
fruits['guava']
200
Things are simple if the values are ints or strings. However, if the values are in the form of collections like lists or dictionaries, the value (an empty list or dict) must be initialized the first time a given key is used. defaultdict automates and simplifies this stuff. The example below will make it more obvious:
d = {}
print(d['A'])
Here, the Python dictionary throws an error since ‘A’ is not currently in the dictionary. Let us now run the same example with defaultdict.
from collections import defaultdict
d = defaultdict(object)
print(d['A'])
<object object at 0x7fc9bed4cb00>
The defaultdict
in contrast will simply create any items that you try to access (provided of course they do not exist yet).The defaultdict is also a dictionary-like object and provides all methods provided by a dictionary. However, the point of difference is that it takes the first argument (default_factory) as a default data type for the dictionary.
4.OrderedDict
An OrderedDict is a dictionary subclass that remembers the order in which that keys were first inserted. When iterating over an ordered dictionary, the items are returned in the order their keys were first added. Since an ordered dictionary remembers its insertion order, it can be used in conjunction with sorting to make a sorted dictionary:
- regular dictionary
d = {'banana': 3, 'apple': 4, 'pear': 1, 'orange': 2}
- dictionary sorted by key
OrderedDict(sorted(d.items(), key=lambda t: t[0]))
OrderedDict([('apple', 4), ('banana', 3), ('orange', 2), ('pear', 1)])
- dictionary sorted by value
OrderedDict(sorted(d.items(), key=lambda t: t[1]))
OrderedDict([('pear', 1), ('orange', 2), ('banana', 3), ('apple', 4)])
- dictionary sorted by the length of the key string
OrderedDict(sorted(d.items(), key=lambda t: len(t[0])))
OrderedDict([('pear', 1), ('apple', 4), ('banana', 3), ('orange', 2)])
A point to note here is that in Python 3.6, the regular dictionaries are insertion ordered i.e dictionaries remember the order of items inserted. Read the discussion here.
Conclusion
Collections module also contain some other useful datatypes like deque, Chainmap, UserString and few more. However, I have shared the ones which I use in my day to day programming to makes things simple. For a detailed explanation and usage visit the official Python documentation page.
Originally published here