Python’s Collections Module — High-performance container data types.


A quick overview of the super useful Collections module of Python.

Photo by rawpixel on Unsplash

If the implementation is hard to explain, it’s a bad idea :The Zen of Python

Python is a pretty powerful language and a large part of this power comes from the fact that it supports modular programming. Modular programming is essentially the process of breaking down a large and complex programming task into smaller and more manageable subtask/module. Modules are like LEGO bricks which can be bundled up together to create a larger task.

Modularity has a lot of advantages when writing code like:

  • Reusability
  • Maintainability
  • Simplicity

Functions, modules and packages are all constructs in Python that promote code modularization.


Objective

Through this article, we will explore Python’s Collections Module. This module aims to improve the functionalities and provides alternatives to Python’s general purpose built-in containers such as dict, list, set, and tuple.


Introduction

Let’s begin the article with a quick glance at the concept of modules and packages.

Module

A module is nothing but a .py script that can be called in another .py script. A module is a file containing Python definitions and statements which helps to implement a set of functions. The file name is the module name with the suffix .py appended. Modules are imported from other modules using the import command. Let’s import the math module.

# import the library
import math
#Using it for taking the log
math.log(10)
2.302585092994046

Python’s in-built modules

Python has innumerable inbuilt modules and there are packages already created for almost any use case you can think of. Check out the complete list here.

Two very important functions come in handy when exploring modules in Python — the dir and help functions.

  • The built-in function dir() is used to find out which functions are implemented in each module. It returns a sorted list of strings:
print(dir(math))


  • After locating our desired function in the module, we can read more about it using the help function, inside the Python interpreter:
help(math.factorial)


Packages

Packages are a collection of related modules stacked up together. Numpy and Scipy, the core machine Learning packages, are made up of a collection of hundreds of modules. here is a partial list of sub-packages available within SciPy.

source

Let us now hop over to the actual objective of this article which is to get to know about the Python’s Collection module. This is just an overview and for detailed explanations and examples please refer to the official Python documentation.

Collections Module

Collections is a built-in Python module that implements specialized container datatypes providing alternatives to Python’s general purpose built-in containers such as dict, list, set, and tuple.

Some of the useful data structures present in this module are:

1. namedtuple()

The data stored in a plain tuple can only be accessed through indexes as can be seen in the example below:

plain_tuple = (10,11,12,13)
plain_tuple[0]
10
plain_tuple[3]
13

We can’t give names to individual elements stored in a tuple. Now, this might not be needed in simple cases. However, in case a tuple has many fields this might be kind of a necessity and will also impact the code’s readability.

It is here that namedtuple’s functionality comes into the picture. It is a function for tuples with Named Fields and can be seen as an extension of the built-in tuple data type. Named tuples assign meaning to each position in a tuple and allow for more readable, self-documenting code. Each object stored in them can be accessed through a unique (human-readable) identifier and this frees us from having to remember integer indexes. Let’s see its implementation.

from collections import namedtuple
fruit = namedtuple('fruit','number variety color')
guava = fruit(number=2,variety='HoneyCrisp',color='green')
apple = fruit(number=5,variety='Granny Smith',color='red')

We construct the namedtuple by first passing the object type name (fruit) and then passing a string with the variety of fields as a string with spaces between the field names. We can then call on the various attributes:

guava.color
'green'
apple.variety
'Granny Smith'

Namedtuples are also a memory-efficient option when defining an immutable class in Python.

2. Counter

Counter is a dict subclass which helps to count hashable objects. The elements are stored as dictionary keys while the object counts are stored as the value. Let’s work through a few examples with Counter.

#Importing Counter from collections
from collections import Counter
  • With Strings
c = Counter('abcacdabcacd')
print(c)
Counter({'a': 4, 'c': 4, 'b': 2, 'd': 2})
  • With Lists
lst = [5,6,7,1,3,9,9,1,2,5,5,7,7]
c = Counter(lst)
print(c)
Counter({'a': 4, 'c': 4, 'b': 2, 'd': 2})
  • With Sentence
s = 'the lazy dog jumped over another lazy dog'
words = s.split()
Counter(words)
Counter({'another': 1, 'dog': 2, 'jumped': 1, 'lazy': 2, 'over': 1, 'the': 1})

Counter objects support three methods beyond those available for all dictionaries:

  • elements()

Returns a count of each element and If an element’s count is less than one, it is ignored.

c = Counter(a=3, b=2, c=1, d=-2)
sorted(c.elements())
['a', 'a', 'a', 'b', 'b', 'c']
  • most_common([n])

Returns a list of the most common elements with their counts. The number of elements has to be specified as n. If none is specified it returns the count of all the elements.

s = 'the lazy dog jumped over another lazy dog'
words = s.split()
Counter(words).most_common(3)
[('lazy', 2), ('dog', 2), ('the', 1)]

Common patterns when using the Counter() object

sum(c.values())                 # total of all counts 
c.clear() # reset all counts
list(c) # list unique elements
set(c) # convert to a set
dict(c) # convert to a regular dictionary c.items() # convert to a list like (elem, cnt)
Counter(dict(list_of_pairs)) # convert from a list of(elem, cnt)
c.most_common()[:-n-1:-1] # n least common elements
c += Counter() # remove zero and negative counts

3. defaultdict

Dictionaries are an efficient way to store data for later retrieval having an unordered set of key: value pairs. Keys must be unique and immutable objects.

fruits = {'apple':300, 'guava': 200}
fruits['guava']
200

Things are simple if the values are ints or strings. However, if the values are in the form of collections like lists or dictionaries, the value (an empty list or dict) must be initialized the first time a given key is used. defaultdict automates and simplifies this stuff. The example below will make it more obvious:

d = {}
print(d['A'])


Here, the Python dictionary throws an error since ‘A’ is not currently in the dictionary. Let us now run the same example with defaultdict.

from collections import defaultdict
d = defaultdict(object)
print(d['A'])
<object object at 0x7fc9bed4cb00>

The defaultdict in contrast will simply create any items that you try to access (provided of course they do not exist yet).The defaultdict is also a dictionary-like object and provides all methods provided by a dictionary. However, the point of difference is that it takes the first argument (default_factory) as a default data type for the dictionary.

4.OrderedDict

An OrderedDict is a dictionary subclass that remembers the order in which that keys were first inserted. When iterating over an ordered dictionary, the items are returned in the order their keys were first added. Since an ordered dictionary remembers its insertion order, it can be used in conjunction with sorting to make a sorted dictionary:

  • regular dictionary
d = {'banana': 3, 'apple': 4, 'pear': 1, 'orange': 2}
  • dictionary sorted by key
OrderedDict(sorted(d.items(), key=lambda t: t[0]))
OrderedDict([('apple', 4), ('banana', 3), ('orange', 2), ('pear', 1)])
  • dictionary sorted by value
OrderedDict(sorted(d.items(), key=lambda t: t[1]))
OrderedDict([('pear', 1), ('orange', 2), ('banana', 3), ('apple', 4)])
  • dictionary sorted by the length of the key string
OrderedDict(sorted(d.items(), key=lambda t: len(t[0])))
OrderedDict([('pear', 1), ('apple', 4), ('banana', 3), ('orange', 2)])

A point to note here is that in Python 3.6, the regular dictionaries are insertion ordered i.e dictionaries remember the order of items inserted. Read the discussion here.


Conclusion

Collections module also contain some other useful datatypes like deque, Chainmap, UserString and few more. However, I have shared the ones which I use in my day to day programming to makes things simple. For a detailed explanation and usage visit the official Python documentation page.

Categories: UncategorizedTags: , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: