“Getting information off the internet is like taking a drink from a firehose.” — Mitchell Kapor
The internet is flooded with data. However, getting that data in the right format is the real deal. It takes some amount of cleaning and filtering to bring the data into a format which is ideal for analysis. Last week I came across some pretty good tabular data sets on Wikipedia. As always, I copied the data and pasted on the excel sheet. For most parts, it worked well, but there were few instances when the typical copy-paste methodology failed miserably:
- When the table was typically long and spread across a complete webpage
- When the data wasn’t static, i.e., it updated after regular intervals. So every time, there was a change in the original dataset, I had to fetch the dataset again.
For situations like these, simply copy-pasting wasn’t enough. Scraping was another option, but I was looking for something quick and easy and which didn’t involve a ton of code. It was then that I came across a convenient function in Google Sheets called
IMPORTHTML, which is ideal for importing data from a table or list within an HTML page. In this article, I’ll describe the end to the end process of fetching tables( and lists) into google sheets.
Prior to extracting or importing data that hasn’t been curated by you make sure you have the necessary permissions. Also, this method works only if the data is publicly available, which means there shouldn’t be requirements for authorisation etc.
Before using the function, let’s quickly understand the syntax of the
IMPORTHTML function. The
IMPORTHTML function takes in three distinct parameters and is written as follows:
IMPORTHTML(URL, query, index)
- URL: Refers to the URL of the page where the table is. Do not forget to add the protocol (e.g. http://) and make sure the URL is between quotes.
- Query: query could be either a table or a list of items. For example, if you want to import a table, mention table within quotes, else mention a list.
- Index: Refers to the position of the table on the web page, which starts at 1.
Usage: Importing a Table
Let’s now see how we can quickly and seamlessly import an HTML table referenced in a Wikipedia article into a Google sheet. The Page is titled: List of Netflix original films and contains information about various films and programs distributed by Netflix, an American global on-demand Internet streaming media provider.
Step 1:Copy the URL
Step 2: Select the table to be imported
Let’s say we wanted to import all the Feature Films, which is the first table on the page.
Step 3: Add the formula in the Google Sheet
Create a new Google Sheet, or create a new tab on an existing sheet. Enter the following formula in a cell specifying the URL, query, and index number.
Voila! The entire table is extracted and populated in the spreadsheet. Here’s what you get in the google sheet:
Let’s combine all the three steps above and look at an end to end demo below:
You can import a list in the same way. The only change would be to replace the word table with the list in the function parameter.
Customizing the output: Use Query Function
There are a bunch of things that you can do to customize the imported data. This can be achieved by using the
Query function in conjunction with
Query function runs a Google Visualisation API Query across data. Let’s see some of the ways to do this:
Limiting the number of imported columns
Let’s say you do not want to import all the columns, rather a select few. You can pass the index of the desired columns. For instance, if we only want the first three columns, our function would be as follows:
Look how we have enclosed the
IMPORTHTML function with the
query function and specified the column indices.
Filtering data based on a column
Another scenario would be when we want some specific data, i.e. data about only the Comedy genre. This can be done very easily as follows:
=query(IMPORTHTML("https://en.wikipedia.org/wiki/List_of_Netflix_original_films","table",1),"Select * where Col2='Comedy'")
Check out the official documentation of the
QUERY function for other ways to use it.
In this article, we saw the power of the
IMPORTHTML function and how it could be used to import tables and lists from the website, blogs, and other HTML sources directly into a google spreadsheet. From there, you can perform several analytics tasks, on the dataset and derive meaningful information from it. You can also extract the data and share it with others so that they can also make use of it.
Originally published here