Import HTML tables into Google Sheets effortlessly

“Getting information off the internet is like taking a drink from a firehose.” — Mitchell Kapor

The internet is flooded with data. However, getting that data in the right format is the real deal. It takes some amount of cleaning and filtering to bring the data into a format which is ideal for analysis. Last week I came across some pretty good tabular data sets on Wikipedia. As always, I copied the data and pasted on the excel sheet. For most parts, it worked well, but there were few instances when the typical copy-paste methodology failed miserably:

  • When the table was typically long and spread across a complete webpage
  • When the data wasn’t static, i.e., it updated after regular intervals. So every time, there was a change in the original dataset, I had to fetch the dataset again.

For situations like these, simply copy-pasting wasn’t enough. Scraping was another option, but I was looking for something quick and easy and which didn’t involve a ton of code. It was then that I came across a convenient function in Google Sheets called IMPORTHTMLwhich is ideal for importing data from a table or list within an HTML page. In this article, I’ll describe the end to the end process of fetching tables( and lists) into google sheets.


Prior to extracting or importing data that hasn’t been curated by you make sure you have the necessary permissions. Also, this method works only if the data is publicly available, which means there shouldn’t be requirements for authorisation etc.


Syntax

Image by Author

Before using the function, let’s quickly understand the syntax of the IMPORTHTML function. The IMPORTHTML function takes in three distinct parameters and is written as follows:

IMPORTHTML(URL, query, index)

Parameters

  • URL: Refers to the URL of the page where the table is. Do not forget to add the protocol (e.g. http://) and make sure the URL is between quotes.
  • Query: query could be either a table or a list of items. For example, if you want to import a table, mention table within quotes, else mention a list.
  • Index: Refers to the position of the table on the web page, which starts at 1.

Usage: Importing a Table

Let’s now see how we can quickly and seamlessly import an HTML table referenced in a Wikipedia article into a Google sheet. The Page is titled: List of Netflix original films and contains information about various films and programs distributed by Netflix, an American global on-demand Internet streaming media provider.

Step 1:Copy the URL

Image Source: Wikipedia

Step 2: Select the table to be imported

Let’s say we wanted to import all the Feature Films, which is the first table on the page.

Image Source: Wikipedia

Step 3: Add the formula in the Google Sheet

Create a new Google Sheet, or create a new tab on an existing sheet. Enter the following formula in a cell specifying the URL, query, and index number.

=IMPORTHTML("https://en.wikipedia.org/wiki/List_of_Netflix_original_films_(2012%E2%80%932019)","table",1)

Voila! The entire table is extracted and populated in the spreadsheet. Here’s what you get in the google sheet:

Image by Author

Let’s combine all the three steps above and look at an end to end demo below:

Image by Author

You can import a list in the same way. The only change would be to replace the word table with the list in the function parameter.


Customizing the output: Use Query Function

There are a bunch of things that you can do to customize the imported data. This can be achieved by using the Query function in conjunction with IMPORTHTML. The Query function runs a Google Visualisation API Query across data. Let’s see some of the ways to do this:

Limiting the number of imported columns

Let’s say you do not want to import all the columns, rather a select few. You can pass the index of the desired columns. For instance, if we only want the first three columns, our function would be as follows:

=query(IMPORTHTML("https://en.wikipedia.org/wiki/List_of_Netflix_original_films","table",1),"Select Col1,Col2,Col3")
Image by Author

Look how we have enclosed the IMPORTHTML function with the query function and specified the column indices.

Filtering data based on a column

Another scenario would be when we want some specific data, i.e. data about only the Comedy genre. This can be done very easily as follows:

=query(IMPORTHTML("https://en.wikipedia.org/wiki/List_of_Netflix_original_films","table",1),"Select * where Col2='Comedy'")
Image by Author

Check out the official documentation of the QUERY function for other ways to use it.


Conclusion

In this article, we saw the power of the IMPORTHTML function and how it could be used to import tables and lists from the website, blogs, and other HTML sources directly into a google spreadsheet. From there, you can perform several analytics tasks, on the dataset and derive meaningful information from it. You can also extract the data and share it with others so that they can also make use of it.


Originally published here

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s