Web Scraping glassdoor.com: Python Tutorial № 2

Getting started: gathering a list of links.

Cierra Andaur
6 min readMay 4, 2021
Photo by Hello I’m Nik on Unsplash

<summoning my best podcast voice> Hello! And welcome back to another edition of my web scraping series where we learn how to web scrape mission statements and ratings from from glassdoor.com! Last week I did an overview of the web scraping function that I used to gather information from 3000+ company urls. This week I’d like to start breaking it down into bite sized chunks. This article will explain lines 1–28 of the function, which handle creating a list of links. Following this article, we will use this list of urls to gather the variables we’ll need for our data frame.

So let’s create that list!

Note: As we are only covering lines 1–28 of the function itself in this article (full function referenced at the end of this article), please check out my previous post for a general overview or if you need help getting started or would like a basic overview.

Setting Up

Imports

from selenium import webdriver# time is used as a buffer to allow webpages to load:
import time

Define a starting point (url_main) and start webdriver. I chose the main page, but you can start with any webpage. This is merely

url_main = 'https://www.glassdoor.com/Explore/browse-companies.htm?overall_rating_low=0&page=1&isHiringSurge=0'driver.get(url_main)

Choose where you want your driver to start from.

Many websites start with a root url and simply iterate through using a page number.** Looking to Glassdoor’s browse company function for example:

**Side Note: The number of pages you’re allowed to use with a root url will depends on the site. Glassdoor has hundreds of pages of company urls, but will only allow for you to manually type up to page number 398. Try it! https://www.glassdoor.com/Explore/browse-companies.htm?overall_rating_low=0&page=398. Scroll to the bottom and you can click on and on and on, but try and manually type 399 or 408 or 590, and it shoots you right back to page one. You may be thinking “at 10 urls a page, why on earth would you need more than 3,980 companies of information?” Obviously… you’re a terrible data scientist ;) because we all know… say it with me: “More data: more better.” Unfortunately, it can be difficult to predict when you’ll hit your brick wall (unless you read it somewhere on some blog or you’re incredibly lucky… or both. For now, let’s focus on what we know.

Photo by Brian Matangelo on Unsplash

Diving In

Now that we know our root url, we’ll need to define some necessary variables:

  • url_root — where we’re starting from
  • num_pages — how many Glassdoor search pages we want to start with (each search page contains 10 companies)
  • nums — list comprehension to iterate through each of those num_pages
  • url_mains — list of all urls from each Glassdoor search page
# Define root URL
url_root = 'https://www.glassdoor.com/Explore/browse-companies.htm?overall_rating_low=0&page='
# Let's start off by scraping 3 pages
num_pages = 3
# List comprehension to iterate through each of those 3 pages
nums = [x+1 for x in range(num_pages)]
# Create a list of the URLs we wish to scrape by adding 'n' to the end of url_root
url_mains = list(map(lambda n: url_root + str(n), nums))

Now we iterate through each of our gathered url_mains list with a for loop.

for u in url_mains:
driver.get(u)

Ok, but wait: this function grabs every clickable link from the search page (pictured below.

(screenshot) That’s too many links to scrape!

We’re only interested in the main page of each company to get the variables we want. Clicking on “Reviews,” “Salaries,” “Jobs,” etc will waste a lot of unnecessary time. We simply want to click on “Continue reading” to pull up each company’s main url. So how do we pick these out? The glorious “Inspect.”

Right click “Continue reading” and choose “Inspect”

Right click (or Ctrl+click on Mac), and choose “Inspect” from the drop down menu. This will split your browser screen into 2 parts so you can inspect the behind-the-scenes HTML code (screen shot below).

After clicking ‘Inspect,’ hover the cursor over the box of HTML that pops up on the right side of your screen. This will highlight on the webpage which part of the page that piece of HTML dictates (seen to the left “Continue reading” in blue).

(screenshot) Resulting window view after right clicking and choosing “Inspect”
(screenshot) a close-up of the HTML we’re inspecting (highlighted above)

Notice this element is indicated by <a href= followed by the url https://www.glassdoor.com/Overview/Working-at-Google-EI_IE9079.11,17.htm. This line contains the information we’ll need to create our list of company urls.

Creating a list of urls

Back to our code:

  1. Define an empty list (company_links) to save each company’s ‘Overview’ url
  2. We’ll use a webdriver function called find_elements_by_tag_name(‘’)
  3. Direct your webdriver to look for the ‘a’ tag which is followed by ‘href’ using get_attribute and contains ‘Overview’
  4. Append each ‘Overview’ link to your empty list (company_links)
# Define an empty list
company_links = []
# Define elems by searching the HTML for the 'a' tag
elems = driver.find_elements_by_tag_name('a')
# Loop through elems and return every item with 'href' attribute
for elem in elems:
company_link = elem.get_attribute('href')
# Append links with 'Overview' keyword to our empty list
if 'Overview' in company_link:
company_links.append(company_link)

Awesome! We have a list of company urls from Glassdoor! Now, we’ll iterate through each link in the company_link list we’ve populated.

# Iterating through each company's "Overview" url
for url in company_links:
driver.get(url)

At this point, we’ve gone through up until line #27 of the full function and we’re ready to grab some variables for our data frame.

Image by Hier und jetzt endet leider meine Reise auf Pixabay aber from Pixabay

time.sleep()

One last thing to note: use time.sleep() to make sure each webpage has plenty of time to load before your driver starts trying to gather more information. For example, time.sleep(5) waits 5 seconds before going on to the next line, giving the webpage 5 seconds to load before moving on. You may need more or less time depending on your internet speed and machine capabilities. If you don’t use time.sleep() and there isn’t enough time for the page to load before it moves on to the next action, your code could throw an error because it will be searching an essentially blank page and therefore won’t find the parameter you’ve specified.

All together, now!

This is the code we have so far:

url_root = 'https://www.glassdoor.com/Explore/browse-companies.htm?overall_rating_low=0&page='num_pages = 3
nums = [x+1 for x in range(num_pages)]
url_mains = list(map(lambda n: url_root + str(n), nums))
time.sleep(10)
for u in url_mains:
driver.get(u)
time.sleep(10)

#looking for 'Overview' links from each main search page
elems = driver.find_elements_by_tag_name('a')
company_links = []
for elem in elems:
company_link = elem.get_attribute('href')
if 'Overview' in company_link:
company_links.append(company_link)
#iterating through each company's "Overview" url
for url in company_links:
driver.get(url)
time.sleep(5)

Next up: we’ll drill down further and take a look at getting the variables we need for our machine learning project.

Happy Scraping!

Reference and Resources

Entire web scraping function:

This project can be found on my personal GitHub account: https://github.com/cierra4prez/NLP_Diversity_and_Inclusion

Direct link to my web scraper can be found here.

ChromeDriver installation instructions found here.

Update

This article is a part of a series regarding a web scraping function used for an NLP project scraping Glassdoor.com. For in-depth explanations of different aspects of the function, please check out my other posts which include python tutorials.

--

--

Cierra Andaur
Cierra Andaur

Written by Cierra Andaur

Data Scientist | Analytics Nerd | Pythonista | Professional Question Asker |

No responses yet