Web Scraping glassdoor.com: Handling Modals

Cierra Andaur
5 min readMay 18, 2021

--

What’s a Modal?

A modal is a dialog box or popup window on a website that is displayed on top of the current page. They aren’t present in every page, but you’ve definitely seen them before. They’re the “Read more” button or the little carrot you click on to view a popup/hidden element.

In web scraping, modals aren’t terribly tricky, but they can be annoying if you’ve never seen one and don’t know how to deal with them. For one, the site’s url doesn’t change when you click on them, so it’s not like you can point the driver to a unique url. Perhaps more annoyingly, the information contained on a popup window can’t be found with the right click + “Inspect” method until after you click on them. This means that you’ll need your webdriver to “click” on these elements while it’s running before the data can be accessed and scraped.

Note: This article explains only the handling of the modals needed for my web scraping function (which you can check out at the end of this article). The tutorial herein assumes that you already have Chrome Driver installed and necessary libraries imported. In order to catch up on these aspects of the function, please check out my previous posts which include python tutorials so you can code along.

Briefly reviewed: If you’ve been following along in my previous posts, you already have Chrome Driver installed and libraries imported, which are also necessary for this article:

from selenium import webdriverimport pandas as pd
import numpy as np
import time

Accessing Modals

For my NLP project web scraping glassdoor.com, I needed access to a couple of different modals:

  1. The “Read more” buttons under company description and mission statement.
  2. A popup displaying detailed rating scores.

Pictured here on Google’s Glassdoor listing:

Modals that need to be accessed to scrape additional data

Modal Example #1: “Read more”

As discussed in my previous article, we’ll use the “Inspect” tool to find the element we will point our driver to, by right clicking and choosing “Inspect” from the menu.

Right click and choose “Inspect”

Remember, you may need to click the carrots to expand your view of the HTML before you’re able to find the exact path. You can also use the icon that looks like a box with an arrow (shown below), or use the hot keys Cmd+Shift+C and then click on the element. This will point you straight to where you need to go by highlighting the HTML for you.

Looking at the HTML, we find that we only see the part of the excerpt, same as what is visible on the webpage (before clicking “Read more”):

close-up

Our entire view looks like this:

full view

To direct our webdriver to click “Read more,” we’ll find the element by the button’s class.

1.Copy “css-568d5y e16x8fv00” and replace the spaces with a period (“.” — as it shows in the above box to the left)

Note: to make sure there are no other buttons on the webpage with this class name, use the hot keys Cmd+F to find by string, selector, of XPath.

2. Place this copied button class into our code using the find_element_by_class() driver method.

driver.find_element_by_class_name('css-568d5y.e16x8fv00')

3. Assign this to a variable so we can use the .click() method to click on the “Read more” button. (In the case that there is no button with this class name, the driver will thankfully move past it without throwing an error.)

read_more = driver.find_element_by_class_name(’css-568d5y.e16x8fv00’)read_more.click()

4. Give the page a couple of seconds to load with time.sleep()

time.sleep(2)

5. Next, we’ll copy the XPath to the text we wish to be scraped. (This XPath is the same before and after we click the button.) We can find this in the same way we found our other variables: Right click the HTML and choose “Copy XPath.”

6. Assign this path to a variable, “description,” and remember to convert it to text using .text.

description = driver.find_element_by_xpath('//*[@id="EIOverviewContainer"]/div/div[1]/div[1]/span').text

7. To handle cases of listings that have no description, we’ll use a try/except block here as used and explained previously. The below code will define a missing description as “N/A”.

Now we do the same for the Mission statement! Lucky for us, the button class is the same, and so the same block of code works. Check out the HTML for yourself on Cisco Systems’ Glassdoor listing.

Modal Example #2: Ratings

Now, let’s work with the ratings carrot.

The ratings 🥕

We need scrape the itemized list of ratings, which are only accessible from opening the popup. We’ll use similar logic as with the read more button to access these. This time we can use find_element_by_xpath()

1.Find the element by the right clicking and choosing “Inspect” as we did before.

2. Find the element in the HTML window and choose “Copy XPath”

3. Place this copied XPath class into our code using the find_element_by_xpath() driver method and add on the .click() method at the end.

driver.find_element_by_xpath('//*[@id="EIOverviewContainer"]/div/div[3]/div[1]/div[2]').click()
Modal with itemized rating data

4. Give the page a few seconds to load with time.sleep() so that we can then scrape the ratings popup.

time.sleep(5)

5. On glassdoor.com, there are 2 common layouts that need to be accounted for, and so we’ll need to use another try/except block to see which one sticks.

Hooray! We’re finally ready to put everything together! Go ahead and test out your scraping_pages() function on a few pages to make sure you have enough time between page loads and no errors to worry about before letting it loose and walking away.

I also recommend giving a quick look at my article: Web Scraping 101: Avoiding Detection. It’s a quick read on a couple of ways you can make sure that you don’t get blocked out of the site you’re scraping.

Happy Scraping!

Our completed scraping_pages() function!

Reference and Resources

This project can be found on my personal GitHub account:

Direct link to the web scraper can be found here.

--

--

Cierra Andaur
Cierra Andaur

Written by Cierra Andaur

Data Scientist | Analytics Nerd | Pythonista | Professional Question Asker |