Web Scraping glassdoor.com: Python Tutorial № 3
Gathering variables for a data frame
When web scraping (or for any project, for that matter), I always like to tackle one piece at a time. This helps you debug your code so much easier. That is why I’ve broken down this web scraping tutorial into several parts. In this article, we will walk through the process of retrieving variables with our web scraper and saving them into a data frame.
This article explains the code starting at line #29 of my web scraping function (referenced at the end of this article), beginning at “Gathering Variables — Main Page.” To get set up, please check out my previous posts:
Gathering Variables
Begin with the page from where you will gather your data. We’ll start with Google’s glassdoor.com listing, which pops up first on the list on the “Browse Companies” search page.
Next, we’ll need to tell our webdriver which elements we want to scrape from that url. Let’s start with some easy/straight-forward variables:
- name — company name
- size — company size
- headquarters — location of headquarters
- industry — company industry
- num_reviews — number of reviews
We’ll use selenium’s find_element_by_xpath to do this. In order to figure out the xpath for your first variable (name), there are a few steps we need to follow
1. Right click the information you want to collect, here: the company’s name, and choose “Inspect.”
Clicking “Inspect” will split your browser window into 2 parts so that we can inspect the behind-the-scenes HTML code (screen shot below).
2. Find your first element
We now want to mouse over this “Elements” window where we see the company’s name (title data-company=“Google”). ** You can also click the carrots to open the code further if you don’t see the element you need. You may have to play around a little to find what you’re looking for.
**Side Note: You’ll notice that when you hover over different elements in the HTML window to the right, the corresponding elements are highlighted on the webpage view. This will help you find your way.
Alternatively, you can click the icon at the menu bar (the box with an arrow, shown below), and then click on the element on the webpage (in the left viewing screen) and the HTML box (the window on the right) will highlight and jump to the corresponding code for that element.
Finding our first element looks like this:
3. Copy the XPath
Once you find the line you need, right click.
Choose Copy > Copy XPath. This copied xpath will be what we paste into our web scraping notebook.
4. Place the XPath into your code
Assign a variable to that xpath and turn the output into readable text. For this example:
- name = driver.find_element_by_xpath(‘copied_xpath_here’).text
I love to do a sanity check at this point just to make sure everything is coming through as I expect it to.
Alright! If all is well, follow the same 4 steps for your other variables (size, headquarters, industry, num_reviews) and code it out! Remember, you may have to play around to find the exact xpath for each element.
Now that we have our variables defined, we can test this code to see if it works on other companies’ Glassdoor urls.
Handling Nuance
Before we get too far, it’s a good idea to test these coded variables on a few other urls to make sure we can actually scrape efficiently. We don’t want to put in too much of our valuable time only to find out, for example, that each page has its own unique structure you haven’t accounted for.
Also, a tip on handling web scraping errors: it’s a good idea to implement some try/except failsafes. This way, if for some reason there’s an error in scraping a particular piece of information, your function doesn’t stomp its foot and refuse to do any more work. (Like if, for example, if your function comes across an odd page layout or if your internet slows to a crawl.)
I also like to print out my own friendly “error” message so that I’m aware if there are problems and I can troubleshoot if I need to. I just don’t want my function throwing a tantrum in the middle of the night while I’m sleeping soundly and dreaming of all the data I’ll be waking up to… only to wake up to no freshly gathered data.
More about these elements below.
Testing Additional URLs
In a previous article, we mapped out how to web scrape a list of urls. We started with a root url formula, searched for specific links (containing the word “Overview”), and appended these overview links to a list.
The below code was created:
from selenium import webdriver
import timeurl_root = 'https://www.glassdoor.com/Explore/browse-companies.htm?overall_rating_low=0&page='num_pages = 3
nums = [x+1 for x in range(num_pages)]
url_mains = list(map(lambda n: url_root + str(n), nums))
time.sleep(10)for u in url_mains:
driver.get(u)
time.sleep(10)
# Looking for 'Overview' links from each main search page
elems = driver.find_elements_by_tag_name('a')
company_links = []
for elem in elems:
company_link = elem.get_attribute('href')
if 'Overview' in company_link:
company_links.append(company_link)# Iterating through each company's 'Overview' url
for url in company_links:
driver.get(url)
time.sleep(5)
We’ll place this code in a function to scrape just one page of Glassdoor’s “Browse Companies” search page. We’ll also add a try/except statement here to let the “except” block handle any errors.
try:
- This block will instruct the driver to scrape our variables and print them out.
- If there is a problem in this block, however, the code will then move to the “except” block.
- Don’t forget to add time.sleep() to give the page some time to load before scraping.
for url in company_links:
try:
driver.get(url)
time.sleep(5) name = driver.find_element_by_xpath('//*[@id="EmpHeroAndEmpInfo"]/div[3]/div[2]').text size = driver.find_element_by_xpath('//*[@id="EIOverviewContainer"]/div/div[1]/ul/li[3]/div').text headquarters = driver.find_element_by_xpath('//*[@id="EIOverviewContainer"]/div/div[1]/ul/li[2]/div').text industry = driver.find_element_by_xpath('//*[@id="EIOverviewContainer"]/div/div[1]/ul/li[6]/div').text num_reviews = driver.find_element_by_xpath('//*[@id="EIOverviewContainer"]/div/div[3]/div[3]/a').text
print(name, size, headquarters, industry, num_reviews)
except:
As mentioned, the except block is to handle in case there is an error in the “try” block. It will do a few things:
- Append the troublemaker url to a list (unsuccessful_links) so we can reference it later if needed, for example if we wish to attempt to scrape those urls at a later time. (We will need to define an empty list named unsuccessful_links outside of the function.)
- Print an error message with the troublemaker url.
- Give extra time for the page to load before moving on to the next link. This is to let your browser catch up in case the error happened due to slow internet speed.
except:
unsuccessful_links.append(url)
print('ERROR: ', url)
time.sleep(10)
All of these 3 blocks together now in a function
Test out your function on one page by running:
scraping_pages(1)
Troubleshooting
If you’re not coding along with me up until this point, I’ll save you some trouble: we have our error message print out for some of the urls (Cisco and Capital One, to name a couple)
In order to inspect what was happening here and see a more productive error message, I took pieces of the code out of the function, removed the try/except statements, and tested the troublemaker urls. I found that the trouble was the XPath for num_reviews.
- Google’s num_reviews XPath: //*[@id=”EIOverviewContainer”]/div/div[3]/div[3]/a
- Cisco and Capital One’s num_reviews XPath: //*[@id=”EIOverviewContainer”]/div/div[4]/div[3]/a
Luckily, the XPath for these two companies were the same, and these were the same as the XPaths for the other troublemakers on the first scraped page. Thank goodness! That could have been a LOT more trouble than it turned out to be.
To address this, we can add another try/except statement with our num_reviews variable. If the try xpath doesn’t gather the info, the except xpath should catch it. Our updated code will look like this:
To test this function, try running on 2 pages from the “Browse Companies” url by running:
scraping_pages(2)
This is exciting because now it’s time to finally start putting our data into a data frame!
Creating a Data Frame
First, we create an empty list that will hold your data (companies). This should be outside of your function.
companies = []
In the function after the variables are scraped, map out the dictionary with the variables we have so far:
companies.append({
"NAME" : name,
"SIZE" : size,
"LOCATION_HQ" : headquarters,
"INDUSTRY" : industry,
"NUM_REVIEWS" : num_reviews,
})
After the 10 urls from each Browse Companies page have been successfully scraped (remember, each page has 10 “Overview” urls), I like to add a print statement after each Glassdoor browse page has completed that shows how many companies the function has scraped so far. This way, we get visual confirmation that everything is working as expected while the code is running. This also means we can see in real-time how quickly (or slowly) our code is working.
print(f'Finished scraping {len(companies)} companies')
Adding this print statement should return something like this:
This shows we’ve successfully scraped 2 pages of company urls, had 3 troublemaker urls in page 3, and then resumed successfully for pages 4 and 5. Errors are inevitable, since some pages will have a slightly different layout, but this indicates that our 2 try/except blocks are working for a majority of companies.
Next, we turn our companies list into a data frame:
df = pd.DataFrame(companies)
and finally, we return that data frame!
return df
All together now:
So exciting! We’re on a good track to running this function on a bunch of pages!
Handling Modals
If all of the data you need may be scraped on the front page, then you’re good to go! However, for my particular NLP project, I need some variables that don’t show on the front page.
- I need to scrape company descriptions and mission statements, and this text is sometimes hidden behind a “Read more” modal.
- I also need access to the Diversity & Inclusion rating, which can only be gathered by clicking the ratings carrot.
I’ll be diving into how we can handle these modals in my next post. I hope to see you there!
Reference and Resources
Entire web scraping function:
This project can be found on my personal GitHub account: https://github.com/cierra4prez/NLP_Diversity_and_Inclusion
Direct link to my web scraper can be found here.