Web Scraping glassdoor.com: Python Tutorial № 2

Getting started: gathering a list of links.

Photo by Hello I’m Nik on Unsplash

Setting Up

from selenium import webdriver# time is used as a buffer to allow webpages to load:
import time
url_main = 'https://www.glassdoor.com/Explore/browse-companies.htm?overall_rating_low=0&page=1&isHiringSurge=0'driver.get(url_main)

Choose where you want your driver to start from.

Photo by Brian Matangelo on Unsplash

Diving In

# Define root URL
url_root = 'https://www.glassdoor.com/Explore/browse-companies.htm?overall_rating_low=0&page='
# Let's start off by scraping 3 pages
num_pages = 3
# List comprehension to iterate through each of those 3 pages
nums = [x+1 for x in range(num_pages)]
# Create a list of the URLs we wish to scrape by adding 'n' to the end of url_root
url_mains = list(map(lambda n: url_root + str(n), nums))
for u in url_mains:
driver.get(u)
(screenshot) That’s too many links to scrape!
Right click “Continue reading” and choose “Inspect”
(screenshot) Resulting window view after right clicking and choosing “Inspect”
(screenshot) a close-up of the HTML we’re inspecting (highlighted above)

Creating a list of urls

# Define an empty list
company_links = []
# Define elems by searching the HTML for the 'a' tag
elems = driver.find_elements_by_tag_name('a')
# Loop through elems and return every item with 'href' attribute
for elem in elems:
company_link = elem.get_attribute('href')
# Append links with 'Overview' keyword to our empty list
if 'Overview' in company_link:
company_links.append(company_link)
# Iterating through each company's "Overview" url
for url in company_links:
driver.get(url)
Image by Hier und jetzt endet leider meine Reise auf Pixabay aber from Pixabay

time.sleep()

All together, now!

url_root = 'https://www.glassdoor.com/Explore/browse-companies.htm?overall_rating_low=0&page='num_pages = 3
nums = [x+1 for x in range(num_pages)]
url_mains = list(map(lambda n: url_root + str(n), nums))
time.sleep(10)
for u in url_mains:
driver.get(u)
time.sleep(10)

#looking for 'Overview' links from each main search page
elems = driver.find_elements_by_tag_name('a')
company_links = []
for elem in elems:
company_link = elem.get_attribute('href')
if 'Overview' in company_link:
company_links.append(company_link)
#iterating through each company's "Overview" url
for url in company_links:
driver.get(url)
time.sleep(5)

Reference and Resources

Update

Data Scientist | Analytics Nerd | Pythonista | Professional Question Asker |

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store