Web Scraping Mission Statements and Ratings from glassdoor.com

Overview including code along in Python

4 min readApr 27, 2021

In response to a summer of unrest surrounding racial injustice in 2020, more companies began promising to the public that they would “do better” in terms of diversity. This made me wonder: Is there a way to look at how a company describes itself and determine whether that company lives up to its self-proclaimed diversity standards? As a data scientist, I can’t help but want to explore quantifiable ways to prove or disprove statements such as these.

In order to answer this question of accountability, I went to glassdoor.com, a website you’re probably familiar with if you’ve ever wanted to look up a company’s salary expectations or cultural rating. Glassdoor is built on the foundation of increasing workplace transparency. Users can rate companies they work for (or have worked for) o a scale of 1 to 5 on categories like Culture & Values, Work/Life Balance, Compensation, and others.

As of October of 2020, they added a new metric called the “Diversity & Inclusion Rating.” So, in order to answer my business question, I set out to use Natural Language Processing (NLP) to analyze the text data in each company’s mission statement, and compare it to this particular rating.

This was a huge project, and one that I’ll likely dive into for a later blog post. For today, however, I want to talk to you about how I went about gathering the data needed for this project.

Web Scraping glassdoor.com

In order to begin, I needed to create a data frame with the necessary metrics. Variables from each company included name, size, headquarters location, industry, ratings (overall, diversity & inclusion, and others), number of reviews, description, and mission statement.

The below function is a beast. This post will be an overview of the entire function, and I’ll be breaking it down into bite-sized bits in future posts.

Here’s how I web scraped 3,000+ company urls:

1. Install ChromeDriver.

ChromeDriver does exactly what it sounds like: it “drives” your web browser according to the code you point the driver to. If you’ve never used it before, it’s pretty cool: an actual browser pops up and it kinda looks like a ghost is clicking through the webpages. There are a bunch of tutorials on installation and it’s pretty straight forward, so I won’t go into that here. You can check out the documentation here. There is even some python code you can copy+paste, you’ll just have to make sure and change “/path/to/chromedriver" to where you decide to install it.

2. Import your driver and necessary libraries.

2. Start your driver on the main url from where you want to start.

This will pull up a new Chrome window.

3. Create your master function.

We use num_pages as an argument to specify how many urls you’d like to scrape in one go.

4. Run your function!

When first running this function, I would recommend starting with a smaller number so you can work out any kinks you need to.

5. Sanity Checks and saving final CSV

I always like to print some sanity checks to make sure everything is tip-top shape. Then, you can save your data into a pandas data frame, and finally save your data frame to csv.

6. Close and quit

If you’re satisfied with the quantity of company urls, you can close your driver here. Otherwise, you can run this function as many times as you like before closing out.

If you have the time, you can run this code for 399 pages. Why stop at 399 when there are several hundred additional pages of companies? Glassdoor won’t allow the user to type in a root url +400. If you need more pages, as I did, you’ll need to find a work around. If you’re interested in how I tackled this workaround, I’ll be posting that in a future blog post.

Since this project needed to be completed in a relatively short period of time, I only ran a couple hundred urls at a time in my “off” hours. This way, I could work on the meat of my project during the day without my computer working overtime trying to keep up with my daily coding and everything running at once.

Thanks for tuning in! I hope this article was helpful as an overview of how to get set up with your web scraping function for glassdoor.com. In following articles, I’d like to provide a series breaking down various parts of this mother function. I hope you’ll stay tuned for those coming soon!

Resources

This project can be found on my personal GitHub account: https://github.com/cierra4prez/NLP_Diversity_and_Inclusion

Direct link to the web scraper can be found here.

ChromeDriver installation instructions found here.

Update

This article is a part of a series regarding a web scraping function used for an NLP project scraping Glassdoor.com. For in-depth explanations of different aspects of the function, please check out my other posts which include python tutorials.