Web Scraping 101: Avoiding Detection

How to web scrape without getting blocked

Cierra Andaur
4 min readMay 25, 2021

Before beginning your first web scraping mission, we should talk about a few things that you might want to keep in mind. Especially if you’re thinking of scraping a ton of data.

There are websites that aren’t terribly keen on the idea of web scrapers sweeping through and gathering all of their data, and so they may have anti-scraping mechanisms in place. This could result in your IP address being blocked or your user credentials getting flagged and being locked out. While there are articles to address this, most have an overwhelming amount of information, and not many with specific code examples. This can be tough for beginners, so I’ve set out to explain 2 very simple yet comprehensive ways we can confuse an anti-scraper so that our robot doesn’t look like… a robot. Shall we quickly go through a couple of ways we can try and avoid detection?

time.sleep()

In previous articles, I’ve explained using the time.sleep() method in order to to give our webpage the time necessary to load, so as to avoid errors in case of slow internet speeds. It’s also helpful in avoiding detection from the server you’re scraping. For one, a bot can crawl a website a lot faster than a human can, and so when your bot is zooming through pages without pause, it can raise some red flags. Use time.sleep() to slow down you code in places.

We can also use time.sleep() in conjunction with Numpy’s random.choice() method which generates a random number within an array that you can define. Below, we have chosen an array between .7 and 2.2. This line says the code shall pause for a random amount of time between .7 seconds and 2.2 seconds.

import numpy as np
time.sleep(np.random.choice([x/10 for x in range(7,22)]))

This is good to implement before moving on to your next webpage. I’ve placed mine at lines 71 and 86 (please refer to the scraper function cited at the end of this article).

Change Your Headers

More specifically: switch your user agent. When we run driver.get(url), we are sending our credentials to that url. The server checks our headers and decides whether or not our request is granted access. Some sites may deny request to a “python-requests” header. To replace this bot header with a human header, simply Google “my user agent” and use this as your header code.

Use this user agent in your code

Since we’re using Selenium’s webdriver, we’ll import “Options” and copy + paste your header into the .add_argument() method. I recommend placing this block of code in the very beginning of your notebook:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opts = Options()opts.add_argument("Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36")


driver = webdriver.Chrome(options=opts)

YouTuber John Watson Rooney does an excellent job at explaining what is happening behind the scenes, and why we switch our headers, in his video: User Agent Switching — Python Web Scraping.

Side Note: In fact, everything John Rooney does with web scraping is pretty awesome, especially if you’re new to the scene. I learned the fundamentals of how to web scrape by following along in his video: Render Dynamic Pates — Web Scraping Product Links with Python. He has a TON of great material.

Don’t use your real log-in info!

A parting word of advice: If you are required to sign in to access the API you need, don’t use your real username and password. Say, for example, you’re web scraping glassdoor.com which is a website that you personally use. Creating a new log in and password is a good fail-safe to make sure that at least if your user account gets black listed, you can still use the site later on.

I hope you find this article helpful in narrowing down what you need to know to avoid getting blocked by an anti-scraper (and some helpful code to get you started). As I mentioned before, there are certainly websites that have more advanced methods of catching web scrapers. For additional resources on the matter, I found the article How to scrape websites without getting blocked useful in understanding more extraneous circumstances.

Resources and Works Cited

This article is a part of a series regarding a web scraping function I used for an NLP project scraping Glassdoor.com (complete scraper function at the end of this article). For in-depth explanations of different aspects of the function, please check out my other posts which include python tutorials.

From the Web:

My NLP Project:

Completed web scraping function

--

--

Cierra Andaur
Cierra Andaur

Written by Cierra Andaur

Data Scientist | Analytics Nerd | Pythonista | Professional Question Asker |

No responses yet