Choosing the Right Web Scraping Approach
When it comes to web scraping, there are two main approaches: using HTTP requests directly (with libraries like requests
in Python) or automating a browser (using tools like Playwright or Selenium). Each approach has its strengths and weaknesses, and choosing the right one can save you hours of development time and maintenance headaches.
HTTP Requests: Keeping It Simple
The most straightforward approach is using HTTP requests. This means sending requests directly to the server and parsing the HTML response. Here's a simple example using Python's requests
library:
import requests
response = requests.get('https://quotes.toscrape.com/page/1/')
print(response.text)
It works fast, easy to implement and deploy. Truly great, until you hit a site that requires authentication or anti-bot measures.
Handling Authentication
Authentication is a pain in the butt. To handle it with HTTP requests, you'll need to reverse engineer the login request, extract the necessary tokens and cookies, and then use them in your requests. Sometimes, these cookies are added by JavaScript, so good luck reverse-engineering where and how they are set. Some smartass websites even set these cookies within arbitrary CSS file responses.
Handling Anti-Bot Measures
At this point, it's very unlikely that HTTP requests will suffice. Anti-bot systems are using JavaScript, so if you aren't running a browser, you're screwed. That's where browser automation comes in.
Browser Automation: Power and Reliability
To render pages and execute JavaScript, you'll need to use browser automation tools, like Playwright.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto('https://quotes.toscrape.com/page/1/')
print(page.content())
browser.close()
The main advantage is that you get a rendered page with all the content filled in by JavaScript frameworks and you can interact with the page just like a real user would: click, scroll, fill out forms, etc. No need to reverse engineer anything, you just write code to push buttons and type text.
However, it can be sooo slow, because it needs to download all the assets and execute all the JavaScript. It's also harder to deploy, because you need to manage browser dependencies (I even have an article on Playwright setup for AWS Lambda).
But once you set it up, it's pretty easy to maintain: just update the selectors when needed and you're good to go.
Pro Tip: Intercepting Requests
When you're using browser automation, you might still want to get raw HTTP responses for some requests (for example, to scrape a JSON endpoint).
To do that, you can use browser's network interception features. Here's an example using Playwright:
page.goto('https://example.com')
with page.expect_response(
lambda response: "/api/quotes" in response.url and response.status == 200
) as response:
data = response.value.json()
Hybrid Approach
My rule of thumb is to take the best of both worlds:
Use browser automation for complex flows, like authentication and bypassing detection. Use HTTP requests for everything else.
This means that once you're past the login and anti-bot, you can switch to HTTP requests.
# ... log in ...
cookies = page.context.cookies()
session = requests.Session()
session.cookies.update({cookie["name"]: cookie["value"] for cookie in cookies})
session.get(...)
Conclusion
In my experience, the hybrid approach is the best way to go. It allows you to use the speed of HTTP requests for the bulk of the work and the reliability of browser automation for the complex stuff.
I hope to write more about web scraping, so stay tuned!