The Evolution of Web Scraping: From LWP::Simple to Headless Browsers - HB

Web scraping has evolved from simple one-liners that fetched static HTML into a cat-and-mouse game involving headless browsers, fingerprint evasion, and infrastructure that scales like a distributed system. This article traces that evolution, covers the techniques that matter today, and walks through a fully working Snapchat Discover crawler that extracts post metadata while behaving like a real user.

Introduction

Every data engineer, security researcher, and intelligence analyst has at some point needed data that lives behind a web page. The practice of extracting that data programmatically — web scraping — is as old as the web itself. What started as fetching a URL and parsing its HTML has grown into a discipline with its own tools, countermeasures, and engineering challenges.

At Hunt-Benito, scraping is one of our core competencies. We’ve built crawlers for OSINT collection, price monitoring, competitive intelligence, and security assessment. Over the years the landscape has changed dramatically — and it’s worth understanding both where we came from and where we are now, because the techniques you need depend entirely on what you’re scraping and who’s trying to stop you.

Part 1: Historical Evolution

The Early Web (1993–2000): “Just Fetch the HTML”

The first search engines were crawlers. In 1993, Matthew Gray’s Wandex and JumpStation were among the earliest bots that systematically fetched web pages and indexed them. These were trivially simple — open a TCP socket, send an HTTP GET request, read the response, extract links, repeat.

The tools of this era were basic:

Tool	Language	Year	Notes
`LWP::Simple`	Perl	1995–1996	`get("http://example.com")` — one line to fetch a page
`wget`	C	1996	Recursive download, still widely used today
`libwww-perl` (LWP)	Perl	1995	Full HTTP client library with cookie jars, redirects
`curl`	C	1996	Command-line data transfer, the Swiss Army knife of HTTP
`Web::Scraper`	Perl	2009	CSS/XPath selectors on top of LWP
`Beautiful Soup`	Python	2004	HTML/XML parser tolerant of malformed markup

In this era, scraping was straightforward. Pages were static HTML. There were no CAPTCHAs, no rate limiters, no JavaScript-rendered content. You fetched the page, parsed the HTML, and extracted what you needed. A Perl one-liner could scrape a site:

use LWP::Simple;
my $html = get("http://example.com/products");
# parse $html with regexes or HTML::TreeBuilder

The only challenge was that the web was slow and unreliable. Timeouts, broken connections, and malformed HTML were the real enemies — not anti-bot systems.

The Ajax Era (2005–2012): JavaScript Changes Everything

Around 2005, two things changed the scraping landscape. First, AJAX (Asynchronous JavaScript and XML) became mainstream. Pages were no longer static HTML — they loaded content dynamically via JavaScript after the initial page load. Second, websites started caring about bots. The first generation of anti-scraping measures appeared: rate limiting, IP blocking, and user-agent filtering.

Tools adapted:

Tool	Language	Year	Approach
Scrapy	Python	2008	Full-featured crawling framework with middleware, pipelines, scheduling
Selenium RC	Java/JS/Python/Ruby	2004 (RC), 2009 (WebDriver)	Browser automation — could execute JavaScript
Mechanize	Python/Ruby/Perl	2003–2010	Headless HTML interaction with cookie/session management
PhantomJS	JavaScript	2011	Headless WebKit — the first widely-used headless browser for scraping

Scrapy was the big leap forward for Python-based scraping. It gave you a proper framework with:
- Spiders: classes that define how to crawl a site
- Middleware: pluggable components for retries, proxies, user-agent rotation
- Pipelines: post-processing hooks for cleaning, deduplicating, and storing data
- Scheduling: URL frontier management with politeness controls

import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        for product in response.css(".product"):
            yield {
                "name": product.css("h2::text").get(),
                "price": product.css(".price::text").get(),
            }
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

This era also saw the first legal battles over scraping. The eBay v. Bidder’s Edge (2000) and hiQ Labs v. LinkedIn (2017) cases established important precedents about the legality of scraping public data — a topic that remains legally complex today.

The Headless Browser Era (2013–2019): The Arms Race Intensifies

As JavaScript-heavy single-page applications (SPAs) became the norm, scrapers that only fetched HTML became insufficient. Content was rendered client-side, behind JavaScript execution, XHR requests, and complex state management. The solution: headless browsers.

Tool	Language	Year	Notes
Puppeteer	JavaScript (Node.js)	2017	Google’s Chrome DevTools Protocol wrapper — fast, reliable
Chrome Headless	C++	2017	Native headless mode in Chrome 59
Splash	Python	2013	Lightweight headless browser with a HTTP API, by Scrapinghub
Headless Chrome + Selenium	Multi	2017	Selenium 3+ with ChromeDriver
Playwright	JavaScript/Python/.NET	2020	Microsoft’s cross-browser automation framework

The headless browser changed everything. Suddenly you could:
- Execute JavaScript and wait for dynamic content to render
- Interact with pages (click, scroll, fill forms)
- Intercept network requests
- Take screenshots
- Handle complex authentication flows

But the arms race went both ways. Websites deployed increasingly sophisticated anti-bot systems:

Generation	Technique	Effectiveness
1st gen (2005–2010)	Rate limiting, IP blocking, User-Agent filtering	Low — trivial to bypass
2nd gen (2010–2015)	CAPTCHAs, honeypot links, JavaScript challenges	Medium — solvable with OCR and JS execution
3rd gen (2015–2020)	Browser fingerprinting, behavioral analysis, TLS fingerprinting	High — requires specialized tools
4th gen (2020–present)	ML-based bot detection, device fingerprinting, network-level analysis	Very high — constant evolution

Scraping Evolution Timeline

Part 2: Modern Web Scraping Techniques

The Modern Stack

Today’s production-grade crawler looks more like a distributed system than a simple script. The stack typically includes:

Modern Scraping Stack

Technique Comparison

Technique	Speed	JS Support	Anti-Bot Evasion	Complexity	Use Case
`requests` + parser	Very fast	None	Low	Low	Static HTML sites, APIs
`httpx` + parser	Fast	None	Low	Low	Modern async HTTP, HTTP/2
Scrapy	Fast	Limited	Medium	Medium	Large-scale crawls of static sites
Scrapy + Splash	Medium	Full	Medium	Medium	JS-rendered pages at scale
Playwright	Slow	Full	High	High	Complex interactions, anti-bot
Playwright + stealth	Slow	Full	Very high	Very high	Hardened targets
Commercial APIs	Varies	Varies	Handled	Low	When you don’t want to build

Anti-Bot Detection and Evasion

Modern anti-bot systems (Cloudflare, Akamai Bot Manager, PerimeterX, DataDome) look at multiple signals:

1. TLS Fingerprinting (JA3/JA4)

Every TLS client negotiates a unique set of cipher suites, extensions, and elliptic curves. This creates a fingerprint (JA3 hash) that identifies the client software. A Python requests session has a different JA3 hash than Chrome. Anti-bot systems flag non-browser fingerprints.

Countermeasures: Use curl_cffi (Python bindings for curl-impersonate), or run a real browser via Playwright.

2. Browser Fingerprinting

Websites collect browser attributes: screen resolution, installed fonts, WebGL renderer, canvas hash, audio context, navigator properties. Headless browsers often have distinctive fingerprints (missing plugins, default viewport, no WebGL support).

Countermeasures: Use playwright-stealth or undetected-chromedriver to normalize fingerprints.

3. Behavioral Analysis

Anti-bot systems track mouse movements, scroll patterns, keystroke timing, and click intervals. A script that loads a page and immediately extracts data behaves differently from a human.

Countermeasures: Simulate human-like behavior — random delays, mouse movements, scrolling, natural click patterns.

4. Network-Level Signals

IP reputation, ASN classification (datacenter vs residential), request frequency patterns, and connection reuse behavior all contribute to bot scoring.

Countermeasures: Residential proxy rotation, request throttling, connection reuse consistent with real browsers.

Human-Like Crawling: The Key Principles

When evading anti-bot detection isn’t about breaking anything — it’s about making your crawler indistinguishable from a real user. The core principles:

Delays with variance: Never use fixed intervals. Humans don’t click exactly every 2.0 seconds. Use random.uniform(1.5, 4.2) or sample from a normal distribution.
Scroll naturally: Real users don’t jump to the bottom instantly. Scroll in variable-sized increments with micro-pauses between them.
Mouse movement: Move the cursor in curved paths (Bézier curves), not straight lines. Dwell on elements before clicking.
Session warmth: A real user doesn’t immediately hit the target URL. Visit the homepage first, wait, then navigate to the target page.
Realistic headers: Match the headers your browser actually sends — Accept, Accept-Language, Accept-Encoding, sec-ch-ua, sec-fetch-*. Never send Python-requests/2.31.0.
Viewport and screen: Use a realistic viewport size (1920x1080, 1366x768) and don’t maximize the window programmatically.

Part 3: Real-World Application — A Snapchat Discover Crawler

Let’s put these techniques into practice. We’re going to build a crawler that scrapes post metadata from Snapchat’s Discover feed at https://www.snapchat.com/discover.

Why Snapchat Discover?

Snapchat’s Discover page is a good example of a modern scraping target:

Server-side rendered Next.js app with embedded Apollo GraphQL data
Anti-bot protections via Akamai (behind the scenes)
Dynamic content loaded progressively as you scroll
Rich metadata per post: media URLs, publisher info, timestamps, media types

The goal: extract metadata for 1000 unique posts while behaving like a human user.

Architecture

The crawler uses Playwright (headless Chromium) with human-like interaction patterns:

Crawler Architecture

Understanding the Data Model

Snapchat’s Discover page embeds all its data in a <script id="__NEXT_DATA__"> tag as a JSON blob. This is Apollo GraphQL’s normalized cache — a flat dictionary where every entity has a unique key:

Key Pattern	Type	Description
`ROOT_QUERY`	Query	Entry point — references the story feed
`SnapProStory:<username>`	Story	A public user’s story with snaps
`PremiumPublisherStory:<id>`	Story	A publisher/brand story (e.g., “Camping & More”)
`Snap:<id>`	Snap	Individual post — media URL, type, timestamps
`Publisher:<uuid>`	Publisher	Creator info — name, bio, snapcode

A single page load yields ~10 stories with ~500 individual snaps. To reach 1000, the crawler will:
1. Load the page and extract the initial dataset
2. Scroll down to trigger infinite loading of more stories
3. Re-extract the dataset after each scroll
4. Deduplicate by snap ID
5. Stop at 1000 unique posts

The Crawler

https://github.com/Hunt-Benito/web-scraping-and-crawling-techniques-and-real-world-application

Key Code: Human-Like Scrolling

The most critical function is the human-like scroll simulation. Real users don’t teleport to the bottom — they scroll in variable-sized steps with pauses between them:

async def human_scroll(page, distance: int = 800):
    remaining = distance
    while remaining > 0:
        step = random.randint(100, min(350, remaining))
        await page.mouse.wheel(0, step)
        remaining -= step
        await asyncio.sleep(random.uniform(0.08, 0.25))
    await asyncio.sleep(random.uniform(0.5, 1.5))

Each scroll step is 100–350 pixels (a natural mouse wheel increment), with 80–250ms between steps, and a longer 0.5–1.5s pause after completing a full scroll gesture.

Key Code: Extracting Apollo Data

The extraction function reaches into the page’s DOM to pull the __NEXT_DATA__ JSON, then walks the Apollo cache to dereference all entities:

def extract_posts_from_apollo(apollo_state: dict) -> list[dict]:
    posts = []
    snap_ids_seen = set()

    for key, entity in apollo_state.items():
        if not isinstance(entity, dict):
            continue
        typename = entity.get("__typename", "")
        if typename not in ("SnapProStory", "PremiumPublisherStory"):
            continue

        story_meta = {
            "story_id": entity.get("id"),
            "story_type": typename,
            "story_title": entity.get("title"),
            "thumbnail_url": entity.get("thumbnailUrl"),
            "published_time": entity.get("publishedTimeInSec"),
        }

        creator_ref = entity.get("creator", {})
        if isinstance(creator_ref, dict) and "__ref" in creator_ref:
            creator = apollo_state.get(creator_ref["__ref"], {})
            story_meta["creator_name"] = creator.get("title") or creator.get("username")
            story_meta["creator_id"] = creator.get("businessProfileId") or creator.get("username")

        for snap_ref in entity.get("snaps", []):
            ref_key = snap_ref.get("__ref", "")
            snap = apollo_state.get(ref_key, {})
            if not snap or snap.get("id") in snap_ids_seen:
                continue
            snap_ids_seen.add(snap["id"])

            post = {**story_meta}
            post["snap_id"] = snap.get("id")
            snap_urls = snap.get("snapUrls", {})
            post["media_url"] = snap_urls.get("mediaUrl")
            post["media_preview_url"] = snap_urls.get("mediaPreviewUrl")
            post["media_type"] = snap.get("snapMediaType")
            posts.append(post)

    return posts

Key Code: Session Warm-Up

Before hitting the Discover page, the crawler warms up the session by visiting the homepage first — just like a real user would:

async def warm_up_session(page):
    await page.goto("https://www.snapchat.com/", wait_until="networkidle")
    await asyncio.sleep(random.uniform(2.0, 4.0))
    await page.mouse.move(
        random.randint(200, 600), random.randint(200, 400),
        steps=random.randint(15, 30)
    )
    await asyncio.sleep(random.uniform(1.0, 2.5))

Running the Crawler

Prerequisites:

$ pip install playwright asyncio
$ playwright install chromium

Basic run (direct connection):

$ python snapchat_discover_crawler.py

Run with proxy for traffic inspection:

$ mitmproxy -p 8080 &
$ python snapchat_discover_crawler.py --proxy http://127.0.0.1:8080

Expected output:

[*] Snapchat Discover Crawler v1.0
[*] Proxy: http://127.0.0.1:8080 (SSL errors ignored)
[*] Target: 1000 unique posts
[*] Warming up session... visiting snapchat.com
[*] Session warmed up (3.2s)
[*] Navigating to snapchat.com/discover
[*] Page loaded (4.8s)
[*] Extracted 487 posts from initial load
[*] Scrolling to load more content...
[*] Scroll 1: 612 total posts (+125 new)
[*] Scroll 2: 798 total posts (+186 new)
[*] Scroll 3: 956 total posts (+158 new)
[*] Scroll 4: 1134 total posts (+178 new)
[*] Target reached: 1134 unique posts collected
[*] Saved 1000 posts to snapchat_discover.json
[*] Total time: 47.3s
[*] Done.

Sample Output

The crawler produces a JSON file with structured metadata for each post:

{
  "crawl_metadata": {
    "timestamp": "2026-05-16T14:32:00Z",
    "total_posts": 1000,
    "source_url": "https://www.snapchat.com/discover",
    "proxy_used": "http://127.0.0.1:8080"
  },
  "posts": [
    {
      "story_id": "hassukhan.7",
      "story_type": "SnapProStory",
      "story_title": null,
      "creator_name": "hassukhan.7",
      "thumbnail_url": "https://cf-st.sc-cdn.net/d/...",
      "published_time": 1778879721,
      "snap_id": "If7GVneyTLK177u4-e9MZwA...",
      "media_url": "https://cf-st.sc-cdn.net/o/...",
      "media_preview_url": "https://cf-st.sc-cdn.net/d/...",
      "media_type": "VIDEO"
    }
  ]
}

Anti-Detection Techniques Used

Technique	Implementation	Why
Proxy with SSL bypass	`ignore_https_errors=True` + proxy config	Allows mitmproxy inspection without cert errors
Session warm-up	Visit homepage first, then navigate to target	Avoids the “direct hit” pattern bots exhibit
Variable delays	`random.uniform()` for all waits and pauses	No timing fingerprints
Natural scrolling	Variable step sizes (100–350px) with micro-pauses	Matches human scroll wheel behavior
Mouse movement	Curved paths with random waypoints during warm-up	Fools mouse-tracking anti-bot systems
Realistic headers	Playwright sends genuine Chrome headers automatically	No `Python-requests` user-agent leakage
Page wait strategy	`wait_until="networkidle"` after navigation	Ensures all dynamic content is loaded
Exponential backoff	Longer pauses between successive scrolls	Mimics diminishing engagement

Sources

Scrapy Documentation: https://docs.scrapy.org/
Playwright Documentation: https://playwright.dev/python/
Puppeteer Documentation: https://pptr.dev/
JA3 TLS Fingerprinting: https://engineering.salesforce.com/tls-fingerprinting-with-ja3-and-ja3s-247362855967/
JA4 Fingerprinting (RFC): https://engineering.salesforce.com/ja4-network-fingerprinting/
curl-impersonate: https://github.com/lwthiker/curl-impersonate
Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/
Snapchat Discover: https://www.snapchat.com/discover
hiQ Labs v. LinkedIn (2017): https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
eBay v. Bidder’s Edge (2000): https://en.wikipedia.org/wiki/EBay_v._Bidder%27s_Edge
OWASP Web Security Testing Guide: https://owasp.org/www-project-web-security-testing-guide/
Scrapinghub/Splash: https://github.com/scrapinghub/splash
Wandex — First Web Search Engine: https://en.wikipedia.org/wiki/World_Wide_Web_Worm
Cloudflare Bot Management: https://www.cloudflare.com/products/bot-management/
undetected-chromedriver: https://github.com/ultrafunkamsterdam/undetected-chromedriver