Web Scraping

6 open source tools compared. Sorted by stars. Scroll down for our analysis.

Tool	Stars	Velocity	Language	License	Score
Scrapy Fast web crawling and scraping framework	62.5k	+240/wk	Python	BSD 3-Clause "New" or "Revised" License	86
browser	31.4k	+180/wk	Zig	AGPL-3.0	73
Crawlee Web scraping and browser automation for Node.js	24.0k	+219/wk	TypeScript	Apache License 2.0	83
crawlee-python	9.2k	+45/wk	Python	Apache-2.0	77
httrack HTTrack Website Copier, copy websites to your computer (Official repository)	4.5k	+17/wk	C	GNU General Public License v3.0	62
kage Shadow any website for offline viewing, with the JavaScript stripped out	2.4k	+735/wk	Go	MIT License	68

Stay ahead of the category

New tools and momentum shifts, every Wednesday.

Our Analysis

Scrapy62.5k★

Scrapy is the Python framework that most serious web scraping projects end up using. It's not a simple HTTP library. It's an entire crawling engine with request scheduling, middleware pipelines, built-in rate limiting, and export to JSON, CSV, or databases. Fully free under BSD 3-Clause. nearly two decades of development, and the documentation is excellent. You define 'spiders' that describe how to navigate and extract data from sites, and Scrapy handles the concurrency, retries, and data pipeline. The catch: Scrapy is async/Twisted-based, which has a learning curve if you're used to simple requests + BeautifulSoup scripts. JavaScript-rendered pages need Scrapy-Splash or Scrapy-Playwright as an add-on. And the biggest catch isn't technical: it's legal. Scraping at scale runs into rate limits, CAPTCHAs, IP bans, and terms of service issues. Scrapy gives you the tools but not the permission. For light scraping, requests + BeautifulSoup is simpler. For managed scraping with anti-bot handling, look at Crawlee.

browser31.4k★

Lightpanda is a headless browser built from scratch in Zig specifically for speed. We're talking 10-50x faster than Chromium-based headless browsers for page loading and JavaScript execution, at a fraction of the memory. The target audience is anyone running browsers at scale: scraping pipelines, automated testing farms, AI agents that need to browse the web. When you're paying per minute of compute and per GB of RAM, a browser that uses 90% less of both changes your infrastructure costs dramatically. AGPL-3.0 license. They offer a cloud service alongside the open source browser. The catch: it's not a full browser. It doesn't render pixels: no screenshots, no visual testing. JavaScript support is growing but not at Chrome-level compatibility. Sites with complex JS frameworks may not work correctly yet. And AGPL means if you modify it and serve it to users, you must open source your changes.

Crawlee is a Node.js web scraping and browser automation library. Not a point-and-click scraper. A full framework for building crawlers that handle the hard stuff: request queuing, proxy rotation, browser fingerprinting, error recovery, and storage. Supports three modes: plain HTTP requests (fast, for simple pages), Cheerio (HTML parsing without a browser), and full browser automation via Playwright or Puppeteer (for JavaScript-heavy sites). You pick the right tool for each job. Built by Apify, who run a web scraping platform. They open-sourced their crawler framework and it's legitimately good. Completely free and Apache-2.0 licensed. No feature gates. You can deploy crawlers anywhere: your server, AWS, or Apify's cloud platform (which is paid but optional). The catch: Crawlee is Node.js only. Python developers should look at Scrapy. Also, web scraping is inherently fragile; sites change, CAPTCHAs evolve, rate limits tighten. Crawlee gives you the tools to handle this but you still need to maintain your scrapers. Apify Cloud is the easy button for deployment ($49/mo+) but self-hosting works fine.

crawlee-python9.2k★

Crawlee gives you a Python framework that handles the ugly parts: browser automation, request queuing, proxy rotation, and anti-bot countermeasures. It's what you build on top of instead of writing raw Selenium or Playwright scripts from scratch. Apache 2.0. Built by Apify, who also sell a cloud scraping platform, but the library itself is fully independent. Supports both HTTP crawling (fast, lightweight) and browser-based crawling (Playwright under the hood for JavaScript-heavy sites). Automatic retries, request deduplication, and session management come built in. Fully free to use. No gated features, no paid tier for the library itself. Apify's cloud platform is a separate product. You never need it. Solo developers get a production-grade scraping framework for $0. Teams of any size benefit from the built-in queue management and error handling that you'd otherwise build yourself. The catch: this is the Python port of Crawlee, which started as a Node.js library. The Python version is younger and the ecosystem of plugins and examples is smaller. If you're in the Node world, the original Crawlee (also by Apify) is more mature. And Apify's cloud integration is deeply embedded; you'll see references to it everywhere in the docs even if you never use it.

httrack4.5k★

HTTrack copies an entire website to your hard drive so you can browse it offline. Point it at a URL and it recursively pulls down the HTML, images, and assets, rebuilding the link structure so the local mirror works exactly like the live site. It's been doing this one job well since the early 2000s, on Windows as WinHTTrack and on Linux as WebHTTrack. There's nothing to set up beyond installing it. It resumes interrupted downloads, updates existing mirrors, and gives you filters to control how deep it crawls and what it grabs. For a tool this old, it's remarkably dependable, which is why it's still the default answer for site archiving. Anyone who needs a local copy of a site, for offline reading, archiving before a shutdown, or grabbing documentation you can't count on staying up, gets exactly that. It's not a modern scraper for structured data extraction, reach for Scrapy or Playwright if you need to parse and transform what you pull. HTTrack mirrors; it doesn't analyze. The catch: the web moved on. Heavy JavaScript sites that render content client-side don't mirror cleanly, because HTTrack grabs what the server sends, not what a browser builds. It shines on traditional, server-rendered sites and struggles with modern single-page apps. Know which kind you're pointing it at.

kage2.4k★

kage saves websites that actually still work offline. Hit "save page" in a browser and half the time you get a broken mess later, because the JavaScript phones home to servers that are gone. kage renders the page in headless Chrome, grabs the finished result, strips every script, and pulls the CSS, images, and fonts down to disk. What's left is a static copy that doesn't rot. Go, MIT licensed, free. It's a command-line tool that does more than dump a folder. It can output a browsable mirror, a compressed ZIM archive, or a self-contained executable you can hand to someone, and it'll serve the clone locally so you can preview it. It respects robots.txt and sitemaps while crawling, and the output is deterministic, so the same site produces a byte-identical archive you can checksum. That's a thoughtful detail most scrapers skip. Useful for archivists, researchers, or anyone who wants a permanent, readable copy of something before it disappears or changes. It's a single-purpose CLI, free for everyone, with nothing to buy. Run it when you need a snapshot you can trust to still open in five years. The catch is what it removes. Stripping JavaScript is the whole point, but it also means anything that genuinely needs scripts to function, a web app, an interactive tool, won't survive the trip. kage preserves pages you read, not apps you use. Know which one you're archiving before you run it.