web-crawler

Star

Here are 1,038 public repositories matching this topic...

mendableai / firecrawl

Star

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

markdown crawler data scraper ai html-to-markdown web-crawler scraping webscraping rag llm ai-scraping

Updated Jul 7, 2025
TypeScript

ScrapeGraphAI / Scrapegraph-ai

Sponsor

Star

Python scraper based on AI

markdown crawler ai html-to-markdown web-crawler scraping web-scraping rag automated-scraper scraping-python web-crawlers llm ai-scraping

Updated Jul 3, 2025
Python

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

nodejs javascript npm crawler scraper automation typescript web-crawler headless scraping crawling web-scraping web-crawling headless-chrome apify puppeteer playwright

Updated Jul 8, 2025
TypeScript

crawlab-team / crawlab

Star

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

go docker platform crawler spider web-crawler scrapy webcrawler scrapyd-ui webspider crawling-tasks crawlab spiders-management

Updated Jul 8, 2025
Go

ssssssss-team / spider-flow

Star

新一代爬虫平台，以图形化方式定义爬虫流程，不写代码即可完成爬虫。

crawler spider web-crawler jsoup xpath webcrawler webspider web-spider spider-flow

Updated Jun 14, 2023
Java

BruceDone / awesome-crawler

Star

A collection of awesome web crawler,spider in different languages

crawler scraper awesome spider web-crawler web-scraper node-crawler

Updated Jun 16, 2024

adithya-s-k / omniparse

Sponsor

Star

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

ocr parser-library web-crawler parse-server whisper-api ingestion-api vision-transformer omniparser

Updated Jun 11, 2025
Python

apify / crawlee-python

Star

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

python crawler scraper automation web-crawler headless scraping crawling pip web-scraping beautifulsoup web-crawling hacktoberfest headless-chrome apify playwright

Updated Jul 8, 2025
Python

mendableai / firecrawl-mcp-server

Star

Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.

web-crawler web-scraping data-collection batch-processing content-extraction search-api claude llm-tools firecrawl model-context-protocol mcp-server firecrawl-ai javascript-rendering

Updated Jul 3, 2025
JavaScript

apache / nutch

Star

Apache Nutch is an extensible and scalable web crawler

java hadoop web-crawler nutch crawling apache

Updated Mar 28, 2025
Java

sjdirect / abot

Star

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

Updated Sep 9, 2024
C#

jasonxtn / Argus

Star

The Ultimate Information Gathering Toolkit

osint web-crawler whois-lookup virustotal information-gathering server-info dns-lookup reconnaissance cms-detection recon-tools email-harvester ssl-analitcs directory-finder txt-records pastebin-monitoring

Updated Oct 8, 2024
Python

xianhu / PSpider

Star

简单易用的Python爬虫框架，QQ交流群：597510560

python crawler multi-threading spider multiprocessing web-crawler proxies python-spider web-spider

Updated Jun 10, 2022
Python

MarginaliaSearch / MarginaliaSearch

Sponsor

Star

Internet search engine for text-oriented websites. Indexing the small, old and weird web.

search-engine web-crawler indexer language-processing no-ai-used internet-search no-cloud self-hostable small-web alt-search

Updated Jul 7, 2025
HTML

Algebra-FUN / WeReadScan

Star

扫描“微信读书”已购图书并下载本地PDF的爬虫

web-crawler selenium weread book-downloader

Updated Sep 19, 2023
Python

apache / stormcrawler

Star

A scalable, mature and versatile web crawler based on Apache Storm

java crawler web-crawler distributed apache-storm stormcrawler

Updated Jul 7, 2025
Java

platonai / PulsarRPA

Star

PulsarRPA: An AI-Enabled, Super-Fast, Thread-Safe Browser Automation Solution! 💖

web-crawler web-scraper web-scraping dom-manipulation web-crawling browser-automation dom-api ai-agents web-extraction rpa web-extractor llm ai-crawler text-to-action browser-use ai-rpa ai-browser-control

Updated Jun 29, 2025
Kotlin

gildas-lormeau / single-file-cli

Star

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)

nodejs cli dockerfile crawler web-crawler archiving web-scraper web-scraping web-archiving scraping-websites single-file deno

Updated Jun 2, 2025
JavaScript

webrecorder / browsertrix-crawler

Sponsor

Star

Run a high-fidelity browser-based web archiving crawler in a single Docker container

crawler web-crawler crawling warc web-archiving webrecorder wacz

Updated Jul 8, 2025
TypeScript

postmodern / spidr

Sponsor

Star

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

ruby crawler scraper web spider web-crawler web-scraper web-scraping web-spider spider-links

Updated Jun 30, 2025
Ruby

Improve this page

Add a description, image, and links to the web-crawler topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the web-crawler topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

web-crawler

Here are 1,038 public repositories matching this topic...

mendableai / firecrawl

ScrapeGraphAI / Scrapegraph-ai

apify / crawlee

crawlab-team / crawlab

ssssssss-team / spider-flow

BruceDone / awesome-crawler

adithya-s-k / omniparse

apify / crawlee-python

mendableai / firecrawl-mcp-server

apache / nutch

sjdirect / abot

jasonxtn / Argus

xianhu / PSpider

MarginaliaSearch / MarginaliaSearch

Algebra-FUN / WeReadScan

apache / stormcrawler

platonai / PulsarRPA

gildas-lormeau / single-file-cli

webrecorder / browsertrix-crawler

postmodern / spidr

Improve this page

Add this topic to your repo