Universal web scraper with LLM-ready markdown, RAG chunking, PDF/DOCX support.
Config is the same across clients — only the file and path differ.
{
"mcpServers": {
"io-github-manchittlab-thecrawler": {
"command": "<see-readme>",
"args": []
}
}
}Are you the author?
Add this badge to your README to show your security score and help users find safe servers.
Universal web scraper with LLM-ready markdown, RAG chunking, PDF/DOCX support.
No automated test available for this server. Check the GitHub README for setup instructions.
Five weighted categories — click any category to see the underlying evidence.
No known CVEs.
No package registry to scan.
This server is missing a description. Tools and install config are also missing.If you've used it, help the community.
Add informationBe the first to review
Have you used this server?
Share your experience — it helps other developers decide.
Sign in to write a review.
Others in browser / writing
Browser automation with Puppeteer for web scraping and testing
Production ready MCP server with real-time search, extract, map & crawl.
🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.
The Apify MCP server enables your AI agents to extract data from social media, search engines, maps, e-commerce sites, or any other website using thousands of ready-made scrapers, crawlers, and automation tools available on the Apify Store.
MCP Security Weekly
Get CVE alerts and security updates for io.github.manchittlab/thecrawler and similar servers.
Start a conversation
Ask a question, share a tip, or report an issue.
Sign in to join the discussion.
Scrape any webpage and extract every data point: text content, links, images, meta tags, headings (h1-h6), HTML tables, JSON-LD structured data, email addresses, and phone numbers. CSS selector targeting for specific content. Recursive crawling to follow internal links. $0.003/page.
| Data | Description |
|---|---|
| Text | All visible text (scripts/styles stripped), up to 50K chars |
| Links | Every <a> tag — href, anchor text, internal/external flag |
| Images | Every <img> — src, alt text, width, height |
| Meta tags | All <meta> — description, og:title, keywords, robots, etc |
| Headings | All h1-h6 with level and text |
| Tables | HTML tables as structured arrays (headers + rows) |
| JSON-LD | Schema.org structured data from <script type="application/ld+json"> |
| Emails | Email addresses found anywhere in the HTML |
| Phones | Phone numbers (7+ digits) found in the HTML |
| Selected | Content matching your CSS selector |
Every extraction type can be toggled on/off.
Scrape a single page:
{
"urls": ["https://example.com"]
}
Crawl a site (follow links):
{
"urls": ["https://example.com"],
"maxDepth": 2,
"maxPages": 50
}
Target specific content:
{
"urls": ["https://example.com"],
"cssSelector": ".main-content"
}
| Field | Type | Default | Description |
|---|---|---|---|
urls | array | (required) | URLs to scrape |
extractText | boolean | true | Visible text content |
extractLinks | boolean | true | All links with anchor text |
extractImages | boolean | true | All images with alt/dimensions |
extractMeta | boolean | true | Meta tags |
extractHeadings | boolean | true | h1-h6 headings |
extractTables | boolean | true | HTML tables as arrays |
extractStructuredData | boolean | true | JSON-LD schema.org data |
extractEmails | boolean | true | Email addresses |
extractPhones | boolean | true | Phone numbers |
cssSelector | string | (optional) | Target specific element |
maxDepth | integer | 0 | 0 = listed URLs only. 1+ = follow links |
maxPages | integer | 100 | Max pages to scrape total |
dryRun | boolean | false | Scrape without charges |
$0.003 per page scraped (pay-per-event pricing).