The MCP Web Scraper is a simple and scalable web tool built with Spring Boot, providing reliable and efficient web data extracting and structuring capabilities for various AI-driven tasks.
This tool uses Playwright for browser automation, allowing it to handle modern, JavaScript-heavy websites with ease. It exposes a simple REST API for initiating web scrapes, making it easy to integrate with other services, such CLI Agents, Chatbot service etc.
- Simple Web Scraping: Uses
Playwrightto render and scrape modern web pages. - Simple Search Integration: Searches
DuckDuckGoto find relevant pages before scraping. - RestFul API: Provides a simple API for initiating scrapes.
- Scalable Architecture: Utilizes multiple Playwright instances and
hash routingto handleconcurrent requestssafely. (Will replace this withThreadPoolandExecutorService) - AI: To be integrated with AI models for
structured data extractionand analysis. (TODO)
flowchart TD
User[User Query] --> API[ServiceController]
API -->|Generate RequestID| UID[Unique Request ID<br/>Ex: 1812403247]
UID --> Router{Hash Router<br/>ID % Total Instances<br/>let mut n: i32 = 10}
Router -->|ID % n = 7| P7[Browser Instance #7<br/>SearchEngine]
Router -->|ID % n = 2| P2[Browser Instance #2<br/>SearchEngine]
Router -->|ID % n = 5| P5[Browser Instance #5<br/>SearchEngine]
P7 --> Search7[ScrapeEngine]
P2 --> Search2[ScrapeEngine]
P5 --> Search5[ScrapeEngine]
Search7 --> Scrape7[Content Scraping]
Search2 --> Scrape2[Content Scraping]
Search5 --> Scrape5[Content Scraping]
Scrape7 --> Result[Aggregated Results]
Scrape2 --> Result[Aggregated Results]
Scrape5 --> Result[Aggregated Results]
Result --> API --> User
style User fill:black
style API fill:grey
style Router fill:black
style P7 fill:black
style P2 fill:black
style P5 fill:black
style Result fill:grey
Pseudo code
public PlaywrightBrowserSearchTools getSearchInstance(int requestId) {
int index = Math.abs(requestId % instances);
return searchTools[index]; // allocate a browser to req directly, prone to conflict
}flowchart TD
User[User Query] --> API[ServiceController]
API -->|Generate RequestID| UID[Unique Request ID<br/>Ex: 1812403247]
UID --> Allocator[PlaywrightAllocator<br/>Resource Manager]
Allocator --> Semaphore{Semaphore Gate<br/>Max nth Concurrent}
Semaphore -->|Wait in Queue| Queue[Request Queue<br/>60s Timeout]
Semaphore -->|Available Slot| Available[Slot Available]
Available --> InstancePool[Instance Pool<br/>nth Browser Instances]
InstancePool --> Lock1{Instance #1<br/>Available?}
InstancePool --> Lock2{Instance #2<br/>Available?}
InstancePool --> Lock3{Instance #3<br/>Available?}
Lock1 -->|Acquired| Work1[Perform Search<br/>& Scraping]
Lock2 -->|Acquired| Work2[Perform Search<br/>& Scraping]
Lock3 -->|Acquired| Work3[Perform Search<br/>& Scraping]
Work1 --> Release1[Release Instance #1<br/>Return to Pool]
Work2 --> Release2[Release Instance #2<br/>Return to Pool]
Work3 --> Release3[Release Instance #3<br/>Return to Pool]
Release1 --> Metrics[Usage Metrics<br/>Tracking & Monitoring]
Release2 --> Metrics[Usage Metrics<br/>Tracking & Monitoring]
Release3 --> Metrics[Usage Metrics<br/>Tracking & Monitoring]
Metrics --> Result[Response with Stats]
Result --> API --> User
Queue -->|Timeout| TimeoutError[Request Timeout<br/>503 Service Unavailable]
TimeoutError --> API
style User fill:black
style API fill:grey
style Semaphore fill:black
style Queue fill:grey
style InstancePool fill:black
style Metrics fill:grey
style TimeoutError fill:grey
Pseudo code
public PlaywrightBrowserSearchTools borrowSearchInstance(int requestId) {
// checks if available or busy, conflict-proof
if (searchSemaphore.tryAcquire(60, TimeUnit.SECONDS)) return findAndLockAvailableInstance(requestId);
return null; // timeout
}- Backend: Spring Boot 3.5.5
- Language: Java 24
- Web Scraping: Playwright
- Search Engine: DuckDuckGo (Reliable)
- Build Tool: Maven
- Cloud Service: Oracle Cloud (Ubuntu Environment)
- Java 24+
- Maven
mvn clean install -DskipTests -Bmvn spring-boot:rundocker-compose -f docker-compose.dev.yaml up --builddocker-compose -f docker-compose.prod.yaml up --builddocker build -t webMCP_scraper:latest .Run development container
docker run -d \
--name webscraper-dev \
-p 3000:3000 \
-e SPRING_PROFILES_ACTIVE=dev \
-e playwright.instances=15 \
-v $(pwd)/logs:/app/logs \
--init --ipc=host \
simpl-webscraper:latestRun production container
docker run -d \
--name webscraper-prod \
-p 3001:3000 \
-e SPRING_PROFILES_ACTIVE=prod \
-e playwright.lockInstances=20 \
-v $(pwd)/logs:/app/logs \
--init --ipc=host \
simpl-webscraper:latestThe application (default) will be available at http://localhost:3000.
This endpoint allows you to initiate a web data scrape, with source, snippet and contents.
Request Body:
{
"query": "Best foods for hamsters",
"results": 3
}Example curl command:
curl -X POST -H "Content-Type: application/json" -d '{
"query": "what are mcp server?"
"results": 5
}' http://localhost:3000/api/v1/service/searchExample Response:
{
"responseId": 253600510,
"userQuery": "Best foods for hamsters",
"searchResultList": [
{
"success": true,
"source": "https://www.thesprucepets.com/feeding-pet-hamsters-1238968",
"snippet": "Hamster Food 101: What Can Hamsters Eat? - The Spruce Pets",
"content": "The Ultimate Guide to Hamster Food: What to Feed Your Pet Explore balanced diets and tasty treats for your rodent pet By LIANNE MCLEOD...",
"error": null
},
{
"success": true,
"source": "https://www.animallama.com/hamsters/hamster-food-list/",
"snippet": "Safe & Unsafe Hamster Food List: Veggies, Fruits, Herbs ... - Animallama",
"content": "DIET & HEALTH | HAMSTERS Safe & Unsafe Hamster Food List: Veggies, Fruits, Nuts, Seeds, Herbs, Protein & More By Monika Kucic Updated on July 18...",
"error": null
},
{
"success": true,
"source": "https://www.petmd.com/exotic/what-can-hamsters-eat",
"snippet": "What Can Hamsters Eat? - PetMD",
"content": "Home What Can Hamsters Eat? By Angelina Childree, LVT . Reviewed by Melissa Witherell, DVM Updated Jul. 23, 2024 Inventori/iStock / Getty Images Plus via Getty Images IN THIS ARTICLE...",
"error": null
}
],
"message": "Routing to #4, #4 instances"
}