How We Built a 1000+ Page/Min Crawler
A technical deep-dive into the distributed architecture behind WebAudit's crawler.
The Challenge
When we set out to build WebAudit, we knew that crawl speed would be a critical differentiator. Enterprise clients need to audit thousands of pages quickly, and agencies managing multiple sites can't wait hours for audits to complete.
Our target was ambitious: crawl and analyze over 1,000 pages per minute while maintaining accuracy and respecting target sites.
Architecture Overview
We built our crawler entirely in Go, chosen for its excellent concurrency primitives and low memory footprint. The architecture consists of several key components:
1. URL Frontier
The URL Frontier manages which pages to crawl next. It implements priority queuing based on page depth, sitemap inclusion, and link structure. We use Redis for distributed state, allowing multiple crawler workers to coordinate without conflicts.
2. Distributed Workers
Crawler workers are stateless and horizontally scalable. They pull URLs from the frontier, fetch pages, and push results to our analysis pipeline. Each worker maintains its own connection pool and rate limiter per domain.
// Simplified worker loop
func (w *Worker) Run(ctx context.Context) error {
for {
select {
case <-ctx.Done():
return ctx.Err()
default:
url, err := w.frontier.Pop(ctx)
if err != nil {
continue
}
result := w.fetch(ctx, url)
w.pipeline.Push(result)
}
}
}3. Rate Limiting
Respecting target websites is crucial. We implement per-domain rate limiting with configurable delays. Our system also respects robots.txt and Crawl-delay directives, adjusting speed based on server response times.
4. Connection Pooling
HTTP connection reuse is essential for performance. We maintain persistent connections per host, with configurable pool sizes. This eliminates TCP handshake and TLS negotiation overhead for subsequent requests.
Handling JavaScript
Many modern websites require JavaScript execution for full content rendering. We handle this with a separate rendering service using headless Chrome, coordinated via message queue. Static content is processed directly, while JavaScript-heavy pages are routed to the renderer.
Analysis Pipeline
Raw crawl data flows through our analysis pipeline, which runs 60+ SEO rules in parallel. Each rule is isolated and stateless, making it easy to add new checks without affecting performance.
- HTML parsing with goquery
- Parallel rule execution
- Issue aggregation and scoring
- Real-time progress updates via SSE
Results
The final system consistently achieves 1,000+ pages per minute on typical websites. For sites with fast servers and simple pages, we've seen rates exceeding 2,000 pages per minute.
Memory usage stays under 500MB per worker, and the entire system can be deployed on modest hardware or scaled horizontally for enterprise workloads.
See It In Action
Experience our fast crawler yourself. Start a free audit and watch your pages get analyzed in real-time.
Start Free Audit