A minimalistic, concurrent web crawler written in Go.
- Concurrent Processing: Configurable number of worker goroutines
- Graceful Shutdown: Proper cleanup and signal handling
- Retry Logic: Exponential backoff with configurable retry attempts
- Configuration Management: Environment variables and command-line flags
- Error Handling: Error handling with detailed logging
- Memory Management: Efficient memory usage with proper cleanup
go run cmd/crawler/main.go \
--max-count 200 \
--max-concurrent 20 \
--url "https://go.dev/learn/" \
--timeout 60s \
--output-dir "./tmp"./crawler --max-count=1000 --url "https://pikabu.ru/" --output-dir "./.tmp"| Flag | Environment Variable | Default | Description |
|---|---|---|---|
--max-count |
CRAWLER_MAX_COUNT |
100 | Maximum pages to crawl |
--max-concurrent |
CRAWLER_MAX_CONCURRENT |
10 | Maximum concurrent workers |
--url |
CRAWLER_URL |
"" | Starting URL |
--timeout |
CRAWLER_TIMEOUT |
30s | HTTP request timeout |
--retry-attempts |
CRAWLER_RETRY_ATTEMPTS |
3 | Number of retry attempts |
--retry-delay |
CRAWLER_RETRY_DELAY |
1s | Delay between retries |
--output-dir |
CRAWLER_OUTPUT_DIR |
./.tmp/ | Output directory |
--log-level |
CRAWLER_LOG_LEVEL |
info | Log level |
- Distributed crawling support
- Advanced filtering and crawling rules (by size, file format)
- Metrics & Monitoring: Comprehensive statistics and performance tracking
- Handle redirects