Pathik

A high-performance web crawler implemented in Go with Python and JavaScript bindings. It converts web pages to both HTML and Markdown formats.

Features

Fast crawling with Go's concurrency model
Clean content extraction
Markdown conversion
Parallel URL processing
Cloudflare R2 integration
Kafka streaming support
Memory-efficient (uses ~10x less memory than browser automation tools)
Automatic binary version management

Performance Benchmarks

Memory Usage Comparison

Pathik is significantly more memory-efficient than browser automation tools like Playwright:

Parallel Crawling Performance

Parallel crawling significantly improves performance when processing multiple URLs. Our benchmarks show:

Python Performance

Testing with 5 URLs:
- Parallel crawling completed in 7.78 seconds
- Sequential crawling completed in 18.52 seconds
- Performance improvement: 2.38x faster with parallel crawling

JavaScript Performance

Testing with 5 URLs:
- Parallel crawling completed in 6.96 seconds
- Sequential crawling completed in 21.07 seconds
- Performance improvement: 3.03x faster with parallel crawling

Parallel crawling is enabled by default when processing multiple URLs, but you can explicitly control it with the parallel parameter.

Installation

pip install pathik

The package will automatically download the correct binary for your platform from GitHub releases on first use.

Binary Version Management

Pathik now automatically handles binary version checking and updates:

When you install or upgrade the Python package, it will check if the binary matches the package version
If the versions don't match, it will automatically download the correct binary

You can manually check and update the binary with:

# Force binary update
import pathik
from pathik.crawler import get_binary_path
binary_path = get_binary_path(force_download=True)

Command line options:

# Check if binary is up to date
pathik --check-binary

# Force update of the binary
pathik --force-update-binary

This ensures you always have the correct binary version with all the latest features, especially when using new functionality like Kafka streaming with session IDs.

Usage

Python API

Basic Crawling

import pathik

# Crawl a single URL
result = pathik.crawl("https://example.com")
print(f"HTML saved to: {result['https://example.com']['html']}")
print(f"Markdown saved to: {result['https://example.com']['markdown']}")

# Crawl multiple URLs in parallel
results = pathik.crawl([
    "https://example.com",
    "https://httpbin.org/html",
    "https://jsonplaceholder.typicode.com"
])

# To disable parallel crawling
results = pathik.crawl(urls, parallel=False)

# To specify output directory
results = pathik.crawl(urls, output_dir="./output")

R2 Upload

import pathik
import uuid

# Generate a UUID or use your own
my_uuid = str(uuid.uuid4())

# Crawl and upload to R2
results = pathik.crawl_to_r2("https://example.com", uuid_str=my_uuid)
print(f"UUID: {results['https://example.com']['uuid']}")
print(f"R2 HTML key: {results['https://example.com']['r2_html_key']}")
print(f"R2 Markdown key: {results['https://example.com']['r2_markdown_key']}")

# Upload multiple URLs
results = pathik.crawl_to_r2([
    "https://example.com",
    "https://httpbin.org/html"
], uuid_str=my_uuid)

Kafka Streaming

import pathik
import uuid

# Generate a session ID for tracking
session_id = str(uuid.uuid4())

# Stream a single URL to Kafka
result = pathik.stream_to_kafka("https://example.com", session=session_id)
print(f"Success: {result['https://example.com']['success']}")

# Stream multiple URLs with custom options
results = pathik.stream_to_kafka(
    urls=["https://example.com", "https://httpbin.org/html"],
    content_type="html",        # Options: "html", "markdown", or "both"
    topic="custom_topic",       # Optional custom topic
    session=session_id,         # Optional session ID
    parallel=True               # Process URLs in parallel (default)
)

# Check results
for url, status in results.items():
    if status["success"]:
        print(f"Successfully streamed {url}")
    else:
        print(f"Failed to stream {url}: {status.get('error')}")

Command Line

# Crawl a single URL
pathik crawl https://example.com

# Crawl multiple URLs
pathik crawl https://example.com https://httpbin.org/html

# Specify output directory
pathik crawl -o ./output https://example.com

# Use sequential (non-parallel) mode
pathik crawl -s https://example.com https://httpbin.org/html

# Upload to R2 (Cloudflare)
pathik r2 https://example.com

# Stream crawled content to Kafka
pathik kafka https://example.com

# Stream only HTML content to Kafka
pathik kafka -c html https://example.com

# Stream only Markdown content to Kafka
pathik kafka -c markdown https://example.com

# Stream to a specific Kafka topic
pathik kafka -t user1_crawl_data https://example.com

# Add a session ID for multi-user environments
pathik kafka --session user123 https://example.com

# Combine options
pathik kafka -c html -t user1_data --session user123 https://example.com

Kafka Streaming

Pathik supports streaming crawled content directly to Kafka. This is useful for real-time processing pipelines.

Kafka Configuration

Configure Kafka connection details in the .env file:

KAFKA_BROKERS=localhost:9092        # Comma-separated list of brokers
KAFKA_TOPIC=pathik_crawl_data       # Topic to publish to
KAFKA_USERNAME=                     # Optional username for SASL authentication
KAFKA_PASSWORD=                     # Optional password for SASL authentication
KAFKA_CLIENT_ID=pathik-crawler      # Client ID for Kafka
KAFKA_USE_TLS=false                 # Whether to use TLS

Kafka Message Format

When streaming to Kafka, Pathik sends two messages per URL:

HTML Content:
- Key: URL
- Value: Raw HTML content
- Headers:
  - url: The original URL
  - contentType: "text/html"
  - timestamp: ISO 8601 timestamp
  - session: Session ID (if provided)
Markdown Content:
- Key: URL
- Value: Markdown content
- Headers:
  - url: The original URL
  - contentType: "text/markdown"
  - timestamp: ISO 8601 timestamp
  - session: Session ID (if provided)

Kafka Consumer Examples

Pathik includes example consumers for Go, Python, and JavaScript in the examples directory.

Python Consumer Example

from kafka import KafkaConsumer
import json

# Connect to Kafka
consumer = KafkaConsumer(
    'pathik_crawl_data',                  # Topic
    bootstrap_servers=['localhost:9092'],
    auto_offset_reset='earliest',         # Start from beginning
    enable_auto_commit=True,
    group_id='pathik-consumer-group'
)

# Optional: filter by session ID
session_filter = "user123"  # Set to None to receive all messages

# Process messages
for message in consumer:
    # Extract headers
    headers = {k: v.decode('utf-8') for k, v in message.headers}
    
    # Filter by session if needed
    if session_filter and headers.get('session') != session_filter:
        continue
        
    # Get message details
    url = message.key.decode('utf-8')
    content_type = headers.get('contentType')
    
    print(f"Received from {url}: {content_type} content ({len(message.value)} bytes)")
    
    # Process content based on type
    if content_type == 'text/html':
        # Process HTML...
        pass
    elif content_type == 'text/markdown':
        # Process Markdown...
        pass

Using in Docker

When using Pathik in a Docker container, you need to install the required dependencies for Chromium:

FROM python:3.10-slim

# Install Chromium dependencies
RUN apt-get update && apt-get install -y \
    libglib2.0-0 \
    libgtk-3-0 \
    libx11-6 \
    libxcomposite1 \
    libxcursor1 \
    libxdamage1 \
    libxi6 \
    libxtst6 \
    libnss3 \
    libcups2 \
    libatk1.0-0 \
    libatk-bridge2.0-0 \
    libgdk-pixbuf2.0-0 \
    libpango-1.0-0 \
    libcairo2 \
    libdrm2 \
    libgbm1 \
    libasound2 \
    fonts-freefont-ttf

# Install pathik
RUN pip install pathik

Development

Setup

# Clone the repository
git clone https://github.com/justrach/pathik.git
cd pathik

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e .

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
assets		assets
benchmark_results		benchmark_results
benchmarks		benchmarks
crawler		crawler
examples		examples
pathik-js		pathik-js
pathik.egg-info		pathik.egg-info
pathik		pathik
storage		storage
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
__init__.py		__init__.py
build_binary.py		build_binary.py
go.mod		go.mod
go.sum		go.sum
main.go		main.go
new-version.sh		new-version.sh
package-lock.json		package-lock.json
package.json		package.json
parallel_test.js		parallel_test.js
parallel_test.py		parallel_test.py
pathik_bin		pathik_bin
publish_to_pypi.py		publish_to_pypi.py
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pathik

Features

Performance Benchmarks

Memory Usage Comparison

Parallel Crawling Performance

Python Performance

JavaScript Performance

Installation

Binary Version Management

Usage

Python API

Basic Crawling

R2 Upload

Kafka Streaming

Command Line

Kafka Streaming

Kafka Configuration

Kafka Message Format

Kafka Consumer Examples

Python Consumer Example

Using in Docker

Development

Setup

About

Releases

Packages

Languages

License

menloresearch/OpenCrawl

Folders and files

Latest commit

History

Repository files navigation

Pathik

Features

Performance Benchmarks

Memory Usage Comparison

Parallel Crawling Performance

Python Performance

JavaScript Performance

Installation

Binary Version Management

Usage

Python API

Basic Crawling

R2 Upload

Kafka Streaming

Command Line

Kafka Streaming

Kafka Configuration

Kafka Message Format

Kafka Consumer Examples

Python Consumer Example

Using in Docker

Development

Setup

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages