Fast filtering for RubyGems versions index files. Designed for memory-constrained environments like Fastly Compute edge workers.
- Streaming parser: Handles 20+ MB files with minimal memory footprint
- Flexible filtering: Allow mode, block mode, or passthrough (no filtering)
- Combined filters: Use
--allowand--blocktogether (allowlist - blocklist) - Version stripping: Optionally replace version lists with
0to reduce size - Order preservation: Maintains exact original order from the input file
gem-index-filter [OPTIONS] <versions-file> [output-file]
Options:
--allow <file> Filter to only gems in allowlist file (one name per line)
--block <file> Filter out gems in blocklist file (one name per line)
--strip-versions Replace version lists with '0' in outputExamples:
# Pass through all gems (no filtering)
gem-index-filter versions
# Filter to only gems in allowlist
gem-index-filter --allow allowlist.txt versions filtered.txt
# Block specific gems
gem-index-filter --block blocklist.txt versions filtered.txt
# Allow mode with blocked gems removed (allowlist - blocklist)
gem-index-filter --allow allow.txt --block block.txt versions filtered.txt
# Strip version information (replace with '0')
gem-index-filter --strip-versions versions filtered.txt
# Stream from stdin
curl https://rubygems.org/versions | gem-index-filter --allow allowlist.txt - > filtered.txtFilter file format (one gem name per line, # for comments):
rails
sinatra
activerecord
puma
use gem_index_filter::{filter_versions_streaming, FilterMode};
use std::collections::HashSet;
use std::fs::File;
let input = File::open("versions")?;
let mut output = File::create("versions.filtered")?;
// Create allowlist
let mut allowlist = HashSet::new();
allowlist.insert("rails");
allowlist.insert("sinatra");
// Stream and filter
filter_versions_streaming(input, &mut output, FilterMode::Allow(&allowlist), false)?;Other modes:
// Block mode - exclude specific gems
let mut blocklist = HashSet::new();
blocklist.insert("big-gem");
filter_versions_streaming(input, &mut output, FilterMode::Block(&blocklist), false)?;
// Passthrough mode - no filtering
filter_versions_streaming(input, &mut output, FilterMode::Passthrough, false)?;
// Strip versions while filtering
filter_versions_streaming(input, &mut output, FilterMode::Allow(&allowlist), true)?;The format uses one line per rubygem, with additional lines appended for updates:
created_at: 2024-04-01T00:00:05Z
---
gemname [-]version[,version]* MD5
- gemname: The name of the rubygem
- versions: Comma-separated list of versions (may include platform)
- -: Minus prefix indicates yanked version
- MD5: Hash of the gem's "info" file
When a gem appears multiple times, the last occurrence has the authoritative MD5.
- Parse: Stream input line-by-line using BufReader
- Filter: Based on mode, check gem name against filter list:
- Passthrough: Include all gems (no filtering)
- Allow mode: Include only gems where
gemlist.contains(gemname) == true - Block mode: Include only gems where
gemlist.contains(gemname) == false - Combined: Preprocess
allowlist - blocklistat startup, then use Allow mode
- Output: Write matching lines immediately in original order
The filtering is optimized for performance and simplicity:
- Streaming architecture: Only current line buffer held in memory
- Order preservation: Maintains exact original order from input
- All occurrences preserved: versions is append-only
The versions file supports HTTP range requests, enabling incremental updates:
// Future API design
struct FilteredIndex {
data: Vec<u8>,
last_byte_offset: u64, // Track where we've processed to
}
impl FilteredIndex {
fn update(&mut self, range_data: &[u8]) {
// Process only new appended data
// Merge updates into existing filtered index
}
}Strategy:
- Store byte offset we've processed to
- Fetch
Range: bytes={offset}-for incremental updates - Filter new lines and append matching gems to existing filtered index
- All occurrences preserved - simple append operation
# Run tests
cargo test
# Build release binary
cargo build --release
# For Fastly Compute (wasm32-wasi target)
cargo build --target wasm32-wasi --release# Run all tests
cargo test
# Test with real data (if you have a versions file)
cargo run --release -- versions output.txtMIT