-
-
Notifications
You must be signed in to change notification settings - Fork 26
/
surt.rb
41 lines (37 loc) · 1.54 KB
/
surt.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
require 'addressable/uri'
# Tools for canonicalizing and formatting URLs according to the Internet
# Archive's "Sort-friendly URI Reordering Transform" (SURT) format:
# http://crawler.archive.org/articles/user_manual/glossary.html#surt
#
# For example:
#
# URL: https://energy.gov/eere/sunshot/downloads/
# SURT: gov,energy)/eere/sunshot/downloads
#
# The implementations primarily live in submodules (Canonicalize and Format),
# while the methods here serve as public entry points. See each implementation
# module for a list of options and default values (at the top of each module).
#
# Code in the submodules is generally based on the Internet Archive's Python
# SURT module: https://github.com/internetarchive/surt
# With some added inspiration from Purell: https://github.com/PuerkitoBio/purell
# and normalize_url: https://github.com/rwz/normalize_url
module Surt
# Canonicalize and format a URL according to SURT.
def self.surt(url, options = {})
Surt::Format.url(Surt::Canonicalize.url(parse_url(url), options), options)
end
# Canonicalize a URL. The result of this is a URL, not a SURT string.
def self.canonicalize(url, options = {})
Surt::Canonicalize.url(parse_url(url), options).to_s
end
# Format a URL as SURT without doing any canonicalization or cleanup.
def self.format(url, options = {})
Surt::Format.url(parse_url(url), options)
end
def self.parse_url(url)
url = url.strip.gsub(/[\r\n\t]/, '').gsub(/\s/, '%20')
url = "http://#{url}" unless url.match?(/^[^:\/.]*:/)
Addressable::URI.parse(url)
end
end