Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract SURT into a separate gem #767

Open
Mr0grog opened this issue Oct 21, 2020 · 1 comment
Open

Extract SURT into a separate gem #767

Mr0grog opened this issue Oct 21, 2020 · 1 comment

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Oct 21, 2020

This project has a nearly complete Ruby port of the Internet Archive’s SURT Python package buried in the app/lib/ directory:

# Tools for canonicalizing and formatting URLs according to the Internet
# Archive's "Sort-friendly URI Reordering Transform" (SURT) format:
# http://crawler.archive.org/articles/user_manual/glossary.html#surt
#
# For example:
#
# URL: https://energy.gov/eere/sunshot/downloads/
# SURT: gov,energy)/eere/sunshot/downloads
#
# The implementations primarily live in submodules (Canonicalize and Format),
# while the methods here serve as public entry points. See each implementation
# module for a list of options and default values (at the top of each module).
#
# Code in the submodules is generally based on the Internet Archive's Python
# SURT module: https://github.com/internetarchive/surt
# With some added inspiration from Purell: https://github.com/PuerkitoBio/purell
# and normalize_url: https://github.com/rwz/normalize_url
module Surt

I wrote it because we needed URL canonicalization tools, none of the existing Ruby ones I could find quite met our needs perfectly, and having a method that roughly matched the Internet Archive’s was advantageous. Nobody had written a Ruby port of SURT.

Since we have generally been working to break more reusable, abstract pieces out of the web monitoring projects, this is probably a really good candidate for that on the ruby side. It might be nice to extract it and publish it as a Ruby Gem. (Gem name: SURT, repo name: edgi-govdata-archiving/ruby-surt)

@Mr0grog Mr0grog added this to Icebox in Web Monitoring via automation Oct 21, 2020
@Mr0grog Mr0grog moved this from Icebox to Ready in Web Monitoring Oct 21, 2020
@stale stale bot added the stale label Jun 3, 2021
@Mr0grog Mr0grog removed the stale label Jun 4, 2021
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jun 4, 2021
@stale
Copy link

stale bot commented Jan 8, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Web Monitoring
  
Ready
Development

No branches or pull requests

1 participant