Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistakes with IRI regular expressions #32

Open
dertseha opened this issue Feb 18, 2023 · 2 comments
Open

Mistakes with IRI regular expressions #32

dertseha opened this issue Feb 18, 2023 · 2 comments

Comments

@dertseha
Copy link

dertseha commented Feb 18, 2023

The code regarding the regular expressions used for rdf/iri contain mistakes:

  • iqueryRE uses ipath instead of iquery, which is thus unused; When replacing it, further mistakes within iquery come to light:
    • iprivate contains an invalid regexp sequence: \x{F0000]-\x{FFFFD} should be \x{F0000}-\x{FFFFD}.
    • iquery is wrongly using "/?" as a sequence; This should be a choice, as in [\/\?].
  • iuserinfo is missing the colon character as per RFC. As such, IRI "https://user:[email protected]" cannot be parsed.
  • h16 regular expression should allow for 1-4 hex digits as per RFC, not require exactly 4 hex digits

As a side-note, the example "http://résumé.example.org", used for testing normalization, is not a properIRI string. The &#xE9 sequence is according to RFC chapter 1.4 the way how non US-ASCII characters are represented within a US-ASCII-only RFC text.
The first # makes the remainder be considered a fragment, which would be invalid because of the second #.

I found these things as I was extracting the package as a separate library, handling all the TODOs (ending up in a large rework), and feeding in many samples from the RFC - especially those about resolving relative IRIs. See https://github.com/contomap/iri .
My rework makes it incompatible with your use in here (different type & behaviour), which is why I collect the mistakes I found only as an issue.

@gonzojive
Copy link
Collaborator

Thanks for the report! When there's time, we'll fix these bugs and add more tests.

One aspect of this library that I never got around to was coming up with a better API. I dislike how url.URL has public fields, which allows the user to set fields to invalid values, so I didn't want to simply copy that approach. Did you consider alternatives when writing the fork?

@dertseha
Copy link
Author

Thank you for your response.
(Note: I was think-typing here, past midnight, and probably wrote more than you were asking for; First half answers your questions :) )

The semi-closest that I came upon (on GitHub) was https://github.com/fogfish/iri (archived) and its successor https://github.com/fogfish/curie , which seems to have a different target.
Otherwise I found IRI code only in the context of RDF in various code bases - which seemed to always treat it as a string and embedded into the RDF system - instead of a general IRI as per RFC. (One example is the </> wrapping in String() that I removed.)
I liked the implementation from this repository here because of the two-phased approach of splitting into general components first (which is essentially IRI/URI agnostic), and then check them per component-specific rules.

As for the potential of invalid values: One approach could be to percent-escape values during String() that would otherwise not be allowed.
A full-on wrapping would be to make the struct an opaque one with no public fields, and then wire everything through setter/getter.

In my approach I let myself be guided more by the design of url.URL, which is why things are public. Yet in hindsight I also see that a full compatibility would (should) not be necessary.


(Here my thoughts went off - if not of interest, at least a thank you for rubber-ducking :) )

An unfinished thought was that the type itself could be a data object with no intelligence apart from String() as it is now, keeping the public fields, and then some sort of filter type Filter interface {Filter(IRI) IRI} would take care of escaping; For example an RFC3897Encoder, which percent-escapes only IRI-reserved characters, and an RFC3896Encoder, which percent-escapes everything non-US-ASCII.

I also had the wild idea to remove the path-resolving code and internally create url.URL instances with only the Path filled out and let them be resolved - and hope they never check for anything besides path delimiter. :)

Coming back to my unfinished thought, truly separating the two mentioned phases.
First into a general ResourceIdentifier type (with public components like Scheme, Authority, ... - like I have now) for which only Parse(string) ResourceIdentifier and a .String() string exist, performing the generic separation and assembly of the components - only the delimiters are considered (://?#) and percent-escaped during String(). What is between these delimiters is not checked.

The second phase would be in dedicated functions -- which would then exist for URI or IRI or whatever:

type Rules interface { // name unclear
  Validate() error // see if the components are following the respective patterns
  Unescape(ResourceIdentifier) ResourceIdentifier // replaces percent-encoding with allowed characters from rules
  Escape(ResourceIdentifier) ResourceIdentifier // replaces reserved characters with percent-encoded
}

// types implementing Rules:
type RFC3897Rules struct {} // IRI rules
type RFC3896Rules struct {} // URI rules

// and general functions
func Resolve(base, rel ResourceIdentifier) ResourceIdentifier // is the same behaviour for URI and IRI -- although there are differences in older RFCs, so perhaps also part of Rules

... contomap/iri is still in version v0, so I might have a go at this; ...Should I take another round - because I actually wanted to implement something completely different that intends to use IRIs :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants