Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should elements suffixed by "Url" be xs:anyURI? #87

Closed
demcg opened this issue Apr 7, 2015 · 18 comments
Closed

Should elements suffixed by "Url" be xs:anyURI? #87

demcg opened this issue Apr 7, 2015 · 18 comments
Assignees
Milestone

Comments

@demcg
Copy link
Contributor

demcg commented Apr 7, 2015

We have many elements that are named "...Url" are type xs:string, should they be xs:anyURI?

@pstenbjorn
Copy link
Contributor

@demcg yes I believe that is right.

@jungshadow
Copy link
Collaborator

@demcg Yes and probably with a pattern to only accept HTTP(S) protocols.

@jungshadow
Copy link
Collaborator

NB: there may be reason for other protocols in the future, but, for the current URIs, HTTP(S) seems the only appropriate option/restriction.

@demcg
Copy link
Contributor Author

demcg commented Apr 7, 2015

@jungshadow
Are we looking at something like:

<!-- very simple filter for http and https prefixes only -->
<xs:simpleType name="Url">
  <xs:restriction base="xs:anyURI">
    <xs:pattern value="http?://.+" />
  </xs:restriction>
</xs:simpleType>

<xs:element name="RegistrationUrl" type="Url" minOccurs="0" />

I am not sure of the syntax, so please fix as required.

@pstenbjorn
Copy link
Contributor

@demcg testing my regex skills, but I think this should work.

<xsd:simpleType name="Url">
  <xsd:restriction base="xs:anyURI">
    <xsd:pattern value="[hH][tT]{2}[pP][sS][:]\/\/[a-zA-Z0-9.-]*|[hH][tT]{2}[pP][:]\/\/[a-zA-Z0-9.-]*" />
  </xsd:restriction>
</xsd:simpleType>

@jungshadow
Copy link
Collaborator

@pstenbjorn Are there situations where the protocols would be in all caps? That seems…interesting. Should that be allowed or flag as something to fix?

@jungshadow
Copy link
Collaborator

I should add that I know it's a part of the spec, but it may be off-putting.

@pstenbjorn
Copy link
Contributor

@jungshadow not really, if we want to enforce a lower case protocol, I think it we are safe in flagging it as something needing fixing. So that would make the regex

[h][t]{2}[p][s][:]\/\/[a-zA-Z0-9.-]*|[h][t]{2}[p][:]\/\/[a-zA-Z0-9.-]*

@demcg
Copy link
Contributor Author

demcg commented Apr 7, 2015

@pstenbjorn will this handle encoded spaces '%20', '#' or query strings '?,&,='

We might want to stay simple, because fully supporting all possible Url styles is a big task.

Thanks

@pstenbjorn
Copy link
Contributor

@demcg thanks to the link you provided me, the following is a validation RegEx that seems to account for both allowed and disallowed uris.

Scary to embed this into a schema, but it works.

^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$

@demcg
Copy link
Contributor Author

demcg commented Apr 7, 2015

@pstenbjorn we could remove "|ftp" ... but that is not much savings. Wow that is a big RegEx, but we only need it in one place ...

@pstenbjorn
Copy link
Contributor

@demcg. Agree on both counts, we certainly don't need to permit ftp URIs and it does only need to be declared once. @jungshadow and @jdmgoogle what are your thoughts?

@cjerdonek
Copy link
Contributor

It seems like overkill to try to have a regex validating an entire url. Why not just check for an initial "http(s)://"? Also, how do we know that the regex is correct and that it doesn't have false negatives, etc?

@jdmgoogle
Copy link
Contributor

I'm actually not that concerned about restricting it to http(s), so I'm okay with just leaving it at anyURI.

@cjerdonek
Copy link
Contributor

Out of curiosity, when would validations like these be applied (and are they ever)? I would also be okay with not imposing a restriction.

@jdmgoogle
Copy link
Contributor

I imagine we (Google and Pew) would encourage data providers to do a test validation before publishing their VIP feeds. I also imagine that we (Google) would perform a validation prior to parsing the file. A http(s)-only regexp would be useful but (a) could be overkill, and (b) could limit jurisdictions who might do something weird like host sample ballots on an FTP site or something. Plus regexps are really hard to get perfectly right, so IMHO the cost/benefit tradeoff isn't worth it.

I may be wrong, though, YMMV, etc, etc.

@jungshadow
Copy link
Collaborator

I wouldn't have brought it up if the URLs weren't so widely inconsistent (e.g. some have a protocol, some have a malformed protocol, some have no protocol...all in the same feed). In the end, I'd like the URLs to work in various tools (ours and others) and, currently, they fail more often than I'd like. Maybe just specifying xs:anyURI is enough, but I'd just like to ensure we get solid data.

@jungshadow
Copy link
Collaborator

Just made up my mind that it should be good enough.

@jungshadow jungshadow added this to the Version 5.0 milestone Apr 7, 2015
@demcg demcg assigned demcg and unassigned jungshadow Apr 8, 2015
demcg added a commit that referenced this issue Apr 8, 2015
Address Issue #87 - changed Urls of type xs:string to type xs:anyURI
@demcg demcg closed this as completed Apr 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants