Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relative URLs in WHATWG URL API #12682

Closed
TimothyGu opened this issue Apr 27, 2017 · 44 comments
Closed

Relative URLs in WHATWG URL API #12682

TimothyGu opened this issue Apr 27, 2017 · 44 comments
Labels
feature request Issues that request new features to be added to Node.js. whatwg-url Issues and PRs related to the WHATWG URL implementation.

Comments

@TimothyGu
Copy link
Member

  • Version: v9.x+
  • Platform: all
  • Subsystem: url

We are on the track to slowly deprecate the non-standard url.parse() (#12168 (comment)) in favor of the new WHATWG standard-based URL API. One use case that currently cannot be migrated over from url.parse() is the handling of relative URLs.

Background

url.parse() accepts incomplete, relative URLs by filling unavailable components of a URL with null.

> url.parse('#hash')
Url {
  protocol: null,
  slashes: null,
  auth: null,
  host: null,
  port: null,
  hostname: null,
  hash: '#hash',
  search: null,
  query: null,
  pathname: null,
  path: null,
  href: '#hash' }

On the other hand, the URL constructor guarantees that all URL objects are fully complete and valid URLs, which means that it throws an exception in case of relative URLs:

> new URL('#hash')
TypeError [ERR_INVALID_URL]: Invalid URL: #hash
    at Object.onParseError (internal/url.js:92:17)
    at parse (internal/url.js:101:11)
    at new URL (internal/url.js:184:5)
    at repl:1:1

WHATWG URL API does have the algorithms necessary to parse relative URLs, however, and that is activated if a base argument is provided:

> new URL('#hash', 'http://complete-url/')
URL {
  href: 'http://complete-url/#hash',
  origin: 'http://complete-url',
  protocol: 'http:',
  username: '',
  password: '',
  host: 'complete-url',
  hostname: 'complete-url',
  port: '',
  pathname: '/',
  search: '',
  searchParams: URLSearchParams {},
  hash: '#hash' }

It is not always the case that a base URL is available, though.

Possible solutions

Do nothing

What this entails is that the currently supported ability to parse relative URLs will die as url.parse() becomes deprecated.

Do not deprecate url.parse(); otherwise do nothing

This is the most obvious actual solution, but from tickets like #12168, I don't see this as a good idea.

Add a non-standard TolerantURL class

This could work if we trick the parser into believing we have a legitimate URL, except there are many conditionals in the URL parser algorithm that provide ad-hoc compatibility fixes with legacy implementations. We would have to make a set of opinionated assumptions about the nature of the URL, such as the URL's scheme.

In addition to parsing, the setters will have awkward semantics. Consider the following:

// Case 1
const relativeURL = new TolerantURL('#hash');
console.log(relativeURL.href);
  // Prints #hash

relativeURL.protocol = 'http:';
console.log(relativeURL.href);
  // Should this print http:#hash (what url.format() does)?
  // http://#hash?
  // Or make the setter a noop and therefore just #hash?

// Case 2
// Assuming we have decided to use http scheme semantics
// for TolerantURL if one is not supplied.
// The URL parser does not allow changing special-ness of
// scheme through the protocol setter.
const relativeURL = new TolerantURL('//username:password@host/');
relativeURL.protocol = 'abc';
console.log(relativeURL.href);
  // Should this print abc://username:password@host/?
  // Or //username:password@host/?

// Compare:
const absoluteURL = new URL('http://username:password@host/');
absoluteURL.protocol = 'abc';
console.log(absoluteURL.href);
  // Prints http://username:password@host/

Something else that's better than what I thought of above...

@TimothyGu TimothyGu added the whatwg-url Issues and PRs related to the WHATWG URL implementation. label Apr 27, 2017
@jasnell
Copy link
Member

jasnell commented Apr 28, 2017

I do not suspect that we will be able to actually deprecate url.parse() for quite some time. For the foreseeable future (at least through 10.x) the two implementations will coexist side-by-side.

@domenic
Copy link
Contributor

domenic commented Apr 29, 2017

I vote for "do nothing", at least until it's clear what the use cases for these base-less relative URLs are and how people use them.

If we find out people use them a lot, I think a better solution would be (preferably in user-land) a "RelativeURL" class, not "TolerantURL", which only has pathname/search/searchParams/hash/toString(). Someone would have to specify how this works, but maybe as a first-pass it could have an internal real-URL and parse against https://example.com/, then just re-expose the pathname/search/searchParams/hash.

@benjamingr
Copy link
Member

benjamingr commented Apr 29, 2017

@domenic it's worth pointing out that working with relative URLS is extremely common in Node - the most common use case I can think of is when an incoming HTTP request arrives - you get a relative URL under request.url which you typically url.parse.

I think it would be a shame to keep two URL APIs just for relative URLs. It would be really nice if URL supported relative URLs somehow - how set in store is the spec at this point regarding that?

@domenic
Copy link
Contributor

domenic commented Apr 29, 2017

It just doesn't make sense to have a single API for both relative and absolute URLs---the components and parsing rules are far too different depending on the base used for the rest of the API to make any sense. So I'm pretty sure the spec isn't going to change, just because there's no underlying model that makes sense.

The best you can do if you want to use one API is to make up a base URL.

@benjamingr
Copy link
Member

@domenic I realize that this is a hard problem, and one I do not understand very well - but I think having a base URL in Node but not in the browser could cause a lot of incompatibility when people expect code using the same spec to run the same way on both platforms.

I think we can only change URL to add example.com or something similar as a default if browsers do. If that isn't practical - then Node should stop calling url.parse a legacy API and embrace using it for relative URLs.

@jasnell
Copy link
Member

jasnell commented Apr 29, 2017

Which can be done, of course, using the information in an HTTP request (in the case of request.url) so I'm not overly concerned with that particular case.

The key challenge, of course, is that without a base, it's impossible to say for sure which rules to apply to the relative bits. We either must provide a base or we must provide an equivalent context in order to properly handle the URL. Otherwise the parsing will be best guess at best.

@domenic
Copy link
Contributor

domenic commented Apr 29, 2017

A compromise might be adding request.parsedURL() or similar (a function to show that it's expensive) which uses the request info + the request.url relative URL string to return a properly parsed URL instance with the base constructed from the request info.

@benjamingr
Copy link
Member

@jasnell

Which can be done, of course, using the information in an HTTP request (in the case of request.url) so I'm not overly concerned with that particular case.

How? An HTTP server is not aware of its host name, may have several dns host names or none. In fact, I'd argue that the end server should not be concerned with what hostname it's using.

@jasnell
Copy link
Member

jasnell commented Apr 29, 2017

It can make a best guess using the protocol and host header, both of which may be modified of course, but it provides enough context to provide a base URL when parsing the request URI.

@watson
Copy link
Member

watson commented Jul 17, 2017

I'll not claim to know which, if any, of the 3 suggested solutions are best, but I just wanted to add a few things to the discussion:

I've seen incoming HTTP requests to Node servers that don't have a Host header in the wild a few times. If it's been striped by a proxy or if the client haven't provided it in the first place I don't know, but you can't rely on this header.

The 2nd issue is knowing the protocol. This can be inferred by looking at the request.socket.encrypted boolean, but this is not an exact science as the TLS might have been terminated in a load balancer.

Bottom line is that we can only safely get the path (via request.url). Getting the rest can be worked around, but not in a nice way.

@TimothyGu
Copy link
Member Author

So what I'm seeing is that what people really want a partial URL parser for is to parse the origin form of the request target in an HTTP request, which is defined as:

origin-form    = absolute-path [ "?" query ]

We can introduce a URLAbsolutePath class to parse from path state for this specific issue. The semantics are fairly well-defined for this specific case, since we know the scheme is always special (HTTP or HTTPS). Beyond that (including relative, query-only, or fragment-only URLs), however, users are on their own.

@jasnell
Copy link
Member

jasnell commented Jul 18, 2017

hmm... I'm hesitant to introduce a new class. This could be approximated in userland fairly easily using something like:

const url = new URL(`https://localhost${absolutePath}?${query}`);

Then look at the bits of url that you care about.

@Jessidhia
Copy link

@jasnell that doesn't look very usable on user input 😕

@jasnell
Copy link
Member

jasnell commented Jul 18, 2017

A userland module can make it more usable. I'd rather avoid adding a convenience class that is not part of the standard

@stevenvachon
Copy link

stevenvachon commented Aug 17, 2017

const relateURL = require('relateurl');

const base = new URL('http://fake/');
const url = new URL('/path?query', base);

url.searchParams.append('query2', 'value');

relateURL(url, base, { output: relateURL.ROOT_PATH_RELATIVE });
//-> /path?query&query2=value

v1.0 will be released in the near future: https://github.com/stevenvachon/relateurl

Perhaps I'll write a RelativeURL package, following what I'd done with incomplete-url:

new RelativeURL('/path?query');
//-> RelativeURL { protocol: '', hostname: '', pathname: '/path/' ... }

@paulmelnikow
Copy link

This issue also applies to absolute URLs without a hostname. For example, url.parse('/foo') works, but new URL('/foo') throws a TypeError.

@paulmelnikow
Copy link

I made url-path to solve this problem for me. It supports absolute paths but it would be easy to add relative paths.

@ghost
Copy link

ghost commented Mar 30, 2018

I stumbled into this problem, too. I do not like to use a new class or module for this problem, so I resolved it by using a new protocol as the base.

> new URL('/', 'relative:///');
URL {
  href: 'relative:///',
  origin: 'null',
  protocol: 'relative:',
  username: '',
  password: '',
  host: '',
  hostname: '',
  port: '',
  pathname: '/',
  search: '',
  searchParams: URLSearchParams {},
  hash: '' }
> new URL('/folder/subfolder/../file.name', 'relative:///');
URL {
  href: 'relative:///folder/file.name',
  origin: 'null',
  protocol: 'relative:',
  username: '',
  password: '',
  host: '',
  hostname: '',
  port: '',
  pathname: '/folder/file.name',
  search: '',
  searchParams: URLSearchParams {},
  hash: '' }

Therefor I suggest internal support for this relative protocol. Maybe a registration of the protocol at the IANA is a good idea: https://tools.ietf.org/html/rfc7595 & https://www.iana.org/protocols. The WHATWG made it several times clear, that they do not intend to support relative URLs at all: whatwg/url#136.

The only downside of the approach is the strange value of the URL.href property, so you have to explicit use URL.pathname or do something like URL.href.substr(URL.protocol === 'relative:' ? 11 : 0), if you like to have all things together. Introducing an non-standard property hrefRelative for this is an option.

@TimothyGu
Copy link
Member Author

@cosycode That works for most cases, but there can be subtle differences between such a relative scheme and a special scheme like http:

> new URL('/folder\\subfolder/../file.name', 'relative:///').pathname
'/file.name'
> new URL('/folder\\subfolder/../file.name', 'http://a/').pathname
'/folder/file.name'

While a non-special scheme treats the backslash as part of the file name, a special scheme treats it as if it were a forward slash.


After spending some more time thinking about this issue, I don't think a one-size-fits-all solution exists, without us running into the same issues as url.parse() did in the first place.

What we could do is have specialized classes for specific tasks -- like the request target of a HTTP response (aka req.url). We could certainly create such a class.

@ghost
Copy link

ghost commented Mar 30, 2018

@TimothyGu An interesting behavior, I did not know exists. Maybe this special case can be mentioned in the IANA registration, so that the relative scheme behaves the same as http.

Personally I do not like the idea of specialized classes, because it is agains the U concept of URL. On the other hand I do not see, how someone can push the WHATWG to change their position regarding protocol/host relative URL objects.

If you really introduce specialized classes, please provide at least a convenient way to transform it into a regular URL like RelativeURL.toURL(base), or better inherit from URL with empty strings for protocol and host.

@JoshuaWise
Copy link

I think the real crux of the problem is that req.url is NOT a URL. It's a request-target, which may be other values such as relative paths (mentioned above) or even just an asterisk *. Therefore, it should be treated differently.

Although Node.js currently does no validation on the req.url, the HTTP spec actually has many requirements for the value, which differ from a general-case URL.

RFC 7230 does define a way to generate an "effective request URI", but doing so requires other information bundled within the HTTP request (such as req.method and req.headers.host). Therefore, I propose a function that takes the req object as an argument (not a string url), and converts it to a RequestTarget object, which has getters for the different url components, some of which may be null. A convenience method can be added which builds a URL object, but such a method will require a default host to be passed in as an argument (just like the URL constructor).

I actually wrote a spec-compliant request-target parser, so it could be used for inspiration: https://github.com/JoshuaWise/request-target

@fromi
Copy link

fromi commented Sep 24, 2020

Yes, I already worked out the problem with a dummy host indeed. I was only surprised that I had to do it when I discovered the spec, but that's of course a non blocking issue :)

@aduh95
Copy link
Contributor

aduh95 commented Oct 2, 2020

Note that on Webkit-based browsers, running new URL('#hash') on the DevTools automatically assumes about:blank as base URL:

URL {
hash: "#hash"
host: ""
hostname: ""
href: "about:blank#hash"
origin: "null"
password: ""
pathname: "blank"
port: ""
protocol: "about:"
search: ""
searchParams: URLSearchParams {}
username: ""
__proto__: URL
}

Maybe Node.js could adopt a similar behavior?

@stevenvachon
Copy link

@aduh95 about:blank doesn't seem applicable to anything but web browsers.

@wesleytodd
Copy link
Member

wesleytodd commented Oct 2, 2020

any utility APIs that could be added can be implemented by the @nodejs/web-server-frameworks themselves.

I will add an agenda item in our meeting to discuss this. That said, I am not sure what utility api's we would want as only a part of the http interfaces and not via the URL interface.

I did intend to bring up req.url being an instance of URL, but my assumption was the base would either be the value from the host header (or meta header) or what is derived from the server. I think that would be enough for frameworks and end users without any helper utilities.

I don't mean this to say I don't see value in supporting relative URLs, but I am not sure the @nodejs/web-server-frameworks team are the ones to solve this.

@styfle

This comment has been minimized.

@TimothyGu
Copy link
Member Author

@aduh95 that's bug in Safari, not a feature. See whatwg/url#539.

@wesleytodd
Copy link
Member

@styfle want to move that comment over in nodejs/web-server-frameworks#71? It would be a good starter to the conversation I wanted to have there. And I have comments but don't want to hijack this thread to make them.

@phawxby
Copy link

phawxby commented Dec 1, 2020

The URL API appears to charged blindly down the route of strictness and standards and lost a whole bunch of the utility in the process. I'm not saying necessarily that's a bad thing, but I think it makes the case that the 2 API's should probably remain because they serve 2 different purposes.

Say I want to just grab the hash portion of a relative URL url.parse('foo#bar') works just fine, new URL('foo#bar') does not. I don't care about the hostname, or the protocol, or anything like that, I just want an easy way to flexibly parse URL's, especially if those URL's have been inputted by a user. The new API has lost a lot of utility due to the lack of input flexibility.

If I want to strictly parse full URL's, fine, new URL makes sense. If I want to perform various flexible utility actions on parsed URLs, use url.parse. The two are distinct in their function and both should remain or the URL API should be made more flexible. There's little point delegating this to third party modules when there's already code to do this that's being deprecated in favour of a new API that serves and entirely different purpose.

nicokaiser added a commit to florajs/flora that referenced this issue Mar 3, 2021
2.0.9 introduced a new URL parser based on WHATWG URL API. However, if
the request.url is relative (which is the case), parsing fails:
nodejs/node#12682
@mitesh1409
Copy link

mitesh1409 commented May 2, 2021

Check this one
https://nodejs.org/dist/latest-v14.x/docs/api/all.html#http_message_url

To parse the URL into its parts:

new URL(request.url, `http://${request.headers.host}`);

Once URL object is created this way we can use all its methods and properties.
Ref. Link: https://nodejs.org/dist/latest-v14.x/docs/api/url.html#url_the_whatwg_url_api

@jasnell
Copy link
Member

jasnell commented May 7, 2021

Given that we've (a) added documentation illustrating how to better handle relative URLs with the WHAT-WG API, and (b) We've backed off the deprecation of the legacy API, I'm going to close this issue for now. There's still an argument that could be made on the standards level for more ergonomic handling of relative URLs but those discussions are better directed to the whatwg/url repository.

@jasnell jasnell closed this as completed May 7, 2021
@ofhouse
Copy link

ofhouse commented May 7, 2021

For the sake of completeness here is the corresponding issue in the whatwg/url repository: whatwg/url#531

@alwinb
Copy link

alwinb commented May 7, 2021

I think this is very important. The WHATWG API has been designed to standardise existing browser behaviour, not to be the general URL API for platforms such as NodeJS. This thread shows that this causes issues, but it is a problem not as much with NodeJS as with limitations of the standard.

I can predict that this will cause more problems down the road (and not just in Node) as the WHATWG API is becoming more widespread and people will necessarily hack around it to make it meet their needs.

I recently completed my research on the technical part of the problem by releasing this somewhat low level library. My hope is that the community can use it as a basis for building a number of more polished URL APIs that do support relative URLs whilst maintaining compatibility with URLs as defined in the WHATWG standard. I have one attempt at such an API here (but please, come up with alternatives).
The theory behind it is solid and is (still being) written down here. Any help in motivating the WHATWG to take on the issue of relative URLs is welcome.

yusufkandemir added a commit to yusufkandemir/quasar that referenced this issue Mar 28, 2022
Here is a lengthy discussion about the problem nodejs/node#12682
rstoenescu pushed a commit to quasarframework/quasar that referenced this issue Mar 29, 2022
* fix(app-vite): Fix SSR publicPath check

* refactor(app-vite): Add JSDoc types for #appOptions

* fix(app-vite): Call SSR injectMiddlewares at the right time to enable publicPath middleware

* fix(app-vite): Correctly use WHATWG URL constructor
Here is a lengthy discussion about the problem nodejs/node#12682
@phawxby
Copy link

phawxby commented Jun 13, 2022

(b) We've backed off the deprecation of the legacy API

@jasnell has that been formally declared anywhere? It may have been and I've just missed it. Should I PR the typings to remove the deprecation notice?
https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/types/node/url.d.ts#L63
https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/types/node/v16/url.d.ts#L63

@jasnell
Copy link
Member

jasnell commented Jun 13, 2022

Yep, if you look here https://nodejs.org/dist/latest-v18.x/docs/api/url.html#legacy-url-api, you'll see that the old API is now explicitly marked "Legacy" rather than "Deprecated" as of Node.js 15.13.0

@styfle
Copy link
Member

styfle commented Jan 17, 2023

As of Node.js 19, url.parse() is once again Deprecated 😮‍💨

https://nodejs.org/dist/latest-v19.x/docs/api/url.html#urlparseurlstring-parsequerystring-slashesdenotehost

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Issues that request new features to be added to Node.js. whatwg-url Issues and PRs related to the WHATWG URL implementation.
Projects
None yet
Development

No branches or pull requests