Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImportCache of CanonicalizeResult is broken for relative imports handled by any importer that is not a baseImporter #2208

Closed
2 tasks done
ntkme opened this issue Apr 3, 2024 · 13 comments
Labels
Milestone

Comments

@ntkme
Copy link
Contributor

ntkme commented Apr 3, 2024


See https://github.com/sass/sass-spec/pull/1969/files for reproduction of the issue.

@ntkme ntkme changed the title Import cache for child imports within files imported via custom FileImporter is broken Import cache for child imports within files imported via custom Importer/FileImporter is broken Apr 3, 2024
@ntkme ntkme changed the title Import cache for child imports within files imported via custom Importer/FileImporter is broken ImportCache of CanonicalizeResult from relative imports are broken Apr 3, 2024
@ntkme
Copy link
Contributor Author

ntkme commented Apr 3, 2024

After reviewing the code, I think the issue is in the API design that we are expecting a single baseImporter that will resolve all relative URLs.

if (baseImporter != null && url.scheme == '') {
var relativeResult = _relativeCanonicalizeCache.putIfAbsent(
(
url,
forImport: forImport,
baseImporter: baseImporter,
baseUrl: baseUrl
),
() => _canonicalize(baseImporter, baseUrl?.resolveUri(url) ?? url,
baseUrl, forImport));
if (relativeResult != null) return relativeResult;
}
return _canonicalizeCache.putIfAbsent((url, forImport: forImport), () {
for (var importer in _importers) {
if (_canonicalize(importer, url, baseUrl, forImport) case var result?) {
return result;
}
}

In these lines, the _relativeCanonicalizeCache is only used when import is resolved by baseImporter, however, when the base importer could not resolve, or it does not exist, the import is passed to other importers, and if the other importers resolved the relative url, it would cache a relative url into _canonicalizeCache which is suppose to only contain absolute url.

I can see two issues here:

  1. There is no interface difference between baseImporter and importers, that both of them can handle relative url, but when importers handles a relative url, it caches relative url as if it's absolute, which is broken.
  2. There is currently a limit of only one baseImporter to handle relative url, but a user might indeed want to implement multiple importers (e.g. a Importer and a FileImporter at the same time) that can handle relative url. I have a use case that I actually implemented this, and found this bug. See: Optimize custom importer sass-contrib/sassc-embedded-shim-ruby#85

In my opinion, I think the idea of baseImporter is an over complication of the problem, that an single array of importers is enough, and a relative import should always go into _relativeCanonicalizeCache.

@ntkme ntkme changed the title ImportCache of CanonicalizeResult from relative imports are broken ImportCache of CanonicalizeResult is broken for relative imports handled by any importer that is not a baseImporter Apr 3, 2024
ntkme added a commit to sass-contrib/sassc-embedded-shim-ruby that referenced this issue Apr 4, 2024
ntkme added a commit to sass-contrib/sassc-embedded-shim-ruby that referenced this issue Apr 4, 2024
ntkme added a commit to sass-contrib/sassc-embedded-shim-ruby that referenced this issue Apr 4, 2024
@stof
Copy link
Contributor

stof commented Apr 5, 2024

Relative paths are resolved using the importer that loaded the current module (this is what the baseImporter is, which is why there is only one).

When going through other importers, URLs are not meant to be considered relative ones. To me, the test you added in sass-spec does not respect the rules than were defined in the spec for importers (and I think it is impossible to define the same importer in the Dart API).

@nex3 reading the changelog in https://github.com/sass/dart-sass/releases/tag/1.68.0, it looks like filesystem importers have a special behavior in the JS API which has no equivalent in the Dart API. Is it expected ? It looks like this is the root cause of this caching issue because it allows JS filesystem importers to resolve the same URL to different canonical URLs.

@ntkme
Copy link
Contributor Author

ntkme commented Apr 5, 2024

When going through other importers, URLs are not meant to be considered relative ones.

This is exactly where the bug is. We are passing relative urls to importers array, but caching it as “absolute” url. In my opinion this is flawed. If importers are meant to only handle absolute urls why are we even passing relative url to importers, shouldn’t it just fail right after base importer not able to resolve? In that case, shouldn’t compile (file) API also have a base importer option?

The fundamental issue I see is the concept of resolving relative url and canonicalizing a url is not necessarily the same: a resolved url from base url is only meant to be an absolute url, and not necessarily a canonical url. This makes it utterly confusing to ”canonicalize” a relative url with three different patterns in the current API:

  1. Relative url is resolved to absolute url from base URL, and then canonicalized by base importer.
  2. Relative url not resolved (no base url), and then resolved and canonicalized by the base importer.
  3. Relative url not resolved, and then resolved and canonicalized by importers.

In my experience, these mixed patterns make it extremely difficult to write generic custom importer or importers that correctly handles all the cases (and it’s not well documented). As a user of the API, I end up finding out that not relying on the base importer which may or may not resolve the url, and only using importers to handle both url resolution and canonization, is much more flexible, easier and consistent.

To me, the test you added in sass-spec does not respect the rules than were defined in the spec for importers.

From thread I’ve read, it appears that the initial design is exactly as you described, that the base importer is meant to take a resolved url, and user can further canonicalize a resolved url, and then importers are meant to handle any unresolved url, which is “assumed” to be absolute, but not enforced. Unfortunately, the community missed the initial RFC and when the API is already out for a long time, webpack team comes and say they need the ability to customize the “resolving” of the url, rather than just given an already resolved url to canonicalize. This was accepted and that’s why we have containingUrl. This change means the previous restrictions in spec, if any, should have been lifted, that what I’m doing (customizing the resolution of relative url) is completely legal as long as my implementation is deterministic.

One of the major design objectives of new module system and new API is deterministic resolution and canonization. The cache in question is based on the assumption of that importers are deterministic and always returns the canonicalized absolute url with the same input. For any single importer it should be deterministic given the same (url, baseUrl) pair, and when the url is absolute the baseUrl does not matter. - I think this is common sense for resolving relative url, and it is why I’m changing the cache key to align with it.

ntkme added a commit to sass-contrib/sassc-embedded-shim-ruby that referenced this issue Apr 5, 2024
@nex3
Copy link
Contributor

nex3 commented Apr 8, 2024

Sorry to chime in a bit late here, I was out sick for a chunk of last week.

I do think there are bugs here, but I don't think the fix in #2209 is exactly correct. Let me give some background that may help explain the way things work today.

History Lesson

The Old Days

When we first added support for importers to Sass, we didn't have a rigorous understanding or specification of exactly what the strings after an @import rule were. They weren't URLs—in early Sass versions, loading an absolute file: URL would not have worked (I think it may still not work in the last Ruby Sass release). If anything, they were local filesystem paths that happened to be written with forward slashes because that worked on all OSes. One of several loose conventions we inherited from the Ruby milieu in which Sass first emerged.

Given that, all @imports were just relative paths by necessity, since there was no "absolute" format that would work consistently everywhere. If you wanted to load a library file, you'd add that library's root to your load paths and then use a path that was relative to that root. Importers grew out of this logic, as a generalization of the idea of "a load path" to something that could theoretically load from places other than the filesystem (or load from the filesystem in more complex ways). They still just took relative paths that they looked for "within" the importer, whatever that meant for that specific importer. So for example, if you had an importer that looked up Sass files in a database, you'd still write @import "path/to/style", and it would receive "path/to/style", look up that key in the database, and return the Sass contents if it existed and say "I didn't find anything" if it didn't.

In this system, we still wanted to enforce the idea that relative loads always take precedence over loads from the load path. But because we weren't officially working with URLs, the only way we were able to do so was by fiat. We just had a separate Importer.find_relative_url() entrypoint that we just told implementers to ignore the load path for and only do a relative resolution. It wasn't great, and Node Sass dropped even this, allowing load paths to take precedence over local resolution in some cases and muddying the waters even further.

Dawn of Dart Sass

When I was implementing Dart Sass, especially knowing that @use was around the corner and we wanted to have a notion of "the same file" so we could avoid @import's double-load problem, I changed the system around. I came up with the distinction between the "canonicalize" step, which produced a canonical URL that uniquely identified a Sass file that would never change during the compilation (and would only be loaded once by @use) and the "load" step which actually loaded the contents of that file. This naturally led us to being more precise that the things being loaded are URLs specifically, so that canonicalization always produces a fully-resolved absolute URL that we can use as a cache key.

But we still had to deal with the fact that all the existing importers (and users!) thought in load-path terms, where they just used relative URLs to load from their importers. We couldn't just say "all URLs going into importers are absolute from the start". And we also needed to ensure that relative loads would take precedence over these load-path-style loads. I think the solution was rather clever: because all loads were now represented as URLs anyway, we'd represent relative paths as URLs that were absolute but not canonical, based on taking the relative URL that appeared in the file and resolving it based on that file's canonical URL. canonicalize() methods were already expected to be able to deal with absolute URLs, so if they followed the rules this should just work automatically. This corresponds to Pattern 1 in @ntkme's list above.

Then we could safely pass a truly relative URL to canonicalize() when resolving that URL as a load path. This corresponds to Pattern 3. Pattern 2, although there does exist code for it, shouldn't actually happen in practice. Looking through the logic I think it's actually possible through StylesheetGraph.modifiedSince(), but that's a bug. We should just make that an error and fix the case where it occurs.

Contextual Complications

Things get more complex from here. In order to support Node-style package resolution, we had to add a means for even absolute loads to see what file they're being loaded from. While it's true that the WebKit loader folks were the first to ask for this, it's also strictly necessary for the Node package importer, so we'd have wanted to add it sooner or later anyway. Node.js's package resolution algorithm is unavoidably sensitive to where the load occurs, because it needs to use the most local node_modules folder to disambiguate between multiple versions of the same package (itself a bit of madness, but that's another long post).

To accommodate this without going fully back to the old Ruby Sass days of fully relying on importer implementors to correctly follow all the rules, we added a way for importers to "opt in" to getting the contextual information: they would indicate which absolute URL schemes they would only accept as non-canonical absolute URLs and never return from canonicalize(), and in return when they received URLs with those schemes (or load-path-style relative URLs with no scheme at all) they would have access to the base URL for those files. Thus a clear separation is maintained between relative loads (which always use canonical URL schemes) and absolute loads (which may not).

This is where the second bug snuck in, though. Because in the old system canonicalization was expected to be purely dependent on its input arguments, we had started caching it to avoid extra filesystem hits for commonly-loaded stylesheets. But now, in some circumstances, canonicalization could depend on the containing URL, and we didn't include that in the cache. This is the specific bug that's highlighted by sass/sass-spec#1969.

How Should We Fix It?

I think the critical element of the fix here is pretty simple: don't cache canonicalize() when containingUrl is used. The question then is, how do we know if it's used or not? The easy answer is to just never cache canonicalize() if the containing URL is available to canonicalize(). However, this essentially breaks our ability to cache any load-path-style canonicalization, including any use of file system importers in the JS API.

The more complex answer would be to stop caching for a given importer once that importer accesses containingUrl in practice. This would require more infrastructure, including in the embedded protocol, but it may be worth the pain—filesystem resolution involves a lot of disk IO because non-canonical Sass URLs can correspond to so many potential files, and avoiding that for repeated loads can be a real performance boon.

In the short term, since this does produce visible behavior bugs, I'm going to fix those using the broader disable. Then we can look into adding caching back with appropriate cache-busting later on.

@ntkme
Copy link
Contributor Author

ntkme commented Apr 9, 2024

Pattern 2, although there does exist code for it, shouldn't actually happen in practice.

When I use compileString API with no "url" defined for string input, a "relative" import from the root stylesheet would go into pattern 2, with a custom base importer. For example:

sass.compileString('@import "a";', {
  importer: {
    canonicalize (url, _context) {
      console.log(`// canonical(${url})`)
      return new URL('u:a')
    },
    load (_url) {
      return {
        contents: 'a {b: c}',
        syntax: 'scss'
      }
    }
  },
})

@ntkme
Copy link
Contributor Author

ntkme commented Apr 9, 2024

In the short term, since this does produce visible behavior bugs, I'm going to fix those using the broader disable. Then we can look into adding caching back with appropriate cache-busting later on.

My initial solution was to effectively have a per importer canonicalization cache whenever baseUrl (containingUrl) is passed, regardless of whether it is used, but only for relative url that has no scheme. Performance wise it is going to be slower but better than no cache.

I understand this is not necessarily correct. For example, if somehow absolute urls are canonicalized differently depends on different baseUrl, this can very well still be broken (IMO if this really happens it's the importer's fault, that's why I did not consider this case). However, if we expand this idea to both relative and absolute urls to always use a cache key of url, forImport, importer, baseUrl, resolveUrl: although this will increase the cache size and need multiple checks due to cache being per importer which can slow down the performance, but in theory this cache should always be correct regardless of containingUrl is accessed or not.

@nex3 What do you think? I updated the PR to implement this.

@nex3
Copy link
Contributor

nex3 commented Apr 9, 2024

When I use compileString API with no "url" defined for string input, a "relative" import from the root stylesheet would go into pattern 2, with a custom base importer. For example:

sass.compileString('@import "a";', {
  importer: {
    canonicalize (url, _context) {
      console.log(`// canonical(${url})`)
      return new URL('u:a')
    },
    load (_url) {
      return {
        contents: 'a {b: c}',
        syntax: 'scss'
      }
    }
  },
})

This is an invalid invocation per the TypeScript types: StringOptionsWithImporter.url is mandatory, so its behavior is officially undefined (as is the behavior for passing anything as undefined that's required by the TS types).

That said, it looks like the Dart API and the embedded protocol both do allow this pattern. Per spec, this would end up invoking the WHATWG URL parser with a relative URL string and an undefined base URL, which would return a failure. We don't do that currently—we just treat this as a load-path-style canonicalization for the base importer, which doesn't really make sense. We should issue a deprecation for this behavior and clarify that the url must be a valid canonical URL if importer is passed.

I understand this is not necessarily correct. For example, if somehow absolute urls are canonicalized differently depends on different baseUrl, this can very well still be broken (IMO if this really happens it's the importer's fault, that's why I did not consider this case).

This is intentional and desirable behavior: for example, in the Node.js package importer, absolute pkg: URLs depend on the baseUrl (exposed as containingUrl to the importer) because Node.js package resolution is context-sensitive as described above.

However, if we expand this idea to both relative and absolute urls to always use a cache key of url, forImport, importer, baseUrl, resolveUrl: although this will increase the cache size and need multiple checks due to cache being per importer which can slow down the performance, but in theory this cache should always be correct regardless of containingUrl is accessed or not.

An important goal of caching canonicalization is to avoid the repeated process of querying each importer in the chain for URLs we already know how to load. You're right that we could avoid the issue by just building all the possible information into the cache key, but then we'll only get cache hits if we're reloading the same URL from the same source file—something that's only going to happen in --watch mode on the CLI. The much more important case is something like @use "bootstrap/variables", which we want to hit the cache once it's been resolved once anywhere, as long as none of the importers it checks are context-sensitive.

I have an in-progress PR which should address these issues.

@nex3 nex3 added the bug label Apr 9, 2024
@nex3 nex3 added this to the 2.0.0 milestone Apr 9, 2024
@nex3
Copy link
Contributor

nex3 commented Apr 9, 2024

Having spent the day working through all the consequences of deprecating Pattern 2, I'm not quite sure that it's correct to remove it. We do currently use it to represent situations like parsing stdin on the CLI and allowing the entrypoint to load files from the working directory without making that available as a load path everywhere. Using a load-path-style importer for this is a bit of a hack, but any alternative—like creating a fake URL for the file—is also a bit of a hack, and has knock-on consequences like not being able to use null to represent the fact that the file doesn't have a real URL. I'm going to sleep on it and figure out what to do later on.

All that is separate from the caching issue, which is still definitely a bug and essentially independent of the question of how the base importer is handled (since the base importer never has access to the containing URL anyway).

@ntkme
Copy link
Contributor Author

ntkme commented Apr 9, 2024

The more complex answer would be to stop caching for a given importer once that importer accesses containingUrl in practice.

I agree that adding an accessedContainingUrl field into CanonicalizeResult is probably the best solution here, but the quoted statement above does not sound right to me. E.g. I can write an importer, that if the url given is an absolute url with scheme, I don't use containingUrl, but if it's a relative url, I will resolve it against containingUrl. In that case containingUrl is conditionally accessed by this importer, so that completely stop caching that importer is not optimal, it's better to strictly cache the CanonicalizeResult based on whether accessedContainingUrl is true or false.

While this do require additional infrastructure, it should be pretty straight forward to implement. As we already have accessed_argument_lists for ArugmentList in embedded protocol, this accessed_containing_url can work similarly.

@nex3 Let me know if you want to take it, or I'm happy to contribute some time on it.

I took a brief look and it is indeed a bit complicated, because in the native dart code containingUrl is implemented as Zone values, so it would be a bit different than how js importer/embedded host would handle whether the value is accessed or not, but it should be doable.

@nex3
Copy link
Contributor

nex3 commented Apr 9, 2024

Yeah, after sleeping on this, I think leaving the existing behavior as-is for Pattern 2 probably makes the most sense. The most accurate representation of the underlying logic here would be to have a mandatory base URL and a flag indicating whether that's also the canonical URL, but I think that's more complicated than it's worth for what is ultimately an edge case—as is going through a full deprecation process to move to a solution that's also not ideal.

@ntkme I agree that we don't need to cache-bust the entire importer on a containingUrl access. I have an implementation mostly finished that refuses to cache it if containingUrl is passed, which should be fairly straightforward to adapt to one that works similarly if containingUrl is accessed. I'd like to land mine ASAP to fix the bug, but if you want to take on the second half afterwards, I'd be much obliged.

nex3 added a commit to sass/sass that referenced this issue Apr 11, 2024
nex3 pushed a commit to sass/sass-spec that referenced this issue Apr 16, 2024
@ntkme
Copy link
Contributor Author

ntkme commented Apr 18, 2024

One note we might want to add to the change log is that the containingUrl access tracking is "potentially" a breaking change for stateful Importers:

  • For single importer with persisted state: E.g. in embedded-host-node's emulated legacy JS API, it tracks the import stack by persisting the stack without directly using containingUrl. The logical effect is that it depends on containingUrl, but containingUrl was not directly accessed, therefore I had to make changes to force access containingUrl to make sure cache behaves correctly.
  • For multiple importers that shares state: E.g. in sassc-embedded-shim-ruby's emulated legacy SassC Ruby API, it use a pair of Importer and FileImporter to wrap the legacy SassC::Importer interface, which may return either a string content or a file path. The way it works is that, first the modern Importer runs the legacy SassC::Importer. If the legacy importer returns a string content, the modern Importer return the canonicalized url, and handle the load. If the legacy importer returns a file path, the modern Importer will save the result in memory and return nil, then the next FileImporter will read the result from memory and return the canonicalized url without any extra computation. - In this case the FileImporter is not reading containing_url directly, but logically it has been accessed. Therefore, I had to add an extra call to containing_url in the FileImporter, just to mark it as accessed to make sure it is non-cacheable.

In real world, the first case is likely pretty rare. However, the second case can be somewhat common as a performance optimization to 1) avoid compute the same result twice in two different importers; and 2) avoid reading large raw files on embedded host and then send via protobuf to compiler. - In fact, this optimization can also be applied to the legacy importer in embedded-host-node.

We will never know if a user's importer is stateful or stateless. Most of them are likely stateless, but there might be some stateful Importers getting impacted by this change, so I think it's worth to document how caching for canonicalization works and how could it impact a stateful importer.

@nex3
Copy link
Contributor

nex3 commented Apr 18, 2024

Importers are really expected not to be stateful, so I'm not concerned about subtle behavioral changes for them. It's fine to have a few exceptions for backwards-compatibility purposes with old bad importer APIs, but non-tooling authors should never do that.

@ntkme
Copy link
Contributor Author

ntkme commented May 1, 2024

Closing this as we decided to abandon the deprecation and cache optimizations has been delivered in 1.76.0.

@ntkme ntkme closed this as completed May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants