Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Toggle contenthash for all output filenames #518

Closed
lukeed opened this issue Nov 11, 2020 · 15 comments · Fixed by #1001
Closed

[Feature] Toggle contenthash for all output filenames #518

lukeed opened this issue Nov 11, 2020 · 15 comments · Fixed by #1001

Comments

@lukeed
Copy link
Contributor

lukeed commented Nov 11, 2020

Code-splitting already produces chunks that are hashed (chunk.[contenthash].js), which is awesome, but the entry files themselves are still written with filenames that match the inputs, unless outfile is given something specific.

Ideally, one would have the ability to include a content-hash as part of the entry (and/or all) file outputs too.
Rollup does this via output.entryFileNames, but all the template patterns aren't necessary imo.. nor is a string template.

Perhaps there could be a global build() option called: contenthash: boolean added?
(Not suggesting a out-prefix since it'd apply generally, much like minify and sourcemap)

When disabled, filenames are untouched.
When enabled, the content hash is calculated and injected into all file names (bundle.js -> bundle.[hash].js)


This may already be on the agenda via #268 (comment) comments, but after searching a bit, I couldn't find anything that addressed non-chunk hashing specifically.

@evanw
Copy link
Owner

evanw commented Nov 25, 2020

Thanks for logging this. Hashing of entry points is indeed on the agenda.

One of the hold-ups is that I haven't yet investigated what other bundlers do as a point of comparison. I actually like Rollup's solution to this since it solves multiple things in an elegant way (e.g. sub-directories too). Thanks for linking their docs. The full solution would be something like #553 but it'd be nice to have something built in.

The other reason is that my current approach of generating the file name by hashing the file doesn't work in this case. Entry points caused by dynamic import() expressions could introduce cycles, which up until this point have been absent. This is a general limitation with my current code splitting approach. I'm planning an overhaul of the code splitting algorithm to address this.

@lukeed
Copy link
Contributor Author

lukeed commented Nov 25, 2020

Sounds good! Worth mentioning that Rollup's hashing isn't deterministic* and is/was up for reconsideration. Have personally run into this issue a few times.

*while I don't understand it fully, I think their hashing is based on import order and reference counting. In that sense it's deterministic, but the order in which Rollup follows imports may differ. The output contents may be identical, but live under different output names. IMO the hash should 100% be content-based.

@benmccann
Copy link

I think the reason Rollup's hashes aren't 100% content-based is that Rollup doesn't know the content at the hashing stage. E.g. if file A imports file B then Rollup writes something like import b.48gad2f4211.js; into file A. The hash must be already generated in order to do that. Rollup plugins may dynamically change imports and ES6 can have circular dependencies and that combination makes it not really possible. You can't just ignore the imports because it's common that the only thing that changes is the import ( e.g. file A didn't change but file B did then both A and B need new hashes since A imports B)

@TylorS
Copy link

TylorS commented Jan 19, 2021

It may or may not be useful, but I had this need while switching over to snowpack, and I wrote a library to help do that - https://github.com/TylorS/typed-content-hash

@Ventajou
Copy link

Ventajou commented Feb 5, 2021

Hashing based on content is awesome for caching, I was able to do that in my old Webpack build unless I misunderstood what they meant by hash 😃

The level of control that Webpack offers in regard to output file names and location (both JS and assets) is really good. I was able to replicate that for assets using plugins in esbuild, but that's not easily feasible for JS files because any edit to the output files will likely break the sourcemaps...

@TylorS
Copy link

TylorS commented Feb 5, 2021

@Ventajou The above library I mention will remap all your sourcemaps with the hash changes FWIW

@TylorS
Copy link

TylorS commented Feb 19, 2021

Hey, @evanw I don't know much go, but I would love to help out here if I can. I would be willing to contribute code if I was pointed in the right direction, but in the meantime, I can give some unsolicited advice on the algorithm in the library I linked above.

I used rollup as the basis for my algorithm since this thread mentioned it, so I figure I could try to break it down. Once you have the otherwise final output (including banner/footers/etc)

  1. Calculate Strongly Connected Components

Since you're tackling code-splitting and care a lot about perf. you might have already encountered the need to calculate strongly connected components using something like "Tarjan's Algorithm". This will allow you to convert to an acyclic graph from a cyclic graph, usually represented as a list of lists. If the nested list contains a singular item there is no cycle and if there are multiple they are strongly connected or have a cycle.

const ouput = [ ['a'], ['b', 'c'], ['d'] ] // b + c would be a cycle 

Conveniently Tarjan's algo. will already have produced a topological sort and you shouldn't need to do any additional sorting.

2.) Sequentially rewrite imports and compute hashes

As you traverse through the sorted list if you encounter a list with no cycles you can 1) rewrite import/export specifiers with previously calculated hashes for its dependencies, 2) calculate the hash for this file and cache it for the next items in the list.

As you encounter lists that represent cycles, you'll quickly notice it is impossible to follow the same pattern.

Potentially in parallel, you'll create a hash for each of the items in the cycle upfront before rewriting the imports. To do so you'll concatenate the document's contents with the contents of all of its dependencies recursively, excluding anything that has already been concatenated, to get the same benefits of it being deterministically content-based but also ensuring dependents get new hashes as their dependencies change.

Again we'll want to cache the hashes we computed for later iterations. Repeat until you've made it through all the lists.


I hope it made some sense written out, but I'd be happy to clarify any points and help figure out any specifics for esbuild. I have some TypeScript code samples if those would be helpful at all as well.

I could potentially see going about it differently such that all content hashes are calculated in parallel by using the same algo. for cycles for everything as well. I'm not sure if go is better equipped to handle this over JS but it could be worth a try. I was worried it would scale poorly with large/complex dependency graphs.

@evanw
Copy link
Owner

evanw commented Feb 19, 2021

Yes, I'm already thinking along similar lines. Not sure if this is exactly what you said or just very similar, but I can describe the algorithm I am currently in the middle of implementing. It's done in three phases:

  1. In serial (linking is a serial task): Determine chunk boundaries and cross-chunk imports. Assign each chunk a long random id that doesn't appear anywhere else in the source code.

  2. Completely in parallel: Render each chunk using the random ids as the paths for cross-chunk imports. This approach supports plugins making arbitrary modifications to the chunk data. Also calculate the hash of the post-plugin content excluding the random ids, which are easy to identify with a simple string scan (no need to parse the content again).

  3. Completely in parallel: Determine the final hash for each chunk with a depth-first traversal over all dependencies (including yourself). For each dependency, include the content hash and the output path template (the output path without the final hash) in the final hash. So the hash should change if the contents or paths of any dependency changes. Then replace the random ids in the output content with the final output paths (output path template + final hash) of the other imported chunks and also simultaneously adjust source map offsets as you go.

I think this should naturally handle cycles without having to split them up into connected components first. It also lets the whole graph go in parallel instead of getting bottlenecked waiting for dependencies to finish first.

@lukeed
Copy link
Contributor Author

lukeed commented Feb 19, 2021

Assign each chunk a long random id

Can you explain this part in a little more detail? The "random id" part worries me on its own – will this mean that hashing isn't deterministic? Between builds, output filenames should only have a new hash if their contents changed & not be subjected to a randomized-base for hashing on each pass.

@evanw
Copy link
Owner

evanw commented Feb 19, 2021

It's just a placeholder that gets completely removed at the end and that is excluded from the content hash. So all builds would still be completely deterministic. The point of the long random id just just so that 100% unlikely to end up being confused with the actual source code. I was thinking that if you used something like __esbuild_id_0__ it's conceivable that your source code might actually end up containing something like that since that's something a human might write.

The only way it wouldn't be deterministic is if you have a plugin that does something non-deterministic like sorting imports based on their name I guess. I could also use a deterministic seed for the random number generator, but then you start getting into the same problem where the id might end up in the source code. Output from one build might somehow make it into another build? Maybe that's not something to worry about though.

Hope that makes sense. Any thoughts after reading that?

@lukeed
Copy link
Contributor Author

lukeed commented Feb 19, 2021

Ah, that sounds great. Thanks!

No immediate thoughts, except for a useless suggestion that you may want to hash the source files' absolute file paths instead of entrusting a rand() for the random id (which is going to be fine basically all the time).

@TylorS
Copy link

TylorS commented Feb 19, 2021

Hey @evanw,

Thanks for getting back. This does indeed sound very similar with the addition of the temporary random ids and using paths as part of the hash.

The random ids make sense to me for replacements. I just kept track of start/end offsets when parsing for dependencies, but I didn't have the bundling or plugins to worry about.

Adding the output path to the template doesn't quite make as much sense to me, could you elaborate on the intent? My understanding would be that if module A depends of module ./B (no cycles), and ./B becomes ./C with no other changes, B and C should have the same deterministic content hash. A however would get a new content hash because it's content changed via the imports to B changing to C.

@evanw
Copy link
Owner

evanw commented Feb 19, 2021

Adding the output path to the template doesn't quite make as much sense to me, could you elaborate on the intent?

To avoid bottlenecks, all substitutions of the final paths into the final output files (i.e. phase 3) happen in parallel. This means that when determining the final path of A the final path of the import to B/C is not available because it's being computed in parallel. So the final path must only depend on information from the previous phase (i.e. phase 2) which only hashes file contents with the import paths removed (since import paths are computed in phase 3).

Specifically:

  • A's hash includes:
    • The content of A excluding import paths and the output path template of A excluding the hash
    • The content of B excluding import paths and the output path template of B excluding the hash
  • B's hash includes:
    • The content of B excluding import paths and the output path template of B excluding the hash

If you don't include the output path template then you have this:

  • A's hash includes:
    • The content of A excluding import paths
    • The content of B excluding import paths
  • B's hash includes:
    • The content of B excluding import paths

In which case A's hash doesn't change if ./B changes to ./C.

@TylorS
Copy link

TylorS commented Feb 20, 2021

That all appears sound to me now that you broke it down for me, thanks! I'm really excited for this feature!

Also many thanks for this project ❤️, I really appreciate the goal of keeping the scope limited and focused.

@evanw
Copy link
Owner

evanw commented Feb 20, 2021

Good to hear. Thanks for sanity-checking it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants