Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rush-lib] Ignore specified files when calculating project state hash #2643

Merged
merged 2 commits into from
May 14, 2021

Conversation

elliot-nelson
Copy link
Collaborator

@elliot-nelson elliot-nelson commented Apr 24, 2021

Summary

Implementation of #2618. The intent is to allow each project (in its rush-project.json file) to optionally specify glob patterns that should be ignored when calculating project state, which in turn, will allow those files to be ignored for the purposes of incremental build watching, cloud build cache hits, etc.

The new option is called incrementalBuildIgnoredGlobs, and is interpreted as a dot-ignore file.

Details

PR needs a little cleanup, but it does work (tested locally on our monorepo here and does exactly what I was hoping). Hoping to get feedback before I finish up the PR, to make sure this approach makes sense.

The new model of ConfigurationFile, which helps define riggable / extendable configuration files, is all async-based and takes a Terminal for possible warning output; this means that anything above it that uses it also needs to be async and to accept a Terminal. This requires a little bit of hacking on several chains (like ProjectWatcher -> ProjectBuildCache -> PackageChangeAnalyzer etc.).

It's worth noting that ProjectBuildCache() always accepted a RushProjectConfiguration that was previously loaded, but, that wasn't that useful in this situation. That config is for the project for which a hash is being calculated; if, for example, A is dependent on B, if B specified disableBuildCache in its project configuration, that doesn't stop A from calculating the files in B as part of its hash. But for this new feature, we do need A to ignore files that B ignores, so we need to load the project configuration for every project in the repo essentially. (Or, possibly, load them on the fly as we hit projects we haven't loaded the project configuration for yet; I opted for the former approach for simplicity but we can enhance that.)

How it was tested

Currently tested in our current monorepo, for example, with project A depending on B:

  • Add ignoreChangedFiles: "*.md" to B's rush-project.json.
  • Tweak B/src/index.ts and rush build; rebuilds B and A as expected.
  • Tweak B/README.md and rush build; B and A get cache hits as expected.
  • Add ignoreChangedFiles: "*.png" to A's rush-project.json.
  • Tweak A/logo.png and rush build; A gets cache hit as expected.
  • Tweak B/logo.png and rush build; rebuilds B and A as expected (A is ignoring PNGs, but B is not)

* A list of file globs under the project root that will be ignored when
* determining whether this project has changed.
*/
ignoreChangedFiles?: string[];
Copy link
Collaborator

@octogonz octogonz Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase "changed files" sounds like it is somehow related to the rush change feature, which this isn't.

If they are globs, we should call them "globs". (Heft has some settings like "includeGlobs" and "excludeGlobs" for example.)

You suggested ignoreFilePatternsForBuildState, but that sounds limited to rush build, which this isn't. The incremental build analysis can be enabled for various Rush commands, which includes rush build but also custom commands. A user might define an incremental command like rush generate-docs and then enable this setting:

    //   /**
    //    * If true then this command will be incremental like the built-in "build" command
    //    */
    //   "incremental": false,

(We could provide separate ignore globs for each command, but that seems like an advanced feature that will get entangled with other design problems like phased commands. Note that phases will introduce "incremental phases" in addition to the "incremental commands" that we have today.)

My point is that "incremental" or "incremental build" is perhaps more clear than "build" by itself.

So based on all this, maybe the setting could be something like:

  • incrementalIgnoredGlobs
  • incrementalBuildIgnoredGlobs
  • incrementalCommandIgnoredGlobs

@iclanton what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think incrementalBuildIgnoredGlobs is just about right. The phrase "incremental build" is the most common phrasing by far, in my opinion, so it makes sense to say "here's ignored globs for incremental build" (the fact that it applies to other commands potentially is OK, since those commands "act like incremental build" -- that's the way you'd describe them to someone).


"ignoreChangedFiles": {
"type": "array",
"description": "A list of file globs under the project root that will be ignored when determining whether this project has changed.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible improved wording:

The incremental analyzer can skip Rush commands for projects whose input files have not changed since the last build. Normally, every Git-tracked file under the project folder is assumed to be an input. Use the incrementalBuildIgnoredGlobs setting to ignore specific file paths. The paths are specified as globs relative to the project folder.


const projectConfiguration:
| RushProjectConfiguration
| undefined = await RushProjectConfiguration.tryLoadForProjectAsync(project, undefined, terminal);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iclanton Can the rush-project.json settings be inherited from a rig? That seems useful

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that they can today, I have it working in our local monorepo with 5.43.0. 🎉


const projectConfiguration:
| RushProjectConfiguration
| undefined = await RushProjectConfiguration.tryLoadForProjectAsync(project, undefined, terminal);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd want to do an async parallel invocation if we are waiting for file reads in this loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is very much critical path for all rush commands that do any work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I was hunting around the codebase and I just might not have found it yet -- is there a generic bottleneck / rate limiting / parallelism-limited async map somewhere in rush-lib that I can use?

(I'm pretty sure dropping a Promise.all(map) bomb is actually slower for 200+ files, so I guess I'd feel better if we could limit it to a parallelism of 16 or something.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hunting around the codebase and I just might not have found it yet -- is there a generic bottleneck / rate limiting / parallelism-limited async map somewhere in rush-lib that I can use?

@elliot-nelson Weirdly no, I can't seem to find one. (Which is weird -- I'm sure someone encountered this before!) The idiomatic pattern would be an API something like:

import { Async } from '@rushstack/node-core-library';

// (or whatever the right API design is)
Async.throttle(3,
  [
    () => doAsyncThing(),
    () => doAsyncThing(),
    () => doAsyncThing(),
    () => doAsyncThing(),
    () => doAsyncThing(),
    () => doAsyncThing(),
  ] 
);

If you want to recommend a high quality throttling NPM package, we could use this PR to add a @beta API in @rushstack/node-core-library that wraps it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put up a PR with an alternate pretty quick option -- #2665.

@@ -92,7 +110,16 @@ export class PackageChangeAnalyzer {
| RushConfigurationProject
| undefined = this._rushConfiguration.findProjectForPosixRelativePath(filePath);
if (owningProject) {
projectHashDeps.get(owningProject.packageName)!.set(filePath, fileHash);
const relativePath: string = path.relative(owningProject.projectRelativeFolder, filePath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend just slicing off the prefix, since owningProject will only be defined if filePath literally starts with projectRelativeFolder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it keeps all paths normalized to POSIX slashes (which git returns, and which projectRelativeFolder has been normalized to).

@@ -92,7 +110,16 @@ export class PackageChangeAnalyzer {
| RushConfigurationProject
| undefined = this._rushConfiguration.findProjectForPosixRelativePath(filePath);
if (owningProject) {
projectHashDeps.get(owningProject.packageName)!.set(filePath, fileHash);
const relativePath: string = path.relative(owningProject.projectRelativeFolder, filePath);
const ignoreMatchers: IMinimatch[] | undefined = ignorePatternsByProject.get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a dependency on ignore, which can run this like it is processing a .gitignore file and theoretically optimize a bit.

@elliot-nelson elliot-nelson requested a review from patmill as a code owner April 28, 2021 14:25
@elliot-nelson elliot-nelson changed the title [rush-lib] Ignore specified files when calculating project state hash (FEEDBACK ONLY) [rush-lib] Ignore specified files when calculating project state hash Apr 28, 2021
@elliot-nelson
Copy link
Collaborator Author

Updated PR title and body, and made following changes:

  • Renamed option to incrementalBuildIgnoredGlobs, and updated the description based on @octogonz's suggestion.
  • Replaced minimatch with ignore (a great change). Added a unit test and slight wording change to convey this in the description -- this actually makes the feature much more powerful, as it allows configurations like ['**', '!only-this-folder/*']. I don't know that we would necessarily highlight that fact, as a large monorepo with a bunch of fancy ignore logic is bound to start slowing down the incremental change logic.
  • Fixed up remaining unit tests and added rush change entry.

Took PR out of FEEDBACK state, as I think it is ready to go except for the suggested improvement of loading rush-project.json files in parallel. I can make that change if there's a good way in this codebase to load a limit of 10-16 or so files at a time.

@elliot-nelson
Copy link
Collaborator Author

Latest cleanup commit -- load project configuration files in parallel, using the new Async.forEachAsync function that was ported over to the node core library.

At this point I believe all feedback has been addressed, let me know if any other concerns remain.

@@ -29,9 +31,12 @@ export class PackageChangeAnalyzer {
this._git = new Git(this._rushConfiguration);
}

public getPackageDeps(projectName: string): Map<string, string> | undefined {
public async getPackageDeps(
Copy link
Collaborator

@octogonz octogonz May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repo uses a suffix to indicate async methods, so generally a function like this should be renamed to getPackageDepsAsync() when converting. The same applies to other functions modified by this same PR.

): Promise<Ignore> {
const projectConfiguration:
| RushProjectConfiguration
| undefined = await RushProjectConfiguration.tryLoadForProjectAsync(project, undefined, terminal);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Ideally this should be cached somewhere, but we don't need to solve that in this PR.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to do that here; this is a wrapper for the heft-config-file library which does is own internal caching of all config files you load with it, and it seems like that's the best way to do it.

(I did file a related ticket, #2691, that will improve the caching further.)

projectHashDeps.get(owningProject.packageName)!.set(filePath, fileHash);
const relativePath: string = filePath
.replace(owningProject.projectRelativeFolder, '')
.replace(/^\//, '');
Copy link
Collaborator

@octogonz octogonz May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a code comment explaining what transformation is happening here?

It might be helpful to include a realistic example of the input/output, something like:

        // Example:
        // filePath =               category/project/src/thing.ts
        // projectRelativeFolder =  category/project
        // relativePath =           src/thing.ts
        const relativePath: string = filePath
          .replace(owningProject.projectRelativeFolder, '')
          .replace(/^\//, '');

It seems obvious to me now, but I had to run this code in the debugger to figure out what it is doing. :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correctness of replace() hinges on the assumption that filePath will always contain projectRelativeFolder, but it's difficult to prove that, since filePath and projectRelativeFolder come from fairly different subsystems.

Here's a couple ideas to protect against that:

  1. Use path.relative() to operate on path objects rather than text strings; OR
  2. First test whether filePath starts with projectRelativeFolder and throw an InternalError if that not. The assumption is probably safe, but if not it would be nice to hear about it so we can fix it.

(BTW I tried some experiments to see if I could break this logic, like altering the case of a project folder on Windows. But the LookupByPath API malfunctions first preventing this code from running. Maybe I'm being overly defensive -- it's just if replace() did fail to match, the behavior might be pretty difficult to understand. 🙂)

@dmichon-msft

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base logic here is that, since LookupByPath cannot return a matching project for a path that does not literally begin with projectRelativeFolder, all paths must necessarily begin with said string prefix. Personally I'd just blindly slice the expected length off of the start of the string, because that is how I prefer to do path manipulation (as quickly as possible, with a minimum of fuss).
Actually the .slice() call is safer here, because, contrary to what my memory had said, I only normalize the projectRelativeFolder to / when handing it off to LookupByPath, so you can technically (incorrectly) use \\ inside of your paths in rush.json and weird things will happen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've split the difference by adding a comment and converting it to use @dmichon-msft 's suggested logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I am content that sufficient paranoia has been applied here heheh

@octogonz octogonz merged commit 9e51c9b into microsoft:master May 14, 2021
@dmichon-msft
Copy link
Contributor

In the repo I work on, this change has increased the run time of PackageChangeAnalyzer._getData() by an order of magnitude (from ~200 ms to 2.4 seconds). The bulk of this time appears to be in reading the rush-project.json files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants