Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extensions - Limit depth of search for "info.xml" #24491

Merged
merged 5 commits into from
Oct 22, 2022

Conversation

totten
Copy link
Member

@totten totten commented Sep 9, 2022

Overview

This is an optimization that may slightly improve the performance of system-flushes. Specifically, it takes the largest file search (find info.xml files) and narrows it in a way that should still work on most deployments.

This is follow-up to some of the side-chatter on #24490 (ping @agileware-justin).

Before

When scanning for info.xml files, Civi will consult:

  • CiviCRM Core (aka [civicrm.root])
  • CiviCRM Custom Extension Dir (aka [civicrm.files]/ext)
  • CMS vendor (aka [cms.root]/vendor)

In each folder, it does an exhaustive search of the subtree.

After

When scanning for info.xml files, Civi will consult:

  • CiviCRM Core's "ext" (aka [civicrm.root]/ext)
  • CiviCRM Core's "tools" (aka [civicrm.root]/tools)
  • CiviCRM Custom Extension Dir (aka [civicrm.files]/ext)
  • CMS vendor (aka [cms.root]/vendor)

In each folder, it will only search 3 levels deep.

Technical Details

On my local dmaster (D7; adhoc assortment of extensions), this reduces the number of directory-checks after a system-flush:

## Before changes
$ cv flush; cv ev 'return $GLOBALS["folderChecks"];'
Flushing system caches
4119

## Limiting to maxDepth=4, searching all the original roots
$ cv flush; cv ev 'return $GLOBALS["folderChecks"];'
Flushing system caches
2074

## Limiting to maxDepth=3, searching the updated set of roots
$ cv flush; cv ev 'return $GLOBALS["folderChecks"];'
Flushing system caches
803

I don't know a better way to measure the impact of this. Qualitatively, I expect the impact varies:

Bigger Improvement Smaller Improvement
Drupal 8/9/10 (bigger filesystem) Drupal 7 (smaller filesystem)
Cheap shared host (more cache competition) Local dev workstation (less cache competition)
Infrequent usage (cold filesystem cache) Frequent usage (warm filesystem cache)

This arrangement should reduce the size of scans while still being drop-in compatible for most deployments -- even with some fairly esoteric customizations (e.g. implementers who deploy their own folder with a subtree of extensions; e.g. git repos with multiple extensions; e.g. Civi extensions installed via composer).

There are two main risks of incompatibility:

  1. If you have a custom folder like [civicrm.root]/my-extensions. This would no longer be scanned. (You could move such a folder; eg to [civicrm.root]/ext/my-extensions.)
  2. If you have a deep hierarchy. (For this, you can override the setting ext_max_depth.)

If this update is accepted, then we may also want to add a pre-upgrade check which performs the old scan and warns about any extensions that would be affected.

(Updated) There is pre-upgrade check to warn about deeply nested extensions, e.g.

Screen Shot 2022-10-20 at 4 48 13 PM

@civibot
Copy link

civibot bot commented Sep 9, 2022

(Standard links)

@civibot civibot bot added the master label Sep 9, 2022
@agileware-justin
Copy link
Contributor

Thanks very much @totten for taking my comments on board.

Asking because I do not know, what is the reason for searching for extensions in directories other than just these two?
CiviCRM Core's "ext" (aka [civicrm.root]/ext)
CiviCRM Custom Extension Dir (aka [civicrm.files]/ext)

@totten
Copy link
Member Author

totten commented Sep 9, 2022

what is the reason for searching for extensions in directories other than just these two?

  • CiviCRM Core's "ext" (aka [civicrm.root]/ext)
  • CiviCRM Custom Extension Dir (aka [civicrm.files]/ext)

Well, basically because they were already being searched.... But more specifically:

  • CMS vendor (aka [cms.root]/vendor) -- If you do composer require to pull in an ext, then this is where (by default) it lands.
  • CiviCRM Core's "tools" (aka [civicrm.root]/tools) -- This was used+suggested before [civicrm.root]/ext existed. Still appears in some builds.

(EDIT) So the overall effect here is to reduce depth of search and to skip some of the larger civicrm-core folders (./CRM, ./Civi, ./templates, etc).

@eileenmcnaughton
Copy link
Contributor

I just did a quick check & got this path

extensions/wmf-thankyou/Civi/WorkflowMessage/ThankYou - which I guess would requre that var to be set. We could perhaps initially set it by scanning to find an appropriate value for the site?

@eileenmcnaughton
Copy link
Contributor

Also - my understanding is there is already a way to opt out cmsroot.vendor - as the most expensive to scan in many / most cases

@totten
Copy link
Member Author

totten commented Sep 9, 2022

@eileenmcnaughton To clarify:

  • CRM_Utils_File::findFiles() is the low-level utility. It is used by many searches (eg searching for info.xml and for *.mgd.php). The PR adds a parameter ?int $maxDepth. This is basically a (REF) - an optional parameter, no systemic effect.
  • CRM_Extension_* does a search for info.xml files. (It calls findFiles($dir, 'info.xml').) The PR applies a max-depth to the info.xml search. This depth is configurable (with a default 3).
  • This has no effect on other searches (e.g. the *.mgd.php search remains the same).

So regarding ...extensions/wmf-thankyou/Civi/WorkflowMessage/ThankYou:

  • The ...extensions needs to be a known base. (If that's your custom extensionsDir, then it's good.)
  • The info.xml should be found at ...extensions/wmf-thankyou/info.xml. (That's only one level below extensionsDir.)
  • The file Civi/WorkflowMessage/ThankYou is found by a different scan. The PR doesn't affect it.

@totten
Copy link
Member Author

totten commented Sep 9, 2022

(@eileenmcnaughton) Also - my understanding is there is already a way to opt out cmsroot.vendor - as the most expensive to scan in many / most cases

Yeah... on configurability, the substance of getFullContainer() should be backed by a configurable list, but that opens a bike-sheddy question. (The list should be a global PHP array! No, it should it an env-var or constant with PATH_SEPARATORs. No, it should be a setting! No, it should be a SQL table! No, it should be an OptionGroup!) Shrug.

I think it's OK to do another issue/discussion/update about configuring the search-list. The option to limit depth remains useful(+simple) either way:

  • If we stick to the non-configurable search-list, then depth-limits improve performance for everyone.
  • If we make the search-list configurable, then depth-limits improve performance for anyone who has a big folder.

(It's fine to say "We don't put extensions in vendor/ on this site" -- but for a site that does have a big folder like vendor/, then they're back to the expensive search. Unless you have the depth-limit. Depth-limit helps any configuration.)

@@ -753,6 +757,7 @@ public static function findFiles($dir, $pattern, $relative = FALSE) {
: '@' . preg_quote(DIRECTORY_SEPARATOR) . '\.@';

$dir = rtrim($dir, '/');
$baseDepth = substr_count($dir, '/');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PR does proceed, this would give a random number between 0, and, say, 16, on windows. On a typical install civicrm root looks like C:/path/to/drupal\sites\all\modules\civicrm\, and of course sometimes people edit it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@demeritcowboy Yeah, I had meant to ping you about that. :)

I guess the only way out is through. Pushed a change that should work with both delimiters (while still keeping the string-operations minimal).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erm - sorry I wasn't trying to open a discussion about configuring the search list - my understanding is this IS existing functionality....

/**
 * Which directories should we exclude when scanning the codebase for things
 * like extension .info files, or .html partials or .xml files etc. This needs
 * to be a valid preg_match() pattern.
 *
 * If you do not define it, a pattern that excludes dirs starting with a dot is
 * used, e.g. to exclude .git/). Adding suitable patterns here can vastly speed
 * up your container rebuilds and cache flushes. The pattern is matched against
 * the absolute path. Remember to use your system's DIRECTORY_SEPARATOR the
 * examples below assume /
 *
 * The default excludes node_modules (can be huge), various CiviCRM dirs that
 * are unlikely to have anything we need to scan inside, and (what could be
 * your) Drupal's private file storage area. It does not exclude
 * vendor but you are likely to see an improvement by adding it.
 *
 * See https://docs.civicrm.org/sysadmin/en/latest/setup/optimizations/#exclude-dirs-that-do-not-need-to-be-scanned
 * and also discussion on including vendor (excluded) in https://lab.civicrm.org/dev/core/-/issues/2031
 */
if (strtoupper(substr(PHP_OS, 0, 3)) !== 'WIN' && !defined('CIVICRM_EXCLUDE_DIRS_PATTERN')) {
  define('CIVICRM_EXCLUDE_DIRS_PATTERN', '@/(\.|node_modules|js/|css/|bower_components|packages/|sites/default/files/private)@');
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artfulrobot contributed that I believe

// They are roughly equivalent. (The differences are described by a secret book hidden in the tower of Mordor.)
$depth = substr_count($path, '/');
if (DIRECTORY_SEPARATOR !== '/') {
$depth += substr_count($path, PATH_SEPARATOR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be DIRECTORY_SEPARATOR instead of PATH.

But the bigger question is maybe who is going to use this setting, and, if it should exist, then could an upgrade script calculate what it should be based on the state of the system at the time (extensions get installed and uninstalled, but the depth is unlikely to increase except for a very weird extension).

Also I get an error if I set the setting to 2: Uncaught Error: Class 'Civi\Api4\Subscriber\AutocompleteSubscriber' not found in ...\ext\afform\core\afform.php:59. So it suggests it shouldn't be allowed to be less than 3.

As noted it is a bit difficult to evaluate the effect. I'm not sure where efforts should focus. We have a couple suggestions scattered a bit:

  • Serialize is slow
  • The change in the class scanning PR.
  • excessive cache clearing
  • Does it need to scan anything (maybe it's convenient, but has a trade-off)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this exclusion would still be in use.

@artfulrobot
Copy link
Contributor

Commenting because pinged by @eileenmcnaughton. @totten's proposal looks ok to me. I can't speak for those avanteguarde types who install extensions with composer, but thinking through the longest "reasonable" path.

  • I could imagine that some people have a subdir in ext/ for contrib/custom (like drupal folks might for modules/themes).
  • Some extension repos bundle a few related extensions under subdirs.

So

{custom-ext-dir}/client-bespoke/some-repo/extension-a/extension-a.info.xml

Is that 3 levels deep (counting dirs) or 4 (counting dirs up to the file)? If three, and therefore would be found, I think that's fine.

@totten
Copy link
Member Author

totten commented Sep 13, 2022

(@demeritcowboy) This should be DIRECTORY_SEPARATOR instead of PATH.

🤦 Fixed.

(@eileenmcnaughton) my understanding is this IS existing functionality... CIVICRM_EXCLUDE_DIRS_PATTERN...

Yeah, that's still the same as before. IMHO, it's reasonable to have a global/crude policy to (eg) exclude .git from all ordinary file-searches.

(That doesn't mean max-depth should be a global/crude policy -- eg the contracts and de-facto practices are different for *.setting.php and *.mgd.php. A depth-limit that works for one may break the other.)

(@artfulrobot) So

{custom-ext-dir}/client-bespoke/some-repo/extension-a/info.xml (edited for typo)

Is that 3 levels deep (counting dirs) or 4 (counting dirs up to the file)? If three, and therefore would be found, I think that's fine.

Thanks, that's a good way to describe the subtlety. It's counting dirs. I agree with this target a "reasonable-long' . The example would match a filter of maxDepth=3. (I wasn't certain off the top of my head -- but testFindFilesDepth() is supposed to control for this, and it has a similar example. Contact/Import/Parser/Contact.php matches maxDepth=3.)

Is it better to describe the depth with or without the file? I'm not sure. FWIW, in Unix, find -maxdepth N counts the filename as part of N. It would be fairly easy to change right now.

(@demeritcowboy) ...who is going to use this setting, and, if it should exist... depth is unlikely to increase except for a very weird extension...

As Rich's example showed, the depth for info.xml is the result of a few independent decisions/opinions (eg sysadmin: "let's use ext/contrib; developer: "let's bundle a related ext"). I think the person who modifies the setting is someone who (a) is upgrading an existing system and (b) has gotten boxed-in (because the prior admin setup complicated folders, and their inhouse scripts depend on these folders, and some developers included bundled exts, and core barfs due to the cumulative depth, and they're not sure what part is safe to move).

So... the fewer people who need to use the setting, the better. It exists as a hedge or pressure-relief-valve.

@demeritcowboy
Copy link
Contributor

@totten Do you still have a commit sitting on your computer that didn't make it to the cloud? It still says PATH_SEPARATOR.

@eileenmcnaughton
Copy link
Contributor

@totten last comment seems to be the blocker

@demeritcowboy demeritcowboy added the merge ready PR will be merged after a few days if there are no objections label Oct 20, 2022
@totten
Copy link
Member Author

totten commented Oct 20, 2022

Pushed up the missing fix re:PATH_SEPARATOR.

I also added the pre-upgrade message. The message only appears if you have deeply nested extensions that would otherwise become unavailable. Here's how it looks in web UI and CLI when two extensions (deeperest and org.example.evenmore) are affected by the change:

Screen Shot 2022-10-20 at 4 48 13 PM

Screen Shot 2022-10-20 at 4 48 32 PM

If you follow the instruction and reload, then the warning will clear up.

@demeritcowboy
Copy link
Contributor

jenkins retest this please (seems like a date rollover thing)

@demeritcowboy demeritcowboy merged commit b054128 into civicrm:master Oct 22, 2022
@totten totten deleted the master-ext-depth branch October 22, 2022 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
master merge ready PR will be merged after a few days if there are no objections
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants