Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes in V8/ICU behaviour should be detected&tested&documented|overridden #42440

Open
LiviaMedeiros opened this issue Mar 22, 2022 · 8 comments

Comments

@LiviaMedeiros
Copy link
Contributor

Version

v14.17.6, v16.14.1, v18.0.0-pre

Platform

Linux nepbook 5.17.0-gentoo-nep #1 SMP Tue Mar 22 08:34:44 +08 2022 i686 Intel(R) Atom(TM) CPU N270 @ 1.60GHz GenuineIntel GNU/Linux

Subsystem

intl, deps

What steps will reproduce the bug?

Example 1

console.log(new Intl.DateTimeFormat('en', { calendar: 'iso8601' }).resolvedOptions().calendar);
console.log(new Intl.DateTimeFormat('en', { calendar: 'BAD_BANANA' }).resolvedOptions().calendar);

node v14.17.6:

gregory
file:///tmp/calendar-14.17.6.mjs:2
console.log(new Intl.DateTimeFormat('en', { calendar: 'BAD_BANANA' }).resolvedOptions().calendar);
            ^

RangeError: Invalid calendars : BAD_BANANA
    at new DateTimeFormat (<anonymous>)
    at file:///tmp/calendar-14.17.6.mjs:2:13
    at ModuleJob.run (internal/modules/esm/module_job.js:170:25)
    at async Loader.import (internal/modules/esm/loader.js:178:24)
    at async Object.loadESM (internal/process/esm_loader.js:68:5)

node v16.14.1:

iso8601
gregory

node v18.0.0-pre:

iso8601
file:///tmp/calendar-18.0.0.mjs:2
console.log(new Intl.DateTimeFormat('en', { calendar: 'BAD_BANANA' }).resolvedOptions().calendar);
            ^

RangeError: Invalid calendar : BAD_BANANA
    at new DateTimeFormat (<anonymous>)
    at file:///tmp/calendar-18.0.0.mjs:2:13
    at ModuleJob.run (node:internal/modules/esm/module_job:197:25)
    at async Promise.all (index 0)
    at async ESMLoader.import (node:internal/modules/esm/loader:341:24)
    at async loadESM (node:internal/process/esm_loader:88:5)
    at async handleMainPromise (node:internal/modules/run_main:61:12)

Example 2

process.env.LANG = 'C';
console.log(new Date().toLocaleString());

node v14.17.6, v16.14.1:

3/23/2022, 1:33:55 AM

node 18.0.0-pre:

2022-3-23 1:33:56 // LANG="jp" was defined in environment

How often does it reproduce? Is there a required condition?

Depends on Node version; maybe on system-icu, environment, etc.

What is the expected behavior?

Failed tests on deps bump, investigation, decision, solution.

What do you see instead?

Inconsistent behaviour in userland, weird and hard-to-debug bugs.

Additional information

I'm not 100% sure, but example 2 seems to be caused by internal caching in v8/icu. Basically it ignores setted values, such as process.env.{LANG,LC_ALL,LC_TIME}. Similar problem happened with TZ setter earlier; Node overrides it with #20026
No clue about example 1.
Not sure if any of these must be "fixed" in Node itself (i.e. as separate issues/PRs).

Both problems could have been catched by simple tests. Improving coverage would allow to prevent such problems in the future.

Note: if CI runs in some sort of default/canonical environments, it might be a good idea to make it possible to do multiple or random combinations somehow, especially for i18n-related parts.
There is a pretty common tendency between developers to set en_US.UTF-8 locale just to have more searchable and shareable outputs, so some bugs might slip off even from a huge multilingual community.

@bnoordhuis
Copy link
Member

As a rule of thumb, you should not expect assigning environment variables inside a script to do anything. Or if it does, to behave consistently across node versions. Set it outside: env LANG=C node script.js

You mention TZ. The special handling of TZ node can't really do for other environment variables.

As to the rest of your bug report, I'm not sure what exactly you expect us to do. If you're arguing for more test coverage, by all means go and open a pull request.

@LiviaMedeiros
Copy link
Contributor Author

I totally agree that reassigning env variables in runtime is usually a bad idea; but I don't see a clear borderline between TZ and LANG here.
If it is strongly discouraged, and users should never expect it to "work", maybe the whole process object in userland should be frozen. Or emit warnings on any mutation.
If we are admitting use cases for that and trying to fix them where possible, it would be great to document known issues, workarounds and node-specific handlings more explicitly.

The general idea is to make relationship between Node releases and corresponding ICU versions a bit closer.

I'll be glad to PR some extra tests for observable behaviours, but first there are few questions:

  • Is it possible to write tests only for non-system-icu? There are ways to detect presence of icu and to distinguish between small and full versions, and there is a convenient common.hasIntl; but how to skip a test for system-icu and run for full-icu?
    My heuristic assumption is isSystemICU = process.config.variables.icu_data_in === undefined && process.config.variables.icu_endianness === undefined but it doesn't look reliable.

  • Are there any methods to run tests in many different env combinations atm? For both manual runs and CI?
    This looks like an acceptable way to test some oneliners with specific variables:

execFile(
  process.execPath,
  ['-p', 'doSomething()'],
  { ...process.env, ...testingEnvVariables },
  common.mustSucceed()
);

...but it might be nice to have a lower-overheaded way to masstest a huge set of env values.

  • Maybe I missed something, and this problem already was addressed before, or there are strong reasons to treat ICU as black box.

@bnoordhuis
Copy link
Member

maybe the whole process object in userland should be frozen

That was tried, briefly, 8 or 9 years ago and it broke the world. It'd be infinitely worse today.

Is it possible to write tests only for non-system-icu?

Kind of ad hoc but process.config.variables.icu_gyp_path === 'tools/icu/icu-generic.gyp' should work.

Are there any methods to run tests in many different env combinations atm?

execFile() and other child_process methods are how we usually do that.

nodejs-github-bot pushed a commit that referenced this issue May 5, 2022
PR-URL: #42683
Refs: #42440
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Antoine du Hamel <[email protected]>
RafaelGSS pushed a commit that referenced this issue May 10, 2022
PR-URL: #42683
Refs: #42440
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Antoine du Hamel <[email protected]>
juanarbol pushed a commit that referenced this issue May 31, 2022
PR-URL: #42683
Refs: #42440
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Antoine du Hamel <[email protected]>
danielleadams pushed a commit that referenced this issue Jun 27, 2022
PR-URL: #42683
Refs: #42440
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Antoine du Hamel <[email protected]>
targos pushed a commit that referenced this issue Jul 12, 2022
PR-URL: #42683
Refs: #42440
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Antoine du Hamel <[email protected]>
targos pushed a commit that referenced this issue Jul 31, 2022
PR-URL: #42683
Refs: #42440
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Antoine du Hamel <[email protected]>
guangwong pushed a commit to noslate-project/node that referenced this issue Oct 10, 2022
PR-URL: nodejs/node#42683
Refs: nodejs/node#42440
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Antoine du Hamel <[email protected]>
@srl295
Copy link
Member

srl295 commented Oct 19, 2022

Hi, found this test while reviewing #45068

I applaud the desire to track down challenging bugs, however:

A challenge here is that ICU output reflects ever-changing linguistic cultural and global preferences.

Inconsistent behaviour in userland, weird and hard-to-debug bugs.

Can you give more specifics?

The general idea is to make relationship between Node releases and corresponding ICU versions a bit closer.

It may not need to be closer. Node has a minimum ICU version it can run with, but it doesn't have a maximum. ICU (and the CLDR data it pulls from) can and does update frequently, especially. Data may be subsetted due to space constraints (small icu).

If you want to make a specific test, you should key the test to match exactly the following process.versions keys: icu, unicode, cldr, tz, and skip if process.config.variables.icu_small. That still might fail in some user configured situations.

  • Is it possible to write tests only for non-system-icu? There are ways to detect presence of icu and to distinguish between small and full versions, and there is a convenient common.hasIntl; but how to skip a test for system-icu and run for full-icu?

process.config.variables.icu_path might work, it depends on how it's built.

  • Maybe I missed something, and this problem already was addressed before, or there are strong reasons to treat ICU as black box.

I would treat it as a source of cultural data, and not as something to expect the same results from. That sounds strange, but consider the nature of what is being done.

ICU's own tests already must change dramatically with each CLDR release.

@ryzokuken
Copy link
Contributor

Hello! The ECMA-402 spec specifies the shape of the Intl API. However, all outputs of format functions (which is most of Intl) are implementation and environment dependent and subject to arbitrary changes in the underlying environment.

While we're nearing quite a bit of consistency on many locales, much of the data is still in flux and any attempt to "test" certain outcomes for format functions would end up in a GitHub whack-a-mole with no clear benefit.

If you see the test262 specification tests for the Intl API (https://github.com/tc39/test262/tree/main/test/intl402), you'll see little to no testing of the format outputs since there's really only one restriction in the spec: that the output should be a human-understandable string of text.

Perhaps one thing we could do would be to run ICU tests as part of our testing suite but I doubt we float any patches over ICU for that to be necessary.

@LiviaMedeiros
Copy link
Contributor Author

Sorry for late reply.

Inconsistent behaviour in userland, weird and hard-to-debug bugs.

Can you give more specifics?

For example, let's say, there is a template literal with a Date object in it, processed by regular expression that assumes this part of string to be within en locale.
This might fail in many other locales, and if the literal is generated and processed by third-party libs, we have to force the locale before calling dangerous code.
Updating ICU to v70.x changed if process.env.LANG = 'en'; can do it.

If you want to make a specific test, you should key the test to match exactly the following process.versions keys: icu, unicode, cldr, tz, and skip if process.config.variables.icu_small. That still might fail in some user configured situations.

I assume that we don't have to do it in Node.js as long as we don't update these separately. Whenever icu is user-provided (aka system-icu), its behaviour is arbitrary so the test should be skipped.

Overall the idea here is to test Node.js integration with its dependencies, not the dependencies themselves. For example, we don't test if it produces specific result in specific locale and specific timezone, but sometimes we have to provide specific sample to test if mutating process.env works as expected.

@srl295
Copy link
Member

srl295 commented Nov 3, 2022

Sorry for late reply.

no problem, thanks for replying

Inconsistent behaviour in userland, weird and hard-to-debug bugs.

Can you give more specifics?

For example, let's say, there is a template literal with a Date object in it, processed by regular expression that assumes this part of string to be within en locale. This might fail in many other locales, and if the literal is generated and processed by third-party libs, we have to force the locale before calling dangerous code. Updating ICU to v70.x changed if process.env.LANG = 'en'; can do it.

First of all, the regular expression should not assume an en locale, that's trying to test the dependency (ICU) itself. I don't know about process.env.LANG, but ICU 70.1 is still sensitive to the LANG environment variable. You might be hitting something else having to do with v8?

If you want to make a specific test, you should key the test to match exactly the following process.versions keys: icu, unicode, cldr, tz, and skip if process.config.variables.icu_small. That still might fail in some user configured situations.

I assume that we don't have to do it in Node.js as long as we don't update these separately. Whenever icu is user-provided (aka system-icu), its behaviour is arbitrary so the test should be skipped.

This assumption is problematic— they are updated separately. Time zone 2022f has come out since ICU 72.1 was integrated, and more are coming. ISO 4217 has also issued an updated amendment with a new end date for the HRK (Croatian Kuna) as of just yesterday morning.

And even if ICU is not system-icu, it may be a different ICU - a newer one, or updated data, such as with new time zone data.

Overall the idea here is to test Node.js integration with its dependencies, not the dependencies themselves. For example, we don't test if it produces specific result in specific locale and specific timezone, but sometimes we have to provide specific sample to test if mutating process.env works as expected.

As @ryzokuken noted, run test262. Or, actually it might be possible to run ICU's own tests, but that's by far a more complicated build process.

@LiviaMedeiros
Copy link
Contributor Author

First of all, the regular expression should not assume an en locale, that's trying to test the dependency (ICU) itself.

This situation might easily happen due to misintegration of 3rd-party libs: one part returns Date and assumes that calling code knows how to safely stringify it; another part uses template literal and assumes that the data was a string of an object with safe @@toPrimitive; third part assumes that resulting text doesn't contain anything above lower ASCII range (regex with something like [A-Za-z0-9:.+-/_()] is just one example of such assumption).

I don't know about process.env.LANG, but ICU 70.1 is still sensitive to the LANG environment variable.

AFAICT the issue with LANG was about caching environment variables: it cares if the variable was set beforehand, but ignores mutating of process.env afterwards. This is why it becomes an issue on Node.js side.

You might be hitting something else having to do with v8?

Not that I'm aware of.

This assumption is problematic— they are updated separately. Time zone 2022f has come out since ICU 72.1 was integrated, and more are coming. ISO 4217 has also issued an updated amendment with a new end date for the HRK (Croatian Kuna) as of just yesterday morning.

And even if ICU is not system-icu, it may be a different ICU - a newer one, or updated data, such as with new time zone data.

I'm not quite sure how exactly it might cause any issues. Do you mean different Node.js releases, or that Node.js built with full-icu can ignore system ICU but use system TZdata, or something else?

As @ryzokuken noted, run test262. Or, actually it might be possible to run ICU's own tests, but that's by far a more complicated build process.

Running tests that are updated together with the dependency itself doesn't help in a context of this problem: they will pass even if there are breaking changes. Running tests from previous release might produce too many unrelated failures and still not catch a breaking change with Node.js integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants