Skip to content

Commit

Permalink
tools: support full-icu by default
Browse files Browse the repository at this point in the history
Instead of an English-only icudt64l.dat in the repo,
we now have icudt64l.dat.gz with all locales.

- updated READMEs and docs
- shrinker now copies source, and compresses (bzip2) the ICU data file
- configure expects deps/icu-small to be full ICU with a full
compressed data file

Fixes: #19214
Co-Authored-By: Richard Lau <[email protected]>
Co-Authored-By: Jan Olaf Krems <[email protected]>
Co-Authored-By: James M Snell <[email protected]>
PR-URL: #29522

Reviewed-By: Jan Krems <[email protected]>
Reviewed-By: Jiawen Geng <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Michael Dawson <[email protected]>
Reviewed-By: Michaël Zasso <[email protected]>
  • Loading branch information
srl295 authored and Trott committed Oct 3, 2019
1 parent a71fb97 commit 1a25e90
Show file tree
Hide file tree
Showing 11 changed files with 194 additions and 135 deletions.
47 changes: 29 additions & 18 deletions BUILDING.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,21 +35,23 @@ file a new issue.
* [Building Node.js](#building-nodejs-1)
* [Android/Android-based devices (e.g. Firefox OS)](#androidandroid-based-devices-eg-firefox-os)
* [`Intl` (ECMA-402) support](#intl-ecma-402-support)
* [Default: `small-icu` (English only) support](#default-small-icu-english-only-support)
* [Build with full ICU support (all locales supported by ICU)](#build-with-full-icu-support-all-locales-supported-by-icu)
* [Unix/macOS](#unixmacos)
* [Windows](#windows-1)
* [Building without Intl support](#building-without-intl-support)
* [Trimmed: `small-icu` (English only) support](#trimmed-small-icu-english-only-support)
* [Unix/macOS](#unixmacos-1)
* [Windows](#windows-2)
* [Use existing installed ICU (Unix/macOS only)](#use-existing-installed-icu-unixmacOS-only)
* [Build with a specific ICU](#build-with-a-specific-icu)
* [Building without Intl support](#building-without-intl-support)
* [Unix/macOS](#unixmacos-2)
* [Windows](#windows-3)
* [Use existing installed ICU (Unix/macOS only)](#use-existing-installed-icu-unixmacOS-only)
* [Build with a specific ICU](#build-with-a-specific-icu)
* [Unix/macOS](#unixmacos-3)
* [Windows](#windows-4)
* [Building Node.js with FIPS-compliant OpenSSL](#building-nodejs-with-fips-compliant-openssl)
* [Building Node.js with external core modules](#building-nodejs-with-external-core-modules)
* [Unix/macOS](#unixmacos-3)
* [Windows](#windows-4)
* [Unix/macOS](#unixmacos-4)
* [Windows](#windows-5)
* [Note for downstream distributors of Node.js](#note-for-downstream-distributors-of-nodejs)

## Supported platforms
Expand Down Expand Up @@ -598,31 +600,40 @@ $ make
## `Intl` (ECMA-402) support

[Intl](https://github.com/nodejs/node/blob/master/doc/api/intl.md) support is
enabled by default, with English data only.
enabled by default.

### Default: `small-icu` (English only) support
### Build with full ICU support (all locales supported by ICU)

By default, only English data is included, but
the full `Intl` (ECMA-402) APIs. It does not need to download
any dependencies to function. You can add full
data at runtime.
This is the default option.

### Build with full ICU support (all locales supported by ICU)
#### Unix/macOS

With the `--download=all`, this may download ICU if you don't have an
ICU in `deps/icu`. (The embedded `small-icu` included in the default
Node.js source does not include all locales.)
```console
$ ./configure --with-intl=full-icu
```

#### Windows

```console
> .\vcbuild full-icu
```

### Trimmed: `small-icu` (English only) support

In this configuration, only English data is included, but
the full `Intl` (ECMA-402) APIs. It does not need to download
any dependencies to function. You can add full data at runtime.

#### Unix/macOS

```console
$ ./configure --with-intl=full-icu --download=all
$ ./configure --with-intl=small-icu
```

#### Windows

```console
> .\vcbuild full-icu download-all
> .\vcbuild small-icu
```

### Building without Intl support
Expand Down
87 changes: 61 additions & 26 deletions configure.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
import shlex
import subprocess
import shutil
import bz2

from distutils.spawn import find_executable as which

# If not run from node/, cd to node/.
Expand Down Expand Up @@ -409,7 +411,7 @@
intl_optgroup.add_option('--with-intl',
action='store',
dest='with_intl',
default='small-icu',
default='full-icu',
choices=valid_intl_modes,
help='Intl mode (valid choices: {0}) [default: %default]'.format(
', '.join(valid_intl_modes)))
Expand Down Expand Up @@ -1399,38 +1401,35 @@ def write_config(data, name):
icu_parent_path = 'deps'

# The full path to the ICU source directory. Should not include './'.
icu_full_path = 'deps/icu'
icu_deps_path = 'deps/icu'
icu_full_path = icu_deps_path

# icu-tmp is used to download and unpack the ICU tarball.
icu_tmp_path = os.path.join(icu_parent_path, 'icu-tmp')

# canned ICU. see tools/icu/README.md to update.
canned_icu_dir = 'deps/icu-small'

# use the README to verify what the canned ICU is
canned_is_full = os.path.isfile(os.path.join(canned_icu_dir, 'README-FULL-ICU.txt'))
canned_is_small = os.path.isfile(os.path.join(canned_icu_dir, 'README-SMALL-ICU.txt'))
if canned_is_small:
warn('Ignoring %s - in-repo small icu is no longer supported.' % canned_icu_dir)

# We can use 'deps/icu-small' - pre-canned ICU *iff*
# - with_intl == small-icu (the default!)
# - with_icu_locales == 'root,en' (the default!)
# - deps/icu-small exists!
# - canned_is_full AND
# - with_icu_source is unset (i.e. no other ICU was specified)
# (Note that this is the *DEFAULT CASE*.)
#
# This is *roughly* equivalent to
# $ configure --with-intl=small-icu --with-icu-source=deps/icu-small
# $ configure --with-intl=full-icu --with-icu-source=deps/icu-small
# .. Except that we avoid copying icu-small over to deps/icu.
# In this default case, deps/icu is ignored, although make clean will
# still harmlessly remove deps/icu.

# are we using default locales?
using_default_locales = ( options.with_icu_locales == icu_default_locales )

# make sure the canned ICU really exists
canned_icu_available = os.path.isdir(canned_icu_dir)

if (o['variables']['icu_small'] == b(True)) and using_default_locales and (not with_icu_source) and canned_icu_available:
if (not with_icu_source) and canned_is_full:
# OK- we can use the canned ICU.
icu_config['variables']['icu_small_canned'] = 1
icu_full_path = canned_icu_dir

icu_config['variables']['icu_full_canned'] = 1
# --with-icu-source processing
# now, check that they didn't pass --with-icu-source=deps/icu
elif with_icu_source and os.path.abspath(icu_full_path) == os.path.abspath(with_icu_source):
Expand Down Expand Up @@ -1508,29 +1507,40 @@ def write_config(data, name):
icu_endianness = sys.byteorder[0]
o['variables']['icu_ver_major'] = icu_ver_major
o['variables']['icu_endianness'] = icu_endianness
icu_data_file_l = 'icudt%s%s.dat' % (icu_ver_major, 'l')
icu_data_file_l = 'icudt%s%s.dat' % (icu_ver_major, 'l') # LE filename
icu_data_file = 'icudt%s%s.dat' % (icu_ver_major, icu_endianness)
# relative to configure
icu_data_path = os.path.join(icu_full_path,
'source/data/in',
icu_data_file_l)
icu_data_file_l) # LE
compressed_data = '%s.bz2' % (icu_data_path)
if not os.path.isfile(icu_data_path) and os.path.isfile(compressed_data):
# unpack. deps/icu is a temporary path
if os.path.isdir(icu_tmp_path):
shutil.rmtree(icu_tmp_path)
os.mkdir(icu_tmp_path)
icu_data_path = os.path.join(icu_tmp_path, icu_data_file_l)
with open(icu_data_path, 'wb') as outf:
with bz2.BZ2File(compressed_data, 'rb') as inf:
shutil.copyfileobj(inf, outf)
# Now, proceed..

# relative to dep..
icu_data_in = os.path.join('..','..', icu_full_path, 'source/data/in', icu_data_file_l)
icu_data_in = os.path.join('..','..', icu_data_path)
if not os.path.isfile(icu_data_path) and icu_endianness != 'l':
# use host endianness
icu_data_path = os.path.join(icu_full_path,
'source/data/in',
icu_data_file)
# relative to dep..
icu_data_in = os.path.join('..', icu_full_path, 'source/data/in',
icu_data_file)
# this is the input '.dat' file to use .. icudt*.dat
# may be little-endian if from a icu-project.org tarball
o['variables']['icu_data_in'] = icu_data_in
icu_data_file) # will be generated
if not os.path.isfile(icu_data_path):
# .. and we're not about to build it from .gyp!
error('''ICU prebuilt data file %s does not exist.
See the README.md.''' % icu_data_path)

# this is the input '.dat' file to use .. icudt*.dat
# may be little-endian if from a icu-project.org tarball
o['variables']['icu_data_in'] = icu_data_in

# map from variable name to subdirs
icu_src = {
'stubdata': 'stubdata',
Expand All @@ -1547,6 +1557,31 @@ def write_config(data, name):
var = 'icu_src_%s' % i
path = '../../%s/source/%s' % (icu_full_path, icu_src[i])
icu_config['variables'][var] = glob_to_var('tools/icu', path, 'patches/%s/source/%s' % (icu_ver_major, icu_src[i]) )
# calculate platform-specific genccode args
# print("platform %s, flavor %s" % (sys.platform, flavor))
# if sys.platform == 'darwin':
# shlib_suffix = '%s.dylib'
# elif sys.platform.startswith('aix'):
# shlib_suffix = '%s.a'
# else:
# shlib_suffix = 'so.%s'
if flavor == 'win':
icu_config['variables']['icu_asm_ext'] = 'obj'
icu_config['variables']['icu_asm_opts'] = [ '-o ' ]
elif with_intl == 'small-icu' or options.cross_compiling:
icu_config['variables']['icu_asm_ext'] = 'c'
icu_config['variables']['icu_asm_opts'] = []
elif flavor == 'mac':
icu_config['variables']['icu_asm_ext'] = 'S'
icu_config['variables']['icu_asm_opts'] = [ '-a', 'gcc-darwin' ]
elif sys.platform.startswith('aix'):
icu_config['variables']['icu_asm_ext'] = 'S'
icu_config['variables']['icu_asm_opts'] = [ '-a', 'xlc' ]
else:
# assume GCC-compatible asm is OK
icu_config['variables']['icu_asm_ext'] = 'S'
icu_config['variables']['icu_asm_opts'] = [ '-a', 'gcc' ]

# write updated icu_config.gypi with a bunch of paths
write(icu_config_name, do_not_edit +
pprint.pformat(icu_config, indent=2) + '\n')
Expand Down
8 changes: 8 additions & 0 deletions deps/icu-small/README-FULL-ICU.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
ICU sources - auto generated by shrink-icu-src.py

This directory contains the ICU subset used by --with-intl=full-icu
It is a strict subset of ICU 64 source files with the following exception(s):
* deps/icu-small/source/data/in/icudt64l.dat.bz2 : compressed data file


To rebuild this directory, see ../../tools/icu/README.md
8 changes: 0 additions & 8 deletions deps/icu-small/README-SMALL-ICU.txt

This file was deleted.

Binary file removed deps/icu-small/source/data/in/icudt64l.dat
Binary file not shown.
Binary file added deps/icu-small/source/data/in/icudt64l.dat.bz2
Binary file not shown.
26 changes: 11 additions & 15 deletions doc/api/intl.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,9 @@ programs. Some of them are:
* [`RegExp` Unicode Property Escapes][]

Node.js (and its underlying V8 engine) uses [ICU][] to implement these features
in native C/C++ code. However, some of them require a very large ICU data file
in order to support all locales of the world. Because it is expected that most
Node.js users will make use of only a small portion of ICU functionality, only
a subset of the full ICU data set is provided by Node.js by default. Several
options are provided for customizing and expanding the ICU data set either when
in native C/C++ code. The full ICU data set is provided by Node.js by default.
However, due to the size of the ICU data file, several
options are provided for customizing the ICU data set either when
building or running Node.js.

## Options for building Node.js
Expand All @@ -38,8 +36,8 @@ in [BUILDING.md][].

* `--with-intl=none`/`--without-intl`
* `--with-intl=system-icu`
* `--with-intl=small-icu` (default)
* `--with-intl=full-icu`
* `--with-intl=small-icu`
* `--with-intl=full-icu` (default)

An overview of available Node.js and JavaScript features for each `configure`
option:
Expand All @@ -66,8 +64,8 @@ operation is identical to that of `Date.prototype.toString()`.

### Disable all internationalization features (`none`)

If this option is chosen, most internationalization features mentioned above
will be **unavailable** in the resulting `node` binary.
If this option is chosen, ICU is disabled and most internationalization
features mentioned above will be **unavailable** in the resulting `node` binary.

### Build with a pre-installed ICU (`system-icu`)

Expand Down Expand Up @@ -106,9 +104,7 @@ console.log(spanish.format(january));
// Should print "enero"
```

This mode provides a good balance between features and binary size, and it is
the default behavior if no `--with-intl` flag is passed. The official binaries
are also built in this mode.
This mode provides a balance between features and binary size.

#### Providing ICU data at runtime

Expand Down Expand Up @@ -149,8 +145,9 @@ enable full `Intl` support.

This option makes the resulting binary link against ICU statically and include
a full set of ICU data. A binary created this way has no further external
dependencies and supports all locales, but might be rather large. See
[BUILDING.md][BUILDING.md#full-icu] on how to compile a binary using this mode.
dependencies and supports all locales, but might be rather large. This is
the default behavior if no `--with-intl` flag is passed. The official binaries
are also built in this mode.

## Detecting internationalization support

Expand Down Expand Up @@ -205,7 +202,6 @@ to be helpful:
[`String.prototype.toUpperCase()`]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/toUpperCase
[`require('buffer').transcode()`]: buffer.html#buffer_buffer_transcode_source_fromenc_toenc
[`require('util').TextDecoder`]: util.html#util_class_util_textdecoder
[BUILDING.md#full-icu]: https://github.com/nodejs/node/blob/master/BUILDING.md#build-with-full-icu-support-all-locales-supported-by-icu
[BUILDING.md]: https://github.com/nodejs/node/blob/master/BUILDING.md
[ECMA-262]: https://tc39.github.io/ecma262/
[ECMA-402]: https://tc39.github.io/ecma402/
Expand Down
42 changes: 20 additions & 22 deletions doc/api/util.md
Original file line number Diff line number Diff line change
Expand Up @@ -932,26 +932,9 @@ Per the [WHATWG Encoding Standard][], the encodings supported by the
one or more aliases may be used.

Different Node.js build configurations support different sets of encodings.
While a very basic set of encodings is supported even on Node.js builds without
ICU enabled, support for some encodings is provided only when Node.js is built
with ICU and using the full ICU data (see [Internationalization][]).
(see [Internationalization][])

#### Encodings Supported Without ICU

| Encoding | Aliases |
| ----------- | --------------------------------- |
| `'utf-8'` | `'unicode-1-1-utf-8'`, `'utf8'` |
| `'utf-16le'` | `'utf-16'` |

#### Encodings Supported by Default (With ICU)

| Encoding | Aliases |
| ----------- | --------------------------------- |
| `'utf-8'` | `'unicode-1-1-utf-8'`, `'utf8'` |
| `'utf-16le'` | `'utf-16'` |
| `'utf-16be'` | |

#### Encodings Requiring Full ICU Data
#### Encodings Supported by Default (With Full ICU Data)

| Encoding | Aliases |
| ----------------- | -------------------------------- |
Expand Down Expand Up @@ -990,6 +973,21 @@ with ICU and using the full ICU data (see [Internationalization][]).
| `'shift_jis'` | `'csshiftjis'`, `'ms932'`, `'ms_kanji'`, `'shift-jis'`, `'sjis'`, `'windows-31j'`, `'x-sjis'` |
| `'euc-kr'` | `'cseuckr'`, `'csksc56011987'`, `'iso-ir-149'`, `'korean'`, `'ks_c_5601-1987'`, `'ks_c_5601-1989'`, `'ksc5601'`, `'ksc_5601'`, `'windows-949'` |

#### Encodings Supported when Node.js is built with the `small-icu` option

| Encoding | Aliases |
| ----------- | --------------------------------- |
| `'utf-8'` | `'unicode-1-1-utf-8'`, `'utf8'` |
| `'utf-16le'` | `'utf-16'` |
| `'utf-16be'` | |

#### Encodings Supported when ICU is disabled

| Encoding | Aliases |
| ----------- | --------------------------------- |
| `'utf-8'` | `'unicode-1-1-utf-8'`, `'utf8'` |
| `'utf-16le'` | `'utf-16'` |

The `'iso-8859-16'` encoding listed in the [WHATWG Encoding Standard][]
is not supported.

Expand All @@ -1005,9 +1003,9 @@ changes:
* `encoding` {string} Identifies the `encoding` that this `TextDecoder` instance
supports. **Default:** `'utf-8'`.
* `options` {Object}
* `fatal` {boolean} `true` if decoding failures are fatal. This option is only
supported when ICU is enabled (see [Internationalization][]). **Default:**
`false`.
* `fatal` {boolean} `true` if decoding failures are fatal.
This option is not supported when ICU is disabled
(see [Internationalization][]). **Default:** `false`.
* `ignoreBOM` {boolean} When `true`, the `TextDecoder` will include the byte
order mark in the decoded result. When `false`, the byte order mark will
be removed from the output. This option is only used when `encoding` is
Expand Down
Loading

0 comments on commit 1a25e90

Please sign in to comment.