Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove -x- encoding for marker attributes #5038

Merged
merged 11 commits into from
Jun 14, 2024

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Jun 11, 2024

#3632

  • baked: use (&str, &str) if needed, otherwise &str as before
  • blob: use U+001E ("record separator") as the separator
  • fs: use additional directory level

@robertbastian robertbastian requested review from sffc, Manishearth and a team as code owners June 11, 2024 16:25
@Manishearth Manishearth removed their request for review June 11, 2024 16:36
@sffc
Copy link
Member

sffc commented Jun 12, 2024

GitHub keeps crashing when I leave inline comments so I'll post them in a comment:

  1. Praise: nice that both locale-only and attributes-only code paths are optimized in baked provider
  2. Nit: Please move the magic separator character 1E to a constant
  3. Question: Why did you make the variables in baked provider start with __, like __BN
  4. Nit: regarding struct Comparator<'a>(&'a DataLocale, &'a DataMarkerAttributes): I still wish we would add a type to icu_provider with all the needed impls. Maybe move the 1E character into icu_provider, too, since deciding what is a safe character is more in the realm of responsibility for icu_provider rather than icu_provider_blob.
    • Actually, you previously had this in the icu_provider crate via the functions on DataLocale like legacy_encode. So, technically speaking, this PR moves the variables to a less desirable location.
  5. Issue: Changing the supported_requests impl to call locale! on each item in the slice significantly increases file size. Why did you make this change?
  6. Observation: by making the aux key marker attribute a path component in fs data provider, it could cause conflicts if the marker attribute is the same as the language. For example, in HelloWorldProvider, if we changed reverse to rev, then it would be a conflict. This is not a problem if every entry has the same number of data key attributes.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments left as a reply to the thread

@robertbastian
Copy link
Member Author

robertbastian commented Jun 12, 2024

Nit: regarding struct Comparator<'a>(&'a DataLocale, &'a DataMarkerAttributes): I still wish we would add a type to icu_provider with all the needed impls. Maybe move the 1E character into icu_provider, too, since deciding what is a safe character is more in the realm of responsibility for icu_provider rather than icu_provider_blob.
Actually, you previously had this in the icu_provider crate via the functions on DataLocale like legacy_encode. So, technically speaking, this PR moves the variables to a less desirable location.

The 1E encoding is a blob-provider-specific encoding. Fs and bake do not use it, hence it should not be something that is dictated by icu_provider.

Issue: Changing the supported_requests impl to call locale! on each item in the slice significantly increases file size. Why did you make this change?

To remove runtime parsing

Observation: by making the aux key marker attribute a path component in fs data provider, it could cause conflicts if the marker attribute is the same as the language. For example, in HelloWorldProvider, if we changed reverse to rev, then it would be a conflict. This is not a problem if every entry has the same number of data key attributes.

Hello world is weird in the regard that some values have attributes and some don't. A marker will either always have attributes, or it won't.

@robertbastian robertbastian requested a review from sffc June 12, 2024 07:51
@robertbastian
Copy link
Member Author

Question: Why did you make the variables in baked provider start with __, like __BN

It's _<attributes>_<locale>. Some attributes start with digits so we need the leading underscore. And attributes we're designed to sit between markers and locales, hence the order.

@sffc
Copy link
Member

sffc commented Jun 12, 2024

I have other comments that weren't addressed.

Nit: regarding struct Comparator<'a>(&'a DataLocale, &'a DataMarkerAttributes): I still wish we would add a type to icu_provider with all the needed impls. Maybe move the 1E character into icu_provider, too, since deciding what is a safe character is more in the realm of responsibility for icu_provider rather than icu_provider_blob.
Actually, you previously had this in the icu_provider crate via the functions on DataLocale like legacy_encode. So, technically speaking, this PR moves the variables to a less desirable location.

The 1E encoding is a blob-provider-specific encoding. Fs and bake do not use it, hence it should not be something that is dictated by icu_provider.

It's blob-specific right now, but it's up to DataProvider to define what a legal separator character or character sequence would be.

This is not blocking.

Issue: Changing the supported_requests impl to call locale! on each item in the slice significantly increases file size. Why did you make this change?

To remove runtime parsing

I would appreciate some numbers for file size and compile time before and after this change, and ideally (optionally) in its own PR. Compile time for baked data really is becoming a real issue so I'm inclined to consider impliterator to be a slow path where I want to minimize impact.

Observation: by making the aux key marker attribute a path component in fs data provider, it could cause conflicts if the marker attribute is the same as the language. For example, in HelloWorldProvider, if we changed reverse to rev, then it would be a conflict. This is not a problem if every entry has the same number of data key attributes.

Hello world is weird in the regard that some values have attributes and some don't. A marker will either always have attributes, or it won't.

Note that if we go with your current proposal in #4984 to model subdivisions as attributes, then we break this invariant.

But I think this isn't actually an issue, since locales always have paths that end in .json.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

@robertbastian
Copy link
Member Author

I would appreciate some numbers for file size and compile time before and after this change, and ideally (optionally) in its own PR. Compile time for baked data really is becoming a real issue so I'm inclined to consider impliterator to be a slow path where I want to minimize impact.

provider/data went from 100MB to 102MB

@robertbastian robertbastian requested a review from sffc June 12, 2024 08:50
@robertbastian
Copy link
Member Author

Please take a detailed look at blob. auxkey_bench panics with a stack overflow on CI, which I cannot reproduce, and I don't see where that could come from.

@sffc
Copy link
Member

sffc commented Jun 12, 2024

I would appreciate some numbers for file size and compile time before and after this change, and ideally (optionally) in its own PR. Compile time for baked data really is becoming a real issue so I'm inclined to consider impliterator to be a slow path where I want to minimize impact.

provider/data went from 100MB to 102MB

Can you give me an idea of compile times?

If there's no substantial difference then I'm happy

@robertbastian
Copy link
Member Author

I don't notice a difference but I haven't benchmarked this.

@sffc
Copy link
Member

sffc commented Jun 12, 2024

The error is

2024-06-12T10:10:03.3907000Z      Running benches/auxkey_bench.rs (target/debug/deps/auxkey_bench-375e559a296b2916)
2024-06-12T10:10:03.3907370Z 
2024-06-12T10:10:03.3924340Z test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.51s
2024-06-12T10:10:03.3924950Z 
2024-06-12T10:10:03.4022910Z Gnuplot not found, using plotters backend
2024-06-12T10:10:03.5713400Z 
2024-06-12T10:10:03.5713780Z thread '
2024-06-12T10:10:03.5714790Z thread '<unknown><unknown>' has overflowed its stack
2024-06-12T10:10:03.5715140Z ' has overflowed its stack
2024-06-12T10:10:03.5717160Z fatal runtime error: stack overflow
2024-06-12T10:10:03.5717600Z fatal runtime error: stack overflow
2024-06-12T10:10:03.5730110Z error: test failed, to rerun pass `-p icu_provider_blob --bench auxkey_bench`
2024-06-12T10:10:03.5730680Z 
2024-06-12T10:10:03.5730860Z Caused by:
2024-06-12T10:10:03.5731740Z   process didn't exit successfully: `/Users/runner/work/icu4x/icu4x/target/debug/deps/auxkey_bench-375e559a296b2916` (signal: 6, SIGABRT: process abort signal)

@robertbastian
Copy link
Member Author

I had seen that, it doesn't really narrow it down for me.

sffc
sffc previously approved these changes Jun 13, 2024
@sffc
Copy link
Member

sffc commented Jun 13, 2024

oh I approve this but it looks like you just added a debug print statement.

auxkey_bench is really slow in CI and we should probably fix that in its own PR. I don't immediately have an idea of the best way to go about making it faster though.

@sffc
Copy link
Member

sffc commented Jun 14, 2024

ZeroTrie builder in const mode assembles everything in a giant stack array. It does this because there is no const alloc. rust-lang/const-eval#20

https://github.com/unicode-org/icu4x/blob/main/utils/zerotrie/src/builder/konst/builder.rs

And it moves the stack arrays around a lot. It does this because the only way to mutate an array in another function in const is to move it since there are no mutable pointers in const. rust-lang/rust#57349

https://github.com/unicode-org/icu4x/blob/main/utils/zerotrie/src/builder/konst/store.rs#L178

I don't know for sure whether this is the stack overflow, but I'm just trying to list out cases where I'm aware of us having stack-heavy function calls.

@robertbastian
Copy link
Member Author

Maybe those methods should be inline(always)

@robertbastian robertbastian merged commit 63fa8da into unicode-org:main Jun 14, 2024
28 checks passed
@robertbastian robertbastian deleted the x- branch June 14, 2024 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants