Skip to content

Enable Error Prone check: DefaultCharset#12313

Merged
findepi merged 2 commits intotrinodb:masterfrom
ksobolew:kudi/default-charset
May 12, 2022
Merged

Enable Error Prone check: DefaultCharset#12313
findepi merged 2 commits intotrinodb:masterfrom
ksobolew:kudi/default-charset

Conversation

@ksobolew
Copy link
Copy Markdown
Contributor

@ksobolew ksobolew commented May 10, 2022

Description

This is focused on charsets, so it won't catch locale-related problems like String#toLowerCase with default locale. But it will catch String to byte[] (and vice versa) conversion with default charset, in addition to the IO stream related issues fixed in this commit.

The existing cases were fixed to explicitly use defaultCharset() where the data being read comes from (or is written to a file on) the local system, otherwise they are fixed to use UTF_8 - unless there are special considerations to use a different charset.

Is this change a fix, improvement, new feature, refactoring, or other?

Static analysis improvement.

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

All over the code (some fixes in places reported by Error Prone).

How would you describe this change to a non-technical end user or system administrator?

Some internal improvements that should make everything slightly more stable.

Related issues, pull requests, and links

https://errorprone.info/bugpattern/DefaultCharset

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label May 10, 2022
@ksobolew ksobolew requested review from findepi and hashhar May 10, 2022 11:54
@findepi findepi added the no-release-notes This pull request does not require release notes entry label May 10, 2022
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this read? where does modelData come from?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Looks like nobody calls this method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's called dynamically in io.trino.plugin.ml.ModelUtils#deserialize(io.airlift.slice.Slice). The byte[] data comes from theSlice. I have no idea what are the contents of the slice - some representation of the ML model, it appears, and it doesn't look like a text 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further investigation: the model data is some binary metadata followed by a textual representation of the model. The text is generated by writing to a temporary file using java.io.DataOutputStream#writeBytes(String), which... treats the string as "a sequence of bytes" (out.write((byte)s.charAt(i)) 😱). So it's actually either US_ASCII or ISO_8859_1.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch. use UTF-8 and add a comment that this may or may not be correct choice

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is out? sounds like default charset can be appropriate, but idk

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.out = new PrintStream(config.getEventLogFile());

IOW, this is a log file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it to default then.

@ksobolew ksobolew force-pushed the kudi/default-charset branch 2 times, most recently from 66c8b80 to b0b963f Compare May 10, 2022 13:50
ksobolew added 2 commits May 11, 2022 15:31
This is focused on charsets, so it won't catch locale-related problems
like `String#toLowerCase` with default locale. But it will catch
`String` to `byte[]` (and vice versa) conversion with default charset,
in addition to the IO stream related issues fixed in this commit.

The existing cases were fixed to explicitly use `defaultCharset()` where
the data being read comes from (or is written to a file on) the local
system, otherwise they are fixed to use `UTF_8` - unless there are
special considerations to use a different charset.
@ksobolew ksobolew force-pushed the kudi/default-charset branch from b0b963f to 9c8345e Compare May 11, 2022 13:32
@ksobolew
Copy link
Copy Markdown
Contributor Author

AC, thanks @findepi

@findepi findepi merged commit 08bcca6 into trinodb:master May 12, 2022
@ksobolew ksobolew deleted the kudi/default-charset branch May 12, 2022 08:01
@github-actions github-actions bot added this to the 381 milestone May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed no-release-notes This pull request does not require release notes entry

Development

Successfully merging this pull request may close these issues.

2 participants