Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastCodeAt that's actually fast #9458

Closed
Simn opened this issue May 21, 2020 · 8 comments · Fixed by #9467
Closed

fastCodeAt that's actually fast #9458

Simn opened this issue May 21, 2020 · 8 comments · Fixed by #9467
Milestone

Comments

@Simn
Copy link
Member

Simn commented May 21, 2020

The specification for StringTools.fastCodeAt is pretty silly:

  This method is faster than `String.charCodeAt()` on some platforms, but
  the result is unspecified if `index` is negative or greater than
  `s.length`.
  End of file status can be checked by calling `StringTools.isEof()` with
  the returned value as argument.

These two statements contradict each other: If the result is unspecified then we cannot make any guarantees about what you can do with the returned value.

As a consequence, some targets have to branch here, e.g. to avoid throwing exceptions. For instance, Java does this:

return (index < s.length) ? cast(_charAt(s, index), Int) : -1;

This leads to an unnecessary double-branching on implementations that then use isEof.

At this point we obviously cannot break fastCodeAt, but I would like to propose the introduction of an unsafeCodeAt which really is just the fastest implementation possible, with no out-of-bounds guarantees whatsoever. Consumers can check the bounds themselves by comparing whatever indices they use against the string length.

This could then also be utilized by StringIterator, because Iterator.next also says this:

A call to this method while hasNext() is false yields unspecified behavior.

And no, I don't want to "use Bytes instead" because I don't want to deal with unicode myself.

@Simn Simn added this to the Design milestone May 21, 2020
@Gama11
Copy link
Member

Gama11 commented May 21, 2020

Would this also lead to fastCodeAt() being deprecated, or at least having a note that points to unsafeCodeAt()?

@Simn
Copy link
Member Author

Simn commented May 21, 2020

I don't think I would deprecate it, just update the documentation.

@RealyUniqueName
Copy link
Member

I agree we need unsafeCodeAt

@ncannasse
Copy link
Member

I think for real "fast" , we would need some kind of optimized StringIterator that results in direct access to the string data. Would work well with utf8 for instance.

@ncannasse
Copy link
Member

Wait we have StringIterator already :) but is it a good enough replacement for fastCodeAt ?

@ncannasse
Copy link
Member

oh, StringIterator is not unicode compatible ? :'(

@Simn
Copy link
Member Author

Simn commented May 21, 2020

StringIteratorUnicode can be optimized for UTF8 targets by carrying some state, similar to the cursors eval strings have. That is a separate problem though and needs a fast character access function as a basis.

@Simn
Copy link
Member Author

Simn commented May 21, 2020

Actually it's a separate problem entirely because it's more of a character-offset to byte-offset mapping. Such an iterator would likely be based on Bytes, not String.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants