Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure UTF-8 is used in more places #494

Merged
merged 3 commits into from
Jul 20, 2024

Conversation

Jayman2000
Copy link
Contributor

Pull Request Type

  • GitHub Workflow changes
  • Documentation or Wiki changes
  • Build and Dependency changes
  • Runtime changes
    • Render changes
    • Audio changes
    • Input changes
    • Network changes
    • Other changes

Description

At the moment, Descent 3 doesn’t always specify which character encoding should be used. As a result, the default character encoding gets used. The default character encoding can change depending on what OS you’re using and how the OS is configured. This PR makes Descent 3 use UTF-8 instead of the default character encoding in some situations. The idea is that if we use UTF-8 everywhere, then users won’t have to change their system configuration or avoid using certain characters in order to make Descent 3 work properly.

Related Issues

Fixes #483.

Checklist

  • I have tested my changes locally and verified that they work as intended.
  • I have documented any new or modified functionality.
  • I have reviewed the changes to ensure they do not introduce any unnecessary complexity or duplicate code.
  • I understand that by submitting this pull request, I am agreeing to license my contributions under the project's license.

Additional Comments

@winterheart had a different idea for how to fix #483. I think that it would be valuable to implement both the fix proposed by this PR and the fix proposed by winterheart. The changes in this PR affect a lot more than just fopen() which is why I think that they are valuable. Calling _wfopen() instead of fopen() on Windows would still be valuable even if this PR gets merged because if would be more efficient. Specifically, if you call fopen(), then the filename parameter has to be converted to UTF-16. If you call _wfopen(), then the filename parameter doesn’t have to be converted to UTF-16 because it’s already in UTF-16.

Before this change, there was a chance that a text editor or compiler
would use the wrong character encoding. For text editors, this commit
adds “charset = utf-8” to .editorconfig. That will cause editors that
support EditorConfig files [1] to automatically use UTF-8 when reading
and writing files. For compilers, this commit adds compiler options that
guarantee that compilers will decode source code files using UTF-8. The
compiler options are known to work on MSVC, GCC and Clang. If we ever
want to support additional compilers, then we might have to edit the if
statement that this commit adds.

This commit does not eliminate the chance that a wrong character
encoding will be used. Someone could always use a text editor that
doesn’t support EditorConfig files or a compiler that doesn’t support
the compiler options that we use. This commit does, however, make an
encoding mismatch much less likely.

[1]: <https://editorconfig.org/#pre-installed>
Consider this program:

  #include <cstdio>

  int main(void) {
    const char *example_string = "ディセント3";
    for (size_t i = 0; example_string[i] != '\0'; ++i) {
      printf("%02hhx ", example_string[i]);
    }
    puts("");

    return 0;
  }

What will that program output? The answer is: it depends. If that
program is compiled with a UTF-8 execution character set, then it will
print this:

  e3 83 87 e3 82 a3 e3 82 bb e3 83 b3 e3 83 88 33

If that program is compiled with a Shift JIS execution character set,
then it will print this:

  83 66 83 42 83 5a 83 93 83 67 33

This is especially a problem when using MSVC. MSVC doesn’t necessarily
default to using UTF-8 as a program’s execution character set [1].

---

Before this change, Descent 3 would use whatever the default execution
character set was. This commit ensures that the execution character set
is UTF-8 as long as Descent 3 gets compiled with MSVC, GCC or Clang. If
Descent 3 is compiled with a different compiler, then a different
execution character set might get used, but as far as I know, we only
support MSVC, GCC and Clang.

I’m not sure whether or not this change has any noticeable effects. If
using different execution character sets do have noticeable effects,
then this change will hopefully ensure that those effects are the same
for everyone.

[1]: <https://learn.microsoft.com/en-us/answers/questions/1805730/what-is-msvc-s-default-execution-character-set>
Consider this program:

  #include <cstdio>

  int main(void) {
    const char *filename = u8"ディセント3.txt";
    auto fp = std::fopen(filename, "r");
    if (fp) {
      std::fclose(fp);
      return 0;
    } else {
      return 1;
    };
  }

If a file named ディセント3.txt exists, then will that program
successfully open it? The answer is: it depends.

filename is going to point to these bytes:

  Raw bytes: e3 83 87 e3 82 a3 e3 82 bb e3 83 b3 e3 83 88 33 2e 74 78 74 00
  Characters: ディセント3.txt␀

Internally, Windows uses UTF-16. When you call fopen(), Windows will
convert the filename parameter into UTF-16 [1]. If the program is run
with a UTF-8 Windows code page, then the above bytes will be correctly
interpreted as UTF-8 when being converted into UTF-16 [2]. The final
UTF-16 string will be this*:

  Raw bytes: ff fe c7 30 a3 30 bb 30 f3 30 c8 30 33 00 2e 00 74 00 78 00 74 00
  Characters: ディセント3.txt

On the other hand, if the program is run with code page 932, then the
original bytes will be incorrectly interpreted as code page 932 when
being converted into UTF-16. The final UTF-16 string will be this*:

  Raw bytes: ff fe 5d 7e fd ff 67 7e 63 ff 67 7e 7b ff 5d 7e 73 ff 5d 7e fd ff 33 00 2e 00 74 00 78 00 74 00
  Characters: 繝�繧」繧サ繝ウ繝�3.txt

In other words, if that program gets compiled on Windows with a UTF-8
execution character set, then it needs to be run with a UTF-8 Windows
code page. Otherwise, mojibake might happen.

*Unlike the first string, this one does not have a null terminator. This
is because the Windows kernel doesn’t use null terminated strings for
paths [3][4].

---

Before this commit, Descent 3 would pass UTF-8 to fopen(), even if
Descent 3 is run with a non-UTF-8 Windows code page [5]. This commit
makes sure that Descent 3 gets run with a UTF-8 Windows code page.

The Windows code page isn’t just used by fopen(). It also gets used by
many other functions in the Windows API [6]. I don’t know if Descent 3
uses any of those other functions, but if it does, then this commit will
also help make sure that those functions receive strings with the
correct character encoding. Descent 3 uses UTF-8 for strings by default
[7]. Making sure that Descent 3 uses UTF-8 everywhere will make
encoding-related mistakes less likely in the future.

Fixes DescentDevelopers#483.

[1]: <https://stackoverflow.com/a/7950569/7593853>
[2]: <https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170#remarks>
[3]: <https://stackoverflow.com/a/52372115/7593853>
[4]: <https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html>
[5]: <DescentDevelopers#475 (comment)>
[6]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page#-a-vs--w-apis>
[7]: adf58ec (Explicitly declare execution character set, 2024-07-07)
@pzychotic
Copy link
Contributor

Confirmed working on a German Win 11 with D3 running out of a Japanese named directory.

I had no idea that a simple manifest entry can switch the ANSI WinApi functions to suddenly except UTF-8. No need to muck around with UTF-16 or other wchar_t based APIs looks like a big win for cross-platform projects.

@Lgt2x
Copy link
Collaborator

Lgt2x commented Jul 20, 2024

How does the manifest file work? Does it have to be next to the executable? Should we install & package it?

@pzychotic
Copy link
Contributor

How does the manifest file work? Does it have to be next to the executable? Should we install & package it?

It will be embedded into the executable, so nothing to do for install.

Copy link
Collaborator

@Lgt2x Lgt2x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice indeed, I did not know that CMake could treat manifest files as sources to embed them into the executable.

Calling _wfopen() instead of fopen() on Windows would still be valuable even if this PR gets merged because if would be more efficient

With the relatively low number of files the game opens, I don't think this is really performance-critical

@Lgt2x Lgt2x merged commit 7cd7902 into DescentDevelopers:main Jul 20, 2024
10 checks passed
@Jayman2000 Jayman2000 deleted the encoding-improvements branch July 20, 2024 19:05
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this pull request Aug 19, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
At the moment, we work around this problem with things like the ddio
module. If we make sure that we always use std::filesystem::paths, then
we’ll be able to get rid of or simplify a lot of the ddio module.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this pull request Aug 20, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this pull request Aug 20, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this pull request Aug 20, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this pull request Aug 22, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this pull request Sep 12, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>

TODO:
• Test on Windows.
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this pull request Sep 12, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this pull request Sep 22, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this pull request Sep 29, 2024
Before this change, Base_directory was a char array. In general, it’s
better to use an std::filesystem::path to represent paths. Here’s why:

• char arrays have a limited size. If a user creates a file or directory
that has a very long path, then it might not be possible to store that
path in a given char array. You can try to make the char array long
enough to store the longest possible path, but future operating systems
may increase that limit. With std::filesystem::paths, you don’t have
to worry about path length limits.

• char arrays cannot necessarily represent all paths. We try our best to
use UTF-8 for all char arrays [1], but unfortunately, it’s possible to
create a path that cannot be represent using UTF-8 [2]. With an
std::filesystem::paths, stuff like that is less likely to happen because
std::filesystem::path::value_type is platform-specific.

• char arrays don’t have any built-in mechanisms for manipulating paths.
Before this commit, the code would work around this problem in two
different ways:

  1. It would use Descent 3–specific functions that implement path
  operations on char arrays/pointers (for example, ddio_MakePath()).
  Once all paths use std::filesystem::path, we’ll be able to simplify
  the code by removing those functions.

  2. It would temporarily convert Base_directory into an
  std::filesystem::path. This commit simplifies the code by removing
  those temporary conversions because they are no longer necessary.

[1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20)
[2]: <rust-lang/rust#12056>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Runtime Issue]: Unable to set temporary directory to a path that contains Japanese characters
3 participants