Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent character encoding in exported files #154

Closed
Ik-12 opened this issue Feb 3, 2021 · 14 comments
Closed

Inconsistent character encoding in exported files #154

Ik-12 opened this issue Feb 3, 2021 · 14 comments

Comments

@Ik-12
Copy link

Ik-12 commented Feb 3, 2021

First of all, many thanks for the excellent tool.

As we use non-ASCII characters in our source files, I have noticed certain
inconsistencies in the character encoding of exported forms and source modules:

  1. Forms with non-ASCII characters are exported as UTF-8 w/ BOM
  2. Forms with only ASCII characters are exported as ASCII

On the other hand:

  1. Source modules with non-ASCII characters are exported as ISO-8859
  2. Source modules with only ASCII characters are exported as ASCII

The encoding used in case 3) cause problems for example with git diff.

As a solution, I suggest that all components should be exported using the same encoding
regardless of the type and the content. The best choice for encoding should be UTF-8.

Please find attached a minimal example to reproduce the issue:
Example.zip

@joyfullservice
Copy link
Owner

joyfullservice commented Feb 3, 2021

Thank you for including the sample file. I am getting a little different results on my computer, so I wanted to clarify a couple things.

  1. Which version of the add-in are you using? You might want to build the dev branch from source to get the very latest development version. (Currently 82 commits ahead of master)
  2. What is the language locale/default encoding on your system? That might affect how it treats text files without BOM headers.

The current intended behavior is that all files are exported/saved with UTF-8 encoding, and files that contain Unicode characters also have a UTF-8 BOM added to the file. (This allows GitHub Desktop and other applications to display the Unicode characters correctly.)

For a closer look at how encoding is handled, you can review modEncoding on the dev branch.

Hope that helps!

@Ik-12
Copy link
Author

Ik-12 commented Feb 3, 2021

Thank you for the prompt reply and apologies for forgetting these details.

  1. We are using the last release, that is 3.2.2., but I just quickly tested with the dev branch
    and the inconsistency occurs also with it.
  2. My system is Windows 10, build 19042.746. Windows display language is "English (United States)",
    regional format is "Finnish (Finland)" and language for non-Unicode programs is "English (United States)".
    Preferred languages list is "Finnish" and "English (United States)". Access version is 2016/64bit.

Powershell [System.Text.Encoding]::Default prints out the following information:
IsSingleByte : True
BodyName : iso-8859-1
EncodingName : Western European (Windows)
HeaderName : Windows-1252
WebName : Windows-1252
WindowsCodePage : 1252
IsBrowserDisplay : True
IsBrowserSave : True
IsMailNewsDisplay : True
IsMailNewsSave : True
EncoderFallback : System.Text.InternalEncoderBestFitFallback
DecoderFallback : System.Text.InternalDecoderBestFitFallback
IsReadOnly : True
CodePage : 1252

I will take closer look at modEncoding and debug the issue tomorrow.

@joyfullservice
Copy link
Owner

Thanks for the additional details! I think I see what you mean now. If I view the file changes in GitHub Desktop, for example, it does not display correctly:

image

Even though it exports and imports fine, in the sense that it reconstructs the original VBA code, it does not display correctly in the diff. After converting to UTF-8 BOM, it displays correctly:

image

At present, there is no testing or conversion on VBA code modules, but I can see how this would be a helpful enhancement to improve the display and handling of the source files, and like you pointed out, to maintain consistency in our file formats.

joyfullservice pushed a commit that referenced this issue Feb 3, 2021
While VBA modules cannot contain Unicode characters, they may contain extended ASCII characters. These are not displayed correctly in some editors, so we are converting affected source files to UTF-8 BOM whever they contain extended characters. Fixes #154
@joyfullservice
Copy link
Owner

I just pushed a commit to the dev branch that I believe will resolve this issue. Any code modules that contain extended characters are converted to UTF-8 for storage in version control, and back to ANSI just before importing back to the database. On my system this tested out great with your sample database for both export and build from source.

My hunch is that this just affects VBA code modules, since most other objects would be saved in UCS-2 to preserve Unicode content, but let me know if you encounter similar issues with any other types of objects.

joyfullservice pushed a commit that referenced this issue Feb 3, 2021
Converts a couple code modules to UTF-8 format for proper display of extended characters, thanks to the resolution of #154
@Ik-12
Copy link
Author

Ik-12 commented Feb 4, 2021

Thank for the fix. However, I noticed two problems with it:

  1. Converting only modules containing extended characters causes that if extended characters
    are later added to the file, git diff will show that the entire file has changed because the encoding is modified.
  2. Detecting extended characters using StringHasUnicode(ReadFile(strTempFile)) does not work on my system.
    This is because the temp file is in ISO-8859 encoding (e.g., 'ä' = 0xE4, 'ö' = 0xF6) and thus the regex "[^\u0000-\u007F]"
    used to detect non-ASCII characters does not work.

My suggestion is just to remove the conditional conversion and convert always to UTF-8. Note that it might
be a good idea to implement this first as an option. Changing the default encoding will cause full file changes
in any version controlled project using this addon, which can be inconvenient for the users.

The main reason why this issue is important to us is that in addition to git diffs, wrong encoding breaks
"Stage selected lines" feature in Git Extensions. This feature is very useful feature because Access often triggers
irrelevant changes files and these should not be included in the commits.

@hecon5
Copy link
Contributor

hecon5 commented Feb 4, 2021

@Ik-12 and @joyfullservice, I've observed this same behavior in SourceTree; this is especially prevalent in SSMS, where it defaults to UTF-16, which then will show the whole file altered. This has caused us to have to use another editor to re-encode into UTF-8, and then everything's happy again.

I agree this should definitely be an option, and off by default for exactly the reason @Ik-12 mentioned; it would effectively create a reset in your VCS, and make tracking line changes difficult pre/post this change.

As much as I have a personal disdain for autoNags: I think this one is a perfect use case for it. If VCS detects there might be Unicode/unexpected characters present, instead of just converting that file to UTF8/whatever, VCS should alert users, so they're aware the file encoding is changing and they can make an informed decision.

Perhaps if this is detected, instead of continuing, the export should stop, bring up the options and let the user choose.

As a complicated solution could be elegant in the end: maybe a good way to handle this is similar to how tables are handled with their data; tables are given the option of just saving the def, saving the def + data and in what format. I realize this will likely slow down the export/import processing by being dynamic, it would allow users to slow their roll over, or just jump and be done with it. If this option was implemented, it would be a good place to implement a per-form print options save setting. This would cut down on doubling the number of exported files in VCS, but would be a pretty significant overhaul; the already existing print config .json file could be used to store the export format and any other form-specific settings.

joyfullservice pushed a commit that referenced this issue Feb 4, 2021
This file is natively exported as ANSI, and should be read as such when testing for extended characters. See #154
@joyfullservice
Copy link
Owner

2. Detecting extended characters using `StringHasUnicode(ReadFile(strTempFile))` does not work on my system.
   This is because the temp file is in ISO-8859 encoding (e.g., 'ä' = `0xE4`, 'ö' = `0xF6`) and thus the regex `"[^\u0000-\u007F]"`
   used to detect non-ASCII characters does not work.

Good catch! I have corrected this in 7e15c40.

The main reason why this issue is important to us is that in addition to git diffs, wrong encoding breaks
"Stage selected lines" feature in Git Extensions. This feature is very useful feature because Access often triggers
irrelevant changes files and these should not be included in the commits.

Yes, I use this feature quite frequently myself, even to divide changes on the same file into different commits.

Back to the question of UTF-8 conversion, when I inspect an OnlyAscii.bas file in a hex editor, it is identical when saved as UTF-8 without BOM, and when saved as ANSI. What do you think about only including the BOM on export when extended characters actually exist in the file? Module source files would always be imported as UTF-8, so that would cover cases where extended characters were added without a BOM, but the files that don't have any extended characters would not include a BOM on export.

image

The two potential concerns I would see with always including a BOM is that 1.) It would change all the files in the repository, and 2.) It is technically not necessary since there is no non-Ascii content in the file.

Let me know what you think!

@hecon5
Copy link
Contributor

hecon5 commented Feb 4, 2021

ASCII, from what I understand, is effectively UTF8 without BOM, so this makes sense that they're the same.

I agree that converting files en-masse would be not-great, though I, personally, think conversion should be either consistent, or chosen by the user; unexpected conversions will lead to headaches. Being able to plan for it and decide a good point to alter makes transition much easier. Similar case to how we're switching over VCS from the integral one to the addin: we're currently running the addin as the non-controlled outputs and accepting all changes while we transition, and then will begin to reverse that (control change in the addin folders, and approve all integrated vcs changes to keep a trail, and then switch out entirely to the addin version. This allows for a nearly seamless transition at our darn leisure and don't have a point where it's difficult to transition through.

joyfullservice pushed a commit that referenced this issue Feb 4, 2021
This should help in situations where extended character content was added to a source file in UTF-8 format, but no BOM was used. #154
@Ik-12
Copy link
Author

Ik-12 commented Feb 4, 2021

Yes, ASCII is a subset of UTF-8 so the files are binary equal if only ASCII characters are used.

One reason for including BOM in all cases would be to "communicate" that
all files are exported in UTF-8. Anyway, my opinion is to follow consistency, that is,
include BOM in all files or leave it always out.

Regarding the encoding option, I think it should be global and have (at least) two values:

  1. The current behavior (default)
  2. Export everything always in UTF-8
    (3. Export everything always in some other encoding)

This should make the transition as easy as possible for the users.

@joyfullservice
Copy link
Owner

As of the latest commits, we are effectively exporting modules only in UTF-8 format. Currently the BOM is only added if the file contains extended characters, which I believe should solve all the diff representation issues. Imports will always treat the module source files as UTF-8, so any edits can be safely made in UTF-8 and will import correctly.

It seems that the question at this point is really on whether we should always include a UTF-8 BOM in every export file, regardless of the content. The reason to do this would be to support other software (SourceTree?) that defaults to other encodings when a BOM is not present. (Would those even be displayed differently anyway?)

If you feel like this is a significant issue that isn't easily solved by adjusting the settings in the other software, I am open to the idea of including a non-default option to always include a UTF-8 BOM in export files. My primary hesitation is that this is a non-standard way of working with file encodings. A UTF-8 BOM is not recommended in general, per the spec, but can be helpful to support certain software systems that use this to identify encoding formats in cases where you actually have Unicode or extended content.

Would you be able to demonstrate an actual case scenario where the current behavior (only use BOM when needed) produces incorrect output or display in diff or version control systems? Again, I am not opposed to the idea of adding this option, I just want to make sure we are solving a problem. 😄

@Ik-12
Copy link
Author

Ik-12 commented Feb 4, 2021

The latest commits solve my issue. Many thanks for this.

The only(?) issue with adding BOM only when it is need is that adding non-ASCII characters
to a module for the first time shows an extra line (the first line) in the diff as illustrated in the screenshot below.
For me this is rather insignificant (and actually I was wrong about that the full file would show as changed in the diff).

I think that having "include BOM" as option would be the best solution because we cannot be sure how any other software
used with this addon handles BOM (e.g., SourceTree may need it, some other tool might not work with it).

diff

@joyfullservice
Copy link
Owner

Just to clarify, is the Include BOM option something that you would personally plan to use in your environment, or just something that might possibly be useful for someone else in the future?

@Ik-12
Copy link
Author

Ik-12 commented Feb 5, 2021

Just something that might possibly be useful in the future.

@joyfullservice
Copy link
Owner

Just something that might possibly be useful in the future.

Sounds good. It is pretty easy to add later on if it becomes a need.

I believe this issue has been resolved, so I will go ahead and close it out. Feel free to respond or reopen the issue if you encounter further problems with the encoding of export files!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants