-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent character encoding in exported files #154
Comments
Thank you for including the sample file. I am getting a little different results on my computer, so I wanted to clarify a couple things.
The current intended behavior is that all files are exported/saved with UTF-8 encoding, and files that contain Unicode characters also have a UTF-8 BOM added to the file. (This allows GitHub Desktop and other applications to display the Unicode characters correctly.) For a closer look at how encoding is handled, you can review Hope that helps! |
Thank you for the prompt reply and apologies for forgetting these details.
Powershell I will take closer look at |
Thanks for the additional details! I think I see what you mean now. If I view the file changes in GitHub Desktop, for example, it does not display correctly: Even though it exports and imports fine, in the sense that it reconstructs the original VBA code, it does not display correctly in the diff. After converting to UTF-8 BOM, it displays correctly: At present, there is no testing or conversion on VBA code modules, but I can see how this would be a helpful enhancement to improve the display and handling of the source files, and like you pointed out, to maintain consistency in our file formats. |
While VBA modules cannot contain Unicode characters, they may contain extended ASCII characters. These are not displayed correctly in some editors, so we are converting affected source files to UTF-8 BOM whever they contain extended characters. Fixes #154
I just pushed a commit to the My hunch is that this just affects VBA code modules, since most other objects would be saved in UCS-2 to preserve Unicode content, but let me know if you encounter similar issues with any other types of objects. |
Converts a couple code modules to UTF-8 format for proper display of extended characters, thanks to the resolution of #154
Thank for the fix. However, I noticed two problems with it:
My suggestion is just to remove the conditional conversion and convert always to UTF-8. Note that it might The main reason why this issue is important to us is that in addition to git diffs, wrong encoding breaks |
@Ik-12 and @joyfullservice, I've observed this same behavior in SourceTree; this is especially prevalent in SSMS, where it defaults to UTF-16, which then will show the whole file altered. This has caused us to have to use another editor to re-encode into UTF-8, and then everything's happy again. I agree this should definitely be an option, and off by default for exactly the reason @Ik-12 mentioned; it would effectively create a reset in your VCS, and make tracking line changes difficult pre/post this change. As much as I have a personal disdain for autoNags: I think this one is a perfect use case for it. If VCS detects there might be Unicode/unexpected characters present, instead of just converting that file to UTF8/whatever, VCS should alert users, so they're aware the file encoding is changing and they can make an informed decision. Perhaps if this is detected, instead of continuing, the export should stop, bring up the options and let the user choose. As a complicated solution could be elegant in the end: maybe a good way to handle this is similar to how tables are handled with their data; tables are given the option of just saving the def, saving the def + data and in what format. I realize this will likely slow down the export/import processing by being dynamic, it would allow users to slow their roll over, or just jump and be done with it. If this option was implemented, it would be a good place to implement a per-form print options save setting. This would cut down on doubling the number of exported files in VCS, but would be a pretty significant overhaul; the already existing print config |
This file is natively exported as ANSI, and should be read as such when testing for extended characters. See #154
Good catch! I have corrected this in 7e15c40.
Yes, I use this feature quite frequently myself, even to divide changes on the same file into different commits. Back to the question of UTF-8 conversion, when I inspect an The two potential concerns I would see with always including a BOM is that 1.) It would change all the files in the repository, and 2.) It is technically not necessary since there is no non-Ascii content in the file. Let me know what you think! |
ASCII, from what I understand, is effectively UTF8 without BOM, so this makes sense that they're the same. I agree that converting files en-masse would be not-great, though I, personally, think conversion should be either consistent, or chosen by the user; unexpected conversions will lead to headaches. Being able to plan for it and decide a good point to alter makes transition much easier. Similar case to how we're switching over VCS from the integral one to the addin: we're currently running the addin as the non-controlled outputs and accepting all changes while we transition, and then will begin to reverse that (control change in the addin folders, and approve all integrated vcs changes to keep a trail, and then switch out entirely to the addin version. This allows for a nearly seamless transition at our darn leisure and don't have a point where it's difficult to transition through. |
This should help in situations where extended character content was added to a source file in UTF-8 format, but no BOM was used. #154
Yes, ASCII is a subset of UTF-8 so the files are binary equal if only ASCII characters are used. One reason for including BOM in all cases would be to "communicate" that Regarding the encoding option, I think it should be global and have (at least) two values:
This should make the transition as easy as possible for the users. |
As of the latest commits, we are effectively exporting modules only in UTF-8 format. Currently the BOM is only added if the file contains extended characters, which I believe should solve all the diff representation issues. Imports will always treat the module source files as UTF-8, so any edits can be safely made in UTF-8 and will import correctly. It seems that the question at this point is really on whether we should always include a UTF-8 BOM in every export file, regardless of the content. The reason to do this would be to support other software (SourceTree?) that defaults to other encodings when a BOM is not present. (Would those even be displayed differently anyway?) If you feel like this is a significant issue that isn't easily solved by adjusting the settings in the other software, I am open to the idea of including a non-default option to always include a UTF-8 BOM in export files. My primary hesitation is that this is a non-standard way of working with file encodings. A UTF-8 BOM is not recommended in general, per the spec, but can be helpful to support certain software systems that use this to identify encoding formats in cases where you actually have Unicode or extended content. Would you be able to demonstrate an actual case scenario where the current behavior (only use BOM when needed) produces incorrect output or display in diff or version control systems? Again, I am not opposed to the idea of adding this option, I just want to make sure we are solving a problem. 😄 |
The latest commits solve my issue. Many thanks for this. The only(?) issue with adding BOM only when it is need is that adding non-ASCII characters I think that having "include BOM" as option would be the best solution because we cannot be sure how any other software |
Just to clarify, is the Include BOM option something that you would personally plan to use in your environment, or just something that might possibly be useful for someone else in the future? |
Just something that might possibly be useful in the future. |
Sounds good. It is pretty easy to add later on if it becomes a need. I believe this issue has been resolved, so I will go ahead and close it out. Feel free to respond or reopen the issue if you encounter further problems with the encoding of export files! |
First of all, many thanks for the excellent tool.
As we use non-ASCII characters in our source files, I have noticed certain
inconsistencies in the character encoding of exported forms and source modules:
On the other hand:
The encoding used in case 3) cause problems for example with git diff.
As a solution, I suggest that all components should be exported using the same encoding
regardless of the type and the content. The best choice for encoding should be UTF-8.
Please find attached a minimal example to reproduce the issue:
Example.zip
The text was updated successfully, but these errors were encountered: