Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ASCII characters get distorted after ex- and import #217

Closed
cwuensch opened this issue May 4, 2021 · 29 comments
Closed

Non-ASCII characters get distorted after ex- and import #217

cwuensch opened this issue May 4, 2021 · 29 comments
Labels

Comments

@cwuensch
Copy link

cwuensch commented May 4, 2021

BrokenChars
Hi there,
thank you for your great add-in! It is extremely more user-friendly than the original version. :)

But there is still a bug with UTF-8 conversion in your add-in (which is not present in the original version):

When I export all objects from a database, and import them into a new DB, all non-ASCII characters get distorted.
Even worse: They are all mapped to the same byte-code, which makes it impossible to restore them afterwards!!

The exported files in
queries/.bas
forms/
.bas
already contain the damaged chars.

Files in modules/*.bas seem okay.

The (original) branch in https://github.com/msaccess-vcs-integration/msaccess-vcs-integration does preserve special chars correctly.

In the attached screenshots, you can see a comparison of an exported form, left is original repo with correct chars (ANSI, not UTF-8) - right is the distorted export from this fork.
Can you help with this, please?

@joyfullservice
Copy link
Owner

Hi Christian,

Thank you for reporting this! Can you provide some more details on your environment and the versions that you are using? That will help in better understanding the problem.

  • Which version of Microsoft Access are you using?
  • Is it 64-bit or 32-bit?
  • Which version of the add-in are you using?
  • What are the local language settings on your computer?
  • Are you using the Windows 10 Beta System UTF-8 option described in Subscript out of range - (UTF-8 System Encoding) #180?

Also, would you be able to attach a small database example that demonstrates the problem? That will help me to reproduce the issue on my system.

Thank you!

  • Adam

@joyfullservice joyfullservice added the need more details Further information needed to understand issue label May 4, 2021
@cwuensch
Copy link
Author

cwuensch commented May 4, 2021

Hi Adam,

thank you for the reply!
I will shortly answer your questions:

  • Microsoft Access 2010 (32 bit)
  • I tried a lot of versions, including the latest
  • Local language is German (what exactly do you need?)
  • No, I did not find this option. And I am using Windows 7.

Of course, I can provide a demo database. But I will need a bit of time for this...

I think, the best idea would be to do no conversion at all. Just output the chars as provided by MSAccess.
Also, I would rather prefer to omit the Unicode BOM in the resulting files.

@joyfullservice
Copy link
Owner

Hi Christian,

I am also using Microsoft Access 2010 (32-bit) so that will make it easier for me to test. I do not have Windows 7, but we can probably still test the issue on my Windows 10 computer.

The sample database does not have to be very complicated. Just create a blank database file, and add a form that includes some extended characters that demonstrate the issue. Export to source, then build from source to see if the characters are different from the original.

Regarding the UTF-8 conversion, we have had some pretty involved discussions on this and the clear consensus of the users was that UTF-8 is the universal standard for storing and representing multi-lingual content, and worked best across different version control systems and development tools. (Pretty much everything supports UTF-8) All source files are now saved in UTF-8 BOM format. See #180, #187, #154, #187 for additional reading on this.

The original database should be reconstructed using the original encoding. If it isn't, then this is something we should probably fix.

@cwuensch
Copy link
Author

cwuensch commented May 4, 2021

Well, the problem definitely occurs already during export, since the resulting textfiles do replace all special characters with the same byte code.
Attached is a little demo database, which shows the problem on my system.
Demo.zip

@hecon5
Copy link
Contributor

hecon5 commented May 4, 2021

Do you happen to have the code page you use on your machine? https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers

@cwuensch
Copy link
Author

cwuensch commented May 4, 2021

The current codepage is windows-1252 (which is default for nearly all languages except for Arabic and Asian ones).
It is nearly identical with iso-8859-15 resp. iso-8859-1 (latin-1)

@joyfullservice joyfullservice added bug and removed need more details Further information needed to understand issue labels May 4, 2021
@joyfullservice
Copy link
Owner

Thank you for the sample database. I can reproduce the issue on my system.

image

@joyfullservice
Copy link
Owner

Thanks to your sample file, I think I have found the issue... The Sanitize routine expects the source file to be exported from Microsoft Access in UTF-8 format. The sample MDB file was using file version 4.0, which is encoded with the system codepage. If I create a new blank database, and import the table and form, they export just fine. (The database version for that file is 14.0)

If I read the export file using the system encoding, it exports and converts the file just fine.

image

This is correctly converted to UTF-8

image

I think the solution here is to use the system encoding when reading the file if the file version is below a certain point. Obviously version 4.0 needs this, but I am not sure technically where the cutoff happens where the export changes to Unicode... I will see if I can find something on this...

joyfullservice added a commit that referenced this issue May 4, 2021
When exporting source from a database in Microsoft Access 2000 format (version 4.0), objects are saved using the system codepage, and must be read as such when opening the file to sanitize it. Newer database versions export as UTF-8 automatically, so specifying the codepage is not necessary when reading the file. Fixes #217
@joyfullservice
Copy link
Owner

I have made an update that I think may resolve this issue when exporting source from older file formats. Can you try the attached version and see if it works correctly in your environment?
Version_Control_v3.3.32.zip

@cwuensch
Copy link
Author

cwuensch commented May 4, 2021

Thank you very much for addressing this issue!
I installed and tested v3.3.32. (The GUI also displayed the new version number)
But the bug persists with no difference, here.

Can I do anything to help you with debugging this issue?

PS: Found out, there is not only the "Access 2000" format, but also "Access 2002-2003", "Access 2007" and I think, also "Access 97". Do they have to be treted separately?

@joyfullservice
Copy link
Owner

@cwuensch - The nice thing about this being a Microsoft Access add-in is that you can actually debug it live. After opening the add-in for the first time in that instance of Microsoft Access, you can see the project loaded in the VBA editor. If you open the modSanitize module, you can add a breakpoint in the SanitizeFile function. This function reads the natively exported form object source file, and where the codepage is determined. If you step through the code (Shift+F8) it should reach the highlighted line. If you then press F8, it should step into the GetSystemEncoding function which is what determines how to read the file contents.

image

I would also be curious to know what the GetSystemEncoding function returns on your system. You can test this by pressing ctl+G to jump to the immediate window, and type ?GetSystemEncoding and hit enter.

image

On my system it returns windows-1252 which reads the file correctly in the ReadFile function.

@cwuensch
Copy link
Author

cwuensch commented May 4, 2021

?GetSystemEncoding returns "windows-1252" on my system as well. (As I said, this is the default setting in Windows for nearly all languages)

But... The problem here is: CurrentDb.Version is a String, not a number! The comparison CurrentDb.Version <= 4 returns False.
It should be replaced by (Left$(CurrentDb.Version,1) <= 4).

Why? Here are some examples:
?CDbl("4.0")
40
?CDbl("4,0")
4
?(CurrentDb.Version <= 4.0)
Falsch (=false)
?(CurrentDb.Version <= 4)
Falsch
?(CDbl(CurrentDb.Version)<= 4.0)
Falsch
?(CDbl(CurrentDb.Version)<= 4)
Falsch
?(CurrentDb.Version = "4.0")
Wahr (=true)
?(Left$(CurrentDb.Version,1) <= 4)
Wahr

Problem is here, that the number <-> String conversins depend on local settings. And in German locale, a decimal point is a comma, instead of a full stop. That's why "4.0" is converted into the number "4" on your system, but into "40" on mine.

@joyfullservice
Copy link
Owner

Nice catch!

@cwuensch
Copy link
Author

cwuensch commented May 4, 2021

Oh no!
Sorry, this was a bad idea...
My suggestion would now ALWAYS read with system encoding, because "12.0" would be converted to 1...

This way it should work:

Public Function MainVer(ver As String) As Integer
  pos = InStr(ver, ".")
  If pos > 0 Then
    MainVer = CInt(Left$(ver, pos - 1))
  Else
    MainVer = CInt(ver)
  End If
End Function

joyfullservice added a commit that referenced this issue May 5, 2021
The database version is stored as a string, but should be converted to a number before performing arithmetic. However, some locales use a period the way English numbers use a comma. See #217 for details.
@joyfullservice
Copy link
Owner

@cwuensch - Give this version a try...
Version_Control_v3.3.33.zip

@cwuensch
Copy link
Author

cwuensch commented May 5, 2021

Thank you very much! I can confirm, that v3.3.33 now exports all elements into working UTF-8.
Is it necessary to consider the different handling of Unicode in Access DB version 4.0 vs. 12.0 regarding the import as well?
(I think you changed this only for export. But maybe this is sufficient?)

May I ask a further question?
After you told me, how to debug the Add-In, I am quite curious: Can I also make changes to the code?
When I do what you told me, the changes will not be saved after closing MS Access. When I open the add-in-file, it just starts an automatic installation process and closes itself afterwards.

@hecon5
Copy link
Contributor

hecon5 commented May 5, 2021

@cwuensch, when Access loads an Add-in (such as this one), if you edit the addin within the "opening" Access file, changes will not be saved. This is a double edged sword: it allows easy debugging, and to try things out that might otherwise ruin files. Downside is that once you close the session, it will discard any settings.

This is a really easy way to load "extras" for users and ensure they don't break things for everyone else.

Anyway, to actually answer your question, I suggest the following;

  1. Fork this MS Access Repository into your GitHub account repos.
    image

  2. Clone your fork to a local repository alongside your other Access dev repos on your machine.

  3. I put mine alongside some other Access repositories I utilize.
    image

  4. Switch to "Dev" Branch: I actually suggest renaming YOUR "dev" branch to something local to you, especially if you still use some of the legacyVCS tools. I named my local fork of this branch to "dev-addin".

  5. Connect a second remote to the joyfullservice/msaccess-vcs-integration repository. This way you can track progress on the upstream ( joyfullservice/msaccess-vcs-integration) repository. There may be changes you don't want to pull into yours, or you may want customization not present on the upstream (in my environment, we have some particular changes required to ensure proper integration in our security environment and configurations that shouldn't be the default for everyone).
    image

  6. Go into your new local repo, and launch the Addin directly.

  7. Make changes, use the add-in to export the add-in code, and commit/push/pull request just like any other repository.
    image

These should probably be made into a wiki page, but they're here for now.

@hecon5
Copy link
Contributor

hecon5 commented May 5, 2021

Updated the wiki; it's here: Editing and Contributing

@cwuensch
Copy link
Author

cwuensch commented May 5, 2021

Thank you @hecon5 for your detailled instructions.
But in fact, my question was slightly different:
How can I - just on my computer - edit (and save!) the MS Access Add-In file?
If I open the Add-In file, it just runs the installation and closes itself. How can I make changes to the file and save them?

@hecon5
Copy link
Contributor

hecon5 commented May 5, 2021

Aha!
image

image

@joyfullservice
Copy link
Owner

@cwuensch - To make changes to the add-in, you simply close the install form by pressing escape or clicking the X in the top right corner as @hecon5 shows in the screenshot. Just remember that the add-in that you are editing is not the same as the add-in that runs when you click the menu item. After making changes to the add-in, you will want to install it by running the Autoexec macro, or by reopening the database. After reinstalling, the new changes will be available in the installed add-in, where you can test it with other databases.

What I will often do is test small changes in the loaded add-in, just like I described in the debugging process. When I am satisfied that it is working the way I want it to, then I copy those changes over to the working copy of the add-in. Also note that if the application-level add-in is loaded, you will not be able to install the update. The application-level add-in (the installed version) opens when it is first called from the menu item, and stays loaded until that instance of Microsoft Access is closed. The add-in will warn you that it cannot install if the installed add-in is already loaded. This probably sounds confusing, but it makes a lot of sense once you understand what is going on. 😄

As to your question on whether the files need to be converted back to the system encoding before import, this would depend on which file version was created when the database is built from source. Right now, I think the add-in just uses the current default database version, so this would work fine if the default is set to 2007 or newer. If the new database was created in a legacy format, you might have problems importing the UTF-8 encoded characters.

Here is what the option looks like in Microsoft Access 2010:

image

In the Build function, this is where you would specify the database format when creating a new database:

image

If we wanted the rebuilt database version to match the original, we should probably have the code specify the version to match the exported database. This hasn't come up as a need yet, but if someone would find this helpful or important, we could create a new issue to add this functionality.

@cwuensch
Copy link
Author

cwuensch commented May 5, 2021

Thank you for all your explanations!

1.) I think, creating a blank database with the user's "Default file format" seems reasonable. Especially, converting an existing database to a new version might be a reasonable use-case for the add-in. So I would not change that.

2.) But, IF the import has to be different, depending on the version of the newly created database, maybe the user's configuration should be considered?
(In my case, for example, the default file format is configured as "Access 2000", because there are customers with legacy software. But in a first quick test, the imported seemed to work nevertheless... that feels strange to me)

@hecon5
Copy link
Contributor

hecon5 commented May 17, 2021

@cwuensch, is this working? If so, we should probably update the released version to include this fix.

joyfullservice added a commit that referenced this issue May 17, 2021
When building from source, the add-in now attempts to use the same file format from the original database. This is important if you are maintaining databases in legacy versions. Also added error handling to give the user some additional clues on how to change the collation order if the database file creation fails. See #217
@joyfullservice
Copy link
Owner

I have just pushed some additional changes to the dev branch relating to the file format of the rebuilt database. The FileFormat of the old database is now stored in project.json and the new database is created using this same version. In other words, if you use Access 2010 to export source from a database in the Access 2000 file format, you can now build that database from source and it will be built in the legacy Access 2000 file format, just like the original database.

Although not directly related to the original issue, this was a side issue I was able to work through for the benefit of those that are working with files in older formats. Attached is a copy of the latest dev version in case you are not familiar with building it from source.
Version_Control_v3.3.36.zip

@cwuensch
Copy link
Author

Hi Adam,
to my mind, it still makes more sense, if a newly created database would use the default format configured in MS Access.
This way, you could convert an "old" *.mdb-file into new *.accdb format. Also, you could stick with legacy formats, when you configure Access respectively.
But this is your choice, of course.

But could you please do me the favor and compile the previous version v3.3.35 for me?
(I did not manage to build this Access add-in file unfortunately...)

@hecon5
Copy link
Contributor

hecon5 commented May 18, 2021

@cwuensch, I took the liberty of making a new issue for that, as it's an interesting idea, and I'm not sure it would be out of scope, but I think it could use some discussion before we do anything with it.

@cwuensch
Copy link
Author

Thank you @hecon5!
But I think this would be a very low priority.
For me, personally, it would be perfectly fine, if someone could just provide a compiled 3.3.35 add-in version?

@joyfullservice
Copy link
Owner

@cwuensch - Did the version 3.3.36 have problems building your database? If so, I would love to be able to resolve the issue for the benefit of everyone.

If you need to convert a database from one version to another, you can use the Save & Publish menu item, and select which version to convert the database to.

image

@cwuensch
Copy link
Author

I don't think so. (Could not yet try 3.3.36, but will do!)
But I would just personally prefer the old logic: That a database will be created in Access' default format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants