-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGRESSION: ASCII is mangled in Version 3.3.1 #186
Comments
I took a peak at the code that saves a module... It has changed quite a bit since I last looked at it. There is too much going on. I know these are all features that have reasons but it's getting quite bloated. I think the solution is to first make unit tests like I did with |
Fundamentally, the ultimate goal of this project is to facilitate the recreation of a database from source files that matches the original. For the purpose of testing the functionality, the To test the add-in, you can do the following:
As of the current version, the only differences I find are (non-essential) changes with the I recognize that this testing database does not cover every possible scenario and option that could be used, but more database components can be easily added as desired to test the reproduction of the database component from source. Using this approach, every component in the database goes through round-trip testing. For my own purposes, I felt this was adequate to meet my needs for testing to ensure that changes to the add-in were not causing undesired side affects. If you encounter persistent reproducible issues, feel free to add them to the testing database and this will help us maintain the desired functionality as the add-in evolves. I know this is kind of a long drawn-out explanation, but hopefully it communicates some of the rationale behind the approach I have been taking on testing. To be honest, if RubberDuck wasn't so painfully slow, I would probably use it more for unit testing in my projects, but hopefully someday it will improve. 😄 |
Thanks for explaining. That's the difference between Integration Testing and Unit Testing. I agree that Integration Testing like this is a great way to show that the plugin works as advertised. Unit testing is still useful for critical components. I tried to follow your process with the test DB and I am getting errors on Step 1.
I know that the linked CSV is hard coded to a path on your dev system. Same with the default printer being specific to your environment. I'm not sure about the other errors. It made some error files: errors.txt
errors1.txt
|
You are doing this on the |
No, I was using |
That would be a very logical conclusion. 😄 Setting the connection strings to relative paths is a very new change that hasn't rolled into Sorry about the confusion on that. After the next release on the Also, that was a helpful distinction regarding Integration Testing and Unit Testing. I agree that both are very valuable in a project like this. 👍 |
Ok, so the dev branch works wonderfully. I filled in a full ASCII table to that test module. However, my problem didn't repeat. Now I'm left wondering why the export from my main project got a bunch of wrong characters in it's export. The new test only proves that the add-in works... |
Spoke too soon! I reproduced it by making that line longer. So it's not the character that is getting mangled but if that line character is repeated a certain number of times then the whole line gets changed in to a different character. Take a look at my PR. It's got the bad characters in it now. Due to the extended character set being present you can see that a bunch of other characters were affected. It's like the code page changed. |
I believe I have isolated the problem:
It's getting converted to Unicode when it's a normal ASCII source. |
Ok, my PR now contains a proposed fix. It makes the While I was in there I noticed that the unit test for |
The name of this function is a bit misleading... (Due to going back and forth a few times on how to handle extended ascii characters.) The description indicates its intended purpose, to return true if the string contains either Unicode or extended ascii characters. See #154 for a more extended discussion on this topic, and why it was important to include the UTF-8 BOM with files that contained extended characters. (Basically treat them as containing Unicode.) |
Thanks for the backstory. I agree with @Ik-12 that we should always export using the same format regardless of content. I tested forcing the write without BOM and it did not fix the problem. So the issue is somewhere in the conversion to UTF8. The extended characters need to be converted to multibyte but they are coming out wrong. First thing to note, when I dump the string coming in to
The output appears identical to the fancy code that converts to a byte array and writes that. This brings me to 2 conclusions:
So the problem is before I proved this by changing What this says to me is that MS-Access |
- Closes joyfullservice#186
Ok, I have written a new Encoder that is far more simple and straightforward. It uses two streams and it reads from the old file while writing the new file at the same time. This might even be faster. It is faithfully exporting my ASCII extended characters. Take a look and see if you like it. You might find that it could replace a bunch of other methods. What's odd is that I'm using "_autodetect_all" and now it's working. |
Thank you! This may end up helping with #180 as well... I will also want to check the performance logs when running this against large projects. We might get some significant performance gains by reading and writing the data in chunks rather than line-by-line. (See the |
Eureka! A clue! So for my encoder to work it needs to be in text mode because we are telling it what text format to write. So the bytemode chunking you were doing for performance can't be done here. I said to myself "1 character should = 1 byte, right?" So I changed my code from doing line at a time, to character mode:
AND the identical corruption returned! So I think this means the problem has something to do with the number of characters falling on the boundary of the chunks! Maybe this is why it only happens when the line is a certain length. This is bad news for your 128k chunk method but it does isolate the problem even further. |
Hmm... I wonder if we read the contents as binary, then converted to a text string after reading the file we could solve both the performance and the corruption issue... |
That's what the new method does. It loads the file as binary:
then it reads it in to a textstream of the defined type while writing that to the output file in any defined type:
This allows you to go from any encoding to any encoding supported by the OS using native OS encoding methods. That is why I said you could use this method for anything. Including replacing the UCS2 methods. |
Merging numerous updates to simplify the Unicode conversion process. See #186 for additional details.
We have come full circle. The issue has returned when I use the new 3.3.7 in |
Would you be able to add something to the test database that would allow me to reproduce the issue? |
When converting from one encoding to another, copy the stream all at once to avoid accidentally splitting Unicode character blocks. See #186
If the issue is with converting files, fa2ba83 may resolve it. If it is with reading files, you might do some similar testing with the |
It seems that perhaps the only place where I need to use chunked reading is when reading mixed UTF-16 content as generated by Access ADP projects. I don't see any performance impact with reading streams line by line. See #186
Okay, just pushed another update that should fix the reading of files as well, with no apparent performance penalty. The only thing left that uses chunks to read a file is something specific to ADP projects. I would guess that this should fix the encoding issue here, at least if it was caused by the chunking. 😄 |
And that shows that it is NOT the chunking. This is a tricky problem. Now I'm seeing it happen because of the When the file first comes out of ARG! Have a nice weekend. I'm done for the day. |
Is there a way to detect the code page/encoding, and have that be the chosen one? Or is autodetect supposed to do that? |
I think the GetACP API might be just what we are needing here... Wrap that in a select case to map to the correct encoding string (i.e. |
Due to problems with the `_autodetect_all` encoding parameter when processing source code files, we are now using a simple `GetACP` API call to retrieve the active code page from Windows, and mapping this to the appropriate encoding. (With support to easily add other encoding mappings in the future.) This should solve the issue described in #186 where the automatic detection was guessing the language incorrectly.
I think that I now understand what was happening with the Instead of trying to auto-detect the language encoding for the non-Unicode VBA module content, I am now using a simple API call to return the active code page that is already being used by Windows, and encoding the file based on that. The removes all the guesswork and inconsistencies with The one known exception to this is if the user selects the beta UTF-8 option in Windows 10 regional administrative settings. (See #180) Since the VBA IDE does not internally support UTF-8, it falls back to the code page mapping of the extended characters and may encounter conversion issues if certain extended characters are used in the code modules. For example, the BOM constants used in this project will not export/import correctly if you use this beta feature. I did some testing, and did not really find a simple way to resolve this, so at this point I plan to just leave that as a known limitation. (I really don't think it will affect many users.) |
@A9G-Data-Droid - Could you confirm that this is working correctly now as of the latest 3.3.8 |
Thanks! This is the way I would have handled it. This is a powerful feature that should allow people with different system charsets to share source files as they are traded in a safe format and then imported in the proper format for their system. I tested this latest version and it handles my new test file just fine. My test results can be seen in PR #190 |
I have a horizontal line comment I use:
'—————————————————————————————————————————————————————————
When exported it becomes this UNICODE character:
'覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧
This makes me think that other extended ASCII characters might have a similar problem.
The text was updated successfully, but these errors were encountered: