Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test ascii #187

Merged
merged 23 commits into from
Mar 5, 2021
Merged

Test ascii #187

merged 23 commits into from
Mar 5, 2021

Conversation

A9G-Data-Droid
Copy link
Contributor

I fleshed out the basExtendedChars.bas based on discussion in issue #186

A9G-Data-Droid and others added 5 commits March 3, 2021 15:25
This adds all the ASCII characters to the test DB to detect export failures
Built and exported with the latest version 3.3.1
@joyfullservice
Copy link
Owner

Could you adjust this PR to target the dev branch? That will give us a change to test and tweak things before rolling it over to master. Thanks!

@joyfullservice joyfullservice changed the base branch from master to dev March 4, 2021 22:05
@joyfullservice
Copy link
Owner

Nice! It allowed me to change the base branch on the PR... On to some testing...

@joyfullservice
Copy link
Owner

Good work! One note is that we need to include the UTF-8 BOM anytime the content of a file contains Unicode characters. For example, in the testing database, frmMain.bas includes Unicode checkmark characters, and needs the BOM to display correctly in some VCS programs (i.e. SourceTree)

@A9G-Data-Droid
Copy link
Contributor Author

A9G-Data-Droid commented Mar 4, 2021

Good news. I am always writing the BOM now. I would rather have 1 format instead of guessing on both export and import. This is the default when you tell a stream to be UTF-8 and so let's keep that convention.

@A9G-Data-Droid
Copy link
Contributor Author

A9G-Data-Droid commented Mar 4, 2021

I hate calling external DLLs from VBA because it looks obtuse. I have replaced the WideChar calls with a native function that is much smaller and easier to read. I also added a unit test to show that they work.

@A9G-Data-Droid
Copy link
Contributor Author

So I did an export of my main project with the new version and I'm seeing almost 5 seconds off total export time. That's anecdotal and not a tightly controlled test but it's promising.

Version 3.3.1

Done. (30.41 seconds)

Version 3.3.3

Done. (25.03 seconds)

@joyfullservice
Copy link
Owner

Good news. I am always writing the BOM now. I would rather have 1 format instead of guessing on both export and import. This is the default when you tell a stream to be UTF-8 and so let's keep that convention.

You have some very valid considerations here... Up to this point I was taking the approach that the UTF-8 BOM should only be used where it was actually needed, where the file contained Unicode or Extended characters. While technically not required, the BOM does help the display in certain programs such as diffing tools.

My reasoning for selectively using the BOM was:

  • It is technically not required, per the Unicode spec.
  • It adds additional characters to the beginning of the file that may be confusing to the user, especially in the case of simple text files with otherwise understandable content.
  • My research seemed to point to avoiding the use of the BOM except for situations where it was truly needed.

However, based on your suggestion, and other conversations in #154, I can see some benefits to always using the UTF-8 BOM (as a signature) in the files used in this project.

  • It gives a single, consistent output format for all source files, as discussed in Inconsistent character encoding in exported files #154.
  • It is the default output when using the stream object to save UTF-8 content, keeping things simple.
  • It eliminates the complexity of needing to check for UTF-8 characters when saving files.
  • It helps in situations where Unicode content may be inserted into an existing source file (i.e. resolving source file conflicts) that might not otherwise have the BOM.
  • The vast majority of software used with version control is already designed to handle the BOM.

Based on this, and the feedback I have seen from others using the add-in, I am willing to go ahead and make this change to always include the BOM in output files. I don't do this lightly, knowing that this has the impact of changing a large number of export files in every project going forward, but it does seem like this is the best long-term direction to take with the UTF-8 standardization in this project.

@@ -74,7 +74,7 @@ Begin Report
Begin FormHeader
KeepTogether = NotDefault
Height =960
BackColor =15849926
BackColor =15064278
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@A9G-Data-Droid, @joyfullservice: This is what I was experiencing on #183, have you noticed this? Any clues as to what the heck is happening?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to do with exporting in different versions of Access? I notice this when comparing exports from Access 2010 and newer versions... I haven't actually dug into comparing the color values.

Copy link
Contributor

@hecon5 hecon5 Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure! I noticed it when me and another dev were exporting the same file (export-import-export) on the same machine will sometimes do this, from their dev machine export - mine import - mine export will almost always cause this.
We verified same OS and Office Version, as well; so this makes it more fun.

Note that we use themes; I'm wondering if this might have something to do with it; but I notice it also when I export the Version Control as well, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens to me every time. Themes might be a clue. Most colors are not defined as a concrete number. Instead they are references to the system colors. Things like "Text Dark", "Text Light", "Background Dark Header" and "Background Light Header" come from Windows system settings and do not correspond to the same RGB value on every computer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, I didn't ask the other dev if they're using the same Windows Theme; I'll check with them. I could also export, change my theme, import, export.

},
"Items": {
"Name": "VCS Testing",
"Description": "For automated testing of Version Control",
"FileName": "rel:Testing.accdb",
"HelpFile": "",
"HelpFile": "99672516",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hecon5 this is even more mysterious than the colors changing. I get random numbers in this HelpFile property every time I export. I never change this value manually. It should be a path. I'm not sure where this number is being introduced.

You can see it from the VBE by going to the Project Properties:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAME! It has repeatedly boggled me; I've simply reverted it and moved on with my life, but yeah; no idea why it does this. I'll even occasionally get kanji characters there, but cannot for the life of me figure it out.

 -  Don't read the full file just to look at the first 3 characters.
Perf.OperationStart "Unicode Check"
Set reg = New VBScript_RegExp_55.RegExp
With reg
With New VBScript_RegExp_55.RegExp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the regex is already included as a reference, you can also call this from With New RegExp; though this does guarantee you get the flavor you wanted, it may make it harder to move to the next one (lol, VBA's never getting updated) if it ever is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I've never seen this explicit version call out. I always use just RegExp.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some functionality differences between versions 1.0 and 5.5 (such as lazy quantifiers) but since this is the only one referenced in the project, I would think we are probably safe to just use RegExp.

 -  Don't manually add a BOM because it can result in two
@A9G-Data-Droid
Copy link
Contributor Author

I apologize for changing all the files in the test database in this PR. I know it gets ugly. I had created a bug and then fixed it. This proves that the bug is fixed because we can export and import against this set of test DB files and it works.

@A9G-Data-Droid
Copy link
Contributor Author

The bug was that if you manually insert the BOM during WriteBinaryFile you can end up with JSON files that have 2 BOMs. Then the JSON detection fails because it's looking for a { bracket and it sees the 2nd BOM and goes BOOM.

@A9G-Data-Droid
Copy link
Contributor Author

A9G-Data-Droid commented Mar 5, 2021

I think I'm finally done messing with this PR. I performed a Full Export on my main project and the index.json shows that all the ExportDate changed and the Hash all stayed the same. This says to me this current working version is capable of exporting all my stuff the same way that it did before! All in 25 seconds.

@joyfullservice
Copy link
Owner

I think I'm finally done messing with this PR. I performed a Full Export on my main project and the index.json shows that all the ExportDate changed and the Hash all stayed the same. This says to me this current working version is capable of exporting all my stuff the same way that it did before! All in 25 seconds.

Excellent! Let me run it on a couple of my projects, including an ADP project, to make sure it is working fine on my end too. I am also curious to compare the performance logs after the change.

We need to find another source for sample extended characters.
@joyfullservice
Copy link
Owner

I did some performance testing to compare this PR with the dev branch using a couple large projects. Exports are pretty similar in performance, but building from source took a pretty drastic performance hit with Unicode file conversion going from less than a second to over 36 seconds on a large project.

My hunch is that this is due to the way we are reading and writing the files, but I am going to do some testing to see if we can bring this back down to close to what we had before.

@A9G-Data-Droid
Copy link
Contributor Author

Those blocks labeled "Unicode Conversion" are areas I think that we could switch to happening line at a time instead of reading it all in to memory, performing some operation on the block, and then writing it out. It could all be done automatically by the stream like I do for the ReEncoder. I know that you have seen good performance improvement with the 128k blocks but I think that was the cause of the original corruption.

@A9G-Data-Droid
Copy link
Contributor Author

Ok, that turned out to be a simple switch.

My time is reduced further:
Done. (23.64 seconds)

@A9G-Data-Droid
Copy link
Contributor Author

A9G-Data-Droid commented Mar 5, 2021

If nothing else, you now have a single place to optimize and it will affect 4 conversions that used to be separate. (This is what I get for saying I was done)

@joyfullservice
Copy link
Owner

If nothing else, you now have a single place to optimize and it will affect 4 conversions that used to be separate. (This is what I get for saying I was done)

Sounds good. I am just about finished with the optimizing... I think I will probably need to resolve some source conflicts when I get through, based on your latest commits... 😄 But hey, that's what this project is for, to allow us to resolve conflicts at the source level, right! 😄

@joyfullservice
Copy link
Owner

Nailed it! Getting performance similar or even slightly faster than the original version. Doing a few more cleanup tweaks...

joyfullservice added 6 commits March 5, 2021 16:41
This allows you to easily rename log files for comparison testing.
Since we are no longer doing anything fancy that requires these to be in `modEncoding`, I have returned them to `modFileAccess` where they fit better with the purpose of the module.
Adjustments to encoding functions to utilize stream approach.
Functions *only* used for testing should be in the unit testing module, not alongside production code.
This is used in several places.
Logging a few more functions, and expanding performance reports table.
@joyfullservice
Copy link
Owner

Okay, @A9G-Data-Droid - Can you try this out on your system? Let me know if you encounter any issues with the chunked reading this time.

Copy link
Contributor Author

@A9G-Data-Droid A9G-Data-Droid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These used to be called elsewhere. If they are no longer used it should all be removed.

@joyfullservice
Copy link
Owner

These used to be called elsewhere. If they are no longer used it should all be removed.

Are you referring to the test functions I moved in 51f0b59? If so, I am happy to remove them. I just didn't want to make the assumption and delete your work. 😄

@A9G-Data-Droid
Copy link
Contributor Author

Yes, I have verified that they aren't being used now that we have centralized all the encoding.

@A9G-Data-Droid
Copy link
Contributor Author

Why is ConvertUcs2Utf8 the odd one out that doesn't match it's complimentary ConvertUtf8Ucs2 ?

Verified the output from the testing database, updating a few files that have changed. (Other than color values.)
@joyfullservice
Copy link
Owner

Why is ConvertUcs2Utf8 the odd one out that doesn't match it's complimentary ConvertUtf8Ucs2 ?

That one involves some special handling for mixed UTF-16 content where the extended/Unicode characters are encoded, but the rest of the file isn't. (I see this with ADP projects.) If there is a way to do that with the other ReEncode function, we can simplify this function as well. (But this can be an additional update after merging this PR.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants