Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UsfmVerseTextUpdater class #160

Merged
merged 1 commit into from
Jan 30, 2024
Merged

Add UsfmVerseTextUpdater class #160

merged 1 commit into from
Jan 30, 2024

Conversation

ddaspit
Copy link
Contributor

@ddaspit ddaspit commented Jan 25, 2024

  • add class to update the verse text in an existing USFM file
  • refactor Paratext project settings parsing into a reusable class

This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Jan 25, 2024

Codecov Report

Attention: 71 lines in your changes are missing coverage. Please review.

Comparison is base (79c0832) 66.04% compared to head (267de3c) 66.59%.

Files Patch % Lines
...achine/Corpora/ZipParatextProjectSettingsParser.cs 51.11% 17 Missing and 5 partials ⚠️
...chine/Corpora/ParatextProjectSettingsParserBase.cs 80.76% 7 Missing and 8 partials ⚠️
src/SIL.Machine/Corpora/UsfmParser.cs 31.81% 14 Missing and 1 partial ⚠️
src/SIL.Machine/Corpora/UsfmVerseTextUpdater.cs 92.85% 8 Missing and 2 partials ⚠️
...chine/Corpora/FileParatextProjectSettingsParser.cs 80.00% 3 Missing and 1 partial ⚠️
...c/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs 83.33% 2 Missing and 1 partial ⚠️
src/SIL.Machine/Corpora/ParatextProjectSettings.cs 94.59% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #160      +/-   ##
==========================================
+ Coverage   66.04%   66.59%   +0.55%     
==========================================
  Files         428      433       +5     
  Lines       33529    33738     +209     
  Branches     4511     4525      +14     
==========================================
+ Hits        22144    22469     +325     
+ Misses      10301    10200     -101     
+ Partials     1084     1069      -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@johnml1135 johnml1135 self-requested a review January 26, 2024 15:54
@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/FileParatextProjectSettingsParser.cs line 5 at r1 (raw file):

using System.IO;
using System.Linq;
using System.Text;

Wouldn't you put the usings into their own file?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/FileParatextProjectSettingsParser.cs line 5 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Wouldn't you put the usings into their own file?

Ah, older version of .net.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 67 at r1 (raw file):

            string prefix = "";
            string form = "41MAT";

So, we default to Matthew? Shouldn't we default to crashing? I don't think I fully understand this logic and intention.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 69 at r1 (raw file):

            string form = "41MAT";
            string suffix = ".SFM";
            XElement namingElem = settingsDoc.Root.Element("Naming");

Part of the general question of if the data is not meeting expectations. I don't know the history of this code (I am assuming it is copied from somewhere else with minor edits). If that is the case, are there any substantive logic changes?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 69 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Part of the general question of if the data is not meeting expectations. I don't know the history of this code (I am assuming it is copied from somewhere else with minor edits). If that is the case, are there any substantive logic changes?

How to handle missing data, exceptions, etc.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 86 at r1 (raw file):

            string biblicalTerms = settingsDoc.Root.Element("BiblicalTermsListSetting").Value;
            string[] parts = biblicalTerms.Split(new[] { ':' }, 3);

There are so many ways to fail, including not getting 3 sections. Are we ok with just throwing an exception and failing the process?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextTextCorpus.cs line 9 at r1 (raw file):

        public ParatextTextCorpus(string projectDir, bool includeMarkers = false)
        {
            var parser = new FileParatextProjectSettingsParser(projectDir);

For my information, what is the difference between ParatextTextCorpus and ParatextBackupTextCorpus?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmParser.cs line 47 at r1 (raw file):

        }

        private static readonly Regex OptBreakSplitter = new Regex("(//)", RegexOptions.Compiled);

I'm assuming this is just a naming update. Descriptive names are better than comments?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 31 at r1 (raw file):

                    return;

                var settingsParser = new ZipParatextProjectSettingsParser(archive);

Ah, the backup is using the zipped one...

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextTextCorpus.cs line 9 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

For my information, what is the difference between ParatextTextCorpus and ParatextBackupTextCorpus?

Ah - one grabs from a zip file, the other from files on the computer.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmVerseTextUpdater.cs line 117 at r1 (raw file):

            string pubNumber
        )
        {

A really simple explanation of what this is doing would be great. This class is not used in machine - likely only in Serval (which hasn't been pushed yet). It appears to be used to update an existing USFM file with pretranslations - but the logic appears split between a few locations.

Copy link
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 13 files reviewed, 8 unresolved discussions (waiting on @johnml1135)


src/SIL.Machine/Corpora/FileParatextProjectSettingsParser.cs line 5 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ah, older version of .net.

Yes, that is correct.


src/SIL.Machine/Corpora/ParatextBackupTermsCorpus.cs line 31 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ah, the backup is using the zipped one...

Yes, that is correct.


src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 67 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, we default to Matthew? Shouldn't we default to crashing? I don't think I fully understand this logic and intention.

These are the defaults in Paratext for these settings. In this case, it indicates the file name pattern for USFM files in the project.


src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 69 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

How to handle missing data, exceptions, etc.

Yes, this code was essentially extracted from another class and made to be reusable.


src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 86 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

There are so many ways to fail, including not getting 3 sections. Are we ok with just throwing an exception and failing the process?

Throwing exceptions is the right approach. If anything throws an exception, then the settings file is corrupt or invalid and we don't know how to interpret it. We could always add in more descriptive exceptions later.


src/SIL.Machine/Corpora/ParatextTextCorpus.cs line 9 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ah - one grabs from a zip file, the other from files on the computer.

Yes, that is correct.


src/SIL.Machine/Corpora/UsfmParser.cs line 47 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I'm assuming this is just a naming update. Descriptive names are better than comments?

This field was moved to a public property in the parser state, so that it is available while parsing.


src/SIL.Machine/Corpora/UsfmVerseTextUpdater.cs line 117 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

A really simple explanation of what this is doing would be great. This class is not used in machine - likely only in Serval (which hasn't been pushed yet). It appears to be used to update an existing USFM file with pretranslations - but the logic appears split between a few locations.

Done.

@johnml1135
Copy link
Collaborator

There is a bit of a lack of coverage for the TextUpdater, specifically the Sidebar, Milestone, Ref, OptBreak , Unmatched. Also, there is some missing coverage in the other files (as seen above in the report).

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/ParatextProjectSettingsParserBase.cs line 69 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yes, this code was essentially extracted from another class and made to be reusable.

In aerospace, if code has been working for a long time, it's grandfathered in. That's fine - we can address issues if they come up.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

@johnml1135
Copy link
Collaborator

I would like more testing for these cases...

Copy link
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more unit tests.

Dismissed @johnml1135 from a discussion.
Reviewable status: 11 of 18 files reviewed, all discussions resolved (waiting on @johnml1135)

@johnml1135
Copy link
Collaborator

I suspected that there could be an issue there :-)

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 7 of 7 files at r3, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

@johnml1135 johnml1135 merged commit d70813c into master Jan 30, 2024
4 checks passed
@ddaspit ddaspit deleted the update-usfm branch January 30, 2024 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants