Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix USFM parsing bugs #229

Merged
merged 18 commits into from
Aug 6, 2024
Merged

Fix USFM parsing bugs #229

merged 18 commits into from
Aug 6, 2024

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Jul 25, 2024

Fixes #228

  • Add custom exception for parsing
  • Fix off-by-one error
  • Handle triplicate, quadruplicate, n-plicate verses *Add test to cover triplicate verse

This change is Reviewable

*Fix off-by-one error
*Handle triplicate, quadruplicate, n-plicate verses
*Add test to cover triplicate verse
@codecov-commenter
Copy link

codecov-commenter commented Jul 25, 2024

Codecov Report

Attention: Patch coverage is 55.00000% with 27 lines in your changes missing coverage. Please review.

Project coverage is 69.66%. Comparing base (136a3ba) to head (6a2d8c9).

Files Patch % Lines
....Machine/Corpora/ParatextProjectTextUpdaterBase.cs 58.82% 13 Missing and 1 partial ⚠️
...L.Machine/Corpora/ZipParatextProjectTextUpdater.cs 0.00% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #229      +/-   ##
==========================================
- Coverage   69.67%   69.66%   -0.02%     
==========================================
  Files         374      377       +3     
  Lines       31317    31374      +57     
  Branches     4387     4391       +4     
==========================================
+ Hits        21821    21856      +35     
- Misses       8478     8498      +20     
- Partials     1018     1020       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@johnml1135
Copy link
Collaborator

tests/SIL.Machine.Tests/Corpora/UsfmMemoryTextTests.cs line 86 at r1 (raw file):

        );

        Assert.That(rows, Has.Length.EqualTo(1));

I would check a few more things:

  • That verse one text truly concatenates
  • That a verse 2 is added
  • That non-verse text in all three parts of verse 1 are added properly

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmParsingException.cs line 10 at r1 (raw file):

        public UsfmParsingException(UsfmParserState state, Exception exception)
            : base(
                $"Failed to parse at line {state.LineNumber} column {state.ColumnNumber} verse ref {state.VerseRef} with surrounding tokens [{string.Join(",", state.Tokens.ToList().GetRange(Math.Max(state.Index - 3, 0), Math.Min(7, state.Tokens.Count - (state.Index - 3))).Select(t => $"{t.Text} (TokenType={t.Type})"))}]",

How does this interact with the other sources of errors?

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmParser.cs line 582 at r1 (raw file):

                }
            }
            catch (Exception e)

Why put the try/catch here rather than in the ProcessTokens function? The ProcessToken function is already 300 lines of code ...

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmParsingException.cs line 10 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

How does this interact with the other sources of errors?

Ahh - no more "stack empty" errors. Please look into the other error handling for these functions and make sure that there is no redundant information.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 5 files at r1, all commit messages.
Reviewable status: 4 of 5 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @Enkidu93)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 5 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/UsfmParser.cs line 582 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Why put the try/catch here rather than in the ProcessTokens function? The ProcessToken function is already 300 lines of code ...

That's a good point. Somehow I wanted to not break the lovely simplicity of ProcessTokens 😆 . Done.


src/SIL.Machine/Corpora/UsfmParsingException.cs line 10 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ahh - no more "stack empty" errors. Please look into the other error handling for these functions and make sure that there is no redundant information.

Exactly! This way, a readable error will propagate up. I couldn't see any other error handling with specific info.


tests/SIL.Machine.Tests/Corpora/UsfmMemoryTextTests.cs line 86 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I would check a few more things:

  • That verse one text truly concatenates
  • That a verse 2 is added
  • That non-verse text in all three parts of verse 1 are added properly

Done.

@johnml1135
Copy link
Collaborator

src/SIL.Machine/Corpora/UsfmParsingException.cs line 10 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Exactly! This way, a readable error will propagate up. I couldn't see any other error handling with specific info.

In UsfmTextBase, it also calls ProcessTokens and uses the ColumnNumber among other things to propagate the error upward. It would likely be good to remove the duplicate information from the error, even if you catch and rethrow as you have here.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 5 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/UsfmParsingException.cs line 10 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

In UsfmTextBase, it also calls ProcessTokens and uses the ColumnNumber among other things to propagate the error upward. It would likely be good to remove the duplicate information from the error, even if you catch and rethrow as you have here.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 5 files at r1, 2 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/UsfmParser.cs line 57 at r4 (raw file):

                );
                sb.Append($"column: {parser.State.ColumnNumber}, error: '{ex.Message}'");
                throw new InvalidOperationException(sb.ToString(), ex);

I would rather not throw a new exception here. I want to avoid a deeply nested hierarchy of inner exceptions that can hide the initial exception. I intentionally shifted the responsibility for these types of exceptions to the classes that use the parser, so that they can add information that is relevant to their context. Also, this function is one of many ways that the UsfmParser class can be used. I don't want to have inconsistent exception handling. It would be better to throw this type of exception from the calling code.


src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 396 at r4 (raw file):

        private void PopNewTokens()
        {
            // if (_replace.Any())

Was this left in by accident?

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 5 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/UsfmTextUpdater.cs line 396 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Was this left in by accident?

Yes, it was - thank you! Done.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 5 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/SIL.Machine/Corpora/UsfmParser.cs line 57 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would rather not throw a new exception here. I want to avoid a deeply nested hierarchy of inner exceptions that can hide the initial exception. I intentionally shifted the responsibility for these types of exceptions to the classes that use the parser, so that they can add information that is relevant to their context. Also, this function is one of many ways that the UsfmParser class can be used. I don't want to have inconsistent exception handling. It would be better to throw this type of exception from the calling code.

OK. I initially had the ProcessToken throwing a custom UsfmParsingException that gave line, column, verse ref, and token context information (that way, it's the same regardless of how it is called), but John said you weren't a fan of that either judging from an earlier PR and committed this instead. What do you suggest, @ddaspit? The way it was was pretty unhelpful. When running the CreateUsfm test, all you're getting is Stack empty or Index out of bounds - it would be helpful to have a more meaningful exception that can help you diagnose the issue. I mean, of course you can just use the debugger 😆 , but it doesn't seem particularly clean.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 5 files reviewed, 5 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/UsfmParser.cs line 57 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

OK. I initially had the ProcessToken throwing a custom UsfmParsingException that gave line, column, verse ref, and token context information (that way, it's the same regardless of how it is called), but John said you weren't a fan of that either judging from an earlier PR and committed this instead. What do you suggest, @ddaspit? The way it was was pretty unhelpful. When running the CreateUsfm test, all you're getting is Stack empty or Index out of bounds - it would be helpful to have a more meaningful exception that can help you diagnose the issue. I mean, of course you can just use the debugger 😆 , but it doesn't seem particularly clean.

Here is my concern. If we add another layer of exception handling in-between the original exception and the exception at the Serval/SIL.Machine.Corpora layer, then you end up with multiple levels of nesting (InvalidOperationException -> UsfmParsingException -> original exception). Each layer obfuscates the original exception even more. I was hoping to keep it to two levels: the original exception and the calling code exception that adds relevant contextual information.

@Enkidu93
Copy link
Collaborator Author

src/SIL.Machine/Corpora/UsfmParser.cs line 57 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Here is my concern. If we add another layer of exception handling in-between the original exception and the exception at the Serval/SIL.Machine.Corpora layer, then you end up with multiple levels of nesting (InvalidOperationException -> UsfmParsingException -> original exception). Each layer obfuscates the original exception even more. I was hoping to keep it to two levels: the original exception and the calling code exception that adds relevant contextual information.

I understand - particularly in regards to having a helpful error message bubble up to the other end - you're forced to not only pass the exception into the constructor but also append the message to the exception's message to ensure it's interpretable all the way up which is awkward. And it needs to be an InvalidOperationException, right? If so and if your main concern is the nesting, could we throw the InvalidOperationException from ProcessTokens? Or perhaps instead of calling ProcessTokens directly, overload Parse again to take tokens as a first argument and then that can be called GetVersesInDocOrder? That way, we wouldn't have to have multiple places that we're throwing this special, parsing-info-laden InvalidOperationException. Or is the concern that you'd like to have project information in the exception and that isn't possible from within the parser?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 5 files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/SIL.Machine/Corpora/UsfmParser.cs line 57 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I understand - particularly in regards to having a helpful error message bubble up to the other end - you're forced to not only pass the exception into the constructor but also append the message to the exception's message to ensure it's interpretable all the way up which is awkward. And it needs to be an InvalidOperationException, right? If so and if your main concern is the nesting, could we throw the InvalidOperationException from ProcessTokens? Or perhaps instead of calling ProcessTokens directly, overload Parse again to take tokens as a first argument and then that can be called GetVersesInDocOrder? That way, we wouldn't have to have multiple places that we're throwing this special, parsing-info-laden InvalidOperationException. Or is the concern that you'd like to have project information in the exception and that isn't possible from within the parser?

The calling code might have contextual information that isn't available to the parser. The corpus class adds project information to the exception. Pushing the responsibility to the calling code for how exceptions are handled allows the calling code to customize the exceptions for their specific requirements.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted a commit that adds a new ParatextProjectTextUpdaterBase class. It mirrors the ParatextProjectSettingsParserBase class and it can be used to update the USFM from a Paratext project. @Enkidu93, you will need to fill out the missing zip classes. You can follow the ZipParatextProjectSettingsParserBase and ZipParatextProjectSettingsParser classes. This will give us a logical place to throw an exception with project information similar to what we do in UsfmTextBase. We will need to update Serval to use the new classes.

Reviewable status: 4 of 9 files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good. I'm on it 👍.

Reviewable status: 4 of 9 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)

@Enkidu93
Copy link
Collaborator Author

I submitted a commit that adds a new ParatextProjectTextUpdaterBase class. It mirrors the ParatextProjectSettingsParserBase class and it can be used to update the USFM from a Paratext project. @Enkidu93, you will need to fill out the missing zip classes. You can follow the ZipParatextProjectSettingsParserBase and ZipParatextProjectSettingsParser classes. This will give us a logical place to throw an exception with project information similar to what we do in UsfmTextBase. We will need to update Serval to use the new classes.

Reviewable status: 4 of 9 files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)

What is the rationale behind having a ZipParatextProjectTextUpdaterBase class and a ZipParatextProjectTextUpdater class besides symmetry. What were you picturing having just in one or in the other, @ddaspit ?

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Aug 1, 2024

I submitted a commit that adds a new ParatextProjectTextUpdaterBase class. It mirrors the ParatextProjectSettingsParserBase class and it can be used to update the USFM from a Paratext project. @Enkidu93, you will need to fill out the missing zip classes. You can follow the ZipParatextProjectSettingsParserBase and ZipParatextProjectSettingsParser classes. This will give us a logical place to throw an exception with project information similar to what we do in UsfmTextBase. We will need to update Serval to use the new classes.

Reviewable status: 4 of 9 files reviewed, 4 unresolved discussions (waiting on @Enkidu93 and @johnml1135)

What is the rationale behind having a ZipParatextProjectTextUpdaterBase class and a ZipParatextProjectTextUpdater class besides symmetry. What were you picturing having just in one or in the other, @ddaspit ?

I'll just push what I have and you can review haha.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a draft - I'll ping when it's time for review (i.e., once I've prepped parallel PR in Serval).

Reviewable status: 4 of 11 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnml1135 @ddaspit I think it's ready for review

Reviewable status: 3 of 11 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)

@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Aug 2, 2024

Parallel PR: sillsdev/serval#447

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 5 files at r6, 1 of 4 files at r7, 5 of 5 files at r9, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @johnml1135)

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 5 files at r1, 1 of 2 files at r2, 2 of 3 files at r3, 2 of 5 files at r6, 1 of 4 files at r7, 5 of 5 files at r9, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)


tests/SIL.Machine.Tests/Corpora/UsfmMemoryTextTests.cs line 86 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Done.

Ok, so the duplicates are thrown on the ground, but each non-verse

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

@johnml1135 johnml1135 merged commit cd90626 into master Aug 6, 2024
3 of 4 checks passed
@johnml1135 johnml1135 deleted the fix_usfm_parsing_bugs branch August 6, 2024 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix USFM parsing issues
4 participants