Port USFM code from Machine up to commit c93f8dc #118

mshannon-sil · 2024-08-30T01:37:49Z

This is the final PR for catching machine.py up to the USFM changes in Machine (until more USFM changes are added to Machine).

This change is

codecov-commenter · 2024-08-30T01:38:34Z

Codecov Report

Attention: Patch coverage is 83.75350% with 116 lines in your changes missing coverage. Please review.

Project coverage is 87.99%. Comparing base (3912c6a) to head (0778bb7).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
tests/corpora/test_usfm_manual.py	4.44%	43 Missing ⚠️
machine/corpora/text_corpus.py	45.16%	17 Missing ⚠️
machine/corpora/alignment_corpus.py	45.00%	11 Missing ⚠️
machine/corpora/parallel_text_corpus.py	85.71%	6 Missing ⚠️
...chine/corpora/zip_paratext_project_text_updater.py	62.50%	6 Missing ⚠️
machine/corpora/dictionary_alignment_corpus.py	66.66%	5 Missing ⚠️
...hine/corpora/paratext_project_text_updater_base.py	83.33%	5 Missing ⚠️
machine/corpora/dictionary_text_corpus.py	69.23%	4 Missing ⚠️
...hine/corpora/paratext_project_terms_parser_base.py	96.42%	4 Missing ⚠️
machine/corpora/standard_parallel_text_corpus.py	72.72%	3 Missing ⚠️
... and 6 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #118      +/-   ##
==========================================
+ Coverage   87.81%   87.99%   +0.17%     
==========================================
  Files         249      260      +11     
  Lines       15082    15569     +487     
==========================================
+ Hits        13245    13700     +455     
- Misses       1837     1869      +32

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mshannon-sil

Reviewable status: 0 of 49 files reviewed, 3 unresolved discussions (waiting on @ddaspit)

machine/corpora/memory_paratext_project_terms_parser.py line 1 at r13 (raw file):

from io import BytesIO

In Machine, this file is under the test folder. I moved it to machine/corpora because there are other similar "memory" files that are in the corpora folder, like usfm_memory_texts.py, that seem to only be used for tests. I can move it back to the test folder though if that would be preferable.

machine/corpora/usfm_parser.py line 72 at r13 (raw file):

        self.state.index += 1

        assert self.state.token is not None

I moved up the assert statement in usfm parser so that the error checking tools don't complain about the possibility of self.state.token being None

tests/corpora/test_paratext_project_terms_parser.py line 9 at r13 (raw file):

    UsfmStylesheet,
)
from machine.corpora.paratext_project_terms_parser_base import _get_glosses, _strip_parens

I know most of the time we just import from machine.corpora when we're in the test folder, rather than from the next level down inside the corpora module. However, that only works if the method is in the init.py, and I didn't think we wanted to add a private helper function for a class to init.py. Is there an issue with importing these specific methods the way I did? I also thought about making them static methods of the class they're helping, since they're static methods in Machine, but I've noticed that the preferred way to handle Machine's static methods has been to define them as private helper functions below the class.

johnml1135 · 2024-08-30T12:52:30Z

machine/corpora/usfm_text_base.py line 53 at r1 (raw file):

                f"{f' in project {self.project}' if self.project else ''}"
                f". Verse: {parser.state.verse_ref}, line: {parser.state.line_number}, "
                f"column: {parser.state.line_number}, error: '{e}'"

See sillsdev/machine#236 - call it "character"

johnml1135 · 2024-08-30T12:54:47Z

tests/testutils/data/usfm/Tes/41MATTes.SFM line 41 at r1 (raw file):

\v 6 Chapter two, verse \w six|strong="12345" \w*.
\p
\v 6 Bad verse. \x - \xo abc\xt 123\x* and more content.

Does this bring it up to match the tests fully? Weren't there more changes than this?

mshannon-sil

Reviewable status: 0 of 49 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @johnml1135)

machine/corpora/usfm_text_base.py line 53 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

See sillsdev/machine#236 - call it "character"

Done.

tests/testutils/data/usfm/Tes/41MATTes.SFM line 41 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Does this bring it up to match the tests fully? Weren't there more changes than this?

There aren't any more changes than this that I'm aware of. Although maybe the changes you're thinking of were in my two previous PRs, since I ended up splitting the porting process into 3 chunks so that it would be easier to review.

ddaspit

Reviewed 3 of 11 files at r1, 4 of 5 files at r2, 7 of 7 files at r3, 2 of 5 files at r4, 3 of 7 files at r5, 2 of 2 files at r7, 1 of 3 files at r8, 1 of 3 files at r9, 5 of 9 files at r11, 3 of 3 files at r12, 18 of 18 files at r13, 1 of 1 files at r14, all commit messages.
Reviewable status: all files reviewed, 22 unresolved discussions (waiting on @johnml1135 and @mshannon-sil)

machine/corpora/alignment_corpus.py line 30 at r13 (raw file):

    def count(self, include_empty: bool = True, text_ids: Optional[Iterable[str]] = None) -> int:
        with self.get_rows(text_ids) as rows:
            return len(list(rows)) if include_empty else sum(1 for r in self.get_rows(text_ids) if not r.is_empty)

You accidentally called get_rows here. Just sum(1 for row in rows if include_empty or not row.is_empty) should be sufficient. This will avoid unnecessarily creating a list when include_empty is set to True.

machine/corpora/alignment_corpus.py line 50 at r13 (raw file):

        return _FilterAlignmentCorpus(self, predicate)

    def filter_texts(self, text_ids: Iterable[str]) -> AlignmentCorpus:

You should rename text_ids to filter, so that it is consistent with my comment on TextCorpus. Also, the parameter should be optional.

machine/corpora/dictionary_alignment_corpus.py line 3 at r13 (raw file):

from typing import Generator, Iterable, Optional, overload

from machine.corpora.alignment_row import AlignmentRow

This import should be relative.

machine/corpora/dictionary_text_corpus.py line 3 at r13 (raw file):

from typing import Generator, Iterable, Optional, overload

from machine.corpora.text_row import TextRow

This import should be relative.

machine/corpora/memory_paratext_project_terms_parser.py line 1 at r13 (raw file):

Previously, mshannon-sil wrote…

In Machine, this file is under the test folder. I moved it to machine/corpora because there are other similar "memory" files that are in the corpora folder, like usfm_memory_texts.py, that seem to only be used for tests. I can move it back to the test folder though if that would be preferable.

This was intentionally moved to the test folder. Because it defines "fake" files, it seems more like a test mock.

machine/corpora/parallel_text_corpus.py line 108 at r13 (raw file):

    def _get_rows(
        self, text_ids: Optional[Iterable[str]] = None
    ) -> ContextManagedGenerator[ParallelTextRow, None, None]: ...

The return type should be Generator.

machine/corpora/parallel_text_corpus.py line 594 at r13 (raw file):

                return len(self._df)
            return len(self._df[(self._df[self._source_column] != "") & (self._df[self._target_column] != "")])
        return sum(1 for row in self._get_rows(text_ids) if include_empty or not row.is_empty)

You should be able to filter the dataframe by the list of text ids. Something like this:

len(self._df[self._df[self._text_column].isin(text_ids_set)])

machine/corpora/parallel_text_corpus.py line 602 at r13 (raw file):

            if self._text_id_column is not None and self._text_id_column in self._df:
                text_id = cast(str, row[self._text_id_column])
                if text_ids_set is not None and text_id not in text_ids_set:

I would move this if up a level.

machine/corpora/parallel_text_corpus.py line 695 at r13 (raw file):

            if self._text_id_column is not None and self._text_id_column in example:
                text_id = cast(str, example[self._text_id_column])
                if text_ids_set is not None and text_id not in text_ids_set:

I would move this if up a level.

machine/corpora/paratext_project_settings_parser_base.py line 15 at r13 (raw file):

    @abstractmethod
    def exists(self, file_name: str) -> bool: ...

These methods should be prefixed with an underscore.

machine/corpora/paratext_project_terms_parser_base.py line 8 at r13 (raw file):

from xml.etree import ElementTree

from machine.corpora.paratext_project_settings import ParatextProjectSettings

This should be a relative import.

machine/corpora/paratext_project_terms_parser_base.py line 92 at r13 (raw file):

    @abstractmethod
    def exists(self, file_name: str) -> bool: ...

These methods should be prefixed with an underscore.

machine/corpora/paratext_project_text_updater_base.py line 23 at r13 (raw file):

        self,
        book_id: str,
        rows: Optional[List[Tuple[List[ScriptureRef], str]]] = None,

Instead of List, you should use Sequence.

machine/corpora/paratext_project_text_updater_base.py line 48 at r13 (raw file):

    @abstractmethod
    def exists(self, file_name: StrPath) -> bool: ...

These methods should be prefixed with an underscore.

machine/corpora/text_corpus.py line 45 at r13 (raw file):

    def count(self, include_empty: bool = True, text_ids: Optional[Iterable[str]] = None) -> int:
        with self.get_rows(text_ids) as rows:
            return len(list(rows)) if include_empty else sum(1 for row in rows if include_empty or not row.is_empty)

Just sum(1 for row in rows if include_empty or not row.is_empty) should be sufficient. This will avoid unnecessarily creating a list when include_empty is set to True.

machine/corpora/text_corpus.py line 110 at r13 (raw file):

    def filter_texts(
        self, predicate: Optional[Callable[[Text], bool]] = None, text_ids: Optional[Iterable[str]] = None

I would merge predicate and text_ids into a single parameter called filter that is a Union of the two types. You will need to check the type of filter using something like isinstance or callable.

machine/corpora/usfm_memory_texts.py line 0 at r13 (raw file):
This file name should be called usfm_memory_text.py.

machine/corpora/usfm_parser.py line 72 at r13 (raw file):

Previously, mshannon-sil wrote…

I moved up the assert statement in usfm parser so that the error checking tools don't complain about the possibility of self.state.token being None

That should be fine.

machine/corpora/zip_paratext_project_terms_parser.py line 5 at r13 (raw file):

from zipfile import ZipFile

from machine.corpora.paratext_project_settings import ParatextProjectSettings

These imports should be relative.

machine/corpora/zip_paratext_project_text_updater.py line 5 at r13 (raw file):

from zipfile import ZipFile

from machine.corpora.paratext_project_text_updater_base import ParatextProjectTextUpdaterBase

These imports should be relative.

tests/corpora/test_paratext_project_terms_parser.py line 9 at r13 (raw file):

Previously, mshannon-sil wrote…

I know most of the time we just import from machine.corpora when we're in the test folder, rather than from the next level down inside the corpora module. However, that only works if the method is in the init.py, and I didn't think we wanted to add a private helper function for a class to init.py. Is there an issue with importing these specific methods the way I did? I also thought about making them static methods of the class they're helping, since they're static methods in Machine, but I've noticed that the preferred way to handle Machine's static methods has been to define them as private helper functions below the class.

This is fine when testing private functions.

tests/corpora/test_update_usfm_parser_handler.py line 6 at r14 (raw file):

from machine.corpora import ScriptureRef, parse_usfm
from machine.corpora.file_paratext_project_text_updater import FileParatextProjectTextUpdater

These classes should be imported from machine. Corpora.

mshannon-sil

Reviewable status: all files reviewed, 22 unresolved discussions (waiting on @ddaspit and @johnml1135)

machine/corpora/alignment_corpus.py line 30 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You accidentally called get_rows here. Just sum(1 for row in rows if include_empty or not row.is_empty) should be sufficient. This will avoid unnecessarily creating a list when include_empty is set to True.

Done.

machine/corpora/alignment_corpus.py line 50 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should rename text_ids to filter, so that it is consistent with my comment on TextCorpus. Also, the parameter should be optional.

Done.

machine/corpora/dictionary_alignment_corpus.py line 3 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This import should be relative.

Done.

machine/corpora/dictionary_text_corpus.py line 3 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This import should be relative.

Done.

machine/corpora/memory_paratext_project_terms_parser.py line 1 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This was intentionally moved to the test folder. Because it defines "fake" files, it seems more like a test mock.

Should we move all the memory files to the test folder then?

machine/corpora/parallel_text_corpus.py line 108 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The return type should be Generator.

Done.

machine/corpora/parallel_text_corpus.py line 594 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should be able to filter the dataframe by the list of text ids. Something like this:
len(self._df[self._df[self._text_column].isin(text_ids_set)])

Done.

machine/corpora/parallel_text_corpus.py line 602 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would move this if up a level.

Done.

machine/corpora/parallel_text_corpus.py line 695 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would move this if up a level.

Done.

machine/corpora/paratext_project_settings_parser_base.py line 15 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These methods should be prefixed with an underscore.

Done.

machine/corpora/paratext_project_terms_parser_base.py line 8 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be a relative import.

Done.

machine/corpora/paratext_project_terms_parser_base.py line 92 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These methods should be prefixed with an underscore.

Done.

machine/corpora/paratext_project_text_updater_base.py line 23 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Instead of List, you should use Sequence.

Done. I also changed it for some other function definitions where Machine specified an IReadOnlyList.

machine/corpora/paratext_project_text_updater_base.py line 48 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These methods should be prefixed with an underscore.

Done.

machine/corpora/text_corpus.py line 45 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Just sum(1 for row in rows if include_empty or not row.is_empty) should be sufficient. This will avoid unnecessarily creating a list when include_empty is set to True.

Done.

machine/corpora/text_corpus.py line 110 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would merge predicate and text_ids into a single parameter called filter that is a Union of the two types. You will need to check the type of filter using something like isinstance or callable.

Done.

machine/corpora/usfm_memory_texts.py line at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This file name should be called usfm_memory_text.py.

Done.

machine/corpora/zip_paratext_project_terms_parser.py line 5 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These imports should be relative.

Done.

machine/corpora/zip_paratext_project_text_updater.py line 5 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These imports should be relative.

Done.

tests/corpora/test_update_usfm_parser_handler.py line 6 at r14 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These classes should be imported from machine. Corpora.

Done. I also checked the rest of my files for this and fixed every other instance I found this occurring.

johnml1135 · 2024-09-04T11:37:53Z

tests/testutils/data/usfm/Tes/41MATTes.SFM line 41 at r1 (raw file):

Previously, mshannon-sil wrote…

There aren't any more changes than this that I'm aware of. Although maybe the changes you're thinking of were in my two previous PRs, since I ended up splitting the porting process into 3 chunks so that it would be easier to review.

Ok - I am not reviewing in depth, I am just wanting to make sure that the testing is the same between the two.

ddaspit

Reviewed 24 of 24 files at r15, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @mshannon-sil)

ddaspit

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @mshannon-sil)

machine/corpora/memory_paratext_project_terms_parser.py line 1 at r13 (raw file):

Previously, mshannon-sil wrote…

Should we move all the memory files to the test folder then?

As we discussed in the standup, you just need to move this class.

mshannon-sil

Reviewable status: 50 of 53 files reviewed, 1 unresolved discussion (waiting on @ddaspit)

machine/corpora/memory_paratext_project_terms_parser.py line 1 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

As we discussed in the standup, you just need to move this class.

Done. And just to confirm, it's okay that I'm importing it as from tests.corpora.memory_paratext_project_terms_parser import MemoryParatextProjectTermsParser right? I first tried to just use `from .memory_paratext_project_terms_parser import MemoryParatextProjectTermsParser, but that caused a pytest test discovery error, so I think there's something about how packages work that I might not be understanding.

mshannon-sil

Reviewable status: 50 of 53 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)

tests/testutils/data/usfm/Tes/41MATTes.SFM line 41 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ok - I am not reviewing in depth, I am just wanting to make sure that the testing is the same between the two.

Yes, the test cases look to match up between both repos.

ddaspit

Reviewed 3 of 3 files at r16, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)

machine/corpora/memory_paratext_project_terms_parser.py line 1 at r13 (raw file):

Previously, mshannon-sil wrote…

Done. And just to confirm, it's okay that I'm importing it as from tests.corpora.memory_paratext_project_terms_parser import MemoryParatextProjectTermsParser right? I first tried to just use `from .memory_paratext_project_terms_parser import MemoryParatextProjectTermsParser, but that caused a pytest test discovery error, so I think there's something about how packages work that I might not be understanding.

Yes, that is perfectly fine.

…including zipped projects

mshannon-sil added the development label Aug 30, 2024

mshannon-sil requested a review from ddaspit August 30, 2024 01:37

mshannon-sil self-assigned this Aug 30, 2024

mshannon-sil commented Aug 30, 2024

View reviewed changes

ddaspit requested changes Aug 30, 2024

View reviewed changes

mshannon-sil commented Sep 4, 2024

View reviewed changes

ddaspit reviewed Sep 5, 2024

View reviewed changes

mshannon-sil commented Sep 5, 2024

View reviewed changes

ddaspit approved these changes Sep 5, 2024

View reviewed changes

mshannon-sil added 14 commits September 5, 2024 16:44

port commit afc6328, fix usfm parsing/generation issues

0a66055

port commit 3182ff2, fix error on terms id duplication

97053cd

port commit 8efc011, skip parsing excluded books when preprocessing

46c0a3d

port commit 4e88953, fix missing last verse

47f328a

port commit ceb586c, remove obsolete handling of \d marker

4289310

port commit 1011777; fix 2, 2a, 2b verse merging

f4812e3

port commit 2469d89, fix sillsdev/serval#424

c70f6e5

port commit 385213e, fix sillsdev/serval#424

785128e

port commit 2659e19, usfm nonverse multichapter

2a2884b

port commit 3911dd6, extend manual test to process multiple projects …

7eec4f3

…including zipped projects

port commit cd90626, fix USFM parsing bugs

af0604f

port commit a15d24e, fixes sillsdev/serval#449

ac8d33f

port commits from PR #233, use terms localizations

0d73c5b

call the horizontal position 'character'

1ec92f9

mshannon-sil added 3 commits September 5, 2024 16:44

address review feedback

7426e20

move memory_paratext_project_terms_parser.py to tests folder

161e65a

port commit 7a819ed, remove duplicate test

56ba148

johnml1135 force-pushed the #102_usfm_3 branch from 0778bb7 to 56ba148 Compare September 5, 2024 20:44

johnml1135 merged commit 88f5bad into main Sep 5, 2024
12 of 13 checks passed

ddaspit deleted the #102_usfm_3 branch September 23, 2024 15:53

mshannon-sil linked an issue Oct 8, 2024 that may be closed by this pull request

Port Machine's USFM generation changes to machine.py #102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port USFM code from Machine up to commit c93f8dc #118

Port USFM code from Machine up to commit c93f8dc #118

mshannon-sil commented Aug 30, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Aug 30, 2024 •

edited

Loading

mshannon-sil left a comment

johnml1135 commented Aug 30, 2024

johnml1135 commented Aug 30, 2024

mshannon-sil left a comment

ddaspit left a comment

mshannon-sil left a comment

johnml1135 commented Sep 4, 2024

ddaspit left a comment

ddaspit left a comment

mshannon-sil left a comment

mshannon-sil left a comment

ddaspit left a comment

Port USFM code from Machine up to commit c93f8dc #118

Port USFM code from Machine up to commit c93f8dc #118

Conversation

mshannon-sil commented Aug 30, 2024 • edited by ddaspit Loading

codecov-commenter commented Aug 30, 2024 • edited Loading

Codecov Report

mshannon-sil left a comment

Choose a reason for hiding this comment

johnml1135 commented Aug 30, 2024

johnml1135 commented Aug 30, 2024

mshannon-sil left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

mshannon-sil left a comment

Choose a reason for hiding this comment

johnml1135 commented Sep 4, 2024

ddaspit left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

mshannon-sil left a comment

Choose a reason for hiding this comment

mshannon-sil left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

mshannon-sil commented Aug 30, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Aug 30, 2024 •

edited

Loading