Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port USFM code from Machine up to commit c93f8dc #118

Merged
merged 17 commits into from
Sep 5, 2024
Merged

Conversation

mshannon-sil
Copy link
Collaborator

@mshannon-sil mshannon-sil commented Aug 30, 2024

This is the final PR for catching machine.py up to the USFM changes in Machine (until more USFM changes are added to Machine).


This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Aug 30, 2024

Codecov Report

Attention: Patch coverage is 83.75350% with 116 lines in your changes missing coverage. Please review.

Project coverage is 87.99%. Comparing base (3912c6a) to head (0778bb7).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
tests/corpora/test_usfm_manual.py 4.44% 43 Missing ⚠️
machine/corpora/text_corpus.py 45.16% 17 Missing ⚠️
machine/corpora/alignment_corpus.py 45.00% 11 Missing ⚠️
machine/corpora/parallel_text_corpus.py 85.71% 6 Missing ⚠️
...chine/corpora/zip_paratext_project_text_updater.py 62.50% 6 Missing ⚠️
machine/corpora/dictionary_alignment_corpus.py 66.66% 5 Missing ⚠️
...hine/corpora/paratext_project_text_updater_base.py 83.33% 5 Missing ⚠️
machine/corpora/dictionary_text_corpus.py 69.23% 4 Missing ⚠️
...hine/corpora/paratext_project_terms_parser_base.py 96.42% 4 Missing ⚠️
machine/corpora/standard_parallel_text_corpus.py 72.72% 3 Missing ⚠️
... and 6 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #118      +/-   ##
==========================================
+ Coverage   87.81%   87.99%   +0.17%     
==========================================
  Files         249      260      +11     
  Lines       15082    15569     +487     
==========================================
+ Hits        13245    13700     +455     
- Misses       1837     1869      +32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 49 files reviewed, 3 unresolved discussions (waiting on @ddaspit)


machine/corpora/memory_paratext_project_terms_parser.py line 1 at r13 (raw file):

from io import BytesIO

In Machine, this file is under the test folder. I moved it to machine/corpora because there are other similar "memory" files that are in the corpora folder, like usfm_memory_texts.py, that seem to only be used for tests. I can move it back to the test folder though if that would be preferable.


machine/corpora/usfm_parser.py line 72 at r13 (raw file):

        self.state.index += 1

        assert self.state.token is not None

I moved up the assert statement in usfm parser so that the error checking tools don't complain about the possibility of self.state.token being None


tests/corpora/test_paratext_project_terms_parser.py line 9 at r13 (raw file):

    UsfmStylesheet,
)
from machine.corpora.paratext_project_terms_parser_base import _get_glosses, _strip_parens

I know most of the time we just import from machine.corpora when we're in the test folder, rather than from the next level down inside the corpora module. However, that only works if the method is in the init.py, and I didn't think we wanted to add a private helper function for a class to init.py. Is there an issue with importing these specific methods the way I did? I also thought about making them static methods of the class they're helping, since they're static methods in Machine, but I've noticed that the preferred way to handle Machine's static methods has been to define them as private helper functions below the class.

@johnml1135
Copy link
Collaborator

machine/corpora/usfm_text_base.py line 53 at r1 (raw file):

                f"{f' in project {self.project}' if self.project else ''}"
                f". Verse: {parser.state.verse_ref}, line: {parser.state.line_number}, "
                f"column: {parser.state.line_number}, error: '{e}'"

See sillsdev/machine#236 - call it "character"

@johnml1135
Copy link
Collaborator

tests/testutils/data/usfm/Tes/41MATTes.SFM line 41 at r1 (raw file):

\v 6 Chapter two, verse \w six|strong="12345" \w*.
\p
\v 6 Bad verse. \x - \xo abc\xt 123\x* and more content.

Does this bring it up to match the tests fully? Weren't there more changes than this?

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 49 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @johnml1135)


machine/corpora/usfm_text_base.py line 53 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

See sillsdev/machine#236 - call it "character"

Done.


tests/testutils/data/usfm/Tes/41MATTes.SFM line 41 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Does this bring it up to match the tests fully? Weren't there more changes than this?

There aren't any more changes than this that I'm aware of. Although maybe the changes you're thinking of were in my two previous PRs, since I ended up splitting the porting process into 3 chunks so that it would be easier to review.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 11 files at r1, 4 of 5 files at r2, 7 of 7 files at r3, 2 of 5 files at r4, 3 of 7 files at r5, 2 of 2 files at r7, 1 of 3 files at r8, 1 of 3 files at r9, 5 of 9 files at r11, 3 of 3 files at r12, 18 of 18 files at r13, 1 of 1 files at r14, all commit messages.
Reviewable status: all files reviewed, 22 unresolved discussions (waiting on @johnml1135 and @mshannon-sil)


machine/corpora/alignment_corpus.py line 30 at r13 (raw file):

    def count(self, include_empty: bool = True, text_ids: Optional[Iterable[str]] = None) -> int:
        with self.get_rows(text_ids) as rows:
            return len(list(rows)) if include_empty else sum(1 for r in self.get_rows(text_ids) if not r.is_empty)

You accidentally called get_rows here. Just sum(1 for row in rows if include_empty or not row.is_empty) should be sufficient. This will avoid unnecessarily creating a list when include_empty is set to True.


machine/corpora/alignment_corpus.py line 50 at r13 (raw file):

        return _FilterAlignmentCorpus(self, predicate)

    def filter_texts(self, text_ids: Iterable[str]) -> AlignmentCorpus:

You should rename text_ids to filter, so that it is consistent with my comment on TextCorpus. Also, the parameter should be optional.


machine/corpora/dictionary_alignment_corpus.py line 3 at r13 (raw file):

from typing import Generator, Iterable, Optional, overload

from machine.corpora.alignment_row import AlignmentRow

This import should be relative.


machine/corpora/dictionary_text_corpus.py line 3 at r13 (raw file):

from typing import Generator, Iterable, Optional, overload

from machine.corpora.text_row import TextRow

This import should be relative.


machine/corpora/memory_paratext_project_terms_parser.py line 1 at r13 (raw file):

Previously, mshannon-sil wrote…

In Machine, this file is under the test folder. I moved it to machine/corpora because there are other similar "memory" files that are in the corpora folder, like usfm_memory_texts.py, that seem to only be used for tests. I can move it back to the test folder though if that would be preferable.

This was intentionally moved to the test folder. Because it defines "fake" files, it seems more like a test mock.


machine/corpora/parallel_text_corpus.py line 108 at r13 (raw file):

    def _get_rows(
        self, text_ids: Optional[Iterable[str]] = None
    ) -> ContextManagedGenerator[ParallelTextRow, None, None]: ...

The return type should be Generator.


machine/corpora/parallel_text_corpus.py line 594 at r13 (raw file):

                return len(self._df)
            return len(self._df[(self._df[self._source_column] != "") & (self._df[self._target_column] != "")])
        return sum(1 for row in self._get_rows(text_ids) if include_empty or not row.is_empty)

You should be able to filter the dataframe by the list of text ids. Something like this:

len(self._df[self._df[self._text_column].isin(text_ids_set)])

machine/corpora/parallel_text_corpus.py line 602 at r13 (raw file):

            if self._text_id_column is not None and self._text_id_column in self._df:
                text_id = cast(str, row[self._text_id_column])
                if text_ids_set is not None and text_id not in text_ids_set:

I would move this if up a level.


machine/corpora/parallel_text_corpus.py line 695 at r13 (raw file):

            if self._text_id_column is not None and self._text_id_column in example:
                text_id = cast(str, example[self._text_id_column])
                if text_ids_set is not None and text_id not in text_ids_set:

I would move this if up a level.


machine/corpora/paratext_project_settings_parser_base.py line 15 at r13 (raw file):

    @abstractmethod
    def exists(self, file_name: str) -> bool: ...

These methods should be prefixed with an underscore.


machine/corpora/paratext_project_terms_parser_base.py line 8 at r13 (raw file):

from xml.etree import ElementTree

from machine.corpora.paratext_project_settings import ParatextProjectSettings

This should be a relative import.


machine/corpora/paratext_project_terms_parser_base.py line 92 at r13 (raw file):

    @abstractmethod
    def exists(self, file_name: str) -> bool: ...

These methods should be prefixed with an underscore.


machine/corpora/paratext_project_text_updater_base.py line 23 at r13 (raw file):

        self,
        book_id: str,
        rows: Optional[List[Tuple[List[ScriptureRef], str]]] = None,

Instead of List, you should use Sequence.


machine/corpora/paratext_project_text_updater_base.py line 48 at r13 (raw file):

    @abstractmethod
    def exists(self, file_name: StrPath) -> bool: ...

These methods should be prefixed with an underscore.


machine/corpora/text_corpus.py line 45 at r13 (raw file):

    def count(self, include_empty: bool = True, text_ids: Optional[Iterable[str]] = None) -> int:
        with self.get_rows(text_ids) as rows:
            return len(list(rows)) if include_empty else sum(1 for row in rows if include_empty or not row.is_empty)

Just sum(1 for row in rows if include_empty or not row.is_empty) should be sufficient. This will avoid unnecessarily creating a list when include_empty is set to True.


machine/corpora/text_corpus.py line 110 at r13 (raw file):

    def filter_texts(
        self, predicate: Optional[Callable[[Text], bool]] = None, text_ids: Optional[Iterable[str]] = None

I would merge predicate and text_ids into a single parameter called filter that is a Union of the two types. You will need to check the type of filter using something like isinstance or callable.


machine/corpora/usfm_memory_texts.py line 0 at r13 (raw file):
This file name should be called usfm_memory_text.py.


machine/corpora/usfm_parser.py line 72 at r13 (raw file):

Previously, mshannon-sil wrote…

I moved up the assert statement in usfm parser so that the error checking tools don't complain about the possibility of self.state.token being None

That should be fine.


machine/corpora/zip_paratext_project_terms_parser.py line 5 at r13 (raw file):

from zipfile import ZipFile

from machine.corpora.paratext_project_settings import ParatextProjectSettings

These imports should be relative.


machine/corpora/zip_paratext_project_text_updater.py line 5 at r13 (raw file):

from zipfile import ZipFile

from machine.corpora.paratext_project_text_updater_base import ParatextProjectTextUpdaterBase

These imports should be relative.


tests/corpora/test_paratext_project_terms_parser.py line 9 at r13 (raw file):

Previously, mshannon-sil wrote…

I know most of the time we just import from machine.corpora when we're in the test folder, rather than from the next level down inside the corpora module. However, that only works if the method is in the init.py, and I didn't think we wanted to add a private helper function for a class to init.py. Is there an issue with importing these specific methods the way I did? I also thought about making them static methods of the class they're helping, since they're static methods in Machine, but I've noticed that the preferred way to handle Machine's static methods has been to define them as private helper functions below the class.

This is fine when testing private functions.


tests/corpora/test_update_usfm_parser_handler.py line 6 at r14 (raw file):

from machine.corpora import ScriptureRef, parse_usfm
from machine.corpora.file_paratext_project_text_updater import FileParatextProjectTextUpdater

These classes should be imported from machine. Corpora.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 22 unresolved discussions (waiting on @ddaspit and @johnml1135)


machine/corpora/alignment_corpus.py line 30 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You accidentally called get_rows here. Just sum(1 for row in rows if include_empty or not row.is_empty) should be sufficient. This will avoid unnecessarily creating a list when include_empty is set to True.

Done.


machine/corpora/alignment_corpus.py line 50 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should rename text_ids to filter, so that it is consistent with my comment on TextCorpus. Also, the parameter should be optional.

Done.


machine/corpora/dictionary_alignment_corpus.py line 3 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This import should be relative.

Done.


machine/corpora/dictionary_text_corpus.py line 3 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This import should be relative.

Done.


machine/corpora/memory_paratext_project_terms_parser.py line 1 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This was intentionally moved to the test folder. Because it defines "fake" files, it seems more like a test mock.

Should we move all the memory files to the test folder then?


machine/corpora/parallel_text_corpus.py line 108 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The return type should be Generator.

Done.


machine/corpora/parallel_text_corpus.py line 594 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should be able to filter the dataframe by the list of text ids. Something like this:

len(self._df[self._df[self._text_column].isin(text_ids_set)])

Done.


machine/corpora/parallel_text_corpus.py line 602 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would move this if up a level.

Done.


machine/corpora/parallel_text_corpus.py line 695 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would move this if up a level.

Done.


machine/corpora/paratext_project_settings_parser_base.py line 15 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These methods should be prefixed with an underscore.

Done.


machine/corpora/paratext_project_terms_parser_base.py line 8 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be a relative import.

Done.


machine/corpora/paratext_project_terms_parser_base.py line 92 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These methods should be prefixed with an underscore.

Done.


machine/corpora/paratext_project_text_updater_base.py line 23 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Instead of List, you should use Sequence.

Done. I also changed it for some other function definitions where Machine specified an IReadOnlyList.


machine/corpora/paratext_project_text_updater_base.py line 48 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These methods should be prefixed with an underscore.

Done.


machine/corpora/text_corpus.py line 45 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Just sum(1 for row in rows if include_empty or not row.is_empty) should be sufficient. This will avoid unnecessarily creating a list when include_empty is set to True.

Done.


machine/corpora/text_corpus.py line 110 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would merge predicate and text_ids into a single parameter called filter that is a Union of the two types. You will need to check the type of filter using something like isinstance or callable.

Done.


machine/corpora/usfm_memory_texts.py line at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This file name should be called usfm_memory_text.py.

Done.


machine/corpora/zip_paratext_project_terms_parser.py line 5 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These imports should be relative.

Done.


machine/corpora/zip_paratext_project_text_updater.py line 5 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These imports should be relative.

Done.


tests/corpora/test_update_usfm_parser_handler.py line 6 at r14 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These classes should be imported from machine. Corpora.

Done. I also checked the rest of my files for this and fixed every other instance I found this occurring.

@johnml1135
Copy link
Collaborator

tests/testutils/data/usfm/Tes/41MATTes.SFM line 41 at r1 (raw file):

Previously, mshannon-sil wrote…

There aren't any more changes than this that I'm aware of. Although maybe the changes you're thinking of were in my two previous PRs, since I ended up splitting the porting process into 3 chunks so that it would be easier to review.

Ok - I am not reviewing in depth, I am just wanting to make sure that the testing is the same between the two.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 24 of 24 files at r15, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @mshannon-sil)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @mshannon-sil)


machine/corpora/memory_paratext_project_terms_parser.py line 1 at r13 (raw file):

Previously, mshannon-sil wrote…

Should we move all the memory files to the test folder then?

As we discussed in the standup, you just need to move this class.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 50 of 53 files reviewed, 1 unresolved discussion (waiting on @ddaspit)


machine/corpora/memory_paratext_project_terms_parser.py line 1 at r13 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

As we discussed in the standup, you just need to move this class.

Done. And just to confirm, it's okay that I'm importing it as from tests.corpora.memory_paratext_project_terms_parser import MemoryParatextProjectTermsParser right? I first tried to just use `from .memory_paratext_project_terms_parser import MemoryParatextProjectTermsParser, but that caused a pytest test discovery error, so I think there's something about how packages work that I might not be understanding.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 50 of 53 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)


tests/testutils/data/usfm/Tes/41MATTes.SFM line 41 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ok - I am not reviewing in depth, I am just wanting to make sure that the testing is the same between the two.

Yes, the test cases look to match up between both repos.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r16, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)


machine/corpora/memory_paratext_project_terms_parser.py line 1 at r13 (raw file):

Previously, mshannon-sil wrote…

Done. And just to confirm, it's okay that I'm importing it as from tests.corpora.memory_paratext_project_terms_parser import MemoryParatextProjectTermsParser right? I first tried to just use `from .memory_paratext_project_terms_parser import MemoryParatextProjectTermsParser, but that caused a pytest test discovery error, so I think there's something about how packages work that I might not be understanding.

Yes, that is perfectly fine.

@johnml1135 johnml1135 merged commit 88f5bad into main Sep 5, 2024
12 of 13 checks passed
@ddaspit ddaspit deleted the #102_usfm_3 branch September 23, 2024 15:53
@mshannon-sil mshannon-sil linked an issue Oct 8, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

Port Machine's USFM generation changes to machine.py
4 participants