Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Generalize the method of obtaining space_code #2891

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ssjkamei
Copy link
Contributor

@ssjkamei ssjkamei commented Oct 5, 2024

I made changes to separate the part that gets the space code from the part that analyzes it for clarity. I hope you like it.

I was wondering about the space_code determination for type1_alternative.
Perhaps the comparison of the if statement itself is not being used, as I think the content is inadequate.

            if words[2].decode() == b" ":
                space_code = i

Copy link

codecov bot commented Oct 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.43%. Comparing base (fcb103a) to head (83d4899).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2891      +/-   ##
==========================================
+ Coverage   96.36%   96.43%   +0.07%     
==========================================
  Files          52       52              
  Lines        8739     8724      -15     
  Branches     1727     1721       -6     
==========================================
- Hits         8421     8413       -8     
+ Misses        186      182       -4     
+ Partials      132      129       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Oct 5, 2024

At the point where space_code is returned, there is logic that is incorrectly returning Width, and when this part of the code is erased, it no longer enters the code below.
I was not sure if I should delete it, but I did because it would cause an error in code coverage.
I don't think I will put back no encoding on the code.

The part that was wrong: return encoding, _default_fonts_space_width[cast(str, ft["/BaseFont"])]

def parse_encoding(
    ft: DictionaryObject, space_code: int
) -> Tuple[Union[str, Dict[int, str]], int]:
    encoding: Union[str, List[str], Dict[int, str]] = []
    if "/Encoding" not in ft:
        try:
            if "/BaseFont" in ft and cast(str, ft["/BaseFont"]) in charset_encoding:
                encoding = dict(
                    zip(range(256), charset_encoding[cast(str, ft["/BaseFont"])])
                )
            else:
                encoding = "charmap"
            return encoding, _default_fonts_space_width[cast(str, ft["/BaseFont"])]
        except Exception:
            if cast(str, ft["/Subtype"]) == "/Type1":
                return "charmap", space_code
            else:
                return "", space_code
.....

Codes that were erased:

    # encoding can be either a string for decode
    # (on 1,2 or a variable number of bytes) of a char table (for 1 byte only for me)
    # if empty string, it means it is than encoding field is not present and
    # we have to select the good encoding from cmap input data
    if encoding == "":
        if -1 not in map_dict or map_dict[-1] == 1:
            # I have not been able to find any rule for no /Encoding nor /ToUnicode
            # One example shows /Symbol,bold I consider 8 bits encoding default
            encoding = "charmap"
        else:
            encoding = "utf-16-be"

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Oct 5, 2024

Please review the code check as it has passed.

encoding = _parse_encoding(ft)
map_dict, int_entry = _parse_to_unicode(ft)

# apply rule from PDF ref 1.7 §5.9.1, 1st bullet :
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# apply rule from PDF ref 1.7 §5.9.1, 1st bullet :
# Apply rule from PDF ref 1.7 §5.9.1, 1st bullet:

# apply rule from PDF ref 1.7 §5.9.1, 1st bullet :
# if cmap not empty encoding should be discarded
# (here transformed into identity for those characters)
# if encoding is an str it is expected to be a identity translation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# if encoding is an str it is expected to be a identity translation
# If encoding is a string it is expected to be an identity translation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants