MAINT: Generalize the method of obtaining space_code #2891

ssjkamei · 2024-10-05T13:05:56Z

I made changes to separate the part that gets the space code from the part that analyzes it for clarity. I hope you like it.

I was wondering about the space_code determination for type1_alternative.
Perhaps the comparison of the if statement itself is not being used, as I think the content is inadequate.

            if words[2].decode() == b" ":
                space_code = i

codecov · 2024-10-05T13:38:15Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.43%. Comparing base (fcb103a) to head (83d4899).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2891      +/-   ##
==========================================
+ Coverage   96.36%   96.43%   +0.07%     
==========================================
  Files          52       52              
  Lines        8739     8724      -15     
  Branches     1727     1721       -6     
==========================================
- Hits         8421     8413       -8     
+ Misses        186      182       -4     
+ Partials      132      129       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ssjkamei · 2024-10-05T15:48:24Z

At the point where space_code is returned, there is logic that is incorrectly returning Width, and when this part of the code is erased, it no longer enters the code below.
I was not sure if I should delete it, but I did because it would cause an error in code coverage.
I don't think I will put back no encoding on the code.

The part that was wrong: return encoding, _default_fonts_space_width[cast(str, ft["/BaseFont"])]

def parse_encoding(
    ft: DictionaryObject, space_code: int
) -> Tuple[Union[str, Dict[int, str]], int]:
    encoding: Union[str, List[str], Dict[int, str]] = []
    if "/Encoding" not in ft:
        try:
            if "/BaseFont" in ft and cast(str, ft["/BaseFont"]) in charset_encoding:
                encoding = dict(
                    zip(range(256), charset_encoding[cast(str, ft["/BaseFont"])])
                )
            else:
                encoding = "charmap"
            return encoding, _default_fonts_space_width[cast(str, ft["/BaseFont"])]
        except Exception:
            if cast(str, ft["/Subtype"]) == "/Type1":
                return "charmap", space_code
            else:
                return "", space_code
.....

Codes that were erased:

    # encoding can be either a string for decode
    # (on 1,2 or a variable number of bytes) of a char table (for 1 byte only for me)
    # if empty string, it means it is than encoding field is not present and
    # we have to select the good encoding from cmap input data
    if encoding == "":
        if -1 not in map_dict or map_dict[-1] == 1:
            # I have not been able to find any rule for no /Encoding nor /ToUnicode
            # One example shows /Symbol,bold I consider 8 bits encoding default
            encoding = "charmap"
        else:
            encoding = "utf-16-be"

ssjkamei · 2024-10-05T15:55:54Z

Please review the code check as it has passed.

stefan6419846 · 2024-10-05T17:21:59Z

pypdf/_cmap.py

+    encoding = _parse_encoding(ft)
+    map_dict, int_entry = _parse_to_unicode(ft)
+
+    # apply rule from PDF ref 1.7 §5.9.1, 1st bullet :


Suggested change

# apply rule from PDF ref 1.7 §5.9.1, 1st bullet :

# Apply rule from PDF ref 1.7 §5.9.1, 1st bullet:

stefan6419846 · 2024-10-05T17:22:18Z

pypdf/_cmap.py

+    # apply rule from PDF ref 1.7 §5.9.1, 1st bullet :
+    #   if cmap not empty encoding should be discarded
+    #   (here transformed into identity for those characters)
+    # if encoding is an str it is expected to be a identity translation


Suggested change

# if encoding is an str it is expected to be a identity translation

# If encoding is a string it is expected to be an identity translation.

ssjkamei added 4 commits October 5, 2024 22:00

MAINT: Generalize the method of obtaining space_code

2869249

Return value correction omission

3ed2358

Style: Correcting code style issues

bc96d93

Style: Correcting code style issues

3273651

ssjkamei added 7 commits October 5, 2024 23:16

fix self-made bugs

e29a0ef

Style: Correcting code style issues

d794b08

fix self-made bugs

ad5a201

fix self-made bugs

92e6412

Style: Correcting code style issues

466b0d9

Deletion of unneeded codes

dd50d06

Deletion of unneeded codes

83d4899

stefan6419846 reviewed Oct 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Generalize the method of obtaining space_code #2891

MAINT: Generalize the method of obtaining space_code #2891

ssjkamei commented Oct 5, 2024

codecov bot commented Oct 5, 2024 •

edited

Loading

ssjkamei commented Oct 5, 2024

ssjkamei commented Oct 5, 2024

stefan6419846 Oct 5, 2024

stefan6419846 Oct 5, 2024

	# apply rule from PDF ref 1.7 §5.9.1, 1st bullet :
	# Apply rule from PDF ref 1.7 §5.9.1, 1st bullet:

	# if encoding is an str it is expected to be a identity translation
	# If encoding is a string it is expected to be an identity translation.

MAINT: Generalize the method of obtaining space_code #2891

Are you sure you want to change the base?

MAINT: Generalize the method of obtaining space_code #2891

Conversation

ssjkamei commented Oct 5, 2024

codecov bot commented Oct 5, 2024 • edited Loading

Codecov Report

ssjkamei commented Oct 5, 2024

ssjkamei commented Oct 5, 2024

stefan6419846 Oct 5, 2024

Choose a reason for hiding this comment

stefan6419846 Oct 5, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 5, 2024 •

edited

Loading