Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: automatic access to pointed object #2464

Merged
merged 9 commits into from
Feb 25, 2024

Conversation

pubpub-zz
Copy link
Collaborator

alternative solution to #2460
fixes #2287

@pubpub-zz pubpub-zz changed the title ENH : automatic access to pointed object ENH: automatic access to pointed object Feb 20, 2024
Copy link

codecov bot commented Feb 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.44%. Comparing base (cd705f9) to head (d881bae).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2464   +/-   ##
=======================================
  Coverage   94.43%   94.44%           
=======================================
  Files          49       49           
  Lines        8013     8024   +11     
  Branches     1618     1618           
=======================================
+ Hits         7567     7578   +11     
  Misses        276      276           
  Partials      170      170           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SamStephens
Copy link

Am I correct that the reason test_indirect_object_page_dimensions passes is because

  • The IndirectObject in the page dimensions is passed into the constructor of RectangleObject, which calls _ensure_is_number with it.
  • As the IndirectObject is neither a NumberObject or a FloatObject, _ensure_is_number constructs a FloatObject from the IndirectObject.
  • The constructor of FloatObject creates a new float using the string representation of the IndirectObject.

If so, this feels wrong; in this scenario, should we not be using the NumberObject the IndirectObject is a reference to, rather than creating a new FloatObject?

@pubpub-zz
Copy link
Collaborator Author

Am I correct that the reason test_indirect_object_page_dimensions passes is because

* The `IndirectObject` in the page dimensions is passed into the constructor of `RectangleObject`, which calls `_ensure_is_number` with it.

* As the `IndirectObject` is neither a `NumberObject` or a `FloatObject`, `_ensure_is_number` constructs a `FloatObject` from the  `IndirectObject`.

* The constructor of `FloatObject` creates a new float using the string representation of the `IndirectObject`.

If so, this feels wrong; in this scenario, should we not be using the NumberObject the IndirectObject is a reference to, rather than creating a new FloatObject?

the general objective of this PR is to flow down the call of the functions to pointed object. the "fix" about str is generic and will fix also possible issues with conversion to NumberObject or FloatObject

/Your idea of pointing the existing Object is possible but will only covers MediaBox case

pypdf/_utils.py Outdated Show resolved Hide resolved
@stefan6419846
Copy link
Collaborator

The constructor of FloatObject creates a new float using the string representation of the IndirectObject.

the general objective of this PR is to flow down the call of the functions to pointed object. the "fix" about str is generic and will fix also possible issues with conversion to NumberObject or FloatObject

Does this really involve doing some intermediate string step with this PR?

@SamStephens
Copy link

SamStephens commented Feb 21, 2024

Does this really involve doing some intermediate string step with this PR?

Unless I'm misreading or misunderstanding, yeah.

The IndirectObject in the page dimensions is passed into the constructor of RectangleObject, which calls _ensure_is_number with it.

def __init__(
self, arr: Union["RectangleObject", Tuple[float, float, float, float]]
) -> None:
# must have four points
assert len(arr) == 4
# automatically convert arr[x] into NumberObject(arr[x]) if necessary
ArrayObject.__init__(self, [self._ensure_is_number(x) for x in arr]) # type: ignore

As the IndirectObject is neither a NumberObject or a FloatObject, _ensure_is_number constructs a FloatObject from the IndirectObject.

def _ensure_is_number(self, value: Any) -> Union[FloatObject, NumberObject]:
if not isinstance(value, (NumberObject, FloatObject)):
value = FloatObject(value)
return value

The constructor of FloatObject creates a new float using the string representation of the IndirectObject.

def __new__(
cls, value: Union[str, Any] = "0.0", context: Optional[Any] = None
) -> "FloatObject":
try:
value = float(str_(value))
return float.__new__(cls, value)
except Exception as e:
# If this isn't a valid decimal (happens in malformed PDFs)
# fallback to 0
logger_warning(
f"{e} : FloatObject ({value}) invalid; use 0.0 instead", __name__
)
return float.__new__(cls, 0.0)

@stefan6419846
Copy link
Collaborator

I understand that this tends to solve more of the general IndirectObject issues we tend to encounter from time to time. Apparently, creating FloatObjects already involved the string conversion beforehand, thus this should not change much.

Some things which I would like to consider/clarify before merging:

  • Is there a realistic way to solve the possibly infinite recursion (exceeding the maximum recursion depth) when doing nested get_object calls?
  • Are there any (further) side effects which could/would arise from this general change?
  • Does this allow us to further simplify some of the existing code?

@pubpub-zz
Copy link
Collaborator Author

  • Is there a realistic way to solve the possibly infinite recursion (exceeding the maximum recursion depth) when doing nested get_object calls?

The only way I can image would be a IndirectObject referencing an IndirectObject which should not exist
I've added some code in case of.

  • Are there any (further) side effects which could/would arise from this general change?

This should fix some other issue.

  • Does this allow us to further simplify some of the existing code?

yes it should allow to remove many .get_object() some open issues should be solved too

@pubpub-zz
Copy link
Collaborator Author

there is an issue in the test but this seems not linked to the PR. any ideas ?

@stefan6419846
Copy link
Collaborator

The file from #1896 does not exist any more: https://www.selbst.de/paidcontent/dl/64733/72916 Do you see an easy way to replace it? Alternatively we probably have to remove this test for now. In both cases, we most likely should do this in a separate PR to keep things clean.

@stefan6419846
Copy link
Collaborator

This only fails on Windows as we do not use the cache there. If we want to continue using the same test file, https://github.com/stefan6419846/pypdf/actions/runs/8031448259/artifacts/1272068136 provides the file (zipped) as retrieved from the cache (will expire, thus we would have to upload this separately).

@pubpub-zz
Copy link
Collaborator Author

Ok I see, the missing file :
selbst.72916.pdf

Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pubpub-zz for your work and @SamStephens for helping with the review. As #2460 tends to cover only one case, I am going to merge #2464 for now.

@stefan6419846 stefan6419846 merged commit 03af2c2 into py-pdf:main Feb 25, 2024
15 checks passed
stefan6419846 added a commit that referenced this pull request Mar 3, 2024
## What's new

Generating name objects (`NameObject`) without a leading slash
is considered deprecated now. Previously, just a plain warning
would be logged, leading to possibly invalid PDF files. According
to our deprecation policy, this will log a *DeprecationWarning*
for now.

### New Features (ENH)
- Add get_pages_from_field  (#2494) by @pubpub-zz
- Add reattach_fields function (#2480) by @pubpub-zz
- Automatic access to pointed object for IndirectObject (#2464) by @pubpub-zz

### Bug Fixes (BUG)
- Missing error on name without leading / (#2387) by @Rak424
- encode_pdfdocencoding() always returns bytes (#2440) by @sbourlon
- BI in text content identified as image tag (#2459) by @pubpub-zz

### Robustness (ROB)
- Missing basefont entry in type 3 font (#2469) by @pubpub-zz

### Documentation (DOC)
- Improve lossless compression example (#2488) by @j-t-1
- Amend robustness documentation (#2479) by @j-t-1

### Developer Experience (DEV)
- Fix changelog for UTF-8 characters (#2462) by @stefan6419846

### Maintenance (MAINT)
- Add _get_page_number_from_indirect in writer (#2493) by @pubpub-zz
- Remove user assignment for feature requests (#2483) by @stefan6419846
- Remove reference to old 2.0.0 branch (#2482) by @stefan6419846

### Testing (TST)
- Fix benchmark failures (#2481) by @stefan6419846
- Broken test due to expired test file URL (#2468) by @pubpub-zz
- Resolve file naming conflict in test_iss1767 (#2445) by @sbourlon

[Full Changelog](4.0.2...4.1.0)
stefan6419846 added a commit that referenced this pull request Mar 3, 2024
## What's new

Generating name objects (`NameObject`) without a leading slash
is considered deprecated now. Previously, just a plain warning
would be logged, leading to possibly invalid PDF files. According
to our deprecation policy, this will log a *DeprecationWarning*
for now.

### New Features (ENH)
- Add get_pages_from_field  (#2494) by @pubpub-zz
- Add reattach_fields function (#2480) by @pubpub-zz
- Automatic access to pointed object for IndirectObject (#2464) by @pubpub-zz

### Bug Fixes (BUG)
- Missing error on name without leading / (#2387) by @Rak424
- encode_pdfdocencoding() always returns bytes (#2440) by @sbourlon
- BI in text content identified as image tag (#2459) by @pubpub-zz

### Robustness (ROB)
- Missing basefont entry in type 3 font (#2469) by @pubpub-zz

### Documentation (DOC)
- Improve lossless compression example (#2488) by @j-t-1
- Amend robustness documentation (#2479) by @j-t-1

### Developer Experience (DEV)
- Fix changelog for UTF-8 characters (#2462) by @stefan6419846

### Maintenance (MAINT)
- Add _get_page_number_from_indirect in writer (#2493) by @pubpub-zz
- Remove user assignment for feature requests (#2483) by @stefan6419846
- Remove reference to old 2.0.0 branch (#2482) by @stefan6419846

### Testing (TST)
- Fix benchmark failures (#2481) by @stefan6419846
- Broken test due to expired test file URL (#2468) by @pubpub-zz
- Resolve file naming conflict in test_iss1767 (#2445) by @sbourlon

[Full Changelog](4.0.2...4.1.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IndirectObject warnings from PdfReader#pages result in width of 0.0
3 participants