Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formula tag should not imply image as content #470

Open
u-fischer opened this issue Sep 14, 2024 · 10 comments
Open

Formula tag should not imply image as content #470

u-fischer opened this issue Sep 14, 2024 · 10 comments
Assignees
Labels
bug Something isn't correct documentation Improvements or additions to documentation

Comments

@u-fischer
Copy link

u-fischer commented Sep 14, 2024

Some historical background: Math formulas are viewed as images ...

In the (free) PDF 1.7 reference from november 2006 both the Formula and the Figure tags are classified as Illustration Elements. They are viewed as graphics and various requirements are based on this assumption (starting from page page 911):

The illustration’s content must consist of one or more complete graphics objects. It may not appear between the BT and ET operators delimiting a text object

This structure type [Formula] is useful only for identifying an entire content element as a formula. No standard structure types are defined for identifying individual components within the formula. From a formatting standpoint, the formula is treated similarly to a figure

For purposes of reflow, however, it is moved (and perhaps resized) as a unit, without examining its internal contents. To be useful for reflow, it must have a BBox attribute.

For accessibility [...] an illustration element should always have an Alt entry or an ActualText entry (or both) in its structure element dictionary.

... but MathML should now be preferred

Historically it is quite understandable that math expressions are treated as illustrations elements as also in HTML they have often been presented as images because of the difficulty of presenting equations and special math symbols.

However, MathML is emerging as the preferred presentation of accessible math on the Web and elsewhere and viewing math as images should be avoided:

Images of math expressions should only be used in exceptional circumstances [...]. The preferred method for displaying math expressions is MathML, which can present mathematical expressions semantically. 1

PDF 2.0 acknowledged this change and added the MathML namespace and associated Files. But in various places the spec still contains requirements that view a math formula always as an image and basically "hide" MathML associated files and MathML tags and so contradict the idea that for math formulas MathML can and should be the preferred method.

Suggested Spec changes

Suggestion 1

14.8.4.8.6 Formula structure type

The standard structure type Formula shall not appear between the BT and ET operators delimiting a text object (see 9.4, "Text objects").

I never understood that sentence but looking at the first citation above from the historical PDF 1.7 I guess it still assumes that math is some graphical image. This is plainly wrong, most math is nowadays set with fonts and unicode symbols. Imho this sentence should be removed completly.

Suggestion 2

A Formula element may have logical substructure, including other Formula elements. For repurposing purposes it may be treated as visually static, without examining its internal contents. It should have a BBox attribute (see 14.8.5, "Standard structure attributes").

Suggested change: A Formula element may have logical substructure, including other Formula elements. It may have a BBox attribute (see 14.8.5, "Standard structure attributes") and can then for repurposing purposes be treated as visually static, without examining its internal contents.

Suggestion 3

For repurposing and accessibility purposes, a Formula element should have either an Alt entry or an ActualText entry in its structure element dictionary (see 14.9.3, "Alternate descriptions" and 14.9.4, "Replacement text").

Suggested change: For repurposing and accessibility purposes, a Formula element that doesn't have a logical substructure describing the semantic meaning should have either an appropriate Associated File, an Alt entry or an ActualText entry in its structure element dictionary (see 14.13.6 Associated files linked to structure elements, 14.9.3, "Alternate descriptions" and 14.9.4, "Replacement text").

Remark: I refrain from adding a processing requirement, but in my opinion a mathml associated file should be prefered over an Alt entry.

Suggestion 4

NOTE Alt is a description of the content enclosed by the Formula element, whereas ActualText gives the exact text equivalent of a formula has the appearance of text.

Suggested change: NOTE Alt is a description of the content enclosed by the Formula element, whereas ActualText gives the exact text equivalent of a formula has the appearance of text. An Associated File can associate content in other formats like for example MathML with a Formula.

Suggestion 5

Table 374 — Standard structure type Formula

Change should to may.

Suggestion 6

14.8.5.4.6 Figure, Form and Formula attributes

Remove Formula from the title and the text. Or restrict all the "shalls" to "Formulas with mainly graphical content".

@u-fischer u-fischer added the bug Something isn't correct label Sep 14, 2024
@petervwyatt petervwyatt added this to the Tagged PDF related milestone Sep 15, 2024
@petervwyatt petervwyatt added the documentation Improvements or additions to documentation label Sep 15, 2024
@hpvd
Copy link

hpvd commented Sep 17, 2024

regarding adding a file for mathml, please take math tagging and security #708 latex3/tagging-project#708 as input for discussion

@DuffJohnson
Copy link
Member

Thanks, @u-fischer. Some interesting questions. It would be ideal to give each suggestion an identifier to be sure we remain on the same page... I'll take them in order.

Suggestion 1

Point taken; this restriction has the appearance of treating the Formula content as a graphic... but does that matter? I defer to @mrbhardy.

Suggestion 2

I like the re-write, but why isn't "should" reasonable in place of the "may"?

Suggestion 3

Overall this seems reasonable to me... but I'm not a fan of relaxing the "should" for Alt. Any team making software to deal with associated files on structure elements can (and should) be expected to also deal with Alt on the same element.

On your remark.. I think this idea fits neatly into the current ambitions of the PDF/UA Processor LWG for a new processor specification.

Suggestion 4

I totally agree with your suggested addition in principle; it just needs a little word-smithing, IMHO. You've also found a typo in the original - I'm pretty sure that the word "that" is missing between "formula" and "has" in the NOTE before your addition. :-)

Suggestion 5

This can be considered following resolution of suggestion 2 (or, resolve this first, and then suggestion 2).

Suggestion 6

My concern here (and I feel as if I must be missing something) is that these restrictions are appropriate for cases of < Formula > that lack substructure (e.g., a < Formula > with an AF)...?

@u-fischer
Copy link
Author

@DuffJohnson

(I will edit my issue above to add the numbers)

Suggestion 1

Point taken; this restriction has the appearance of treating the Formula content as a graphic... but does that matter?

Well as it is written now (The standard structure type Formula shall not appear between the BT and ET operators) is make imho not sense at all as structure types do not appear in the content stream. If the sentence is a rewording of the 1.7. sentence about the illustration types (figure, formula) and should actually read as The content of standard structure type Formula shall not appear between the BT and ET operators then it is clearly wrong as the content of Formula is normally text.

Suggestion 2

I like the re-write, but why isn't "should" reasonable in place of the "may"?

Because breakable text should not require a BBox. The purpose of a BBox is (for example) to allow an html derivation to take a screenshot and move the picture around or to stop reflow. But if you have a mathml formula you no longer have a picture. You wouldn't require a BBox for Ruby or Warichu or a Span, wouldn't you? So why for an inline mathml formula using an unicode math font?

Suggestion 3

but I'm not a fan of relaxing the "should" for Alt

Again: you are not requiring an Alt on Ruby or Warichu or a Span as their meaning is intrinsic so what is the purpose on a mathml formula? It is not a picture where a blind user would miss the point without an Alt, it is text. We could easily duplicate the content of the mathml AF also into the Alt key and so fulfill the requirement as mathml is clearly an adequate description of a formula but where is the sense? Alt is for the description of non-text content, it shouldn't repeat text content.

Suggestion 6

... that these restrictions are appropriate for cases of < Formula > that lack substructure (e.g., a < Formula > with an AF)...?

In my view a Formula with a proper mathml AF as described in UA-2 should be equivalent to a Formula with mathml substructure.

@car222222
Copy link

These remarks are mainly @u-fischer but they may be of more general,interest.

Some primary comments
Concerning Sections 2, 3 and 6:

Re 3 (and 2):
I am not sure that inline math is always simply “text” in the
sense that this term is often currently used in connection
with the content of PDF files.

But equally, it is not clear that math needs a compulsory Alt key since there is often, for complex higher-level math, no obvious "plain text" equivalent to put in there.
Noting that it seems very unlikely that "Alt text" is intended to contain anything that is "essentially code" rather than simple text.

Note also that it is conventional in math typesetting for a single (long) inline math element to be typeset over two
or more lines: in that sense inline math can be just like
“text in a paragraph”.
It is not clear whether the PDF rules and conventions currently allow for this common convention for typesetting inline math.
Thus some extensions may be required here.

In 6:
what do you mean here by “should be equivalent to”? This is rather vague.

Maybe it could (or will at some stage) mean that there is a precisely defined method
for transforming each form into the other:
if such a possibility is currently a fact, or even an achievable goal, then we need
to define formally this inverse pair of transformations.

But this will not make math become “text” since a file of
mathml code is not, in current PDF terminology, a “text file”
or even a “representation of some text”.

@u-fischer
Copy link
Author

@car222222 I used the term "text" as a counterpart to "image". So as something that consists of characters and has an intrinsic meaning as a language and can be translated (and not only described). The whole issue is about not viewing a formula as an image. Neither the TeX meaning of text versus math mode nor text files versus binary files is not meant here.

@car222222
Copy link

car222222 commented Sep 30, 2024

All true. I was only pointing out that in PDF-speak, "text" has a different meaning, that does not include code-like strings of characters: so you need to be careful.

I was definitely not suggesting anything contrary to your aim of clearly distinguishing math from anything related to pictures or images.

@car222222
Copy link

I may be wrong, but I think that the "PDF meaning of text" does not, in practice, include programs or other code.

@DuffJohnson
Copy link
Member

Thanks @u-fischer.

On Suggestion 1:

Ok, so I agree that "content" is missing from the sentence; that's an erratum in itself, IMO, assuming that we don't simply remove this provision.

As to that... I agree that the restriction does not appear (to me) to make sense today, even if once it did... but I am innocent as to the technical implications of this change. @mrbhardy?

Assuming your point is valid and has no negative implication, it seems as if Suggestion 1 can be stated as: "Delete the second sentence of 14.8.4.8.5, para 2". Is that accurate?

On Suggestion 2:

My sense of your remark is that the value of the BBox is conditional on whether the Formula element is Block or Inline. If it's Block then BBox is reasonable ("should")... but a BBox does NOT make sense if the < Formula ? is Inline.

If that's your meaning then can you propose specification language to this effect?

On Suggestion 3:

I don't think its pragmatic to assume that all users encountering a Formula will prefer the mathML (or LaTeX, for that matter) to an Alt. A formula may be encountered casually, or investigated. A user may not be sufficiently literate to understand the formula, but might appreciate, e.g., "An example of the factored form of a quadratic equation".

Why rob an author who elects to include an Alt of the possibility that users who wish to read the Alt won't get that option? Why not lean into the notion that software can and should be able to represent all the alternatives that an author might legitimately wish to employ?

On Suggestion 6:

In my view a Formula with a proper mathml AF as described in UA-2 should be equivalent to a Formula with mathml substructure.

Can you help me by spelling out what's at-issue here? Why is it bad to require an Inline < Formula > element to a Width attribute? Or a Block to have a Height?

Or is your problem with "This value shall be the sole source of information about the element’s extent in the block-progression direction." ? If so, perhaps we can rewrite it to indicate other sources of information?

I am agreeing, though, that 14.8.5.4.6 should be entirely rewritten, as it's basically a hang-over from the "Illustration" group of SEs in PDF 1.7. If others agree this should be a stand-along erratum.

@u-fischer
Copy link
Author

Suggestion 1

Assuming your point is valid and has no negative implication, it seems as if Suggestion 1 can be stated as: "Delete the second sentence of 14.8.4.8.5, para 2". Is that accurate?

Formula is 14.8.4.8.6 not 5 (at least in my version), and the sentence is in a paragraph of its own, so delete the second paragraph of 14.8.4.8.6.

Suggestion 2

My sense of your remark is that the value of the BBox is conditional on whether the Formula element is Block or Inline. If it's Block then BBox is reasonable

Why? You are again viewing the Formula as image. Would you also claim that a BBox is reasonable on other block elements, e.g. a List or a Blockquote or a Table? I'm not saying that there aren't cases where it could be reasonable to add a BBox, but why should it needed by default? Reflow can handle a table without a BBox, so why shouldn't it be able to handle a matrix?

Suggestion 3

I don't think its pragmatic to assume that all users encountering a Formula will prefer the mathML (or LaTeX, for that matter) to an Alt. A formula may be encountered casually, or investigated. A user may not be sufficiently literate to understand the formula, but might appreciate, e.g., "An example of the factored form of a quadratic equation".

I didn't ask to forbid to add an /Alt, I only do not think that it should be obligatory. I mean your argument is valid for a table too: a table may be encountered casually, or investigated. A user may not be sufficiently literate to understand all the table data but might appreciate, e.g., "This table shows the failures rates of various experiments". but that doesn't mean that a table or a table cell or some other structure should always have an Alt key.

Suggestion 6

Why is it bad to require an Inline < Formula > element to a Width attribute? Or a Block to have a Height?

Because you are not requiring that for other text like an inline Span or a Blockquote or a Table cell. We really do not want to measure in simple sentences like if 𝑥, 𝑦, and 𝑧 larger then 0 then 𝑓(𝑥,𝑦,𝑧) will have a maximum all the math formula.

@DuffJohnson
Copy link
Member

Suggestion 1

Formula is 14.8.4.8.6 not 5 (at least in my version), and the sentence is in a paragraph of its own, so delete the second paragraph of 14.8.4.8.6.

Agreed! My mistake.

Suggestion 2

My sense of your remark is that the value of the BBox is conditional on whether the Formula element is Block or Inline. If it's Block then BBox is reasonable

Why? You are again viewing the Formula as image. Would you also claim that a BBox is reasonable on other block elements, e.g. a List or a Blockquote or a Table? I'm not saying that there aren't cases where it could be reasonable to add a BBox, but why should it needed by default? Reflow can handle a table without a BBox, so why shouldn't it be able to handle a matrix?

WTPDF uses "BBox on a table" as one example of the "semantic" use of BBox on that SE (see Table B.2.)

The "semantically significant" use of BBox for Formula is given as: "A formula that lies on a single page and occupies a single rectangle." Is this unreasonable? These things don't reflow in the same ways as e.g. lists (which have their own reflow problems!). If it's not unreasonable, and if this is a major use case (which is fair)... then maybe we simply need to be clearer about when the "should" is applicable, and when it should be "may" instead....?

Suggestion 3

I don't think its pragmatic to assume that all users encountering a Formula will prefer the mathML (or LaTeX, for that matter) to an Alt. A formula may be encountered casually, or investigated. A user may not be sufficiently literate to understand the formula, but might appreciate, e.g., "An example of the factored form of a quadratic equation".

I didn't ask to forbid to add an /Alt, I only do not think that it should be obligatory. I mean your argument is valid for a table too: _a table may be encountered casually, or investigated.

"Should" is not "obligatory".

If we relax "should" we will encourage a situation in which a predictably-substantial number of users will get a worthless (to them) result when they encounter a < Formula > instead of an alt that an author may have otherwise provided.

A table provides many more clues as to its content, not least from its structure, which doesn't require SMEs to understand. Tables are frequently captioned to accommodate the use case you mention.

Suggestion 6:

As stated, I think we should totally rewrite this clause, and I would fully agree to pulling these SEs apart when we do so. I'm ok (I think) with removing those requirements from < Formula > elements whose content is not an image.

Care to take a shot at rewriting that clause as a new Erratum?? :-D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't correct documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

5 participants