Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate that description is single line #30

Closed
FFY00 opened this issue Jul 7, 2022 · 33 comments · Fixed by #76
Closed

Validate that description is single line #30

FFY00 opened this issue Jul 7, 2022 · 33 comments · Fixed by #76
Labels
enhancement New feature or request
Milestone

Comments

@FFY00
Copy link
Member

FFY00 commented Jul 7, 2022

No description provided.

@sigma67
Copy link

sigma67 commented Apr 15, 2024

@FFY00 what is the rationale here?

This breaks all our project installs via pdm-backend (which was updated yesterday to 0.8.0rc1).

Since when are multiline descriptions illegal?

@henryiii
Copy link
Collaborator

henryiii commented Apr 15, 2024

The spec says they are one line:

https://packaging.python.org/en/latest/specifications/core-metadata/#summary

And putting multiple lines originally broke the format:

mesonbuild/meson-python#67 (comment)

But it looks like we were joining the newlines correctly since 0.6 in c74f4c5.

@frostming, thoughts?

@henryiii
Copy link
Collaborator

henryiii commented Apr 15, 2024

description (PEP 621) is the Summary (METADATA) field. readme (PEP 621) is Description (METADATA). See https://packaging.python.org/en/latest/specifications/pyproject-toml/#description and https://packaging.python.org/en/latest/specifications/pyproject-toml/#readme.

A multiline "description" would be readme.text in PEP 621.

@sigma67
Copy link

sigma67 commented Apr 15, 2024

You are correct regarding the single line summary/description field

However due to the previous behavior it is effectively a breaking change

@henryiii
Copy link
Collaborator

@frostming can decide here (also @dnicolodi or @rgommers for meson-python). If it should be warning for a release, then error, or have an opt-in/out by backends, etc, happy to leave it to them. If I were deciding, I'd look at the numbers on PyPI of multiline project.descriptions. (I'll likely do that at some point anyway).

@rgommers
Copy link
Collaborator

Since there is already a report of breaking even with a pre-release, a deprecation rather than a hard error seems to be warranted.

For meson-python: we don't put an upper bound on pyproject-metadata, so this has the potential to break our users too. I've spot checked the most prominent users and those all use single line descriptions, but it's hard to be sure without spending a lot more time on it which packages would be broken.

@rgommers rgommers added this to the 0.8.0 milestone Apr 15, 2024
@rgommers
Copy link
Collaborator

I've just created a milestone for 0.8.0 - it'd be useful to start tagging issues/PRs with a milestone, makes it easier to investigate potential issues later.

@henryiii
Copy link
Collaborator

even with a pre-release

This is vendored into pdm-backend, so it's actually a normal release of pdm-backend that found this, not a pre-release. Scikit-build-core will be vendoring in the next release too: scikit-build/scikit-build-core#703 - partially because we guarantee reproducible SDists, and those can be affected by changes in pyproject-metadata. Also I want to ship METADATA 2.2 soon, and don't want to wait for all distros to update.

@dnicolodi
Copy link
Contributor

I'm not sure I follow. Is this about how the field is serialized in the METADATA file or is it about how it is specified in pyproject.toml? Clearly the former is possible as the format used for the METADATA file allows for splitting value into multiple lines (it just needs to be done properly, and it is actually required for very long field values). For the latter, the only thing that the specification says, is that description needs to be a string, thus, technically, newlines are allowed. However (unfortunately as most of the Python packaging specifications) the specification is very loose and lacking details on how the field is supposed to be used. I think that concatenating multiple lines into a single line is the correct thing to do. I'm not sure whether the multi-line support currently in pyproject-metadata is correct.

@sigma67
Copy link

sigma67 commented Apr 15, 2024

It's also interesting that only the core-metadata summary field is specified to be single-line, not the TOML description field - it is only specified to be a TOML string. TOML strings are generally allowed to be multiline.

Personally I also find this documentation way too obscure for something that is user-facing -- how is the average pyproject.toml user supposed to find a core-metadata specification documentation snippet (which is not even clear to be associated with the pyproject.toml description field to the uninitiated).

@dnicolodi
Copy link
Contributor

I think normalizing the description field in a way similar to

description = '\n'.join(description.splitlines())

is the best way forward. The multi-line value support currently in pyproject-metadata (linked by @henryiii a few comments above) does not seem right according to the metadata specification (summary should be a single line).

Is description the only field affected by this? Can other field contain newline characters?

@henryiii
Copy link
Collaborator

henryiii commented Apr 15, 2024

This is validating that a user does not enter a multiline string in project.description. The spec does clearly say that Summary is a singe line. Contrast that with Description, just below, which contains detailed information about how to handle multiple lines when putting it inline (putting it as the body is a METADATA 2.1 feature).

It is not clear, however, that project.description in pyproject.toml must be one line as well even though it maps to Summary. The wrapping as I understand it is correct; https://datatracker.ietf.org/doc/html/rfc822.html#section-3.1.1 states that any indented text (any number of spaces) is equivalent to that character. I don't think this preserves line breaks; that's why the pipe is present in the Description instructions, to convey new lines? But I'm not sure, it's weirdly specific and then links to the official wrapping instructions which only describe whitespace.

So currently

description = """a
b"""

Becomes:

Summary: a
    b

Which is {"Summary": "a b"} when read (AFAIU).

Doing this hides the single-line nature of Summary, though, and can confuse a user into thinking that the new lines are preserved. And encourages much longer Summary's. If it was done earlier, I don't think anyone would complain about project.description matching Summary and requiring one line.

I'm rather curious as to how other backends (hatchling, poetry1, setuptools, for example) handle this.

By the way, I'm not sure email.Message supports this. It's one of the differences I noticed. But it could be easily simulated by replacing new lines with spaces.

Footnotes

  1. Obviously not called project.description, but they should have an equivalent.

@henryiii
Copy link
Collaborator

henryiii commented Apr 15, 2024

I think normalizing the description field in a way similar to

I think you meant:

description = ' '.join(description.splitlines())

Technically, new lines are allowed in strings in toml. So you could put them in all the fields. IMO, that should be validated and error out, but it could be done. Many of them (like name) can't have spaces, so that rules out new lines too. license is free-form, so it probably can have newlines.

If I understand the email-format correctly (which I probably don't), new lines are not possible without workarounds like the seven spaces + pipe one, which is only implemented for Description?

@dnicolodi
Copy link
Contributor

@henryiii When reading the core metadata specification, it is very hard for me to distinguish which ones are format requirements and which ones are validation requirements. The specification says that the format is RFC822-like. And RFC822 does not preserve newlines in any field (except the message body). Thus it is very weird to say that Summary is not allowed to contain newlines: the serialization format does not allow it. Therefore, I get that this is not a content validation requirement, but a serialization format requirement (a restriction of what RFC822 would otherwise allow). This is where my conclusion that the current pyproject-metadata behavior is not correct. But obviously there is quite a lot of speculation regarding interpretation of the specification.

@dnicolodi
Copy link
Contributor

I think you meant:

description = ' '.join(description.splitlines())

Ops. Yes, indeed. Too much multitasking...

@sigma67
Copy link

sigma67 commented Apr 15, 2024

I agree that the main issue here is the relatively late introduction of the new validation.

Poetry's [tool.poetry.description] does not complain about multiline descriptions. The serialized metadata summary is single line.

@henryiii
Copy link
Collaborator

IMO, this is a standardization question. The guidelines should either be updated to describe what to do with multiline descriptions or to require a single line. If the latter, validate-pyproject and schemstore could validate that these are a single line, that might be a good first step.

@dnicolodi
Copy link
Contributor

If I understand the email-format correctly (which I probably don't), new lines are not possible without workarounds like the seven spaces + pipe one, which is only implemented for Description?

It's a decade since I've concerned myself with this kind of details, but my understanding is the same. Newlines are not preserved in email messages headers.

@henryiii
Copy link
Collaborator

I think easing back on this is fine, the question really is should it be a warning with a future error, or do we keep the newline behavior (and should we use the string join over trying to preserve them in the metadata's structure like we do now). I think a post on the Python discourse would be in order to ask. @pfmoore might know?

@dnicolodi
Copy link
Contributor

I would not introduce a warning at this point. Normalizing newlines to spaces seems natural enough to make it happen without warning, unless newlines are explicitly forbidden. Either replacing newlines with spaces or preserving newlines in the serialization format as in the current code (knowing that they would be substituted with spaces when the METADATA file is read) are fine for me. The latter may be a violation of the specification, but it seems that nothing breaks on this violation, thus it is inconsequential.

@pfmoore
Copy link
Member

pfmoore commented Apr 15, 2024

In my view, the current specifications are clear. Summary must be a single line, and description must be a string that maps to Summary. There is no mapping rule that says newlines in description get special treatment, so doing so would be a non-portable implementation detail. Changing any of this would require a spec change, and probably a PEP (as the evidence here is that the issue is not uncontroversial…)

As to whether this project should validate the field, or whether it should deprecate multi-line values, with or without a warning, that is a project decision. Projects specifying a multi-line description are technically in violation of the standards, and tools could fail as a result, but in practical terms the violation seems to do little harm. So I’m not sure that I personally see the benefit of being strict here.

I agree that there should be better user facing documentation describing how to write your pyproject.toml. Information is scattered over too many places, and at the same time too many tools don’t document things “because they are covered elsewhere”. Blaming the user for not finding the right detail in the spec isn’t helpful here. But nor is it fair to claim the standard can be ignored because it’s not obvious where to find it.

@dnicolodi
Copy link
Contributor

@pfmoore As indicated in previous comments, the format used for core metadata does not allow to encode a value containing a newline characters. Newlines must be followed by indentation and are treated as a continuation of the previous line. This kind of encoding is used when the value to be recorded becomes too long (I haven't checked where python.email.Message puts the limit when using the compat32 policy).

If the core metadata specification refers to the value before serialization, specifying that it should not contain newlines seems superfluous as newlines cannot be encoded, thus must be rejected a priori. If it refers to the encoded value, should it also be interpreted as an implicit limit on the length of the value?

@pfmoore
Copy link
Member

pfmoore commented Apr 15, 2024

The core metadata spec refers to the (abstract) value of the metadata, independent of serialisation. Specifying that Summary must be a single line is not superfluous, because the spec doesn't refer to any particular serialisation method. Indeed, PEP 566 originally included a specification for a JSON format serialisation of the core metadata. While that format was never standardised, disallowing newlines in that context definitely isn't redundant.

The email serialisation format is regrettably somewhat under-specified, as noted in the spec itself, precisely because the email format was never intended for this use and therefore doesn't cover all the nuances we need. That's a shame, but simply a fact of life that we have to deal with. The formats the standards are based on were defined in simpler times, when nobody felt the need for this level of precision.

@henryiii
Copy link
Collaborator

henryiii commented Apr 15, 2024

IMO, if implementations are allowed to reject multiline description, then everyone should reject multiline descriptions, otherwise the point of PEP 621, having a single definition all tools can share, is somewhat invalidated. I'd assume tools that read PEP 621 that are not the backend generally don't strictly validate multiple lines, so in practice it's mostly okay, but it's still not at all ideal.

The problem with allowing multiple line descriptions is that then there is no assistance at all in keeping this short, and you end up with things like this (selecting the three longest project.description fields in all pyprojects.tomls on PyPI):

https://github.com/Abdel-RahmanSaied/mediaSecure/blob/caabcf86622a519a55a99f604d8dad60a27f94bc/pyproject.toml#L11-L47

https://github.com/ultrafunkamsterdam/nodriver/blob/9943eb8b4426e8153936da6346d77f7112b6bb2e/pyproject.toml#L4-L32

https://gitlab.tel.uva.es/juagar/qoptcraft/-/blob/89644a9ed0321bae314fedb59db3c5af22614541/pyproject.toml#L17-24

These projects obviously confused the project.description with the project.readme.text field. If this was enforced as a single line in pyproject.toml, then these projects would have been taught about the difference without ever having to look up a spec. Also, very interestingly, on PyPI these are showing the first line only, not the merging of all lines. Does setuptools just ignore the following lines?

@henryiii
Copy link
Collaborator

On PyPI a couple of weeks ago, here's a count of the new lines:

-1: 97599
0: 39283
1: 39
2: 22
3: 12
4: 6
5: 4
6: 2
7: 1
8: 1
27: 1
35: 1

(-1 means the description is empty)

That means 0.2% of all projects use newlines in the description field.

@henryiii
Copy link
Collaborator

henryiii commented Apr 15, 2024

Out of curiosity, here's a randomly selected one new line project (I can list the projects for each of those numbers of newlines):

https://github.com/TrueLearnAI/truelearn/blob/dc159e2418a8beafab8e96b665d96ed14c23fb4f/pyproject.toml#L19-L20

And the PyPI description is cut in half, as expected for setuptools: https://pypi.org/project/truelearn

You can verify that setuptools just deletes everything after a new line:

https://inspector.pypi.io/project/truelearn/1.1.0/packages/97/ed/a383a58588db3e1cb81078dbb20ae9839ea38b4a1e36f737af0ddd3412ac/truelearn-1.1.0-py3-none-any.whl/truelearn-1.1.0.dist-info/METADATA

@pfmoore
Copy link
Member

pfmoore commented Apr 15, 2024

That example suggests that the newline is somewhat incidental, included purely to wrap the text somewhere convenient. So for that case, replacing <spaces>newline<spaces> with a single space would probably be acceptable.

But of course, that may not always be true, and "in the face of ambiguity, refuse the temptation to guess"...

With regard to "if one project rejects newlines, all should otherwise interoperability suffers" - yes, in theory. In practice, though, changing backends is never simply changing the build-system value (otherwise why bother having multiple backends) so fixing the description isn't going to be a major problem. Practicality beats purity and all that - in this case, not breaking existing users of a backend is more important than making the transition marginally easier for people changing backends (IMO).

@henryiii
Copy link
Collaborator

I’m not referring to changing backends, but tools that can read the pyproject.toml as well. A lot of work went into making sure that arbitrary tools can read this, and it has enabled some great things (like front ends allowing any backend, GitHub dependency graphs, Ruff and UV pull information here, validate-pyproject, etc. Tools like that cannot know what a multiline description will do; error, join (current behavior here), or truncate (setuptools). It may not be all that common for them to care about the description, but the fact remains that they can’t tell exactly what it will be.

I think that pyproject-metadata should not be opinionated, unless the specs are opinionated. Personally, I will check this field and throw a clear error inside scikit-build-core if it’s multi line, as that will help users understand this field.

@pfmoore
Copy link
Member

pfmoore commented Apr 15, 2024

OK. I think we're talking past each othere here - my apologies.

As far as standards are concerned, a project with a newline in the description value is invalid. Beyond that, the standards have nothing to say. In particular, the standards don't require tools to raise an error when they encounter invalid projects.

Tools have to make their own decisions as to how they handle invalid projects. Most are likely to take the stance that they will reject them, simply because that's easier. However, some tools might prefer (for backward compatibility reasons, or otherwise) to handle invalid projects. Yes, that means that if your project (or a project you're using) is invalid, some tools will fail. That's what happens once you stray outside of the standards.

As far as pyproject-metadata is concerned, I'm not sufficiently familiar with the project to have an opinion1. It really depends on what tools use the library, and how they want to handle invalid data. Presumably most of those tools are build backends, of one form or another? That may affect what features they want from pyproject-metadata.

Footnotes

  1. Not that that's ever stopped me before! 🙂

@henryiii
Copy link
Collaborator

henryiii commented Apr 15, 2024

Three build backends that I know of use pyproject-metadata: scikit-build-core, meson-python, and pdm-backend. I think that’s our major users. Others might use it directly for building utilities of some sort, but I think it’s mostly used through those build backends.

I’m happy to go with whatever the other backend maintainers like if there’s no standard suggestion: no warning, warning, or error. I’ll make it an error for scikit-build-core regardless.

The current standards don’t say much about this field at all, could that at least be updated to say it’s a single line summary?

@pfmoore
Copy link
Member

pfmoore commented Apr 15, 2024

The current standards don’t say much about this field at all, could that at least be updated to say it’s a single line summary?

They say it's mapped to Summary, which is defined to be a single line. Saying that description should therefore be a single line as well feels like a simple clarification to me.

@dnicolodi
Copy link
Contributor

The core metadata spec refers to the (abstract) value of the metadata, independent of serialisation

the specification document opens with a description of the serialization format, thus I was under the impression that the specification covers both the data specification and the serialization format, thus the data does not exist in isolation from the serialization format.

@sigma67
Copy link

sigma67 commented Apr 18, 2024

Major thanks for the user-friendly fix in 0.8!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants