Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate if the Quota System is providing meaningful value #11902

Open
dstufft opened this issue Jul 21, 2022 · 3 comments
Open

Evaluate if the Quota System is providing meaningful value #11902

dstufft opened this issue Jul 21, 2022 · 3 comments
Labels
needs discussion a product management/policy issue maintainers and users should discuss

Comments

@dstufft
Copy link
Member

dstufft commented Jul 21, 2022

Something that was surfaced in the discussion around deletions was a concern that the quota system on PyPI, as it is currently implemented, is causing a less than ideal experience for both authors and users of PyPI. I've also gone back and read previous discussions or posts like What to do about GPUs? (and the built distributions that support them).

The problems from the maintainer side, that I have seen surfaced:

  • Projects are being forced to delete older releases in order to make room for newer releases, though thankfully it is largely pre-releases currently 1.
  • Projects are avoiding uploading wheels for a variety of platforms, because they're worried about hitting the quota limits and having to ask for a quota increase and not knowing whether they'd actually get the increase 2.
  • When they do ask for a quota increase, it can take weeks or even months for the maintainer to get a reply, blocking their ability to do releases 3.

Just to make sure that everyone is on the same page, the background of how file hosting/quotas has evolved on PyPI is roughly:

  • Originally PyPI did not support file uploads at all, nor was it intended to be used as a software repository for tools to consume.
  • At some point setuptools was written that started finding files to fetch from PyPI through a variety of mechanisms.
  • At some other point (not sure if before or after the last one), PyPI added the ability to host files on PyPI, and as a basic sanity check the... Apache I think it was at the time, host had a default limit on the total request body size (as most servers do), and over time this eventually got increased to 60M, effectively limiting files on PyPI to no more than 60M in size.
  • At some point PEP 470 removes external file hosting from PyPI, which meant that in order to have a good experience with pip install ... by default, projects are required to upload to PyPI unless they want to require their users to configure an additional repository.
  • As part of the migration to Warehouse, we switched from having a web server fronting Warehouse that buffered the entire request body to one that let Warehouse itself handle pulling those bytes off the wire, which no more buffering meant that Warehouse itself was responsible for setting limits, and originally just hard coded the same 60M limit that PyPI originally had.
  • In Ability to increase file size limit on per-project basis? #346, Richard noted that we were starting to get requests for larger files sizes for some projects, which was implemented Allow a per project file upload limit #655 to allow having that 60M limit changed on a per project basis.
  • In Provide metrics for top N package storage "hogs" #4288 it was surfaced that PyPI's on disk size was currently larger than 2TB but we didn't have a great mechanism to show the information on what projects were involved in that, which was implemented in Add a /stats endpoint #4469 4 to add a /stats/ route that showed the top N packages and how much storage they consume.
  • In Implement limit on Project.total_size #7446 the idea of limiting the total size of a project was proposed and implemented in Ability to add total project size limit(internal changes) #8128 and Enforcing total project size limit(user-facing changes) #8129.

That brings us to where we are today.

I don't have really good information for how large PyPI has grown over time other than we're currently at 12TB and in 2018 we were at "> 2TB", but the per project quotas were implemented in 2020. It was mentioned in a comment on Nov 13 13, 2019 that PyPI was currently at 6.5TB

Picking 10GB as our default project quota in PyPI was done with this comment:

I grandfathered in all existing projects with Project.total_size >= 10GB. I set their limits to roughly twice their current size, minus ~20%, rounded to the nearest 10GB. My thought is that PyPI's total size is roughly doubling every year, and that the rate of growth of these should probably fall under that curve.

I wouldn't expect any of the projects on https://pypi.org/stats to request total size increases in the next ~1 year. I think we can give them file size increases liberally though.

At the time, the grandfathered in projects at >= 10G was 73 total projects.

Currently our process for people to ask for increased limits is to have them post a ticket on https://github.com/pypa/pypi-support, and one of the PyPI team will come around and look into it.

I went ahead and did some looking at those requests, and what I found was:

  • The oldest request in that repo goes back to Nov 13, 2019 asking for an increased file size limit 5.
  • There are a total of 383 requests in that time period, averaging to a limit request every 2.5 days since the first request.
  • Limit requests are split pretty evenly between requests for increase file size limit and increased project size limit, but there is a 10% tilt towards project size.
  • It appears out of 383 requests, 369 of them have been closed. Of those 369, 274 of them have been accepted, or about 75% of them, 11, or about 2% have been denied, and 19, or about 5% the user was guided towards alternative strategies to reduce their file size. The remaining ones were generally just ones where the issue was closed due to no response to asking questions 6.
  • Of the 11 that were denied, most of them were denied for the user hosting a large data file (including java jars, etc) in the project.
  • Of the 19 that were guided towards alternative strategies, it was largely split between:
    • Breaking the project up into sub projects, each getting its own limit 7.
    • Side loading large data files through some other fashion (e.g. a download() method).
    • Removing files from the wheels (tests, docs, etc) or getting the user to try different compilation strategies or even just paying attention to their file size causing them to notice something they can adjust to reduce file size.

That's a lot of information there, but ultimately the questions for this issue are:

  • Is the quota system providing value?
  • Is our process for requesting an increase providing value?
  • Is there anything that we can change to reduce the friction?

Footnotes

  1. This kind of flies in the face of how we typically expect PyPI to be used, as a stable archive of artifacts with deletions being rare.

  2. This directly hurts the consumers of Python packages, as they lose out on the ability to install from wheels on those platforms.

  3. Obviously this is due to the fact PyPI has no staff available to process these requests, relying on when volunteers are able/willing to do pretty tedious work going through issues.

  4. This was ultimately reverted, then reworked, then had more changes to it over the years, but this was the initial PR to add it.

  5. Since per project limits weren't added until 2020, that should mean that all of our project quota requests ended up here.

  6. Categorizing this was kind of lossy, I had to go through all of those issues manually and skim through them, so there very well might have been some miscategorizations in my tally.

  7. This feels kind of like approving the limit in spirit? If a project wants a single 20GB limit, that doesn't feel materially different to me than splitting the project into two, with two 10GB limits.

@dstufft dstufft added the needs discussion a product management/policy issue maintainers and users should discuss label Jul 21, 2022
@dstufft
Copy link
Member Author

dstufft commented Jul 21, 2022

I tried to summarize the state of the world fairly objectively above, so here are some of my personal thoughts.

I haven't been involved in the quota requests, so I may lack context on some of this! However, I tried to find as much information as possible to fill in the missing information.

What I do know is that several people have made reasonable points about how the current system isn't serving them well as users, and then in my digging It appeared like it wasn't serving us well either.


I think that there is some kind of value in requiring maintainers of large projects to explicitly request larger quotas. One of the things that struck me is that there were a non-zero number of people in the issues I reviewed that were able to reduce their file size simply by being asked to or being presented with additional options that that hadn't thought about.

However, I worry that our current implementation of both the mechanisms and process for handling these is creating friction with users and forcing PyPI's volunteers to spend extra effort on tedium, when it feels like that time could likely be better spent elsewhere.

In Stop Allowing Deleting Things from PyPI? a very reasonable discussion was had about the role that PyPI plays in the ecosystem. In that thread, it seemed to me like most people were in favor of PyPI offering some kind of restriction on deletions, though the devil is in the details 1, and I believe in general (though I could be wrong) that all of us who work on PyPI generally believe that the ideal case is that files are left on PyPI indefinitely to act as some kind of archive.

However, one of the common suggestions offered to (and a common strategy employed by) projects consuming a large amount of space, is to go back and delete previous releases to free up extra space inside of their quota, something which flies in the face of both what we (I believe) think is the ideal, and what the general tone of that deletion thread is. Though I think that there is a balancing act with that, deleting pre-releases feels relatively fine to me, but deleting actual releases feels kind of bad to me.

It also feels like a solution that doesn't scale, if we want PyPI to generally act in part as an archive, then by its nature it is going to continue to grow with each new release of any project hosted on it, even if the community itself doesn't grow 2.

One of the major improvements to Python packaging over the years has been the addition of binary wheels and what that has been able to mean for end users trying to install things without having to spend long periods of time setting up compiler tool chains and compiling software. Unfortunately, binaries are often larger than source releases, and the more platforms that a project tries to provide wheels for, the quicker they'll use their quota up. We know that this is already causing some projects to opt not to ship wheels for some platforms, to avoid that problem.

My memory could be wrong, but as I recall one of (if not the) major driver for putting quotas in place is what it means for mirrors like bandersnatch that want to mirror a complete copy of PyPI. Unfortunately, even with the quota system, the total size of PyPI has grown to 12TB, pushing a full mirror into consuming most of even the largest drives available to consumers. With the growth of the community and the way the ecosystem is shifting, it doesn't feel sustainable to me to treat a full mirror on a single, reasonably sized drive, as a reasonable goal anymore. Maybe the right solution is less about trying to constrain PyPI's storage and instead, at least as far as bandersnatch is concerned, it should be focusing on providing defaults that will limit the mirrror to the set of packages that people actually use 3.

I do think that it would be worthwhile to figure out changes we can make to Warehouse and/or the process to try and reduce this friction and reduce the amount of time spent by our volunteers dealing with these issues.

A half-baked idea of what that could look like:

  • Continue having a per project quota with sane defaults that cover the majority of cases 4.
  • Add a UI inside of Warehouse for maintainers to request a quota or file size increase for their project then:
    1. It checks to see if the limit increase is too big of a jump from their current quota and current file sizes, if so, it will immediately reject the limit increase with a message that tells them why, and directions to open an issue if they think they still need it.
    2. It checks how much space pre-release files take for that project, and if it's a large portion of their quota, reject the request with the same messaging.
    3. It checks if the limits cross over some additional internal limit for "too big for automatic requests", and if so, rejects with the message.
    4. We can't really automatically look for things like data models and such, but it can ask the user to affirm that there either isn't any large data models, jars, etc or that they've already attempted our suggested strategies and they're not enough.
    5. If all the above passes, automatically grant the request 5.
  • Continuously check projects for large amounts of pre-releases taking up too much space, and proactively alert the maintainers asking them to remove unneeded ones and/or the PyPI team to have us reach out to them 6.
  • Put in a system that will warn maintainers when one of their file sizes are approaching their file size limit and/or their project is approaching their quota, prompting them to proactively attempt to reduce their file sizes or remove old pre-releases.
    • We could also just warn in general if one of their file sizes is some percent larger than before, it could indicate a mistake in their packaging that they didn't notice.
  • In general, we come up with specific guidelines of what an acceptable package contains (not it's file size), and if a project's package conforms to that, then I think that we should be allowing that usage 7.
  • Explore a Wheel 2.0 (and maybe a sdist 2.0? 3.0?) that uses better compression techniques.
  • Perhaps provide a way for the PyPI team to provide an unlimited quota to specific projects that we know are well established projects that are going to do what they can to keep their storage size down.

Ultimately, I think that the current system appears to be wasting a non-trivial amount of everyone's time with how manual the process is, and how much of the limit requests appear to be relatively simple rubber stamps, and it's only in the egregious edge cases that we really need to get involved with.

I think that the largest benefits of the current system come from the fact that projects can't really just accidentally take up tons of room, that whenever they start to take up a lot, there is a built-in amount of push back to guide them towards reducing their file size, when without it they might not even think about it or notice at all.

I also think that while it's good for us to push people to remove old pre-releases to take up less space, that if people feel the need to delete old actual releases, then that's a sign that our policies and tooling are possibly not calibrated correctly for the modern landscape, and we should figure out how to adapt to the way the ecosystem has changed.

I think the half-baked proposal above seems pretty reasonable to me, it:

  • Still requires people to ask, and provides some basic sanity checks to make sure their ask seems reasonable plus guides them towards not needing it.
  • Provides better tooling to let big projects reduce their file sizes.
  • Reduces the amount of effort the PyPI team has to spend on quota issues.
  • Provides a better experience for everyone?

It would be interesting to hear what people think!

Footnotes

  1. This isn't the thread to litigate whether deletions should be a thing on PyPI or not, the linked discuss thread is a better place for that. I'm just mentioning it because it's related to one of the strategies that people use to cope with our quotas.

  2. And of course, the community is growing, causing more people to release things.

  3. You could imagine, for instance, something that just mirrored the top N files by download count and left the remaining N files hosted on PyPI, or omitted them from the mirror completely.

  4. It would probably be worthwhile to re-run the numbers to see if the defaults still fit, and the limits that we have given people, how many of them are just a little larger than the current limit.

  5. It would probably be useful to capture the results of all of these checks, and maybe send an email to the PyPI team or open a GitHub issue or something to review after the fact, or maybe only in edge cases where it starts getting close to hitting hard limits.

  6. We might even be able to add a feature to allow projects to opt into automatic reaping of old pre-releases? Not sure!

  7. There's a good argument to be made here that if a corporate owned project is using a lot of resources that we should be badgering them for resources to support that. I think I support that, but I think it might be better to handle that by surfacing metrics and alerts rather than requiring a human-in-the-loop intervention every 2.5 days on average.

@rgommers
Copy link

Thanks for the detailed write-up @dstufft!

I also think that while it's good for us to push people to remove old pre-releases to take up less space

It's worth expanding a bit on what a "pre-release" means. Alpha/beta/RC releases are one thing - they belong on PyPI for wider user testing, and can be cleaned up. However, nightlies are a completely different thing to me. Given that we (scientific Python projects) are aware of the PyPI space constraints and want to be sensitive to those and not take up a ton of space, we are putting nightlies in a separate wheelhouse that we maintain ourselves, and clean up regularly: https://anaconda.org/scipy-wheels-nightly/

On the other hand, the top four space consumers on https://pypi.org/stats/ are all packages that push nightlies to PyPI. That has always seemed to me to be a bit of an abuse of the service PyPI offers (a limit-request review seems to confirm that: pypi/support#1129 (comment)). Some clearer guidance to PyPI users on what's desired/allowed/expected would be useful.

And in case nightlies are okay to put on PyPI, I suspect a better interface for cleaning them up would be quite useful. Manually clicking a button and typing a name to verify you're deleting the right thing is not quite encouraging one to clean up.

There's a good argument to be made here that if a corporate owned project is using a lot of resources that we should be badgering them for resources to support that.

Agreed in principle. Please do keep in mind that there's no strict corporate vs. community project division. As an example: CuPy is one of the larger consumers of space; it's a project driven by Preferred Networks (a small company), but mostly as a service to the community (it's "NumPy on a GPU") and they're very community-oriented (and may transfer the project to NumFOCUS at some point in the near future).

I think that the largest benefits of the current system come from the fact that projects can't really just accidentally take up tons of room, that whenever they start to take up a lot, there is a built-in amount of push back to guide them towards reducing their file size, when without it they might not even think about it or notice at all.

I do have to say that this has been useful to me at least once; a co-maintainer asked for a size increase for SciPy, and that alerted me to the fact that we were bundling unwanted content into wheels for a particular release.

I think the half-baked proposal above seems pretty reasonable to me, it:

That all does sound very reasonable and useful to me.

@njsmith
Copy link
Contributor

njsmith commented Jul 22, 2022

Looking at the /stats page, it seems like there's a lot of room for relaxing these constraints. Like, idk if this is the ideal solution, but it would probably work?

  • Talk to the tensorflow folks to make sure everyone's on the same page about space usage (already happening)
  • Give everyone else an unlimited quota
  • Have an alert or two for projects using massive amounts of space, and when they trip go talk to them about it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs discussion a product management/policy issue maintainers and users should discuss
Projects
None yet
Development

No branches or pull requests

3 participants