-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix unnecessary downloads by pipenv lock #3827 #3830
Conversation
It's a change of upstream library, but I am OK to this. Just need another look by @techalchemy @jxltom @uranusjr |
Edit: Oh nvm, this is a Pipenv-only patch. |
I think I am OK with the implementation, but not entire sure whether it is a good idea to trust the URL hash. But OTOH, if the URL could be compromised, so can the downloaded package, so I guess in philosophy none is better than the other anyway, you have to trust one of them eventually. |
I guess I have mixed feelings about this. I've been saying for some time now that 'hash checking' is what takes a long time, but in reality it's the fact that we are requiring the download of multiple platforms worth of every artifact in order to lock due to our patches on pip. I'm a firm no on this specific PR for security reasons (we discussed the different approaches early in the project; the Hash Cache is itself patched into piptools and not native) and for security reasons obviously we opted to download the contents and build the hashes that way. From a reproducibilty standpoint however, it gains you nothing to download and hash the artifacts for other platforms which you yourself will not be installing, since you can't use them and aren't building your environment with them. You will never have any idea if they actually work properly. On the other hand, simply relying on a hash from a URL as a manner of trust is highly insecure and doesn't even require an SSL handshake / there is no certificate verification even, and that was the reason we never agreed to do it that way. That reasoning, fundamentally, hasn't changed, so I'm not too sure our approach on that specific point should. I am open to alternative options there though, clearly downloading the entire internet worth of packages is not a great way of handling this... |
Yeah, I think it is the purpose of the original commit cb895aa It only stores cache when the hash part exists in the URL, but never extracts it, which looks like kind of mistake. EDIT I got some insight into the reason for doing this from the comment of @techalchemy . What do you think about taking advantage of some "pypi.org only" methods( |
hmmm using the json endpoint is an interesting idea, it's probably equally secure compared with downloading from there, right? The content is still encrypted / we will still be doing the SSL handshake yeah? /cc @nateprewitt I know you had concerns about this 2 years ago when we implemented it |
Using the JSON endpoint is fundamentally the same as using the URL-provided hash. They are both strings from the index server. I’m firmly -1 on the JSON endpoint. Just use the Simple page; if that is a bad idea, so is JSON. |
On consideration, this whole discussion is a moot point. pipenv must download/build artifacts and resolve dependencies once in order to install them. That means by the time lock is run, it should already have all the information it needs to reproduce the build. pip-tools seems to follow this "fetch once" model. Why doesn't pipenv? |
It’s not entirely moot IMO. Pipenv only needs to download one artifact (ideally) for resolution, the URL-provided hash can still avoid some downloads. |
We obviously would not want to download artifacts more than one time, I've been saying everywhere this issue comes up that if someone can identify whether this is actually occurring and how we can fix it, we will absolutely merge that change. |
The point about hash hints is again not a point about trust specifically, it's a point about reproducibilty. If you are downloading the artifact, we want to ensure that we know exactly which artifacts represent the valid set of options at a given point in time. We can only actually make a guarantee about the one you installed, which is why it might make sense to avoid the downloads entirely in the alternative cases. |
No, I'm concerned about using hash hints at all, which is a fundamental shift in our implementation. The only reason I am considering it is because pipenv currently downloads and hashes artifacts for each platform and python version, no matter whether it is actively performing an installation with those artifacts. However, it also caches the outcome in a local hash cache. If we switched to using the hash hints, would we use them in place of those supplied in the existing pipfile? The difference here is that we are only using a single artifact, so actually performing the hash is only telling us anything useful about that individual artifact (i.e. from a reproducibility standpoint, we can only be sure that the one specific artifact works the same, since we didn't install any of the others). To address a point @uranusjr made the other day:
It's not fundamentally the same since the proposal is to stop downloading all the artifacts in the first place. We currently only have the option to use the url hints because we expose ourselves to each Using the simple index would require us to use the xmlrpc client which is not really a great idea. Overall this is a pretty major change to our hashing logic. I appreciate your desire to move this forward and I do think it's possibly a good idea but I also want to convey that this is a serious modification to both our security model (which I didn't design and probably don't understand well enough to sign off on) and to our core resolution logic. We can't be haphazard about this, so it will take some time to consider how to move forward. |
The current implementation has unreasonable lock times for many users. It needs changing if this is a problem you want to fix.
no, the proposal (which this PR now only partially implements) was to hash and cache the hash every artifact that is downloaded at download-time and use hash hints for what hasn't been installed (or tested) at lock time.
You've convinced me that I won't be able to fix this with the level of effort I'm willing to put in. I've only been using pipenv for a few days and at this point the simplest choice for me is to move back to pip. Thanks. |
We agree, the only thing anyone is uncertain about is the approach
Right, that is restating the same point. We currently do download every artifact at lock time. Anything that modifies this behavior (i.e. using hash hints) is therefore inherently proposing to change that fact. |
With @pilkibun on this. pipenv is dead. its been a half hour trying to install from lock file. Agree that sci packages may slow down installs but if you think of it in terms of time spent. Me spending a couple of hours resolving dependency issues once every 3-6 months or 30+ mins wasted on every ci/test/local build? its a no brainer to move back to pip |
@ekhaydarov consider not locking when you are just installing from lock file? I don't like ci/testing updating the packages on me, especially when those updates aren't going to be saved/tracked. So I find |
Hello, (1) For example, when installing Tested the following (2 cores, 4 GB RAM, high throughput network on AWS EC2 with traffic shaping): I would like to cover some concerns that were voiced in that PR: (2)
According to My interpretation is that this hash is the actual source of truth for your download - you should check your download against that hash to verify that the downloaded file is not corrupted or has been tampered with (if it doesn't match - assume your download has gone bad). (3)
When you download from PyPI, you ultimately trust the host from which you download because of the TLS. The trust doesn't go beyond that as there are no signatures for most (?) packages. Essentially you just trust the CDN and you have no other choice (but rarely signed wheels). My point is that you have equal trust for the hash in the URL and the file contents and because of the previous point we should actually prefer the hash. (4)
At the moment, the only security that pipenv provides is that the wheels did not change after the lock. @frostming @uranusjr @techalchemy Let's agree on # 1 and discuss the rest. edit: this was resolved in #4500, thank you! |
Closes #3827
The issue
pipenv lock
retrieves all possible artifacts (even for other platforms/arch) in order to calculate to their hash, even when Pypi includes the sha256 hash in the url. This means lengthy downloads (In one case 893MB vs. 50MB), which users experience as pipenv hanging.The fix
If the artifact url already contains a hash value, using the same hash function which pipenv would have used, we simply extract and use it in directly without downloading anything. It is also stored in the hash cache.
The checklist
news/
directoryIf this is a patch to the
vendor
directory…This fixes an existing vendor patch merged in cb895aa.