Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spaCy-models: Please Consider Distributing via PyPi #5967

Closed
jamescurtin opened this issue Aug 25, 2020 · 7 comments
Closed

spaCy-models: Please Consider Distributing via PyPi #5967

jamescurtin opened this issue Aug 25, 2020 · 7 comments
Labels
install Installation issues models Issues related to the statistical models resolved The issue was addressed / answered

Comments

@jamescurtin
Copy link

Feature Summary

Release spaCy models via PyPi

Feature Description

We use spaCy in an enterprise setting. For security, the hosts that build production docker images cannot connect to the external internet. This introduces complexity when trying to install packages like spacy-models, where the recommended installation method is to either install from a Github release (requiring a connection to github.com) or to vendor the package (avoids networking issues, but bloats individual repos).

Publishing the models through PyPi would be beneficial in that spacy-models would no longer be installed differently than other packages & would also allow us to benefit from the security that PyPi provides (e.g. ability to mirror the package index on our internal network, assurance that package versions are immutable, etc.).

Perhaps you could start with adding the small models to PyPi, as they would not run into default package size restrictions. PyPi allows package authors to file a request increasing the maximum allowable size of the package: the increased limits would easily support the medium models. There is also precedent for setting size limits that would allow for distributing the large models as well.

@adrianeboyd adrianeboyd added install Installation issues models Issues related to the statistical models labels Aug 25, 2020
@honnibal
Copy link
Member

Unfortunately we've discussed this at length and there's just no way to make it happen.

There are some packages that are quite large, up to 1gb of data. These packages cannot be served over PyPi. We don't want to have two mechanisms for distributing the models. We thought about making our own PyPi server, but this didn't work either, because it introduces security problems if someone nips in and registers the package name on the main PyPi index. If we instead preregistered those packages, users would get confusing errors if they forget our index server.

So we think the current solution is the best we can do. The models actually are pip packages --- they're just served from Github release assets. So you can totally add them to your internal PyPi server and use pip from there.

@buriy
Copy link

buriy commented Aug 25, 2020

Hi @honnibal , and what if you would only have wrappers on PyPI, that will download the models from github on setup?
They can run "spacy download" script for actual downloading.
Actually I don't like that the models are now pip packages -- because they are misbehaving pip packages... They have several different names (en and en-web-sm-2.3.0), they do not need to be imported with "import en".

@adrianeboyd
Copy link
Contributor

Having the packages download the models from github wouldn't help with the security restrictions mentioned above.

The model packages are standard pip packages with longer names like en_core_web_sm. If you install the package from a downloaded .tar.gz from spacy-models or with spacy download en_core_web_sm you'll just have en_core_web_sm and no en shortcut.

In contrast, spacy download en does several things: 1) map the shortcut name en to the package en_core_web_sm, 2) download and install the en_core_web_sm package with pip, 3) add a symlink from en to en_core_web_sm. The symlink is a separate step that doesn't involve pip or how the model package is installed.

We've realized that the symlinks cause a number of headaches, so we don't recommend them anymore and are planning to remove them in spacy v3. Then you will only be able to use the full package names like en_core_web_sm with spacy.load().

@jamescurtin
Copy link
Author

There are some packages that are quite large, up to 1gb of data. These packages cannot be served over PyPi. We don't want to have two mechanisms for distributing the models.

Understood--thanks for the explanation of some of the alternatives considered.

So you can totally add them to your internal PyPi server and use pip from there.

This is likely what we'll investigate give PyPi isn't an option. I may have missed this in the docs, but is there a way to programmatically generate the sdists bundled in the Github releases such that we could dynamically build and upload the package to our internal PyPi server when a new tag is published? Or would we need to download the release & upload the artifact directly?

@honnibal
Copy link
Member

@jamescurtin I'm not sure I understand the question well. I mean, you'll always be able to automate these things, regardless of which convenience scripts spaCy provides? And I'm sure you know that, so I must be missing what you're actually asking.

Maybe the only thing you might not be aware of is the spacy package command, which sets up a model directory so that it's easy to package using Python's setuptools.

@honnibal honnibal added the resolved The issue was addressed / answered label Sep 24, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2020

This issue has been automatically closed because it was answered and there was no follow-up discussion.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 1, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
install Installation issues models Issues related to the statistical models resolved The issue was addressed / answered
Projects
None yet
Development

No branches or pull requests

4 participants