Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should malware/spam/etc which gets removed from pypi generate advisories here? #45

Open
westonsteimel opened this issue Dec 4, 2021 · 16 comments

Comments

@westonsteimel
Copy link
Collaborator

Following from some discussion in pypi/warehouse#4703, do we think that packages removed from PyPI due to being classified as malware, etc should cause advisories to be generated here?

@westonsteimel
Copy link
Collaborator Author

Perhaps it should pass some sort of threshold for number of downloads or something to make it more worthwhile?

@di
Copy link
Member

di commented Dec 4, 2021

@oliverchang I'm curious if there is precedent for this in other advisory dbs.

@westonsteimel
Copy link
Collaborator Author

westonsteimel commented Dec 4, 2021

I do think npm at least publishes advisories for malicious packages (I think theirs is entirely via GitHub's advisories now if I remember correctly?) https://github.com/advisories?page=5&query=malicious+ecosystem%3Anpm

@oliverchang
Copy link
Contributor

I think it's fair game to include these, and the reporting can re-use the existing infrastructure / tooling (i.e. pip-audit).

As @westonsteimel mentioned, other vuln DBs like GHSA also track these.

@westonsteimel
Copy link
Collaborator Author

westonsteimel commented Dec 6, 2021

@oliverchang, I did attempt using the existing analysis on one of these (I think one of the ones from this jfrog article), and it does cause some failures because, of course, these packages have been removed from pypi, so when it attempts to extract versions from pypi project JSON it fails. I did notice we do appear to have the version info in the input pypi_versions.json from the BigQuery query though.

@di
Copy link
Member

di commented Dec 6, 2021

Do we need some mechanism for "this entire project is malicious, regardless of version"?

@darakian
Copy link
Contributor

Hey all 👋

Just want to chime in on this in hopes of reviving the conversation and to share some of github's thinking. npm does indeed make a point of publishing advisories for malware packages. Those packages are also pulled and the namespace for the package is forever more dead. Pulling the packages prevents future exploitation of course and the alert is to inform users who have already downloaded the package in order to minimize an attacker's window of opportunity.

So to that end I think it would 100% be valuable to publish malware takedown advisories.

Do we need some mechanism for "this entire project is malicious, regardless of version"?

It might be nice, but this can be achieved with uncapped version ranges in any advisory. eg. >= 0.0.0. I know differing version schemes might make that hard to be perfect, but it's a start.

Something else I would like to ask which may be more controversial is that in the event a package is taken the namespace for that package also be taken down/reserved/be made never usable again. The rational here is so that these advisories need not be invalidated over time as new users re-use package names.

@di
Copy link
Member

di commented Dec 13, 2022

Hey, thanks for that insight!

Something else I would like to ask which may be more controversial is that in the event a package is taken the namespace for that package also be taken down/reserved/be made never usable again. The rational here is so that these advisories need not be invalidated over time as new users re-use package names.

As great as this sounds I think it's not possible in practice for an ecosystem like Python that (currently) only has a single global namespace and has an open registration system for projects.

A very common occurrence is that a legitimate maintainer publishes a source repo, and before they get a chance to publish this to PyPI, an attacker beats them to it and publishes a malicious version of that project name.

We'd want to take down the project and inform people that a specific release was malicious, but we wouldn't want to block the legitimate maintainers from eventually publishing that name on PyPI.

@darakian
Copy link
Contributor

Totally fair. Certainly for one off malware I would not suggest burning the namespace, but maybe the specific version. Anyway, advisories on malware is very much a 👍 from me 😄

@joshbressers
Copy link

It's probably worth framing this in the context of this work
https://checkmarx.com/blog/how-140k-nuget-npm-and-pypi-packages-were-used-to-spread-phishing-links/

144K packages is a pretty wild number. The list exists though
https://gist.github.com/jossef/1c1152368ff6210340644f44afec7e8e

Looks like there are 7824 PyPI packages.

I think in the case of malicious packages, there's not much to say other than "don't use this". It differs from a traditional security advisory that tends to need some additional details for the people on the receiving end.

I think for something like GitHub or OSV, it should be trivial to create the data for this based off something as simple as a CSV file. One doesn't need a lot of details, just a list of bad versions.

@kaplanlior
Copy link

A very common occurrence is that a legitimate maintainer publishes a source repo, and before they get a chance to publish this to PyPI, an attacker beats them to it and publishes a malicious version of that project name.

We'd want to take down the project and inform people that a specific release was malicious, but we wouldn't want to block the legitimate maintainers from eventually publishing that name on PyPI.

Which should authenticate these packages sources and their Git references so only the package maintainer could use his own repo (an idea by @jossef) . Regarding naming - that's of course a wider discussion.

@darakian
Copy link
Contributor

I suspect package to repo authentication is out of scope for this topic. I'd love to see it even if opt-in only, but for the moment it's not in place and malware advisories would be useful with or without it.

@di let me know if there's anything I can do from the github side to help 👍

@kurtseifried
Copy link

One comment re the info, a package might be malware, or specific versions might have malware (but older/newer versions are fine). So unless the package is removed, some source of data saying which ones are safe/unsafe to use is advisable. It doesn't necessarily have to live in the pypa database though.

@teruokun
Copy link

So for now, I think the key bit here is having a way for pypi to communicate the list of packages you shouldn't be getting from Pypi. While I think that eventually it likely should be a part of an API definition (perhaps as an alternative 'list packages' interface), given that Pypi and its mirrors are used often as the root, it makes sense to communicate at least what you shouldn't expect to be getting from Pypi and as an advisory for any metadata mirrors to at least have a stance on how they should react with packages are added to or removed from the pypi-removed list.

I'd like to propose an idea here for a format, though I can understand if something more structured like json might be preferable: a pip requirements.txt format of the block list, potentially zipped. This would be almost trivial to construct from the existing warehouse table and would be worthwhile for most consumers as an easy way to scan their existing environment for the packages using existing tools (i.e. any existing prohibited requirements). If transparency of reasoning is also worthwhile, it could include the reasons as comments. In terms of later flexibility, it does also benefit from easily adding version specifications or other requirement qualifiers, though it need not do so at outset. It also doesn't necessarily block a more structured file with more details later on, if it's desired, but does make an easy integration point for pip users

So the process would be, in either a scheduled and/or on-change basis, a job would generate a new blocklist file from the existing DB table and if different from the existing file, make the pull request to the advisory database to update the file.

What do others think? I'd be willing to spike a little work on it if this seems like a reasonable approximation.

@sethmlarson
Copy link
Contributor

sethmlarson commented Jul 10, 2023

Reviving this topic a bit, I think it makes sense to have packages which have been removed from PyPI listed in this database. My thinking is:

  • Package names that have been removed aren't removed from the "name pool" forever. People can request previously removed package names via PEP 541. This will help mirrors/consumers have more confidence with which packages are malicious versus safe to use and install knowing that names can be reused.
  • CVEs aren't emitted for malicious packages and it's unlikely other providers of vulnerability IDs would do so for PyPI. Having a PYSEC record for this information is likely the only provider that would do so.

I'm unsure how automated this process could be, but a PYSEC that contains the removed package name, versions (assuming that versions can't be re-used post-deletion, please check this assumption), and hashes of the files I think would be enough for pip-audit to detect a malicious package?

@kurtseifried
Copy link

CVEs aren't emitted for malicious packages and it's unlikely other providers of vulnerability IDs would do so for PyPI. Having a PYSEC record for this information is likely the only provider that would do so.

I can unequivocally say the the GSD project (https://gsd.id) would like to do ID's for these issues, especially as we can properly tag them, currently they would get tagged as "concern":

{
"gsd": {
"metadata": {
"type": "concern",

But we can definitely look at adding a "malware" category (I suspect there are enough of these across multiple ecosystems to make it worth doing).

We are also happy to support automation in order to get you GSD ID's quickly and easily, like we do for the Linux Kernel already (several thousand per year).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants