Update scripts to gather sources info of nix packages by anlambert · Pull Request #10 · nix-community/nixpkgs-swh

anlambert · 2025-07-04T14:09:23Z

I am a software engineer at Software Heritage and I recently reviewed the state of the archival of NixOS source packages.

To archive the sources of nix packages, we have a dedicated nixguix lister consuming a JSON file with relevant info to create loading tasks to archive the sources of Guix/NixOS packages. Source code can come from tarballs, simple files or from a VCS (git, mercurial, subversion).

Last year we work with the Guix folks to improve the archiving of their source packages, you can find more details on that blog post or this one. The key feature of that work was to map the content addressed identifiers used by Guix (standard checksums and NAR checksums) to the SoftWare Hash IDentifiers (SWHIDs) used by Software Heritage.

We updated our loaders to ingest tarballs and files into the archive to recompute the checksums (standard or NAR) after downloading source code artifacts to ensure the integrity of the archival. We also store the mapping between checksums and SWHIDs at the end of the loading process. It now allows Guix to have a fallback to fetch source code from the archive in case upstream vanished. To do so, they query the /api/1/extid endpoint of SWH Web API to get a SWHID, targeting a directory or a file, from a NAR hash or a standard hash. Once the SWHID retrieved, if it targets a directory they can request the cooking of a tarball using the /api/1/vault/flat/ to download the source code or if it targets a file download it using the /api/1/content/raw/. The implementation details of the Guix bridge with SWH can be found in that source file.

Such fallback for downloading source code could also be added for nix packages but like Guix a bridge must be implemented to communicate with the archive.

To get back to that pull request, I noticed the sources-unstable.json file was no longer updated since four years.

$ curl -I https://nix-community.github.io/nixpkgs-swh/sources-unstable.json
HTTP/2 200 
server: GitHub.com
content-type: application/json; charset=utf-8
last-modified: Tue, 12 Oct 2021 05:25:39 GMT
...

This means currently the archive is not synchronized with the current state of nix packages. I tried to execute the nixpkgs-swh tool but it ends up with error. So I started debugging it but ended up with a lot of improvements to it, notably:

ensure to not crash when a field in a derivation is missing
detect if sources are coming from a VCS (hg, git, svn), in that case also extract info about the changeset, tag or revision targeting the source code to fetch
fetch narinfo for each package in the nix HTTP cache and add them in the JSON output
add a GitHub Actions workflow to generate the sources-unstable.json file and upload it on GitHub Pages on a daily basis
improve performance of post process Python script

The JSON output now contains more relevant info for a better archiving of nix source packages by SWH. Notably having the narinfo file for each package allows SWH to download source code directly from the nix cache. It ensures that the computed hashes (standard checksums or NAR ones) when verifying integrity of downloaded tarballs will be correct as
NixOS can apply some transformations to upstream sources (the postFetch phase).

On the SWH side, several changes were made to improve the archival of nix source packages:

we now have a swh.core.nar module enabling to NAR serialize or unpack a NAR archive
the nixguix lister was improved to exploit the narinfo files and create loading tasks targeting NAR archives in the nix cache, see merge request
the content and tarball loaders were improved to handle NAR archives as input, see merge request

If this PR gets merged, we could catch up with the archiving lag of nix packages. Looking forward to advance on that topic.

Detect if source comes from a VCS and set a type attribute accordingly. Extract VCS revision targetting the source of a package. Extract some parameters related to sources coming from git. Extract nix store path of source derivation.

The script processes the JSON output of the nix-instantiate call gathering info about nix package sources. Improve formatting with black. It has been updated to: - fix call to nix-hash - remove duplicates in the input sources list - robustify the hashes normalization - compute new fields expected by the SWH nixguix lister

Remove the filtering of sources to contain only tarballs as SWH can now also handle sources coming from VCS (git, hg, svn). Add --show-trace option to nix-instantiate to ease debugging. Call post-process.py after renaming of add-sri.py.

The generate script can work successfully without it so better removing it.

Apply black formatting. Ensure to avoid errors after the update of sources data.

Use aiohttp to greatly optimize the performance of fetching a lot of small text files through HTTP requests.

Use asyncio to greatly improve performance of subprocess calls to nix-hash command.

nlewo

Hello!

Thank you for all this improvements. I tried to resurrect this project few months ago but failed because of some evaluation issues in nixpkgs. I seems they are now fixed.

I still need to run this locally before merging but this looks almost ready to get merged.

What is the purpose of the commit a526639 ? I would prefer to avoid committing such kind of big file to preserve the size of this repository.

I discussed with the NixOS infra team and they were ok to run these scripts on the NixOS infra. There is the PR #8 which would allow to run this project via a systemd timer on a powerful server. So, i don't think we need to rely on GitHub actions.

I should be able to run it locally during this WE.

nlewo · 2025-07-11T16:15:12Z

        filteredSources.append(source)

+
+async def get(source, session):


I'm wondering how sustainable it is to fetch all narinfo at each sources.json file generation.
Do you think we could add a caching mecanism? This could come in a followup PR.

I do not think we need to implement a cache as most of the narinfo files are already cached by Varnish.

$ curl -I https://cache.nixos.org/0l30f0az52ygfy40w45hg7qmailhjdyn.narinfo HTTP/2 200 last-modified: Mon, 07 Jul 2025 10:54:20 GMT etag: "58f29ebe3b456459030fc045f0b5094d" x-amz-server-side-encryption: AES256 content-type: text/x-nix-narinfo server: AmazonS3 via: 1.1 varnish, 1.1 varnish accept-ranges: bytes age: 2032 date: Tue, 15 Jul 2025 13:35:57 GMT x-served-by: cache-iad-kjyo7100035-IAD, cache-lcy-egml8630025-LCY x-cache: HIT, HIT x-cache-hits: 7, 0 access-control-allow-origin: * content-length: 621

Considering the performance of the asynchronous fetches of narinfos (a couple dozens of seconds for fetching several tens of thousands files), I guess there is a few cache misses.

nlewo · 2025-07-14T17:02:40Z

btw, I run it locally (only on a nixpkgs subset) and it seems to be working well!

With that workflow, unstable sources of nixpgs are computed every day on midnight then deployed to GitHub Pages.

Use builtin nix function convertHash to normalize hashes instead of spawning numerous nix-hash processes from the post-process.py script, this greatly improve pipeline execution performance.

anlambert · 2025-07-15T13:46:46Z

Hi @nlewo and thanks for the review !

What is the purpose of the commit a526639 ? I would prefer to avoid committing such kind of big file to preserve the size of this repository.

This was mainly for testing the updated nixguix lister/loader with various NixOS releases by uploading those files in my GitHub Pages. I removed the commit adding those from that PR but I kept these files in a separate branch on my fork to avoid generating them again.

Ideally, once all changes to the nixguix lister/loader merged on the SWH side, we should try to archive the sources of all past NixOS releases by triggering one shot listings using these files as input to the lister. This means those must be uploaded on a public HTTP server for SWH to fetch them. Then triggering weekly or monthly listing on unstable nixpkgs should ensure sources archival does not lag.

I discussed with the NixOS infra team and they were ok to run these scripts on the NixOS infra. There is the PR #8 which would allow to run this project via a systemd timer on a powerful server. So, i don't think we need to rely on GitHub actions.

Sure. Can we keep the workflow file though ? This is pretty handy to have CI for testing new developments or fixing issues before submitting a PR.

FYI, this was actually a workaround because of the missing builtins.hashTo function. This is however now fixed: NixOS/nix#3151 and NixOS/nix#7708

This means we could set the SRI hash during the Nix evaluation. This could improve evaluation time since no processes would need to be spawned.

Note this is an improvement and doesn't need to be part of this MR.

Thanks for the pointer, I added a commit to perform hash normalizations directly in the find-tarballs.nix script. Performance of the post-process.py script is now much better.

nlewo · 2025-07-19T13:39:25Z

Then triggering weekly or monthly listing on unstable nixpkgs should ensure sources archival does not lag.

Initially, my plan was to generate these files 4 times per days because Hydra evaluated nixpkgs each 6 hours. It seems it is now evaluating nixpkgs every 36 hours. So, if we don't want to miss any artifact pushed to the NixOS binary cache, we could generate the JSON file every day.

Regarding the ingestion of older releasese, we could expose them manually and maybe manually trigger a SWH lister run. But, that's more on your side. I think we could also discuss this later again in order to focus on ingesting the current releases.

Sure. Can we keep the workflow file though ? This is pretty handy to have CI for testing new developments or fixing issues before submitting a PR.

Ok, I definitively agree on running a CI on each PR. When i'm working on this project, i actually only consider the nixpkgs.hello, instead of nixpkgs. This doesn't requires a lot of memory but allow to validate all scripts. Would it be accepatable for you to run the CI on this subset?
(I think we would then no longer need the trick to get more RAM and the CI will run faster.)

I added a commit to perform hash normalizations directly in the find-tarballs.nix script. Performance of the post-process.py script is now much better.

wow, cool! Thank you!

nlewo · 2025-07-20T16:57:13Z

@anlambert I pushed a commit with a flake into your branch. This also add the --testing option to the script allowing to only evaluates nixpkgs.hello, which is pretty fast.

In the CI, we would then have to run:

nix run .#nixpkgs-swh-generate -- --testing /tmp/swh unstable

(This would also allow to easily and quickly run what the CI is executing.)

Also, i finally run your branch on the whole nixpkgs and it works nicely! Thank you very much!

This also introduces a `--testing` argument which can be used to validate all scripts without having to evaluate the whole nixpkgs.

nlewo · 2025-07-20T17:21:26Z

I added the systemd service to my small personal server (with testing = true;) and it correctly generated all files: http://nixpkgs-swh.abesis.fr/.

(So, once this MR merged, I could ask the NixOS infra team to run it on the NixOS infra.)

Comment the print instruction for sucessful HTTP requests but keep the one for failed requests.

Use nix run command to execute the pipeline extracting sources info of NixOS packages. Execute the pipeline in testing mode considering only nixpkgs.hello for faster execution and less memory consumption. Drop the upload to GitHub Pages and prefer to simply display the produced JSON file as it is quite small.

anlambert · 2025-07-21T09:37:33Z

Initially, my plan was to generate these files 4 times per days because Hydra evaluated nixpkgs each 6 hours. It seems it is now evaluating nixpkgs every 36 hours. So, if we don't want to miss any artifact pushed to the NixOS binary cache, we could generate the JSON file every day.

On the SWH side, we will have to run the lister everyday then. As we exploit the Last-Modified HTTP header value from the remote nix cache responses, loading tasks to archive sources will not be recreated if we already encountered those in a previous listing.

Regarding the ingestion of older releasese, we could expose them manually and maybe manually trigger a SWH lister run. But, that's more on your side. I think we could also discuss this later again in order to focus on ingesting the current releases.

Ack, it should not be too complicated on our side to create these one shot listing tasks for older releases. We could host the JSON files ourself but I think it will be better if we target an HTTP server managed by nix.

@anlambert I pushed a commit with a flake into your branch. This also add the --testing option to the script allowing to only evaluates nixpkgs.hello, which is pretty fast.

In the CI, we would then have to run:

nix run .#nixpkgs-swh-generate -- --testing /tmp/swh unstable

(This would also allow to easily and quickly run what the CI is executing.)

Awesome, thanks a lot ! I have updated the GitHub Actions workflow to use the command above and I also removed the deployment to GitHub Pages to ensure the workflow can be successfully executed from every forks of that repository.

nlewo · 2025-07-24T05:12:32Z

Thank you!
So, let's merge this MR. I will discuss with the NIxOS infra team to integrate this project into the NixOS infra and keep you updated.

nlewo · 2025-09-07T15:47:12Z

@anlambert fyi, i submitted NixOS/infra#830 to deploy it in the NixOS infra.

anlambert added 11 commits June 26, 2025 01:38

scripts/*.nix: Reformat with nixfmt

7eb03a2

Add .gitignore to ignore build folder

165a94a

scripts/find-tarballs.nix: Avoid error on missing derivation attribute

8819b73

scripts/generate.sh: Update implementation

2534a29

Remove the filtering of sources to contain only tarballs as SWH can now also handle sources coming from VCS (git, hg, svn). Add --show-trace option to nix-instantiate to ease debugging. Call post-process.py after renaming of add-sri.py.

Remove not needed nix folder

58651af

The generate script can work successfully without it so better removing it.

scripts/analyze.py: Improve formatting and update implementation

f25cf4f

Apply black formatting. Ensure to avoid errors after the update of sources data.

scripts/swh-urls.nix: Dump untransformed URL if mirror is missing

191fd33

script/post-process.py: Retrieve narinfo data from remote nix cache

f312260

Use aiohttp to greatly optimize the performance of fetching a lot of small text files through HTTP requests.

scripts/post-process.py: Improve performance of hashes normalization

b76d4bc

Use asyncio to greatly improve performance of subprocess calls to nix-hash command.

nlewo reviewed Jul 11, 2025

View reviewed changes

anlambert added 2 commits July 15, 2025 15:25

Add GitHub Actions workflow to deploy sources-unstable.json to Pages

2fb2f36

With that workflow, unstable sources of nixpgs are computed every day on midnight then deployed to GitHub Pages.

scripts: Normalize hashes directly in find-tarballs.nix

9532117

Use builtin nix function convertHash to normalize hashes instead of spawning numerous nix-hash processes from the post-process.py script, this greatly improve pipeline execution performance.

anlambert force-pushed the update-scripts branch from 5d167bf to 9532117 Compare July 15, 2025 13:25

Add a flake

f2a6b23

This also introduces a `--testing` argument which can be used to validate all scripts without having to evaluate the whole nixpkgs.

nlewo force-pushed the update-scripts branch from 0487f4e to f2a6b23 Compare July 20, 2025 17:02

nlewo reviewed Jul 20, 2025

View reviewed changes

Comment thread scripts/post-process.py Outdated

anlambert added 2 commits July 21, 2025 11:12

scripts/post-process.py: Make the script execution less verbose

3720b2a

Comment the print instruction for sucessful HTTP requests but keep the one for failed requests.

nlewo merged commit ae97b4f into nix-community:master Jul 24, 2025

nlewo mentioned this pull request Sep 7, 2025

Add a flake with a NixOS module to generate files #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update scripts to gather sources info of nix packages#10

Update scripts to gather sources info of nix packages#10
nlewo merged 16 commits intonix-community:masterfrom
anlambert:update-scripts

anlambert commented Jul 4, 2025

Uh oh!

nlewo left a comment

Uh oh!

nlewo Jul 11, 2025

Uh oh!

anlambert Jul 15, 2025

Uh oh!

Uh oh!

nlewo commented Jul 14, 2025

Uh oh!

anlambert commented Jul 15, 2025

Uh oh!

nlewo commented Jul 19, 2025

Uh oh!

nlewo commented Jul 20, 2025

Uh oh!

Uh oh!

nlewo commented Jul 20, 2025

Uh oh!

anlambert commented Jul 21, 2025

Uh oh!

nlewo commented Jul 24, 2025

Uh oh!

nlewo commented Sep 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		filteredSources.append(source)


		async def get(source, session):

Uh oh!

Conversation

anlambert commented Jul 4, 2025

Uh oh!

nlewo left a comment

Choose a reason for hiding this comment

Uh oh!

nlewo Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

anlambert Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nlewo commented Jul 14, 2025

Uh oh!

anlambert commented Jul 15, 2025

Uh oh!

nlewo commented Jul 19, 2025

Uh oh!

nlewo commented Jul 20, 2025

Uh oh!

Uh oh!

nlewo commented Jul 20, 2025

Uh oh!

anlambert commented Jul 21, 2025

Uh oh!

nlewo commented Jul 24, 2025

Uh oh!

nlewo commented Sep 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants