Skip to content

Update scripts to gather sources info of nix packages#10

Merged
nlewo merged 16 commits intonix-community:masterfrom
anlambert:update-scripts
Jul 24, 2025
Merged

Update scripts to gather sources info of nix packages#10
nlewo merged 16 commits intonix-community:masterfrom
anlambert:update-scripts

Conversation

@anlambert
Copy link
Copy Markdown
Contributor

I am a software engineer at Software Heritage and I recently reviewed the state of the archival of NixOS source packages.

To archive the sources of nix packages, we have a dedicated nixguix lister consuming a JSON file with relevant info to create loading tasks to archive the sources of Guix/NixOS packages. Source code can come from tarballs, simple files or from a VCS (git, mercurial, subversion).

Last year we work with the Guix folks to improve the archiving of their source packages, you can find more details on that blog post or this one. The key feature of that work was to map the content addressed identifiers used by Guix (standard checksums and NAR checksums) to the SoftWare Hash IDentifiers (SWHIDs) used by Software Heritage.

We updated our loaders to ingest tarballs and files into the archive to recompute the checksums (standard or NAR) after downloading source code artifacts to ensure the integrity of the archival. We also store the mapping between checksums and SWHIDs at the end of the loading process. It now allows Guix to have a fallback to fetch source code from the archive in case upstream vanished. To do so, they query the /api/1/extid endpoint of SWH Web API to get a SWHID, targeting a directory or a file, from a NAR hash or a standard hash. Once the SWHID retrieved, if it targets a directory they can request the cooking of a tarball using the /api/1/vault/flat/ to download the source code or if it targets a file download it using the /api/1/content/raw/. The implementation details of the Guix bridge with SWH can be found in that source file.

Such fallback for downloading source code could also be added for nix packages but like Guix a bridge must be implemented to communicate with the archive.

To get back to that pull request, I noticed the sources-unstable.json file was no longer updated since four years.

$ curl -I https://nix-community.github.io/nixpkgs-swh/sources-unstable.json
HTTP/2 200 
server: GitHub.com
content-type: application/json; charset=utf-8
last-modified: Tue, 12 Oct 2021 05:25:39 GMT
...

This means currently the archive is not synchronized with the current state of nix packages. I tried to execute the nixpkgs-swh tool but it ends up with error. So I started debugging it but ended up with a lot of improvements to it, notably:

  • ensure to not crash when a field in a derivation is missing
  • detect if sources are coming from a VCS (hg, git, svn), in that case also extract info about the changeset, tag or revision targeting the source code to fetch
  • fetch narinfo for each package in the nix HTTP cache and add them in the JSON output
  • add a GitHub Actions workflow to generate the sources-unstable.json file and upload it on GitHub Pages on a daily basis
  • improve performance of post process Python script

The JSON output now contains more relevant info for a better archiving of nix source packages by SWH. Notably having the narinfo file for each package allows SWH to download source code directly from the nix cache. It ensures that the computed hashes (standard checksums or NAR ones) when verifying integrity of downloaded tarballs will be correct as
NixOS can apply some transformations to upstream sources (the postFetch phase).

On the SWH side, several changes were made to improve the archival of nix source packages:

  • we now have a swh.core.nar module enabling to NAR serialize or unpack a NAR archive
  • the nixguix lister was improved to exploit the narinfo files and create loading tasks targeting NAR archives in the nix cache, see merge request
  • the content and tarball loaders were improved to handle NAR archives as input, see merge request

If this PR gets merged, we could catch up with the archiving lag of nix packages. Looking forward to advance on that topic.

anlambert added 11 commits June 26, 2025 01:38
Detect if source comes from a VCS and set a type attribute accordingly.

Extract VCS revision targetting the source of a package.

Extract some parameters related to sources coming from git.

Extract nix store path of source derivation.
The script processes the JSON output of the nix-instantiate call gathering
info about nix package sources.

Improve formatting with black.

It has been updated to:
- fix call to nix-hash
- remove duplicates in the input sources list
- robustify the hashes normalization
- compute new fields expected by the SWH nixguix lister
Remove the filtering of sources to contain only tarballs as SWH
can now also handle sources coming from VCS (git, hg, svn).

Add --show-trace option to nix-instantiate to ease debugging.

Call post-process.py after renaming of add-sri.py.
The generate script can work successfully without it so better
removing it.
Apply black formatting.

Ensure to avoid errors after the update of sources data.
Use aiohttp to greatly optimize the performance of fetching a lot of
small text files through HTTP requests.
Use asyncio to greatly improve performance of subprocess calls to
nix-hash command.
Copy link
Copy Markdown
Member

@nlewo nlewo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello!

Thank you for all this improvements. I tried to resurrect this project few months ago but failed because of some evaluation issues in nixpkgs. I seems they are now fixed.

I still need to run this locally before merging but this looks almost ready to get merged.

What is the purpose of the commit a526639 ? I would prefer to avoid committing such kind of big file to preserve the size of this repository.

I discussed with the NixOS infra team and they were ok to run these scripts on the NixOS infra. There is the PR #8 which would allow to run this project via a systemd timer on a powerful server. So, i don't think we need to rely on GitHub actions.

I should be able to run it locally during this WE.

Comment thread scripts/post-process.py Outdated
filteredSources.append(source)


async def get(source, session):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how sustainable it is to fetch all narinfo at each sources.json file generation.
Do you think we could add a caching mecanism? This could come in a followup PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we need to implement a cache as most of the narinfo files are already cached by Varnish.

$ curl -I https://cache.nixos.org/0l30f0az52ygfy40w45hg7qmailhjdyn.narinfo
HTTP/2 200 
last-modified: Mon, 07 Jul 2025 10:54:20 GMT
etag: "58f29ebe3b456459030fc045f0b5094d"
x-amz-server-side-encryption: AES256
content-type: text/x-nix-narinfo
server: AmazonS3
via: 1.1 varnish, 1.1 varnish
accept-ranges: bytes
age: 2032
date: Tue, 15 Jul 2025 13:35:57 GMT
x-served-by: cache-iad-kjyo7100035-IAD, cache-lcy-egml8630025-LCY
x-cache: HIT, HIT
x-cache-hits: 7, 0
access-control-allow-origin: *
content-length: 621

Considering the performance of the asynchronous fetches of narinfos (a couple dozens of seconds for fetching several tens of thousands files), I guess there is a few cache misses.

Comment thread scripts/post-process.py Outdated
@nlewo
Copy link
Copy Markdown
Member

nlewo commented Jul 14, 2025

btw, I run it locally (only on a nixpkgs subset) and it seems to be working well!

With that workflow, unstable sources of nixpgs are computed every day
on midnight then deployed to GitHub Pages.
Use builtin nix function convertHash to normalize hashes instead
of spawning numerous nix-hash processes from the post-process.py
script, this greatly improve pipeline execution performance.
@anlambert
Copy link
Copy Markdown
Contributor Author

Hi @nlewo and thanks for the review !

What is the purpose of the commit a526639 ? I would prefer to avoid committing such kind of big file to preserve the size of this repository.

This was mainly for testing the updated nixguix lister/loader with various NixOS releases by uploading those files in my GitHub Pages. I removed the commit adding those from that PR but I kept these files in a separate branch on my fork to avoid generating them again.

Ideally, once all changes to the nixguix lister/loader merged on the SWH side, we should try to archive the sources of all past NixOS releases by triggering one shot listings using these files as input to the lister. This means those must be uploaded on a public HTTP server for SWH to fetch them. Then triggering weekly or monthly listing on unstable nixpkgs should ensure sources archival does not lag.

I discussed with the NixOS infra team and they were ok to run these scripts on the NixOS infra. There is the PR #8 which would allow to run this project via a systemd timer on a powerful server. So, i don't think we need to rely on GitHub actions.

Sure. Can we keep the workflow file though ? This is pretty handy to have CI for testing new developments or fixing issues before submitting a PR.

FYI, this was actually a workaround because of the missing builtins.hashTo function. This is however now fixed: NixOS/nix#3151 and NixOS/nix#7708

This means we could set the SRI hash during the Nix evaluation. This could improve evaluation time since no processes would need to be spawned.

Note this is an improvement and doesn't need to be part of this MR.

Thanks for the pointer, I added a commit to perform hash normalizations directly in the find-tarballs.nix script. Performance of the post-process.py script is now much better.

@nlewo
Copy link
Copy Markdown
Member

nlewo commented Jul 19, 2025

Then triggering weekly or monthly listing on unstable nixpkgs should ensure sources archival does not lag.

Initially, my plan was to generate these files 4 times per days because Hydra evaluated nixpkgs each 6 hours. It seems it is now evaluating nixpkgs every 36 hours. So, if we don't want to miss any artifact pushed to the NixOS binary cache, we could generate the JSON file every day.

Regarding the ingestion of older releasese, we could expose them manually and maybe manually trigger a SWH lister run. But, that's more on your side. I think we could also discuss this later again in order to focus on ingesting the current releases.

Sure. Can we keep the workflow file though ? This is pretty handy to have CI for testing new developments or fixing issues before submitting a PR.

Ok, I definitively agree on running a CI on each PR. When i'm working on this project, i actually only consider the nixpkgs.hello, instead of nixpkgs. This doesn't requires a lot of memory but allow to validate all scripts. Would it be accepatable for you to run the CI on this subset?
(I think we would then no longer need the trick to get more RAM and the CI will run faster.)

I added a commit to perform hash normalizations directly in the find-tarballs.nix script. Performance of the post-process.py script is now much better.

wow, cool! Thank you!

@nlewo
Copy link
Copy Markdown
Member

nlewo commented Jul 20, 2025

@anlambert I pushed a commit with a flake into your branch. This also add the --testing option to the script allowing to only evaluates nixpkgs.hello, which is pretty fast.

In the CI, we would then have to run:

nix run .#nixpkgs-swh-generate -- --testing /tmp/swh unstable

(This would also allow to easily and quickly run what the CI is executing.)

Also, i finally run your branch on the whole nixpkgs and it works nicely! Thank you very much!

This also introduces a `--testing` argument which can be used to
validate all scripts without having to evaluate the whole nixpkgs.
Comment thread scripts/post-process.py Outdated
@nlewo
Copy link
Copy Markdown
Member

nlewo commented Jul 20, 2025

I added the systemd service to my small personal server (with testing = true;) and it correctly generated all files: http://nixpkgs-swh.abesis.fr/.

(So, once this MR merged, I could ask the NixOS infra team to run it on the NixOS infra.)

Comment the print instruction for sucessful HTTP requests but keep
the one for failed requests.
Use nix run command to execute the pipeline extracting sources info
of NixOS packages.

Execute the pipeline in testing mode considering only nixpkgs.hello
for faster execution and less memory consumption.

Drop the upload to GitHub Pages and prefer to simply display the
produced JSON file as it is quite small.
@anlambert
Copy link
Copy Markdown
Contributor Author

Initially, my plan was to generate these files 4 times per days because Hydra evaluated nixpkgs each 6 hours. It seems it is now evaluating nixpkgs every 36 hours. So, if we don't want to miss any artifact pushed to the NixOS binary cache, we could generate the JSON file every day.

On the SWH side, we will have to run the lister everyday then. As we exploit the Last-Modified HTTP header value from the remote nix cache responses, loading tasks to archive sources will not be recreated if we already encountered those in a previous listing.

Regarding the ingestion of older releasese, we could expose them manually and maybe manually trigger a SWH lister run. But, that's more on your side. I think we could also discuss this later again in order to focus on ingesting the current releases.

Ack, it should not be too complicated on our side to create these one shot listing tasks for older releases. We could host the JSON files ourself but I think it will be better if we target an HTTP server managed by nix.

@anlambert I pushed a commit with a flake into your branch. This also add the --testing option to the script allowing to only evaluates nixpkgs.hello, which is pretty fast.

In the CI, we would then have to run:

nix run .#nixpkgs-swh-generate -- --testing /tmp/swh unstable

(This would also allow to easily and quickly run what the CI is executing.)

Awesome, thanks a lot ! I have updated the GitHub Actions workflow to use the command above and I also removed the deployment to GitHub Pages to ensure the workflow can be successfully executed from every forks of that repository.

@nlewo
Copy link
Copy Markdown
Member

nlewo commented Jul 24, 2025

Thank you!
So, let's merge this MR. I will discuss with the NIxOS infra team to integrate this project into the NixOS infra and keep you updated.

@nlewo nlewo merged commit ae97b4f into nix-community:master Jul 24, 2025
@nlewo
Copy link
Copy Markdown
Member

nlewo commented Sep 7, 2025

@anlambert fyi, i submitted NixOS/infra#830 to deploy it in the NixOS infra.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants