Update scripts to gather sources info of nix packages#10
Update scripts to gather sources info of nix packages#10nlewo merged 16 commits intonix-community:masterfrom
Conversation
Detect if source comes from a VCS and set a type attribute accordingly. Extract VCS revision targetting the source of a package. Extract some parameters related to sources coming from git. Extract nix store path of source derivation.
The script processes the JSON output of the nix-instantiate call gathering info about nix package sources. Improve formatting with black. It has been updated to: - fix call to nix-hash - remove duplicates in the input sources list - robustify the hashes normalization - compute new fields expected by the SWH nixguix lister
Remove the filtering of sources to contain only tarballs as SWH can now also handle sources coming from VCS (git, hg, svn). Add --show-trace option to nix-instantiate to ease debugging. Call post-process.py after renaming of add-sri.py.
The generate script can work successfully without it so better removing it.
Apply black formatting. Ensure to avoid errors after the update of sources data.
Use aiohttp to greatly optimize the performance of fetching a lot of small text files through HTTP requests.
Use asyncio to greatly improve performance of subprocess calls to nix-hash command.
nlewo
left a comment
There was a problem hiding this comment.
Hello!
Thank you for all this improvements. I tried to resurrect this project few months ago but failed because of some evaluation issues in nixpkgs. I seems they are now fixed.
I still need to run this locally before merging but this looks almost ready to get merged.
What is the purpose of the commit a526639 ? I would prefer to avoid committing such kind of big file to preserve the size of this repository.
I discussed with the NixOS infra team and they were ok to run these scripts on the NixOS infra. There is the PR #8 which would allow to run this project via a systemd timer on a powerful server. So, i don't think we need to rely on GitHub actions.
I should be able to run it locally during this WE.
| filteredSources.append(source) | ||
|
|
||
|
|
||
| async def get(source, session): |
There was a problem hiding this comment.
I'm wondering how sustainable it is to fetch all narinfo at each sources.json file generation.
Do you think we could add a caching mecanism? This could come in a followup PR.
There was a problem hiding this comment.
I do not think we need to implement a cache as most of the narinfo files are already cached by Varnish.
$ curl -I https://cache.nixos.org/0l30f0az52ygfy40w45hg7qmailhjdyn.narinfo
HTTP/2 200
last-modified: Mon, 07 Jul 2025 10:54:20 GMT
etag: "58f29ebe3b456459030fc045f0b5094d"
x-amz-server-side-encryption: AES256
content-type: text/x-nix-narinfo
server: AmazonS3
via: 1.1 varnish, 1.1 varnish
accept-ranges: bytes
age: 2032
date: Tue, 15 Jul 2025 13:35:57 GMT
x-served-by: cache-iad-kjyo7100035-IAD, cache-lcy-egml8630025-LCY
x-cache: HIT, HIT
x-cache-hits: 7, 0
access-control-allow-origin: *
content-length: 621
Considering the performance of the asynchronous fetches of narinfos (a couple dozens of seconds for fetching several tens of thousands files), I guess there is a few cache misses.
|
btw, I run it locally (only on a nixpkgs subset) and it seems to be working well! |
With that workflow, unstable sources of nixpgs are computed every day on midnight then deployed to GitHub Pages.
Use builtin nix function convertHash to normalize hashes instead of spawning numerous nix-hash processes from the post-process.py script, this greatly improve pipeline execution performance.
5d167bf to
9532117
Compare
|
Hi @nlewo and thanks for the review !
This was mainly for testing the updated nixguix lister/loader with various NixOS releases by uploading those files in my GitHub Pages. I removed the commit adding those from that PR but I kept these files in a separate branch on my fork to avoid generating them again. Ideally, once all changes to the nixguix lister/loader merged on the SWH side, we should try to archive the sources of all past NixOS releases by triggering one shot listings using these files as input to the lister. This means those must be uploaded on a public HTTP server for SWH to fetch them. Then triggering weekly or monthly listing on unstable nixpkgs should ensure sources archival does not lag.
Sure. Can we keep the workflow file though ? This is pretty handy to have CI for testing new developments or fixing issues before submitting a PR.
Thanks for the pointer, I added a commit to perform hash normalizations directly in the |
Initially, my plan was to generate these files 4 times per days because Hydra evaluated nixpkgs each 6 hours. It seems it is now evaluating nixpkgs every 36 hours. So, if we don't want to miss any artifact pushed to the NixOS binary cache, we could generate the JSON file every day. Regarding the ingestion of older releasese, we could expose them manually and maybe manually trigger a SWH lister run. But, that's more on your side. I think we could also discuss this later again in order to focus on ingesting the current releases.
Ok, I definitively agree on running a CI on each PR. When i'm working on this project, i actually only consider the
wow, cool! Thank you! |
|
@anlambert I pushed a commit with a flake into your branch. This also add the In the CI, we would then have to run: (This would also allow to easily and quickly run what the CI is executing.) Also, i finally run your branch on the whole nixpkgs and it works nicely! Thank you very much! |
This also introduces a `--testing` argument which can be used to validate all scripts without having to evaluate the whole nixpkgs.
|
I added the systemd service to my small personal server (with (So, once this MR merged, I could ask the NixOS infra team to run it on the NixOS infra.) |
Comment the print instruction for sucessful HTTP requests but keep the one for failed requests.
Use nix run command to execute the pipeline extracting sources info of NixOS packages. Execute the pipeline in testing mode considering only nixpkgs.hello for faster execution and less memory consumption. Drop the upload to GitHub Pages and prefer to simply display the produced JSON file as it is quite small.
On the SWH side, we will have to run the lister everyday then. As we exploit the
Ack, it should not be too complicated on our side to create these one shot listing tasks for older releases. We could host the JSON files ourself but I think it will be better if we target an HTTP server managed by nix.
Awesome, thanks a lot ! I have updated the GitHub Actions workflow to use the command above and I also removed the deployment to GitHub Pages to ensure the workflow can be successfully executed from every forks of that repository. |
|
Thank you! |
|
@anlambert fyi, i submitted NixOS/infra#830 to deploy it in the NixOS infra. |
I am a software engineer at Software Heritage and I recently reviewed the state of the archival of NixOS source packages.
To archive the sources of nix packages, we have a dedicated nixguix lister consuming a JSON file with relevant info to create loading tasks to archive the sources of Guix/NixOS packages. Source code can come from tarballs, simple files or from a VCS (git, mercurial, subversion).
Last year we work with the Guix folks to improve the archiving of their source packages, you can find more details on that blog post or this one. The key feature of that work was to map the content addressed identifiers used by Guix (standard checksums and NAR checksums) to the SoftWare Hash IDentifiers (SWHIDs) used by Software Heritage.
We updated our loaders to ingest tarballs and files into the archive to recompute the checksums (standard or NAR) after downloading source code artifacts to ensure the integrity of the archival. We also store the mapping between checksums and SWHIDs at the end of the loading process. It now allows Guix to have a fallback to fetch source code from the archive in case upstream vanished. To do so, they query the /api/1/extid endpoint of SWH Web API to get a SWHID, targeting a directory or a file, from a NAR hash or a standard hash. Once the SWHID retrieved, if it targets a directory they can request the cooking of a tarball using the /api/1/vault/flat/ to download the source code or if it targets a file download it using the /api/1/content/raw/. The implementation details of the Guix bridge with SWH can be found in that source file.
Such fallback for downloading source code could also be added for nix packages but like Guix a bridge must be implemented to communicate with the archive.
To get back to that pull request, I noticed the
sources-unstable.jsonfile was no longer updated since four years.This means currently the archive is not synchronized with the current state of nix packages. I tried to execute the
nixpkgs-swhtool but it ends up with error. So I started debugging it but ended up with a lot of improvements to it, notably:narinfofor each package in the nix HTTP cache and add them in the JSON outputsources-unstable.jsonfile and upload it on GitHub Pages on a daily basisThe JSON output now contains more relevant info for a better archiving of nix source packages by SWH. Notably having the
narinfofile for each package allows SWH to download source code directly from the nix cache. It ensures that the computed hashes (standard checksums or NAR ones) when verifying integrity of downloaded tarballs will be correct asNixOS can apply some transformations to upstream sources (the
postFetchphase).On the SWH side, several changes were made to improve the archival of nix source packages:
narinfofiles and create loading tasks targeting NAR archives in the nix cache, see merge requestIf this PR gets merged, we could catch up with the archiving lag of nix packages. Looking forward to advance on that topic.