Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for kernel builds that are not a single binary #58

Open
veruu opened this issue Nov 23, 2022 · 13 comments
Open

Support for kernel builds that are not a single binary #58

veruu opened this issue Nov 23, 2022 · 13 comments

Comments

@veruu
Copy link

veruu commented Nov 23, 2022

In some distributions, such as RPM based distributions (e.g. Fedora, CentOS Stream), the result of the build usually isn't a single binary but multiple RPM packages. While these may be handled separately, such a thing is very inconvenient and doesn't allow easy system installation with how the packages depend on each other. For that, the resulting packages are distributed in a yum/dnf repository, which can be both browsable or a link that is consumable by yum/dnf.

If KCIDB wishes to properly support distribution kernels, output_files for a kernel build should support this use case where the link doesn't point to a single binary, but a directory that may by usable by either human or a machine.

@spbnick
Copy link
Collaborator

spbnick commented Nov 24, 2022

Thank you for a nicely-worded and detailed issue!

This is what I think we could do:

  • Add support for another, optional field in the resource schema: resources, resource_list, file_list, or files.
  • Make that field contain a nested list of "resources".
  • If that field is present for a resource, the resource is considered a directory, its name field is considered the name of that directory, the url is considered to be pointing to a UI page listing its contents, and the field itself - a list of files in that directory.
  • Limit nesting depth to e.g. three (would that work for CKI/dnf repos, @veruu?)
  • When mirroring such a "resource tree", do not mirror URLs of "resources" containing the field, but instead (optionally) set them to point to a new page listing the mirrored files underneath it.
  • When triaging, ignore URLs (contents) of the resources with the new field.

So, a subset of output files for this build could look something like this:

{
    "output_files": [
        {
            "name": "cki-kernel.repo",
            "url": "https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/703863786/publish%20x86_64%20debug/3372589884/artifacts/cki-kernel.repo"
        },
        {
            "name": "cki-kernel.dnf",
            "url": "https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/index.html?prefix=trusted-artifacts/703863786/publish%20x86_64%20debug/3372589884/artifacts/repo/5.14.0-201.mr1687_221124_0516.el9.x86_64/",
            "resources": [
                {
                    "name": "repodata",
                    "url": "https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/index.html?prefix=trusted-artifacts/703863786/publish%20x86_64%20debug/3372589884/artifacts/repo/5.14.0-201.mr1687_221124_0516.el9.x86_64/repodata/",
                    "resources": [
                        {
                            "name": "repomd.xml",
                            "url": "https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/703863786/publish%20x86_64%20debug/3372589884/artifacts/repo/5.14.0-201.mr1687_221124_0516.el9.x86_64/repodata/repomd.xml"
                        }
                    ]
                },
                {
                    "name": "kernel-5.14.0-201.mr1687_221124_0516.el9.src.rpm",
                    "url": "https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/703863786/publish%20x86_64%20debug/3372589884/artifacts/repo/5.14.0-201.mr1687_221124_0516.el9.x86_64/kernel-5.14.0-201.mr1687_221124_0516.el9.src.rpm"
                },
                {
                    "name": "kernel-headers-5.14.0-201.mr1687_221124_0516.el9.x86_64.rpm",
                    "url": "https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/703863786/publish%20x86_64%20debug/3372589884/artifacts/repo/5.14.0-201.mr1687_221124_0516.el9.x86_64/kernel-headers-5.14.0-201.mr1687_221124_0516.el9.x86_64.rpm"
                }
            ]
        }
    ]
}

@veruu
Copy link
Author

veruu commented Nov 24, 2022

This is very much increasing the bar of entry for submitters 🙈 having to loop over the nested directories and list everything in them, when the simplest step would be to only link that top level and consider everything under it as it is.

From a technical standpoint, it would not work either, as not all dnf repos are browsable and have an index page. It's not a requirement yum/dnf needs, and CKI does not have them either. We do provide a different location that is browsable (for human use), but that one is not usable by yum/dnf instead.

@spbnick
Copy link
Collaborator

spbnick commented Nov 24, 2022

This is very much increasing the bar of entry for submitters see_no_evil having to loop over the nested directories and list everything in them, when the simplest step would be to only link that top level and consider everything under it as it is.

Most submitters don't have to report nested directories. CKI is the only one so far. There's no additional complexity, if you don't. And, again, linking just to the human-targeted "index" page wouldn't work for archival. However, using this schema, if you don't really want to have the yum/dnf repo archived, you can just report the index page "resource" with an empty resources array. This way you still get the link to the index page across (and shown in the UI), but you don't have to go in and list the contents.

From a technical standpoint, it would not work either, as not all dnf repos are browsable and have an index page. It's not a requirement yum/dnf needs, and CKI does not have them either. We do provide a different location that is browsable (for human use), but that one is not usable by yum/dnf instead.

We can make the url field optional for resources which have the resources field.

@spbnick
Copy link
Collaborator

spbnick commented Nov 24, 2022

Also, AFAIK, yum/dnf don't read that "index" page, but they just assume existence of various entry files under the given "repo" URL.

@veruu
Copy link
Author

veruu commented Nov 28, 2022

I meant additional complexity for new submitters who'd like to start using this feature. IF the idea is to also provide the builds to be usable for reproducing results (e.g. a kernel maintainer trying to reproduce a failure), then we need a way to provide the repo (not the index page) through KCIDB.

Now, I already mentioned that I don't see the necessary point of KCIDB triaging the build binaries and such and I still think that, so in this particular case we may be able to get away with providing a repo file and hosting the repos ourselves for a certain period of time, and that would avoid some trouble with the unfriendliness of having to list the full content of the repos.

The optional URL fields for resources would definitely be a useful trick for our side 👍 That way we could still submit the repos as they are and people could use them, but they wouldn't pollute the triaging etc. from your side.

Am I missing something else we'd need to solve here?

@spbnick
Copy link
Collaborator

spbnick commented Nov 29, 2022

I meant additional complexity for new submitters who'd like to start using this feature.

Well, it's not reaally that hard. Took me maybe half an hour to write this command-line tool which takes a URL base and a directory and generates the JSON:

#!/usr/bin/python3

import os
import sys
import json
from urllib.parse import urljoin

def urljoin_dirs(url_base, root_dirpath, dirpath, name):
    return urljoin(
        url_base,
        os.path.join(os.path.relpath(dirpath, root_dirpath), name)
    )

def dir2json(url_base, root_dirpath):
    resources_map = {}
    root = []
    for dirpath, dirnames, filenames in os.walk(root_dirpath):
        resources = root if dirpath == root_dirpath else resources_map[dirpath]
        for dirname in dirnames:
            dir_resources = []
            resources_map[os.path.join(dirpath, dirname)] = dir_resources
            resources.append(dict(
                name=dirname,
                url=urljoin_dirs(url_base, root_dirpath, dirpath, dirname),
                resources=dir_resources
            ))
        resources += [
            dict(
                name=filename,
                url=urljoin_dirs(url_base, root_dirpath, dirpath, filename)
            )
            for filename in filenames
        ]
    return root

json.dump(
    dir2json(sys.argv[1], os.path.normpath(sys.argv[2])),
    sys.stdout,
    indent=4
)

We could supply people with something like this as a sample and solve that problem.

IF the idea is to also provide the builds to be usable for reproducing results (e.g. a kernel maintainer trying to reproduce a failure), then we need a way to provide the repo (not the index page) through KCIDB.

Yes, that's the idea.

However, there's no way to discover and download directory hierarchies over HTTP, as you probably know, only separate files. We could've done it over FTP, but people don't use thaaat these days. So we have to either express the complete hierarchy in JSON, or pack our dnf repos into archives, which defeats the convenience of just pointing directly to a "directory" hosted by KCIDB. And, no, we're not going to use something like wget --recursive because that is a whole other barrel of worms.

Now, I already mentioned that I don't see the necessary point of KCIDB triaging the build binaries and such and I still think that, so in this particular case we may be able to get away with providing a repo file and hosting the repos ourselves for a certain period of time, and that would avoid some trouble with the unfriendliness of having to list the full content of the repos.

No, KCIDB is not likely to triage binaries, but archiving the artifacts would probably be among its major values (as long as LF can afford it 🙈). We wouldn't be able to offer meaningful .repo files anyway, even if we archive the repos themselves, as that would require rewriting the URLs.

The optional URL fields for resources would definitely be a useful trick for our side +1 That way we could still submit the repos as they are and people could use them, but they wouldn't pollute the triaging etc. from your side.

Great 😁 👍

Am I missing something else we'd need to solve here?

Nooo, I think that's probably it.

@spbnick
Copy link
Collaborator

spbnick commented Nov 29, 2022

One other thing I'm thinking about, though, is redoing the schema to ensure that names are unique. The current schema was based on what BigQuery could do with "repeating records", but since we'd have to forgo that anyway for materialized views, and for other databases, we could as well make the schema better, so the sample above looks something like this:

{
    "output_files": {
        "cki-kernel.repo": {
            "url": "https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/703863786/publish%20x86_64%20debug/3372589884/artifacts/cki-kernel.repo"
        },
        "cki-kernel.dnf": {
            "url": "https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/index.html?prefix=trusted-artifacts/703863786/publish%20x86_64%20debug/3372589884/artifacts/repo/5.14.0-201.mr1687_221124_0516.el9.x86_64/",
            "resources": {
                "repodata": {
                    "url": "https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/index.html?prefix=trusted-artifacts/703863786/publish%20x86_64%20debug/3372589884/artifacts/repo/5.14.0-201.mr1687_221124_0516.el9.x86_64/repodata/",
                    "resources": {
                        "repomd.xml": {
                            "url": "https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/703863786/publish%20x86_64%20debug/3372589884/artifacts/repo/5.14.0-201.mr1687_221124_0516.el9.x86_64/repodata/repomd.xml"
                        }
                    }
                },
                "kernel-5.14.0-201.mr1687_221124_0516.el9.src.rpm": {
                    "url": "https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/703863786/publish%20x86_64%20debug/3372589884/artifacts/repo/5.14.0-201.mr1687_221124_0516.el9.x86_64/kernel-5.14.0-201.mr1687_221124_0516.el9.src.rpm"
                },
                "kernel-headers-5.14.0-201.mr1687_221124_0516.el9.x86_64.rpm": {
                    "url": "https://s3.amazonaws.com/arr-cki-prod-trusted-artifacts/trusted-artifacts/703863786/publish%20x86_64%20debug/3372589884/artifacts/repo/5.14.0-201.mr1687_221124_0516.el9.x86_64/kernel-headers-5.14.0-201.mr1687_221124_0516.el9.x86_64.rpm"
                }
            }
        }
    }
}

With this approach we could preserve the original filenames in KCIDB archives more easily and avoid a few issues.

@veruu
Copy link
Author

veruu commented Nov 29, 2022

One other thing I'm thinking about, though, is redoing the schema to ensure that names are unique.

From user (someone looking into the data/results), tester and CI developer perspective, this proposal is really counterproductive 😞

It exposes test system details to users who'd have to understand all those to figure out which file is the important one to check. This is already a problem brought up when interacting with developers and maintainers.

Also, test systems can create multiple files which have the same standard name, but different full path (e.g. subresults of a single test in Beaker). Forcing the unique filenames would mean either the test system expectations are going to be broken, or the CI system would need to implement arbitrary renames and moves when submitting the results. It's not about how hard it is to write the code that does that, it's about every submitter having to keep this requirement in mind and doing the extra work, only for the result being users who refuse to check the random names.

Being able to keep the paths as they are while being able to give them a user friendly name as a pointer is really useful for all involved parties, and it's a shame that this feature that would be really useful (seeing a friendly name that points to whatever test system picked) is the one thing that you're trying to get rid of.

@spbnick
Copy link
Collaborator

spbnick commented Nov 30, 2022

Hey, I'm not going to do it if it's a problem for you, no worries 😁 Let's try to work something out.

First of all, before throwing more arguments at you, could you perhaps share what exact problem this (array->dict) change would prevent you from solving?

@veruu
Copy link
Author

veruu commented Nov 30, 2022

First of all, before throwing more arguments at you, could you perhaps share what exact problem this (array->dict) change would prevent you from solving?

Which array->dict change? 🙈 I don't think there was any place where that part would be a problem, but by this point of the discussion I may be lost

@spbnick
Copy link
Collaborator

spbnick commented Nov 30, 2022

Ha-ha, perfect 😂 Sorry! Could we try to talk it over in a small meeting?
I spent a good chunk of time today, trying to come up with a good response, and I could still do it, but I think we could save some time if we just talk. This doesn't seem to be urgent, so we could pick a comfortable slot, whenever we feel like it.

@veruu
Copy link
Author

veruu commented Nov 30, 2022

Sure! Pick some time tomorrow, maybe 1PM my time? Tomorrow is kinda busy with meetings otherwise and then we're on PTO, so this would probably work the best?

@spbnick
Copy link
Collaborator

spbnick commented Dec 1, 2022

OK, after the meeting we came to an agreement regarding the following:

  • The schema using filenames as keys in dictionaries of "resources" is good to go in general (some field names might change).
  • The url field will have to be optional, if there's a resources field in the same entry. If present, it points to a page describing the files in the resources field.
  • If the recipient of the data copies the linked resources (e.g. upstream KCIDB archiving resources), it ignores those particular url fields, and only copies resources linked from the entries without resources fields. It is then free to generate its own index pages and link them using the url fields alongside the resources fields.
  • The automatic upgrade from a resource-lists schema to a future resource-dictionaries schema will mangle any duplicate filenames. E.g. by adding a numeric counter to the filenames, in front of the first "extension".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants