Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: fsck repair #345

Closed
wants to merge 2 commits into from
Closed

Conversation

krnowak
Copy link
Contributor

@krnowak krnowak commented Jun 15, 2016

This is something I quickly hacked up, since I haven't seen any possibility to tell ostree to fetch the missing objects after doing ostree fsck --delete.

For now it only tries to download content objects, as doing that with metadata objects seems to be a bit more involved. And in my case all the corrupted objects were actually content objects.

WDYT?

@rh-atomic-bot
Copy link

Can one of the admins verify this patch?
I understand the following comments:

  • bot, add author to whitelist
  • bot, test pull request
  • bot, test pull request once

@cgwalters
Copy link
Member

What if we stored the remote we retrieved a commit from as a user xattr, and when we found a missing object, we look at the commits via which it's reachable and use those by default for re-pulls?

@cgwalters
Copy link
Member

I've been thinking about storing things like information about the TLS exchange as user xattrs too, so we can say in ostree admin status whether or not the commit was retrieved securely (ideally pinned TLS) etc.

@krnowak
Copy link
Contributor Author

krnowak commented Jun 16, 2016

The idea is sounds alright to me, but this information might get stale (because remote was removed/renamed), so it might be a good idea to still be able to provide a flag like --repair-from-remote, but maybe under different name (like --fallback-repair-remote or something), which would make ostree fsck to fall back to the given remote if there is something wrong with the remote from xattrs.

Also, are xattrs also checksummed? I guess they are, but only for file objects, so we should be fine here, because it won't alter commit's checksum, right?

Also, what about corrupted commit object? Can we trust user xattrs in that case?

Another thing needed for this PR would be exposing an API for fetching and writing a single object of a given type and checksum, something like:

_OSTREE_PUBLIC
gboolean ostree_repo_pull_one_object (OstreeRepo *repo,
                                      const gchar *remote,
                                      const gchar *checksum,
                                      OstreeObjectType type,
                                      GVariant *options,
                                      GCancellable *cancellable,
                                      GError **error);

That way, libsoup usage would be hidden again in the library and there wouldn't be any need for exposing a function like ostree_get_relative_object_path

@cgwalters
Copy link
Member

How about a new variant of the pull API which supported fetching multiple non-commit objects? Or just even extend the current API to support appending e.g. .file to the checksum to fetch. That makes some things more natural, like recursing on .dirtree objects.

@cgwalters
Copy link
Member

cgwalters commented Jun 16, 2016

What I've hit in the past is .file objects being corrupted by programs traversing out to /sysroot or the like, or people who mount -o remount,rw /usr and then proceed to use cp /path/to/foo /usr/bin/foo which uses O_TRUNC and hence corrupts. Corruption of metadata quickly gets into a "what can I trust" heuristic problem, same thing with filesystems.

But I'd like to know more about your situation - what case are you trying to fix? Does your use case e.g. include filesystem-level corruption?

@cgwalters
Copy link
Member

I agree though ultimately we'd need the ability to fetch from arbitrary (or all) remotes like your current code has, not just to handle the user removing them, but also because we should support current systems without xattr metadata.

@krnowak
Copy link
Contributor Author

krnowak commented Jun 16, 2016

How about a new variant of the pull API which supported fetching multiple non-commit objects? Or just even extend the current API to support appending e.g. .file to the checksum to fetch. That makes some things more natural, like recursing on .dirtree objects.

Something like "objects" option to ostree_pull_requests_with_options which would be a checksum-to-object-type map? I guess this would make the pull code to fetch only these objects (and maybe subobjects in case of a metadata object type) instead of a latest commit.

What I've hit in the past is .file objects being corrupted by programs traversing out to /sysroot or the like, or people who mount -o remount,rw /usr and then proceed to use cp /path/to/foo /usr/bin/foo which uses O_TRUNC and hence corrupts. Corruption of metadata quickly gets into a "what can I trust" heuristic problem, same thing with filesystems.

But I'd like to know more about your situation - what case are you trying to fix? Does your use case e.g. include filesystem-level corruption?

Ah, actually I'm not sure what was the source of the corruption… I'm running EndlessOS on a VM and there was a script changing the environment for development purposes, it breaks out of ostree chroot after a reboot IIUC. I also did some modifications in the system, but that was outside ostree. In the end, it was just 12 corrupted objects from 120k+. So I'd say that my use-case was "just do something to fix the thing, so I can make a copy of the ostree repository to a pendrive".

I agree though ultimately we'd need the ability to fetch from arbitrary (or all) remotes like your current code has, not just to handle the user removing them, but also because we should support current systems without xattr metadata.

Right, that too.

@cgwalters
Copy link
Member

If this is for development, have you looked at ostree admin unlock?

@krnowak
Copy link
Contributor Author

krnowak commented Jun 16, 2016

If this is for development, have you looked at ostree admin unlock?

Ah, no. I simply followed the instructions that turned the Endless OS VM into a developer machine.

@cgwalters
Copy link
Member

Probably the best is to use ostree_object_name_serialize() and what we pass is an array of those, i.e. a(su). I think the static deltas code does this in some places.

@rh-atomic-bot
Copy link

☔ The latest upstream changes (presumably 23049bb) made this pull request unmergeable. Please resolve the merge conflicts.

@cgwalters cgwalters added the WIP label Jul 6, 2016
@rh-atomic-bot
Copy link

Can one of the admins verify this patch?
I understand the following comments:

  • bot, add author to whitelist
  • bot, test pull request
  • bot, test pull request once

@cgwalters
Copy link
Member

If anyone is in this situation and wants to resync the repo, here's an example of how to do it manually. Let's delete the object for /usr/bin/bash (note this won't affect my deployment, just unlink the object in the repo, so you can safely do it live):

# atomic host status
State: idle
Deployments:
● centos-atomic-continuous:centos-atomic-host/7/x86_64/devel/continuous
       Version: 7.2016.832 (2016-11-15 14:36:43)
    BaseCommit: aa15f7eb597898638a8eb6e3973c20847e9cf80381a1b5e4cf968ffd8a4f49e1
        Commit: 0b86e4cf57f24d69711ce6f59703f824d828ccabd9e12708842004b89fe6c286
        OSName: centos-atomic-host
      Packages: flannel fuse-sshfs git
# ostree ls -C aa15f7eb597898638a8eb6e3973c20847e9cf80381a1b5e4cf968ffd8a4f49e1 /usr/bin/bash
-00755 0 0 960392 86b2316d6dc44af3706ae2abc807ce0d32584fbae51d84627344c8b6ce2e7b2c /usr/bin/bash
# rm /ostree/repo/objects/86/b2316d6dc44af3706ae2abc807ce0d32584fbae51d84627344c8b6ce2e7b2c.file
# ostree fsck
...
Object missing: 86b2316d6dc44af3706ae2abc807ce0d32584fbae51d84627344c8b6ce2e7b2c.file
# touch /ostree/repo/state/aa15f7eb597898638a8eb6e3973c20847e9cf80381a1b5e4cf968ffd8a4f49e1.commitpartial
# ostree pull centos-atomic-continuous:aa15f7eb597898638a8eb6e3973c20847e9cf80381a1b5e4cf968ffd8a4f49e1

1 metadata, 1 content objects fetched; 455 KiB transferred in 1 seconds       
# ls -al /ostree/repo/objects/86/b2316d6dc44af3706ae2abc807ce0d32584fbae51d84627344c8b6ce2e7b2c.file 
-rwxr-xr-x. 1 root root 960392 Jan  1  1970 /ostree/repo/objects/86/b2316d6dc44af3706ae2abc807ce0d32584fbae51d84627344c8b6ce2e7b2c.file

@cgwalters
Copy link
Member

I think a prereq here is for fsck to whenever it is in --delete mode, find all commits which refer to the object and mark them as partial.

@klausenbusk
Copy link

I think a prereq here is for fsck to whenever it is in --delete mode, find all commits which refer to the object and mark them as partial.

Any update on this? Is there anyway I can workaround this for now in some automatic way?

@alexlarsson
Copy link
Member

I think a prereq here is for fsck to whenever it is in --delete mode, find all commits which refer to the object and mark them as partial.

So, i looked at this a bit and it is tricky to do with the current public ostree_repo_traverse_commit () code, because it assumes that the reachable hash table is a GVariant to GVariant map, but we need to add some back-pointer to it so that we can track which commits an object came from.

But, besides that, the whole fsck operation seems to depend on the ostree_repo_traverse* methods successfully traversing all reachable objects. If one of the dirmeta objects are broken or missing, then ostree_repo_traverse_commit_union() will fail and we will never ever start fsck:ing the objects.

alexlarsson added a commit to alexlarsson/ostree that referenced this pull request Apr 6, 2018
This means we can later use various operations to heal the repository
because ostree does not assume all objects are there.

This the begining of a fix for ostreedev#345
cgwalters pushed a commit to alexlarsson/ostree that referenced this pull request Apr 13, 2018
This means we can later use various operations to heal the repository
because ostree does not assume all objects are there.

This the begining of a fix for ostreedev#345
rh-atomic-bot pushed a commit that referenced this pull request Apr 14, 2018
This means we can later use various operations to heal the repository
because ostree does not assume all objects are there.

This the begining of a fix for #345

Closes: #1533
Approved by: cgwalters
rfairley pushed a commit to rfairley/ostree that referenced this pull request Apr 17, 2019
add depcheck() to verify RPMs installed
@openshift-ci-robot
Copy link
Collaborator

@krnowak: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@krnowak
Copy link
Contributor Author

krnowak commented Apr 15, 2021

Not going to pick this up anytime soon, so closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants