Skip to content

Conversation

@kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Oct 12, 2025

This currently includes #4930 (and serves as a test for it). Draft until that one is merged.

Inspired by the discussion in #4905.

In case early stage of runc init (nsenter) fails for some reason, it
logs error(s) with FATAL log level, via bail().

The runc init log is read by a parent (runc create/run/exec) and is
logged via normal logrus mechanism, which is all fine and dandy, except
when runc init fails, we return the error from the parent (which is
usually not too helpful, for example):

runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

Now, the actual underlying error is from runc init and it was logged
earlier; here's how full runc output looks like:

FATA[0000] nsexec-1[3247792]: failed to unshare remaining namespaces: No space left on device
FATA[0000] nsexec-0[3247790]: failed to sync with stage-1: next state: Success
ERRO[0000] runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

The problem is, upper level runtimes tend to ignore everything except
the last line from runc, and thus error reported by e.g. docker is not
very helpful.

This patch tries to improve the situation by collecting FATAL errors
from runc init and appending those to the error returned (instead of
logging). With it, the above error will look like this:

ERRO[0000] runc run failed: unable to start container process: can't get final child's PID from pipe: EOF; runc init error(s): nsexec-1[141549]: failed to unshare remaining namespaces: No space left on device; nsexec-0[141547]: failed to sync with stage-1: next state: Success

Yes, it is long and ugly, but at least the upper level runtime will
report it.


In addition, this slightly improves error reporting in nsexec itself
(see commits prefixed with "libct/nsenter:" for details).

@kolyshkin kolyshkin force-pushed the better-init-errors branch 2 times, most recently from 08fb065 to 0200b76 Compare October 13, 2025 19:01
@kolyshkin kolyshkin marked this pull request as draft October 13, 2025 22:41
@kolyshkin kolyshkin marked this pull request as ready for review October 14, 2025 00:05
@kolyshkin kolyshkin marked this pull request as draft October 14, 2025 18:47
@kolyshkin kolyshkin marked this pull request as ready for review October 15, 2025 23:02
@kolyshkin kolyshkin requested review from AkihiroSuda, cyphar, lifubang and rata and removed request for cyphar October 15, 2025 23:02
Since sane_kill after a failed read or write, but before reporting the
error from that read or write, it may change the errno value in case
kill(2) fails.

Save and restore the errno around the call to kill.

Signed-off-by: Kir Kolyshkin <[email protected]>
We use bail to report fatal errors, and bail always append %m
(aka strerror(errno)). In case an error condition did not set
errno, the log message will end up with ": Success" or an error
from a stale errno value. Either case is confusing for users.

Introduce bailx which is the same as bail except it does not
append %m, and use it where appropriate. The naming follows
libc's err(3) and errx(3).

PS we still use bail in a few cases after read or write, even
if that read/write did not return an error, because the code
does not distinguish between short read/write and error (-1).
This will be addressed by the next commit.

Signed-off-by: Kir Kolyshkin <[email protected]>
Introduce and use CHECK_IO and CHECK_IO_KILL macros so that we can
call either bail or bailx on error, depending on read/write return.

This prevents the "Success" prefix in errors like:

	failed to sync with stage-1: next state: Success

Signed-off-by: Kir Kolyshkin <[email protected]>
In case early stage of runc init (nsenter) fails for some reason, it
logs error(s) with FATAL log level, via bail().

The runc init log is read by a parent (runc create/run/exec) and is
logged via normal logrus mechanism, which is all fine and dandy, except
when `runc init` fails, we return the error from the parent (which is
usually not too helpful, for example):

	runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

Now, the actual underlying error is from runc init and it was logged
earlier; here's how full runc output looks like:

	FATA[0000] nsexec-1[3247792]: failed to unshare remaining namespaces: No space left on device
	FATA[0000] nsexec-0[3247790]: failed to sync with stage-1: next state: Success
	ERRO[0000] runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

The problem is, upper level runtimes tend to ignore everything except
the last line from runc, and thus error reported by e.g. docker is not
very helpful.

This patch tries to improve the situation by collecting FATAL errors
from runc init and appending those to the error returned (instead of
logging). With it, the above error will look like this:

	ERRO[0000] runc run failed: unable to start container process: can't get final child's PID from pipe: EOF; runc init error(s): nsexec-1[141549]: failed to unshare remaining namespaces: No space left on device; nsexec-0[141547]: failed to sync with stage-1: next state: Success

Yes, it is long and ugly, but at least the upper level runtime will
report it.

Signed-off-by: Kir Kolyshkin <[email protected]>
} while(0)

/* bailx is the same as bail, except it does not add ": %m" (errno). */
#define bailx(fmt, ...) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

I think the first three commits are bug fixes — could you please move them to a separate PR?

void write_log(int level, const char *format, ...) __attribute__((format(printf, 2, 3)));

extern int logfd;
#define bail(fmt, ...) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add an internal macro to reduce code duplication?
For example:

/**
 * Write a fatal error message to stderr or logfd.
 *
 * This internal macro handles the common logic of output destination
 * selection based on the value of logfd.
 */
#define __log_fatal(fmt, ...)                                      \
    do {                                                           \
        if (logfd < 0)                                             \
            fprintf(stderr, "FATAL: " fmt "\n", ##__VA_ARGS__);   \
        else                                                       \
            write_log(FATAL, fmt, ##__VA_ARGS__);                  \
    } while (0)

/**
 * Terminate the program with a fatal error message including errno.
 *
 * Use this macro when a system call fails and you want to include
 * the corresponding strerror(errno) message via %m.
 *
 * Example: if (fork() < 0) bail("failed to fork");
 */
#define bail(fmt, ...)           \
    do {                         \
        __log_fatal(fmt ": %m", ##__VA_ARGS__); \
        exit(1);                 \
    } while (0)

/**
 * Terminate the program with a fatal error message without errno.
 *
 * Use this macro for configuration errors, programming errors, or any
 * condition not related to errno. This is the same as bail(), except
 * it does not append ": %m" (errno description).
 *
 * Example: if (!app) bailx("mapping tool not present");
 */
#define bailx(fmt, ...)          \
    do {                         \
        __log_fatal(fmt, ##__VA_ARGS__); \
        exit(1);                 \
    } while (0)

* CHECK_IO_KILL is a variant of CHECK_IO that kills PIDs before bailing.
* Use this when you need to kill child process(es) on I/O failure.
*/
#define CHECK_IO_KILL(op, fd, buf, count, pid1, pid2, ...) \
Copy link
Member

@lifubang lifubang Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my personal opinion, I'd rather use a function to replace these two macros to improve readability. For example:

void check_io(int ret, size_t size, char *err_msg, pid_t stage1_pid, pid_t stage2_pid) {
    if (ret != size) {
        sane_kill(stage1_pid, SIGKILL);
        sane_kill(stage2_pid, SIGKILL);
        if (ret < 0)
            bail(err_msg);
        bailx("%s: %d byte(s), expected: %d byte(s)", err_msg, ret, size);
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants