Bug 1708602: pkg/daemon: workaround old util-linux logger in rhel7.6 #734

runcom · 2019-05-10T13:26:50Z

- What I did

RHEL7.6 has an old util-linux's logger which doesn't support the --journald flag to log to journal in a structured way. Work that around by logging a json directly and fall back to that if we're on rhel7.6 then.

- How to verify it

- Description for the changelog

runcom · 2019-05-10T13:27:03Z

/hold

Depends on #733

runcom · 2019-05-10T13:31:09Z

/test e2e-rhel-scaleup

cgwalters · 2019-05-10T13:32:10Z

pkg/daemon/update.go

Eeek. Scraping all journal messages for that?

How about we detect on startup whether the host is rhel7, and add that as a flag in the daemon struct. That'd be useful for a lot of other things.

we scrape all journal only if --journald isn't supported so we need to do this anyway right? I didn't want to add a specific rhel7 flag as I was just testing for functionality which should be generally better. If rhel7.7 upgrades util linux to a logger that does have --journald, we'd be still using this awful ack if we just test RHEL versions

If rhel7.7 upgrades util linux to a logger that does have --journald, we'd be still using this awful ack if we just test RHEL versions

Even if it did upgrade which...seems very unlikely (right?) we'd still have an upgrade hazard where for previous versions we'd write one way and then later the other way. Seems better to just always live with one format.

Eeek. Scraping all journal messages for that?

to reply specifically on that. With this patch, we're not always scraping all journal, if !loggerWithoutJournalOption() { takes care of telling us if we need to do that in case logger doesn't have --journald.

Even if it did upgrade which...seems very unlikely (right?) we'd still have an upgrade hazard where for previous versions we'd write one way and then later the other way. Seems better to just always live with one format.

Yeah, agree it's very unlikely indeed, but since we don't need to play with a rhel7 flag today, isn't testing for the logger functionality in itself enough? (I'm super happy to change it anyway)

OLD_LOGGER seems a bit too generic...while it's unlikely, that string could appear in log messages from any othe process.

(Ideally...we'd have the MCD logging under machine-config-daemon.service that we create on the host. That's another topic)

How about calling it OPENSHIFT_MACHINE_CONFIG_DAEMON_LEGACY_LOG_HACK ? 😉

How about calling it OPENSHIFT_MACHINE_CONFIG_DAEMON_LEGACY_LOG_HACK ?

😂 I'll go for it, that's a good one indeed to avoid collisions

cgwalters · 2019-05-10T13:34:01Z

pkg/daemon/update.go

Then in this code we wouldn't try and fall back each time, we'd just do:

if (!dn.hostRhel7) { logger --journald } else { logger with literal JSON as message }

see #734 (comment)

cgwalters · 2019-05-10T14:06:11Z

pkg/daemon/update.go

I think we should use logger --help...as running this command just hangs for me; maybe it works because we're not providing stdin?

Or better, do:

if dn.OperatingSystem == machineConfigDaemonOSRHCOS { return false }

?

uhm, does it hang on RHCOS? works fine on rhel7.6 🤔

oh yeah I get it now thanks!

cgwalters · 2019-05-10T14:07:14Z

pkg/daemon/update.go

note double negative here, probably better as loggerSupportsJournal

runcom · 2019-05-10T14:17:30Z

/test e2e-rhel-scaleup

runcom · 2019-05-10T15:31:56Z

/retest

runcom · 2019-05-10T16:27:25Z

/test e2e-rhel-scaleup

kikisdeliveryservice · 2019-05-10T16:30:13Z

pkg/daemon/update.go

love this. 😆

ashcrow

I feel like the logger shell out code should probably end up being encapsulated in it's own struct/func, but for now this seems sane to fix the bug.

runcom · 2019-05-10T17:18:52Z

@vrutkovs just for awareness, not sure who the owner is, but scaleup is broken somehow

EDIT: the last failure there seem to be:

level=fatal msg="failed to initialize the cluster: Working towards 0.0.1-2019-05-10-173304: 100% complete, waiting on authentication: timed out waiting for the condition"
2019/05/10 18:23:30 Container setup in pod e2e-rhel-scaleup failed, exit code 1, reason Error
Waiting for scaleup to complete...Another process exited
2019/05/10 18:23:40 Container test in pod e2e-rhel-scaleup failed, exit code 1, reason Error

runcom · 2019-05-10T17:30:58Z

/test e2e-rhel-scaleup

runcom · 2019-05-10T17:45:32Z

/retest

runcom · 2019-05-10T20:03:39Z

/refresh

runcom · 2019-05-10T20:50:38Z

/test e2e-rhel-scaleup

runcom · 2019-05-10T21:04:24Z

aws limits.... will retest later...

runcom · 2019-05-11T07:48:21Z

/retest

runcom · 2019-05-11T07:50:07Z

/hold cancel

cgwalters · 2019-05-11T12:04:07Z

pkg/daemon/update.go

I still think it'd be cleaner to call loggerSupportsJournal once or so on MCD startup rather than trying and failing each time.

oh I can definitely do that if it's for loggerSupportsJournal

should be fixed

cgwalters · 2019-05-11T12:50:34Z

/approve
/retest

Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom · 2019-05-12T09:54:27Z

/test e2e-rhel-scaleup

runcom · 2019-05-12T10:38:39Z

ok, this is now stable in my testing against a real scaleup cluster with 3 rhcos masters + 2 centos 7 workers, the fallback to still write on disk made the trick and I'd leave it as a redundancy fallback. We should run our e2e-aws-op in scaleup job though or we risk these kinds of regressions everytime. What bothers me the most though is:

rhcos logs to journal just fine, we know it and CI jobs are on our side
RHEL7.6 logs to journal works from my testing
CentOS 7 logs aren't working, logger just pretends it logged something but it didn't at all
the 2 points above are the reason I've re-added the write-to-disk function and calling it now

Also, see openshift/release#3748 (comment)

Follow up from my points above: CentOS7 in scaleup job uses cri-o 1.12.x (wrong and old since kubelet is at 1.13)

runcom · 2019-05-12T12:30:34Z

/retest

The cluster and machineconfig operator go up in the scale up test but the job itself fails. This is not related to this change as it wasn't passing even before this (reason why it was never enabled by default)

runcom · 2019-05-12T20:15:48Z

/test e2e-rhel-scaleup

runcom · 2019-05-12T23:00:56Z

/test e2e-rhel-scaleup

runcom · 2019-05-13T10:28:56Z

scaleup passed this time..

ashcrow · 2019-05-13T13:02:26Z

Let's strike while e2e is succeeding. @runcom is this ready for final review/merge?

runcom · 2019-05-13T13:04:46Z

Let's strike while e2e is succeeding. @runcom is this ready for final review/merge?

it is, if @cgwalters @kikisdeliveryservice @LorbusChris can ack on this it would be ready to merge

LorbusChris · 2019-05-13T13:37:49Z

/lgtm

openshift-ci-robot · 2019-05-13T13:38:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, LorbusChris, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [LorbusChris,cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

runcom · 2019-05-13T14:00:49Z

/test e2e-rhel-scaleup

LorbusChris · 2019-05-13T15:03:44Z

😢

/test e2e-rhel-scaleup
/test e2e-aws

runcom · 2019-05-13T16:32:32Z

/retest

openshift-bot · 2019-05-13T17:16:17Z

/retest

Please review the full test history for this PR and help us cut down flakes.

ashcrow · 2019-05-13T18:00:21Z

/retest

openshift-bot · 2019-05-13T18:21:17Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2019-05-13T19:09:14Z

@runcom: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-rhel-scaleup	`1817d82`	link	`/test e2e-rhel-scaleup`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 10, 2019

openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 10, 2019

openshift-ci-robot requested review from ashcrow and cgwalters May 10, 2019 13:27

cgwalters reviewed May 10, 2019

View reviewed changes

runcom force-pushed the fix-old-logger branch 2 times, most recently from 69f2122 to c37d648 Compare May 10, 2019 14:15

runcom force-pushed the fix-old-logger branch from c37d648 to bfbf2f6 Compare May 10, 2019 16:26

kikisdeliveryservice reviewed May 10, 2019

View reviewed changes

pkg/daemon/update.go Outdated

Copy link

Contributor

kikisdeliveryservice May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this. 😆

runcom reacted with laugh emoji

ashcrow reviewed May 10, 2019

View reviewed changes

runcom force-pushed the fix-old-logger branch from bfbf2f6 to acfb35f Compare May 10, 2019 20:19

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2019

cgwalters reviewed May 11, 2019

View reviewed changes

runcom added 2 commits May 12, 2019 11:46

pkg/daemon: fire an event on pending config roll back

faae793

Signed-off-by: Antonio Murdaca <runcom@linux.com>

pkg/daemon: do not compare pointers

2948acb

Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom force-pushed the fix-old-logger branch from 6c7df75 to 2948acb Compare May 12, 2019 09:47

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 12, 2019

pkg/daemon: write pending state to disk if old logger

1817d82

Signed-off-by: Antonio Murdaca <runcom@linux.com>

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 12, 2019

openshift-ci-robot assigned LorbusChris May 13, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2019

openshift-merge-robot merged commit 6397877 into openshift:master May 13, 2019

runcom deleted the fix-old-logger branch May 13, 2019 19:40

runcom mentioned this pull request May 15, 2019

[release-4.1] Bug 1708602: pkg/daemon: workaround old util-linux logger in rhel7.6 #754

Merged

Bug 1708602: pkg/daemon: workaround old util-linux logger in rhel7.6 #734

Bug 1708602: pkg/daemon: workaround old util-linux logger in rhel7.6 #734

Uh oh!

Conversation

runcom commented May 10, 2019

Uh oh!

runcom commented May 10, 2019

Uh oh!

runcom commented May 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runcom May 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runcom May 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters May 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runcom commented May 10, 2019

Uh oh!

runcom commented May 10, 2019

Uh oh!

runcom commented May 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashcrow left a comment

Choose a reason for hiding this comment

Uh oh!

runcom commented May 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

runcom commented May 10, 2019

Uh oh!

runcom commented May 10, 2019

Uh oh!

runcom commented May 10, 2019

Uh oh!

runcom commented May 10, 2019

Uh oh!

runcom commented May 10, 2019

Uh oh!

runcom commented May 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

runcom commented May 11, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

runcom May 10, 2019 •

edited

Loading

runcom May 10, 2019 •

edited

Loading

cgwalters May 10, 2019 •

edited

Loading

runcom commented May 10, 2019 •

edited

Loading

runcom commented May 11, 2019 •

edited

Loading

runcom commented May 12, 2019 •

edited

Loading

runcom commented May 13, 2019 •

edited

Loading

openshift-ci-robot commented May 13, 2019 •

edited

Loading