Skip to content

Conversation

@runcom
Copy link
Member

@runcom runcom commented May 10, 2019

- What I did

RHEL7.6 has an old util-linux's logger which doesn't support the --journald flag to log to journal in a structured way. Work that around by logging a json directly and fall back to that if we're on rhel7.6 then.

- How to verify it

- Description for the changelog

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 10, 2019
@runcom
Copy link
Member Author

runcom commented May 10, 2019

/hold

Depends on #733

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 10, 2019
@runcom
Copy link
Member Author

runcom commented May 10, 2019

/test e2e-rhel-scaleup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eeek. Scraping all journal messages for that?

How about we detect on startup whether the host is rhel7, and add that as a flag in the daemon struct. That'd be useful for a lot of other things.

Copy link
Member Author

@runcom runcom May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we scrape all journal only if --journald isn't supported so we need to do this anyway right? I didn't want to add a specific rhel7 flag as I was just testing for functionality which should be generally better. If rhel7.7 upgrades util linux to a logger that does have --journald, we'd be still using this awful ack if we just test RHEL versions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If rhel7.7 upgrades util linux to a logger that does have --journald, we'd be still using this awful ack if we just test RHEL versions

Even if it did upgrade which...seems very unlikely (right?) we'd still have an upgrade hazard where for previous versions we'd write one way and then later the other way. Seems better to just always live with one format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eeek. Scraping all journal messages for that?

to reply specifically on that. With this patch, we're not always scraping all journal, if !loggerWithoutJournalOption() { takes care of telling us if we need to do that in case logger doesn't have --journald.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it did upgrade which...seems very unlikely (right?) we'd still have an upgrade hazard where for previous versions we'd write one way and then later the other way. Seems better to just always live with one format.

Yeah, agree it's very unlikely indeed, but since we don't need to play with a rhel7 flag today, isn't testing for the logger functionality in itself enough? (I'm super happy to change it anyway)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLD_LOGGER seems a bit too generic...while it's unlikely, that string could appear in log messages from any othe process.

(Ideally...we'd have the MCD logging under machine-config-daemon.service that we create on the host. That's another topic)

How about calling it OPENSHIFT_MACHINE_CONFIG_DAEMON_LEGACY_LOG_HACK ? 😉

Copy link
Member Author

@runcom runcom May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling it OPENSHIFT_MACHINE_CONFIG_DAEMON_LEGACY_LOG_HACK ?

😂 I'll go for it, that's a good one indeed to avoid collisions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then in this code we wouldn't try and fall back each time, we'd just do:

if (!dn.hostRhel7) {
   logger --journald
} else {
  logger with literal JSON as message
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@cgwalters cgwalters May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use logger --help...as running this command just hangs for me; maybe it works because we're not providing stdin?

Or better, do:

if dn.OperatingSystem == machineConfigDaemonOSRHCOS {
    return false
}

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm, does it hang on RHCOS? works fine on rhel7.6 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah I get it now thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note double negative here, probably better as loggerSupportsJournal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@runcom runcom force-pushed the fix-old-logger branch 2 times, most recently from 69f2122 to c37d648 Compare May 10, 2019 14:15
@runcom
Copy link
Member Author

runcom commented May 10, 2019

/test e2e-rhel-scaleup

@runcom
Copy link
Member Author

runcom commented May 10, 2019

/retest

@runcom
Copy link
Member Author

runcom commented May 10, 2019

/test e2e-rhel-scaleup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this. 😆

Copy link
Member

@ashcrow ashcrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the logger shell out code should probably end up being encapsulated in it's own struct/func, but for now this seems sane to fix the bug.

@runcom
Copy link
Member Author

runcom commented May 10, 2019

@vrutkovs just for awareness, not sure who the owner is, but scaleup is broken somehow

EDIT: the last failure there seem to be:

level=fatal msg="failed to initialize the cluster: Working towards 0.0.1-2019-05-10-173304: 100% complete, waiting on authentication: timed out waiting for the condition"
2019/05/10 18:23:30 Container setup in pod e2e-rhel-scaleup failed, exit code 1, reason Error
Waiting for scaleup to complete...Another process exited
2019/05/10 18:23:40 Container test in pod e2e-rhel-scaleup failed, exit code 1, reason Error

@runcom
Copy link
Member Author

runcom commented May 10, 2019

/test e2e-rhel-scaleup

@runcom
Copy link
Member Author

runcom commented May 10, 2019

/retest

@runcom
Copy link
Member Author

runcom commented May 10, 2019

/refresh

@runcom
Copy link
Member Author

runcom commented May 10, 2019

/test e2e-rhel-scaleup

@runcom
Copy link
Member Author

runcom commented May 10, 2019

aws limits.... will retest later...

@runcom
Copy link
Member Author

runcom commented May 11, 2019

/retest

@runcom
Copy link
Member Author

runcom commented May 11, 2019

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2019
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it'd be cleaner to call loggerSupportsJournal once or so on MCD startup rather than trying and failing each time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I can definitely do that if it's for loggerSupportsJournal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fixed

@cgwalters
Copy link
Member

/approve
/retest

runcom added 2 commits May 12, 2019 11:46
Signed-off-by: Antonio Murdaca <runcom@linux.com>
Signed-off-by: Antonio Murdaca <runcom@linux.com>
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 12, 2019
Signed-off-by: Antonio Murdaca <runcom@linux.com>
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 12, 2019
@runcom
Copy link
Member Author

runcom commented May 12, 2019

/test e2e-rhel-scaleup

@runcom
Copy link
Member Author

runcom commented May 12, 2019

ok, this is now stable in my testing against a real scaleup cluster with 3 rhcos masters + 2 centos 7 workers, the fallback to still write on disk made the trick and I'd leave it as a redundancy fallback. We should run our e2e-aws-op in scaleup job though or we risk these kinds of regressions everytime. What bothers me the most though is:

  • rhcos logs to journal just fine, we know it and CI jobs are on our side
  • RHEL7.6 logs to journal works from my testing
  • CentOS 7 logs aren't working, logger just pretends it logged something but it didn't at all
  • the 2 points above are the reason I've re-added the write-to-disk function and calling it now

Also, see openshift/release#3748 (comment)

Follow up from my points above: CentOS7 in scaleup job uses cri-o 1.12.x (wrong and old since kubelet is at 1.13)

@runcom
Copy link
Member Author

runcom commented May 12, 2019

/retest

The cluster and machineconfig operator go up in the scale up test but the job itself fails. This is not related to this change as it wasn't passing even before this (reason why it was never enabled by default)

@runcom
Copy link
Member Author

runcom commented May 12, 2019

/test e2e-rhel-scaleup

1 similar comment
@runcom
Copy link
Member Author

runcom commented May 12, 2019

/test e2e-rhel-scaleup

@runcom
Copy link
Member Author

runcom commented May 13, 2019

scaleup passed this time..

@ashcrow
Copy link
Member

ashcrow commented May 13, 2019

Let's strike while e2e is succeeding. @runcom is this ready for final review/merge?

@runcom
Copy link
Member Author

runcom commented May 13, 2019

Let's strike while e2e is succeeding. @runcom is this ready for final review/merge?

it is, if @cgwalters @kikisdeliveryservice @LorbusChris can ack on this it would be ready to merge

@LorbusChris
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, LorbusChris, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [LorbusChris,cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom
Copy link
Member Author

runcom commented May 13, 2019

/test e2e-rhel-scaleup

@LorbusChris
Copy link
Contributor

😢

/test e2e-rhel-scaleup
/test e2e-aws

@runcom
Copy link
Member Author

runcom commented May 13, 2019

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@ashcrow
Copy link
Member

ashcrow commented May 13, 2019

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented May 13, 2019

@runcom: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-rhel-scaleup 1817d82 link /test e2e-rhel-scaleup

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants