Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.3] Add a "reboot request" annotation flow, and request-reboot command #926

Closed

Conversation

cgwalters
Copy link
Member

@cgwalters cgwalters commented Jul 3, 2019

Currently, the MCO only supports rebooting for config changes;
the MCC sets the desiredConfig annotation, and the MCD implicitly
takes currentConfig != desiredConfig as permission to reboot.

However, we have cases where we want to support drain+reboot
for non-config changes.

The first major motivating reason for this is to handle kernel
arguments injected via initial MachineConfig objects. Until
we do more work, because pivot as shipped with OpenShift 4.1.0
doesn't know about MachineConfig (and we have no mechanism
right now to update the bootimages: openshift/os#381 )
what will happen is the MCD will land on the nodes and then
go degraded, because the kernel arguments don't match what is
expected.

Now, we really want to handle kernel arguments at the first boot,
and that will be done eventually.

But this gives us a mechanism to reconcile rather than go degraded.
Previously if we'd had the MCD start a reboot on its own, we could
exceed the "maxUnavailable" as defined by the pool, break master
etcd consistency etc.

In other words, reboots need to be managed via the MCC too.

For the MCD, the reboot-requested source of truth is the systemd
journal, as it gives us an easy way to "scope" a reboot request
to a given boot.

The flow here is:

  • MCC sets desiredConfig
  • MCD notices desiredConfig, logs request in journal
  • MCD gets its own journal message, adds reboot-requested annotation
  • MCC approves reboot via reboot-approved annotation
  • MCD reboots
  • MCD notices it has reboot annotation, checks to see if it's in the journal; it's not, so reboot-requested flag is removed
  • MCC notices node has reboot-approved without reboot-requested, removes reboot-approved

This ensures that config changes are always testing the "reboot request"
API.

For now, we expose that API as /usr/libexec/machine-config-daemon request-reboot <reason>.
For example, one could use oc debug node in a loop and run that on each node
to perform a rolling reboot of the hosts.

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 3, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 3, 2019
@runcom
Copy link
Member

runcom commented Jul 4, 2019

knowing the constraint behind all this (time mainly?) I believe this is a good start - the thing I'm still grasping is why we're using the journal to request an external reboot and not something kube driven like a crd or something, but that can surely come later

@cgwalters
Copy link
Member Author

thing I'm still grasping is why we're using the journal to request an external reboot and not something kube driven like a crd or something, but that can surely come later

My thought is that anyone who needs to reboot a node has privileged access to that node anyways - e.g. the tuned case, it is a privileged daemonset. The journal gives us auditing and is already per-node.

The other reason is to "bind" the reboot request to a boot. We only query the journal for reboot requests logged after this boot; if it was an annotation, we'd have to do something like require the annotation include the boot ID or something, so we know when to delete it.

@cgwalters
Copy link
Member Author

But just to think about this a bit more, what would a CRD approach to this look like? Something like a rebootrequest.machineconfiguration.openshift.io with:

spec:
    nodeRef:
      kind: Node
      name: node-41234
      uid: ab3ff6bc-a186-11e9-93cc-5254005254d4
    bootID: 7cacc01d-13b0-4d3c-abd2-8a8461cccd1c

And then each MCD would watch for all instances of rebootRequest? Or we'd have the MCC roll out via annotations as this PR is doing?

@cgwalters
Copy link
Member Author

Worth noting here the journal is conceptually just an implementation detail; the "API" is execve(/usr/libexec/machine-config-daemon request-reboot <reason>).

That doesn't preclude that command creating a CRD in the future or something else.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 9, 2019
@cgwalters cgwalters force-pushed the reboot-coordination branch from 7374f47 to 8752108 Compare July 9, 2019 20:23
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 9, 2019
@cgwalters cgwalters force-pushed the reboot-coordination branch from 8752108 to 638784a Compare July 16, 2019 18:18
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 16, 2019
@cgwalters cgwalters force-pushed the reboot-coordination branch 4 times, most recently from d480b0c to d93b81b Compare July 22, 2019 17:38
@cgwalters
Copy link
Member Author

Slowly making progress with this...what I'm hitting now is that for some reason the persistent journalctl -f process isn't seeing updates. One thing that I'm not sure is the cause of that but may be related is libsystemd seems to be falling back to polling instead of using inotify.

@cgwalters cgwalters force-pushed the reboot-coordination branch from d93b81b to f9e14de Compare July 23, 2019 21:15
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 23, 2019
@cgwalters
Copy link
Member Author

The journalctl issue is still a problem, I mostly wrestled with getting the units to pass today. There's a new "ping pong" involved in the reboot request/approval flags between the MCD and MCC, and to avoid having to entirely rewrite the unit tests I just changed them to set the flags in advance.

@cgwalters cgwalters force-pushed the reboot-coordination branch from f9e14de to bd604b0 Compare July 23, 2019 22:23
@cgwalters
Copy link
Member Author

Hooray, this passed CI!

@cgwalters cgwalters force-pushed the reboot-coordination branch from bd604b0 to 3f8d302 Compare July 24, 2019 14:11
@lucab
Copy link

lucab commented Jul 26, 2019

@LorbusChris at a very high-level, this is already very similar to the zincati+airlock way (local agent for reboots, external DB for lock handling). At lower levels, I don't think there is much that can be aligned/reused at this point.

@cgwalters cgwalters changed the title Reboot coordination Add a "reboot request" annotation flow, and request-reboot command Jul 26, 2019
@cgwalters
Copy link
Member Author

I think for some use cases (particularly e.g. privileged daemonset pods), the "exec command on host to request reboot" is a natural-enough API.

However, there are other reasons to reboot. One case is that over time, kernel memory can get fragmented:

etc. And rebooting is a way to work around it. To deal with this you'd need to have a controller/script/job which did e.g. oc debug node/$x -- /usr/libexec/machine-config-daemon request-reboot "fragmentation" or so, which isn't very kube-like. If one was writing a "fragmentation avoidance controller/operator", it'd be nicer to e.g. watch Prometheus metrics and just set a node annotation or even something more CRD like probably.

Speaking of that, https://github.com/kubevirt/node-maintenance-operator is closely related here - I feel like that functionality might just be better as core to the MCO? And we'd have a

NodeMaintenance
  type: reboot

or so?

Currently, the MCO only supports rebooting for config changes;
the MCC sets the `desiredConfig` annotation, and the MCD implicitly
takes `currentConfig != desiredConfig` as permission to reboot.

However, we have cases where we want to support drain+reboot
for non-config changes.

The first major motivating reason for this is to handle kernel
arguments injected via initial MachineConfig objects.  Until
we do more work, because `pivot` as shipped with OpenShift 4.1.0
doesn't know about `MachineConfig` (and we have no mechanism
right now to update the bootimages: openshift/os#381 )
what will happen is the MCD will land on the nodes and then
go degraded, because the kernel arguments don't match what is
expected.

Now, we really want to handle kernel arguments at the first boot,
and that will be done eventually.

But this gives us a mechanism to reconcile rather than go degraded.
Previously if we'd had the MCD start a reboot on its own, we could
exceed the "maxUnavailable" as defined by the pool, break master
etcd consistency etc.

In other words, reboots need to be managed via the MCC too.

For the MCD, the `reboot-requested` source of truth is the systemd
journal, as it gives us an easy way to "scope" a reboot request
to a given boot.

The flow here is:

- MCC sets desiredConfig
- MCD notices desiredConfig, logs request in journal
- MCD gets its own journal message, adds `reboot-requested` annotation
- MCC approves reboot via `reboot-approved` annotation
- MCD reboots
- MCD notices it has reboot annotation, checks to see if it's in the journal; it's not, so `reboot-requested` flag is removed
- MCC notices node has `reboot-approved` without `reboot-requested`, removes `reboot-approved`

This ensures that config changes are always testing the "reboot request"
API.

For now, we expose that API as `/usr/libexec/machine-config-daemon request-reboot <reason>`.
For example, one could use `oc debug node` in a loop and run that on each node
to perform a rolling reboot of the hosts.
@cgwalters cgwalters force-pushed the reboot-coordination branch from bf63d4b to ea46adb Compare July 26, 2019 21:27
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cgwalters cgwalters changed the title Add a "reboot request" annotation flow, and request-reboot command [4.3] Add a "reboot request" annotation flow, and request-reboot command Aug 5, 2019
@cgwalters cgwalters added the 4.3 label Aug 5, 2019
@openshift-ci-robot
Copy link
Contributor

@cgwalters: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 9, 2019
@jmencak
Copy link
Contributor

jmencak commented Nov 28, 2019

Speaking of that, https://github.com/kubevirt/node-maintenance-operator is closely related here - I feel like that functionality might just be better as core to the MCO? And we'd have a

I believe the functionality should be part of an SLO (MCO is probably the best candidate), rather than an optional OLM-managed component. It seems wrong for SLOs to rely on OLM-managed components.

@pliurh
Copy link
Contributor

pliurh commented Dec 3, 2019

Will this PR get merged in 4.3? or even 4.4?

@cgwalters
Copy link
Member Author

Also related kubernetes/enhancements#1411

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 14, 2020

In general this needs to be something more like

spec:
  generation: 2

where if someone bumps generation then all nodes are updated (generation is part of the generated config).

We use that on deployments, statefulsets, and daemonsets today by doing:

spec:
  template:
    metadata:
      annotations:
        generation: "2"

We are discussing standardizing this across all "reboot able objects" (including machines) such that a command like

oc rollout deploy/foo mcp/bar sts/baz

would thus rollout all of those entities.

EDIT: These names are made up, it might be configGeneration or stateGeneration or some other word that implies the "rebooting / redeploying - ness" of it.

@cgwalters
Copy link
Member Author

Random aside considering this much later: One thing we can always do is take the hit of a "double reboot" - in current scenarios where e.g. the bootimage is too old to do something we want, we can write a systemd unit that does the pivot, reboots, and then uses new code (whether that's ostree or podman or whatever) to perform more things like kernel argument changes, and then finally reboots again and runs kubelet.service.

@openshift-ci-robot
Copy link
Contributor

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-op ea46adb link /test e2e-aws-op
ci/prow/e2e-aws ea46adb link /test e2e-aws
ci/prow/e2e-aws-disruptive ea46adb link /test e2e-aws-disruptive
ci/prow/e2e-gcp-op ea46adb link /test e2e-gcp-op
ci/prow/e2e-aws-upgrade ea46adb link /test e2e-aws-upgrade
ci/prow/e2e-gcp-upgrade ea46adb link /test e2e-gcp-upgrade
ci/prow/e2e-vsphere ea46adb link /test e2e-vsphere
ci/prow/build-rpms-from-tar ea46adb link /test build-rpms-from-tar
ci/prow/e2e-ovn-step-registry ea46adb link /test e2e-ovn-step-registry
ci/prow/e2e-aws-proxy ea46adb link /test e2e-aws-proxy
ci/prow/e2e-upgrade ea46adb link /test e2e-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot
Copy link
Contributor

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-agnostic-upgrade ea46adb link /test e2e-agnostic-upgrade
ci/prow/e2e-aws-serial ea46adb link /test e2e-aws-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 19, 2021
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. team-mco
Projects
None yet
Development

Successfully merging this pull request may close these issues.