Skip to content

Commit 3d6193d

Browse files
committed
enhancement: add ignition dual spec 2/3 support
Add an enhancement proposal for supporting both ignition spec 2 and 3 for OS provisioning/updating. Signed-off-by: Yu Qi Zhang <[email protected]>
1 parent a16654e commit 3d6193d

File tree

1 file changed

+377
-0
lines changed

1 file changed

+377
-0
lines changed
+377
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,377 @@
1+
---
2+
title: Ignition Spec 2/3 dual support
3+
authors:
4+
- "@yuqi-zhang"
5+
reviewers:
6+
- "@ashcrow"
7+
- "@cgwalters"
8+
- "@crawford"
9+
- "@LorbusChris"
10+
- "@miabbott"
11+
- "@mrguitar"
12+
- "@runcom"
13+
- "@vrutkovs"
14+
approvers:
15+
creation-date: 2019-11-04
16+
last-updated: 2019-11-19
17+
status: **provisional**|implementable|implemented|deferred|rejected|withdrawn|replaced
18+
see-also: https://github.com/openshift/enhancements/pull/78
19+
replaces:
20+
superseded-by:
21+
---
22+
23+
# Ignition Spec 2/3 dual support
24+
25+
26+
## Release Signoff Checklist
27+
28+
- [ ] Enhancement is `implementable`
29+
- [ ] Design details are appropriately documented from clear requirements
30+
- [ ] Test plan is defined
31+
- [ ] Graduation criteria for dev preview, tech preview, GA
32+
- [ ] User-facing documentation is created in [openshift/docs]
33+
34+
35+
## Summary
36+
37+
This enhancement proposal aims to add dual ignition specification version 2/3
38+
(ignition version 0/2) support to Openshift 4.x, which currently only support
39+
ignition version 0 spec 2 for OS provisioning and machine config updates. We
40+
aim to introduce a non-breaking method to switch all new and existing clusters
41+
to ignition spec version 3 at some version of the cluster, which will be
42+
performed by the Machine-Config-Operator (Henceforth MCO). The Openshift
43+
installer and underlying control plane/bootstrap configuration, as well as RHEL
44+
CoreOS (Henceforth RHCOS) package version will also be updated.
45+
46+
This overall migration will take part as a two-phase process:
47+
Phase 1/OCP 4.4:
48+
- MCD gains the ability to process both v2 and v3 configs
49+
- MCS gains the ability to translate from v3 configs to v2 configs "on the fly"
50+
- MCC attempts to update all v2 configs to v3, leaving those which cannot be converted
51+
- Installer and other components generate v3 configs
52+
53+
Phase 2/OCP 4.5:
54+
- MCO enforces that all configs are v3 before allowing the CVO to start the update
55+
- RHCOS bootimages switches to only accept v3 configs
56+
57+
58+
## Motivation
59+
60+
Ignition released v2.0.0 (stable) on Jun 3rd, 2019, which has an updated
61+
specification format (ignition spec version 3, henceforth “Spec 3”). This
62+
change includes some important reworks for RHCOS provisioning, most
63+
importantly the ability to specify and mount other filesystems and fixing
64+
issues where ignition v0 spec was not declarative. Particularly, this is
65+
required to support having /var on a separate partition or disk, an important
66+
requirement for security/compliance purposes in OCP. The existing version on
67+
RHCOS systems (ignition version v0.33) carries a spec version (spec version 2,
68+
henceforth “Spec 2”) that is not compatible with Spec 3. Thus we would like to
69+
update the ignition version on RHCOS/Installer/MCO to make use of the changes.
70+
71+
This proposal will also allow closer alignment with OKD, as OKD will be based
72+
on Fedora CoreOS (Henceforth FCoS), which is already on, and only supports,
73+
ignition spec 3. We want to do this in a way that can minimize deltas between
74+
OKD and OCP.
75+
76+
77+
### Goals
78+
79+
#### Phase 1:
80+
- [ ] A config translator is created to translate from ignition spec 2 to spec 3
81+
- [ ] A config translator is created to translate from ignition spec 3 to spec 2
82+
- [ ] The MCO is updated to support both ignition spec 2 and 3, with:
83+
- [ ] a translator in the Machine Config Controller that can convert spec 2 cluster configuration to spec 3
84+
- [ ] a detector in the Machine Config Server that can serve the correct spec version based on bootimage version
85+
- [ ] The Openshift installer is updated to generate ignition spec 3 configs
86+
- [ ] Successfully create new Spec 3 only IPI/UPI clusters
87+
- [ ] Create a migration method for non-compatible (non-translatable configs) Spec 2 clusters to Spec 3
88+
- [ ] An alerting mechanism is put in place for outdated and incompatible/non-translatable configs
89+
- [ ] Docs are updated to reflect the new config version
90+
91+
#### Phase 2:
92+
- [ ] RHCOS bootimage is updated to accept ignition spec 3 configs
93+
- [ ] Successfully upgrade existing 4.x clusters to Spec 3 clusters
94+
95+
96+
### Non-Goals
97+
98+
- Support FCoS/OKD directly through this change
99+
- API support for MCO, namely switching to RawExt formatted machineconfig
100+
objects instead of explicitly referencing ignition, is not considered as
101+
part of this proposal
102+
103+
104+
## Proposal
105+
106+
This change is multi-component:
107+
108+
109+
### Vendoring Changes
110+
111+
The MCO and Installer must change to go modules (currently dep) for vendoring
112+
as ignition v2 requires using go modules. To support both spec 2 and spec 3,
113+
both ignition versions must be vendored in for typing.
114+
115+
116+
### Spec Translator
117+
118+
To handle both spec versions, as we need the ability to upgrade existing
119+
clusters, we will create a translator between spec 2 and spec 3. This ensures
120+
that a cluster only has one “desiredConfig” which will be translated to spec 3
121+
when the MCO with dual support detects that the existing configuration of a
122+
machine is on spec 2 (will happen only once for all existing and new nodes,
123+
when the MCO with dual support is first deployed onto the cluster). This will
124+
only be required as part of the MCO.
125+
126+
Note that there exists three types of spec 2 configs:
127+
- Those that are directly translatable to spec 3. This is the case for all existing IPI configs.
128+
- Those that we’re not 100% sure we can directly translate, but we can infer what the user is doing and do a translation
129+
- Those that we’re 100% sure we CANNOT translate directly, and requires user input for us to correctly translate
130+
131+
During phase 1 we should attempt to translate on a best-effort basis. If the
132+
cluster can be directly translated to spec 3, we will do so and use the now
133+
v3 spec version config in the MCD. If not, we will support both, and alert
134+
the user that there are untranslatable configs.
135+
136+
During phase two, we should fail updates unless the cluster is fully on spec
137+
3 config. This effectively means that UPI clusters are at risk when upgrading
138+
to an MCO with dual support. Mitigation methods are discussed below.
139+
140+
For backwards support (spec 3 to spec 2) the configs are 100% translatable.
141+
142+
143+
### RHCOS
144+
145+
The RHCOS bootimage needs to be updated to ignition package v2.0+ . Required
146+
dependencies are discussed below.
147+
148+
149+
### Installer
150+
151+
Phase 1:
152+
153+
The installer needs to be updated to generate spec 3 configs for master and
154+
worker nodes. These are, for now, immediately translated down by the MCS
155+
when served to spec 2 bootimages. If the user has otherwise defined custom
156+
machinesets with spec 3 images, those are served spec 3 configs. This means
157+
that the installer will have to vendor both ignition v0 and v2 until phase 2.
158+
159+
Phase 2:
160+
161+
The bootstrap ignition configs are also updated to serve spec 3, and RHCOS
162+
images pinned by the installer will be updated to ones with ignition v2 (spec3).
163+
All spec 2 references are stripped from the installer.
164+
165+
166+
### MCO
167+
168+
The MCO and its subcomponents are the most affected by this change. The
169+
aforementioned spec translator will be housed in the MCO. This means that the
170+
MCO would need to simultaneously vendor both ignition v0 and v2, and translate
171+
existing between the spec versions as needed.
172+
173+
Phase 1:
174+
175+
The MCD has the capability to understand both spec 2 and spec 3 configs,
176+
and lay down files as needed.
177+
178+
The MCC has the capability to translate spec 2 to spec 3 configs. If the
179+
translation is completely successful, the MCD will be instructed to use
180+
the spec 3 configs. Otherwise the MCD will continue using the existing
181+
spec 2 configs, and apply new spec 2/spec 3 configs. If at any point the
182+
MCC notices it can fully translate to spec 3, that will be the version we
183+
use.
184+
185+
The MCS will host both spec 3 and spec 2 configs, with a functionality to
186+
translate spec 3 down to spec 2 configs, and spec 2 up to spec 3 configs.
187+
The MCS will first check which ignition spec version a new node supports,
188+
before serving a config.
189+
190+
Phase 2:
191+
192+
The MCO will flat out reject spec 2 configs, and refuse to upgrade clusters
193+
that have spec 2 bits.
194+
195+
The MCO will also throw an alert upon:
196+
- An attempted to update to the new ignition spec cannot be performed due to untranslatable configs, and the user must manually update/remove the untranslatable config before the update can continue
197+
- The cluster sees a spec 2 machineconfig applied to it has transitioned to phase 2 (pure spec 3), and rejects that machineconfig.
198+
199+
The MCO should also add ability to reconcile broken spec 2 -> spec 3 updates,
200+
after manual intervention from the user.
201+
202+
203+
### User Stories
204+
205+
** As the admin of an existing 4.x cluster on spec 2, I’d like to upgrade to the newest version and use ignition spec 3 **
206+
207+
Acceptance criteria:
208+
- The update completes without user intervention, if all machine configs existing on the cluster can be directly translated to spec 3
209+
- The user receives an alert, if the update is unable to complete due to untranslatable configs
210+
- The user is able to recover the cluster and finish the update if the untranslatable configs are manually translated or remove, or roll back to the old version
211+
- The user should have received notification that the update will be changing spec version, as well as received necessary documentation on how to recover a failed update
212+
- CI tests are put in place to make sure the existing versions can be updated to the new payload
213+
214+
** As a user of openshift, I’d like to install from a spec 2 bootimage and immediately update to a spec 3 payload **
215+
216+
Acceptance criteria:
217+
- Essentially the same as story 1
218+
219+
** As a user of Openshift, I’d like to install a fresh ignition spec 3 cluster **
220+
221+
Acceptance criteria:
222+
- The workflow remains the same for an IPI cluster
223+
- The workflow remains the same for a UPI cluster, minus custom specification changes
224+
- The user should have good documentation, based on version, of how to set up user defined configs during install time
225+
226+
** As an admin of an existing spec 3 cluster, I’d like to apply a new machineconfig **
227+
228+
Acceptance criteria:
229+
- The machineconfig is applied successfully, if the user has defined a correct spec 3 ignition snippet
230+
- The user is properly alerted if they attempt to apply a spec 2 config, and the machineconfig fails to apply
231+
- The user is given necessary docs to remove the undesired spec 2 config and to translate it to a spec 3 config
232+
233+
** As an admin of an existing spec 3 cluster, I’d like to autoscale a new node **
234+
235+
Acceptance criteria:
236+
- The MCS can correct detect which ignition version to serve the bootimage
237+
- The bootimage boots correctly and pivots to align correctly to the rest of the cluster version
238+
239+
240+
### Risks and Mitigations
241+
242+
** Failing to update a cluster **
243+
244+
The IPI configuration is fully translatable. UPI as well as user provided
245+
configuration as day 2 operations are not workflows we can guarantee. For
246+
some users they will simply fail to update the cluster to a new version. To
247+
mitigate, we must allow the user to recover and/or reconcile, or at the bare
248+
minimum have comprehensive documentation on what to do in this situation
249+
250+
** Failing to apply a spec 2 machineconfig that worked prior to the final update **
251+
252+
Users will likely be unhappy that there is such a large breaking change. In
253+
other similar cases, e.g. for auto-deployed metal clusters, the served ignition
254+
configs must all be updated after a certain point of the bootimage to be able
255+
to bring up new machines. To mitigate we should communicate this change well in
256+
advance, and provide methods to translate ignition configs. Failed
257+
installation/alerting systems must clearly communicate the source of error in
258+
this case, as well as how to mitigate.
259+
260+
261+
## Design Details
262+
263+
The implemented changes for the various components can be separate, with the
264+
caveat that ignition spec 3 support for MCO must happen first (so that other
265+
component changes can be tested in cluster). The MCO changes can be standalone,
266+
as they serve to bring a spec 2 cluster to spec 3, or work as in on a spec 3
267+
cluster.
268+
269+
** RHCOS details: **
270+
271+
RHCOS must change to use ignition v2, which supports spec 3 configs. The actual
272+
switching of the package is very easy on RHCOS. The building of ignition v2,
273+
however, presents two issues:
274+
275+
- The util-linux package is old on RHEL, without support of “lsblk -o PTUUID” which ignition uses. This will have to be reworked in RHCOS, or the package must be bumped and rebuilt for rhel 8.1 or workaround as in https://github.com/coreos/ignition-dracut/pull/133
276+
- Ignition-dracut has seen significant deltas between the spec2x and master (spec 3) branches, especially for initramfs network management. There are also minor details such as targets that need to be checked for existing units. There exists a need to merge some spec2x bits back into 3x before RHCOS can move to 3x
277+
278+
This change will be phase 2 only.
279+
280+
** Installer details: **
281+
282+
The installer would only need to support both ignition spec 2 or spec 3, to
283+
have early support for spec 3 in place. Spec 3 will be generated for worker/
284+
master during phase 1, but bootstrap will be on spec 2 until phase 2.
285+
286+
The generated v3 ignition configs are passed to the bootstrap MCS, which
287+
immediately down-translates them to spec 2. Both those will be served in
288+
parallel, and the MCS will curl the node first to detect ignition version
289+
before serving it the corresponding config.
290+
291+
At the time of writing this proposal, there exist FCOS/OKD branches for the
292+
installer that are looking to move to spec 3, and has had success in installing
293+
a cluster. This work can be integrated for OCP as well. The main issue remaining
294+
is due to the necessity of moving to go modules as the vendoring method, there
295+
as are failures in the Azure Terraform provider that seem to be incompatible
296+
with this change.
297+
298+
** MCO details: **
299+
300+
A spec translator will first be implemented in the MCO, with the ability to
301+
detect untranslatable configs. The MCO then should be updated to have support
302+
for both ignition V0S2 and V2S3. Since the MCO is responsible for the current
303+
cluster nodes, it will be the only place at which spec translation is done.
304+
The translation will happen when the version of MCO with dual support and
305+
translator is first deployed; it will detect the existing config being spec 2,
306+
generate a new renderedconfig based on a translator from spec2 to spec 3. If
307+
this translation is successful, it will instruct the MCD that the spec 3 config
308+
is now the complete renderedconfig of the system. Future spec 2 configs applied
309+
to the system will undergo the same translation. After phase 2 happens, spec 2
310+
configs detected will be rejected and an error thrown.
311+
312+
If the translation fails, the MCO will throw an alert to the admin that the
313+
cluster machineconfig will soon be switching to spec 3, and there are existing
314+
configs that are not translatable. If the admin takes no action, eventually the
315+
cluster will failed to upgrade. In phase 2, the admin will see a failed update
316+
with reason as "Unupgradeable".
317+
318+
The spec translator will also translate the existing bootimage configs served
319+
to new nodes joining the cluster. The MCS will check which config version is
320+
currently being served, and will translate spec 2 to spec 3 and vice-versa,
321+
so the MCS with dual support will always be able to serve both. Failed
322+
spec 2 to spec 3 translations will also be handled as above, with warning
323+
that at some point the cluster will refuse to upgrade. Spec 3 bootimages will
324+
also fail to join the cluster.
325+
326+
The MCO should also add functionality to more easily reconcile broken
327+
machineconfigs and ignition specs being served, thus allowing a cluster admin
328+
to correctly recover/abort a failed update to spec 3.
329+
330+
331+
** Other notes: **
332+
333+
Spec 2 -> Spec 3 translation has not been fully implemented anywhere before.
334+
There could be many edge cases we have not yet considered. There are other
335+
potential difficulties such as serving the correct ignition config. See above
336+
section on risks and mitigations.
337+
338+
Starting from some version of Openshift, likely v4.6, we can remove dual
339+
support and be fully ignition spec 3.
340+
341+
Kubernetes 1.16 onwards has support for CRD versioning:
342+
https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definition-versioning/.
343+
If we opt to delay this that is potentially an alternative method of
344+
implementation.
345+
346+
347+
### Test Plan
348+
349+
Extensive testing of all possible paths, especially those outlined in the
350+
user stories, is critical to the success of this major update. The existing
351+
CI infrastructure is a good start for upgrade testing. There should be
352+
additional tests added, especially in the MCO repo, for edge cases as
353+
described in the user stories, to ensure we never break this behaviour.
354+
Many existing tests will also have to be updated given the spec change.
355+
356+
357+
### Upgrade / Downgrade Strategy
358+
359+
The spec translation will happen as part of an upgrade, when the new MCO
360+
is deployed. See above discussions on alerts during upgrade. For clusters
361+
that are already on spec 3, future upgrades will proceed as usual, much
362+
like what we have in spec 2.
363+
364+
365+
### Graduation Criteria
366+
367+
This is a high risk change. Success of this change will require extensive
368+
testing during upgrades. UPI clusters are especially at risk since there
369+
are potentially situations we cannot reconcile with spec translations.
370+
Some of the exact details need further fleshing out during implementation,
371+
and potentially will be not feasible. Existing user workflow will be
372+
disrupted, so communication of these changes will also be very important.
373+
374+
375+
## Infrastructure Needed [optional]
376+
377+
None extra

0 commit comments

Comments
 (0)