-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mass pruning tool for unused RHCOS AMIs in rh-dev #3373
Conversation
db3ce8f
to
a081589
Compare
cc930b9
to
d12a701
Compare
This is great! Is there any way to get graph data for 4.5 and prior? |
@mike-nguyen - Not that the data is gone, but thankfully I ran this before the release controller team archived older releases. You can see that the <= 4.5 releases are mentioned in the analysis I posted: https://drive.google.com/file/d/19BVPz7qn-K_ADO8WZKSQBgXHB-81-3ND/view?usp=share_link . I'll build off this cache if we run this. I'll tag AMIs that are preserved so we don't have to do any fancy work to remember this fact in the future. |
Would be good to add a top-level docstring documenting the approach of this tool. IIUC, it's crawling through the update graph and then collating the AMIs to keep by looking at each node and its associated I think it's a valid approach, but I think a better approach long-term is to prune at the cosa build level since it's not just AMIs that we want to prune. The code here could build a list of cosa build IDs we want to keep, and a GC process could prune all the cosa builds (and the referenced remote resources within, which include AMIs) which aren't on that list and older than X. The larger issue for that is tracked at coreos/fedora-coreos-tracker#99 and parts of the functionality exists in https://github.com/coreos/coreos-assembler/blob/main/src/cmd-remote-prune. As is, this code will invalidate AMIs referenced in the |
I looked at the file in my browser and didn't realize it only loaded partially until I scrolled down more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems legit
# No machine-os version information in the image. In old releases, use image info for the | ||
# machine-os-content image. | ||
print(' Querying for machine-os-content info') | ||
moc_pullspec = list(filter(lambda tag: tag['name'] == 'machine-os-content', release_info['references']['spec']['tags']))[0]['from']['name'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty soon we will cease to have a machine-os-content
. a refinement here would look up the container configured as primary in group.yml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add that, but the method itself is a fallback. In current builds, the rhcos version is in the payload metadata and doesn't require inspecting the rhcos image: https://github.com/openshift/aos-cd-jobs/pull/3373/files#diff-a3f55c4ee0db478ae07381ad1394f41bc69ea3b33c5f606ace1e946afd082c27R87
# machine-os-content image. | ||
print(' Querying for machine-os-content info') | ||
moc_pullspec = list(filter(lambda tag: tag['name'] == 'machine-os-content', release_info['references']['spec']['tags']))[0]['from']['name'] | ||
if 'registry.svc' in moc_pullspec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. that wouldn't ever happen in a release image we promoted would it? doesn't hurt to have this as failsafe but seems unnecessary. likewise below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this because some ancient CI payload did. Nothing ART promoted.
if region_name not in installer_ami_ids: | ||
installer_ami_ids[region_name] = {} | ||
installer_ami_ids[region_name]['images'] = [] | ||
ami_id = region_entry['image'] | ||
installer_ami_ids[region_name]['images'].append(ami_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not important but i just prefer brevity
if region_name not in installer_ami_ids: | |
installer_ami_ids[region_name] = {} | |
installer_ami_ids[region_name]['images'] = [] | |
ami_id = region_entry['image'] | |
installer_ami_ids[region_name]['images'].append(ami_id) | |
ami_id = region_entry['image'] | |
installer_ami_ids.setdefault(region_name, {}).setdefault('images', []).append(ami_id) |
cache_entry = yaml.dump({ | ||
payload_key: rhcos_in_use[payload_key] | ||
}) | ||
# Write, line by line, to the cache file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor note that yaml.dump
includes a line break.
for a moment i was confused thinking each entry would be one line. perhaps better to say:
# Write, line by line, to the cache file | |
# Write to the cache file separated by a blank line |
if payload_key_references['payload_version'] in ami_info: | ||
print(f'Will not clean up RHCOS AMI {image_id} ({image_name}) since it is referenced by a release controller release payload {payload_key}') | ||
matches_payload_keys.append(payload_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we actually care about release payload references to AMIs? we only use AMIs at install time, right? and i'm pretty sure we don't reference AMIs anywhere besides build metadata and the installer so don't we only need to keep things that the installer refers to?
the difference isn't going to be huge though if you want to be cautious here and keep both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the difference isn't going to be huge though if you want to be cautious here and keep both.
I don't think we do need them. So, yes, this is just a cautionary step.
if matches_payload_keys: | ||
will_preserve.add(f'{aws_region}:{image_id}') | ||
|
||
if not matches_payload_keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not matches_payload_keys: | |
else: |
? :-)
if '.ci' in payload_version or '.nightly' in payload_version: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... we don't want to preserve
everything in a nightly for the indefinite future, but we do want to preserve it for long enough to become a promoted release. So should there also just be a blanket exemption for AMIs until they're old enough (say, a month)? Presumably garbage collection config would cover that, perhaps that's why it's not built in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - If we ran this again the future, older nightlies would have been garbage collected from the release controller. As written, this tool would not try to preserve them.
I was just working on some boot image bumps and realized that this tool could mark images for deletion that are in openshift/installer pull requests that have not merged. I don't think that should affect this PR but just something to be aware of when this is setup to run. With the new RHCOS pipeline, we won't be replicating to all regions unless we run the |
- AMIs less than 30 days old will not be considered for pruning. - Recycle bin enabling in all regions - Tags are now being applied to images to indicate whether they should be garbage collected or preserved.
d12a701
to
b257519
Compare
@mike-nguyen . In the latest push, I've excluded images created in the last 30 days. |
RHCOS suggested that an exhaustive scan of installer commits was safer than isolating AMIs associated with active releases.
e991e49
to
47bf15e
Compare
|
||
git_depth = 0 | ||
while True: | ||
installer_commit: Commit = f'upstream/master~{git_depth}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going down master
only. I think we need to go down from all the release branches too, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call. I wasn't considering post-cut changes. Updated.
# or older format like https://github.com/openshift/installer/blob/bfdba2d19e3fa1105bab93aa4cce8bb7f44b4b2c/data/data/rhcos.json | ||
amis_in_file = list(re.findall('ami-[a-z0-9]+', str(gitshow_process.stdout))) | ||
amis_in_use.update(amis_in_file) | ||
git_depth += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of visiting every commit on the branch, it'd be more efficient to ask git
for only the commits which touch the files we care about and then only visit those. Something like:
git log --format=%H <branch> -- data/data/coreos/rhcos.json data/data/rhcos.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it. Updated.
That latest push scans all branches and finds 2159 AMIs from the installer repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach looks sane to me! One thing I double-checked was whether we've used rhcos.json
even from 4.1.0, which we did.
LGTM to run |
No description provided.