Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support regions without m4 machines #1163

Merged
merged 2 commits into from
Feb 22, 2019

Conversation

eparis
Copy link
Member

@eparis eparis commented Jan 31, 2019

This does not dynamically pick machine types for the bootstrap node. Instead it just uses i3.large. i3 is the only machine type in all regions of AWS. It costs a little extra (about $0.05/hour) but we only run it for like 20 minutes.

Fixes: #1127

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 31, 2019
@eparis eparis changed the title WIP: Non m4 machines Support regions without m4 machines Jan 31, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2019
@wking
Copy link
Member

wking commented Jan 31, 2019

I've pushed https://github.com/wking/openshift-installer.git fixup-bootstrap-default-type (c3d322c) with the code needed to get the bootstrap type from Go through Terraform to its eventual resource consumer. We still need to update the Go to feed in the dynamic default, but with the logic internal to pkg/asset/machines and no bootstrap machine object coming out, that's difficult at the moment. I don't think we want an install-config knob for this if we can avoid it, so how about a Bootstrap asset in pkg/asset/machines that just holds the dynamic value?

var (
_ asset.Asset = (*Master)(nil)

defaultMachineClass = map[string]string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aws specific code should live in pkg/asset/machines/aws

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also WDYT using m4 by default and overriding only some regions, so that we only maintain override list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aws specific code should live in pkg/asset/machines/aws

This and some other feedback attached to eparis/installer@bc58267. I dunno why @eparis didn't just push that to this PR for feedback ;).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your hope was that I only keep non-m4 regions in this map instead of every region? I can probably do that, and fix CI.

}
return true
}
err := svc.GetProductsPages(input, handlePage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our unit tests are going to get rate limited :P

Copy link
Member

@wking wking Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our unit tests are going to get rate limited :P

We shouldn't be running that many installer PRs in CI at once, should we? Are product rate limits especially tight?

@wking
Copy link
Member

wking commented Feb 1, 2019

Vendor conflict; needs a rebase.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 1, 2019
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 1, 2019
@eparis
Copy link
Member Author

eparis commented Feb 1, 2019

pushed my second stab at this

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2019
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2019
@wking
Copy link
Member

wking commented Feb 14, 2019

I've picked this up and pushed ebd0559 -> 1275b04 to:

  • Rebase onto master and resolve conflicts.
  • Drop the m4 entries from the map and only keep entries for regions which needed an override.
  • Added additional pricing filters to address rate-limiting concerns (and we don't need to check for instance types that support Windows, etc., anyway).
  • Drop the complete regionMap from the unit tests, instead relying on validation.Regions for this information (although AWS doesn't seem to be particularly consistent about their display names, so I needed two special-cases).
  • Drop the InstanceClasses type. Using maps directly seemed about as expressive and saves readers from the mental overhead of absorbing a new type.
  • Added t.Run subtests to make it easier to isolate errors when things go wrong for a given region (this also lets us be a bit more generous with t.Fatal calls).

@wking
Copy link
Member

wking commented Feb 14, 2019

unit:

�[0G�[2K�[1;92m? �[0m�[1;99mAWS Access Key ID �[0m�[36m[? for help]�[0m �[?25l�7�[999;999f�[6n�[6n--- FAIL: TestGetDefaultInstanceClass (0.00s)
	platform_test.go:18: EOF
FAIL

Looks like our CI user lacks pricing:GetProducts. Or we're running the unit tests without AWS credentials at all, since an earlier unit job has:

--- FAIL: TestDefaultRegionInstanceTypeMap (0.01s)
	master_test.go:123: Error fetching from AWS: NoCredentialProviders: no valid providers in chain. Deprecated.
			For verbose messaging see aws.Config.CredentialsChainVerboseErrors
FAIL

Do we want to set AWS credentials for unit tests? Or should we skip this test when we don't have credentials? If the latter, is the idea that maintainers would periodically run the test manually?

@wking
Copy link
Member

wking commented Feb 14, 2019

I've pushed 147e63d -> f9a381e, pushing the test out into a new tests/aws directory where folks can run it manually (so we don't have to give our usual unit tests AWS creds). CC @staebler

@wking wking force-pushed the non-m4-machines branch 2 times, most recently from d9a52d5 to 3fb0ff6 Compare February 15, 2019 07:53
@wking
Copy link
Member

wking commented Feb 15, 2019

I've pushed f9a381e -> 3fb0ff6, rebasing onto master again, and moving tests/aws -> t/aws so dep ensure will find the tests and add their dependencies to the project vendor/. More on why in the 71e7f8580acf commit message.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 16, 2019
@eparis
Copy link
Member Author

eparis commented Feb 16, 2019

the AWS bulk pricing api (instead of being able to filter) does not require creds, though then we can't use the aws library and have to just get the raw url. Probably could decode it into the aws library types, but they aren't actually very helpful...

"eu-west-1": "Ireland",
"eu-west-2": "London",
"eu-west-3": "Paris",
"sa-east-1": "São Paulo",
"us-east-1": "N. Virginia",
"us-east-2": "Ohio",
"us-gov-east-1": "AWS GovCloud (US-East)",
"us-gov-west-1": "AWS GovCloud (US-West)",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You sure about this? The pricing API calls this AWS GovCloud (US) I think...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pricing API calls this AWS GovCloud (US) I think...

It does, but I have an entry for decoding that here. They use US-West in other places, and now that US-East exists, I think the qualified name makes more sense here.

@wking
Copy link
Member

wking commented Feb 16, 2019

the AWS bulk pricing api (instead of being able to filter) does not require creds, though then we can't use the aws library and have to just get the raw url.

I'm fine with a transition here, but am also fine punting that to follow-up work.

@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 16, 2019
@crawford
Copy link
Contributor

From the logs:

level=error msg="Error: Required variable not set: aws_master_root_volume_iops"

I've never seen that one before.

From Trevor:

For regions like eu-west-3 (Paris) that don't support m4.  The only
instance type available in all AWS regions today is i3.  We prefer m4
because of the higher EBS attach limit.  But for regions that don't
support m4 we use m5.

This also adds a CI job which checks the AWS pricing API to keep our
region map up to date.

The eu-north-1 region is new as of 2018-12-12 [1], and it show up in:

  $ aws ec2 describe-regions --query 'Regions[].RegionName' --output json | jq -r 'sort[]'
  ...
  eu-central-1
  eu-north-1
  eu-west-1
  ...

The GovCloud entries didn't show up in that query (maybe they are only
listed for accounts that have access?), but they are listed on [2].

The new test is in platformtests/aws to keep it out of our default
unit tests.  Maintainers can run it manually to verify that the
default fallbacks we hard-code are still appropriate for the current
AWS offerings.  It's not in tests/aws, because 'dep ensure' seems to
ignore tests.  I think it's fine to have both platformtests and tests
in parallel, with the former holding tests who share the project
vendor/ and the latter holding tests which have a distinct vendor/.
bdd-smoke is in tests with its own vendor/ intentionally to keep
gomega and other test-specific libraries out of the project vendor/
[3].  But for these tests we're using Go's built-in test framework, so
it's less maintenance to layer our minimal additional dependencies
into the project vendor/.

[1]: https://aws.amazon.com/blogs/aws/now-open-aws-europe-stockholm-region/
[2]: https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/
[3]: openshift#552 (comment)
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2019
@wking
Copy link
Member

wking commented Feb 19, 2019

level=error msg="Error: Required variable not set: aws_master_root_volume_iops"

Oops. Fixed with 3745dcf -> 354554d, fixing an omitempty that had snuck back in during a broken rebase (and rebasing again onto master).

@wking
Copy link
Member

wking commented Feb 19, 2019

e2e-aws:

Failing tests:

The bootstrap user should successfully login with password decoded from kubeadmin secret [Suite:openshift/conformance/parallel]
[Conformance][Area:Networking][Feature:Router] The HAProxy router should expose prometheus metrics for a route [Suite:openshift/conformance/parallel/minimal]

/retest

@wking
Copy link
Member

wking commented Feb 19, 2019

e2e-aws:

error: unable to read image registry.svc.ci.openshift.org/ci-op-hdsq08kg/stable@sha256:cd30230b8b0fbd2cfcca3040f99a376c86f5f8abca441fb85165d48d1e23ba97: received unexpected HTTP status: 504 Gateway Time-out

/retest

@droslean
Copy link
Member

/retest

@crawford
Copy link
Contributor

Same error as before:

error: unable to read image registry.svc.ci.openshift.org/ci-op-hdsq08kg/stable@sha256:b3f34b64b11c1554b47f9b455972ce8da79367a26bc12ee20da7e330a128d0de: received unexpected HTTP status: 504 Gateway Time-out

/retest

@crawford
Copy link
Contributor

Flaky tests:

[Feature:DeploymentConfig] deploymentconfigs when run iteratively [Conformance] should only deploy the last deployment [Suite:openshift/conformance/parallel/minimal]

Failing tests:

The bootstrap user should successfully login with password decoded from kubeadmin secret [Suite:openshift/conformance/parallel]

Sigh Blocked on openshift/origin#22070.

@crawford
Copy link
Contributor

/retest

@crawford
Copy link
Contributor

2019/02/20 01:57:48 Build src failed, printing logs:
E0220 01:57:58.864097      13 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".1584efd716f757a1", GenerateName:"", Namespace:"ci-op-4r3lp5bf", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"", Namespace:"ci-op-4r3lp5bf", Name:"", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"CiJobFailed", Message:"Running job pull-ci-openshift-installer-master-e2e-aws for PR https://github.com/openshift/installer/pull/1163 in namespace ci-op-4r3lp5bf from author eparis", Source:v1.EventSource{Component:"ci-op-4r3lp5bf", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf134ae9aff55ba1, ext:19501311214, loc:(*time.Location)(0x1a064c0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf134ae9aff55ba1, ext:19501311214, loc:(*time.Location)(0x1a064c0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'namespaces "ci-op-4r3lp5bf" not found' (will not retry!)

will not retry!

We'll see about that...

/retest

@wking
Copy link
Member

wking commented Feb 20, 2019

e2e-aws:

level=fatal msg="failed to initialize the cluster: Cluster operator network has not yet reported success"

/retest

@crawford
Copy link
Contributor

I haven't seen this failure before:

failed to create imagestreamtag for input image: imagestreamtags.image.openshift.io "pipeline:base" is forbidden: unable to create new content in namespace ci-op-4r3lp5bf because it is being terminated

It's an issue with CI; not OpenShift.

/retest

@crawford
Copy link
Contributor

/retest

@droslean
Copy link
Member

/test e2e-aws

1 similar comment
@droslean
Copy link
Member

/test e2e-aws

@crawford
Copy link
Contributor

/retest

@abhinavdahiya
Copy link
Contributor

green ci :yay:

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, crawford, eparis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants