-
Notifications
You must be signed in to change notification settings - Fork 54
Conversation
@@ -343,7 +343,9 @@ func createNS(r *ndctl.Region, percentage uint) error { | |||
Size: canUse, | |||
Align: align, | |||
}) | |||
return fmt.Errorf("failed to create PMEM namespace with size '%d' in region '%s': %v", canUse, r.DeviceName(), err) | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised how this state of code ever worked, as it always returns error.... I am quite sure I wrote that if stmt there when this code was developed. But seems between commits 5365a34 and 0748a91 that whole function was moved, but leaving those (quite important ) if-lines behind. Just curious, what was reasoning to leave out if-part. And how the testing actually passed then. We dont have tests for that part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
means, current fix is not new code, but actually restores code that was there originally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right @okartau, It was unintentionally introduced by me while moving the code as part of removing init-container from the driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And how the testing actually passed then. We dont have tests for that part?
We do not have tests to cover this part of the code. Current tests cover only create, delete, and list devices on a pre-populated logical volume group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do go through this code once when installing a driver in LVM mode for the first time. The error return then causes a restart of the container. The next instance of the container then gets passed this error because it uses the then existing namespace instead of entering this faulty branch again.
The new check for "no restarts" failed when this line wasn't fixed and passed once it was.
gomega.Expect(len(pods.Items)).Should(gomega.BeNumerically(">", 0), "should have PMEM-CSI pods") | ||
for _, pod := range pods.Items { | ||
for _, containerStatus := range pod.Status.ContainerStatuses { | ||
if containerStatus.RestartCount > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should relax this condition to limit to only our driver container. Test failures also showed that driver-registrar
container restarts, I am not sure if it exists after a timeout?:
[2020-08-24T22:19:20.745Z] [BeforeEach] lvm-testing
[2020-08-24T22:19:20.745Z] /mnt/workspace/pmem-csi_PR-714/test/e2e/deploy/deploy.go:599
[2020-08-24T22:19:20.745Z] �[1mSTEP�[0m: preparing for test "lvm-testing TLS controller is secure" in namespace default
[2020-08-24T22:19:20.745Z] Aug 24 22:19:20.469: INFO: No external IP address on nodes, falling back to internal IPs
[2020-08-24T22:19:21.540Z] Aug 24 22:19:21.361: INFO: reusing existing lvm-testing PMEM-CSI components
[2020-08-24T22:19:21.540Z] Aug 24 22:19:21.361: INFO: Waiting for PMEM-CSI driver.
[2020-08-24T22:19:21.540Z] pmem-csi-controller-0/[email protected]: I0824 22:19:21.379354 1 pmem-csi-driver.go:419] HTTP request: GET "/metrics" from 10.32.0.1:11586 Go-http-client/1.1
[2020-08-24T22:19:21.540Z] Aug 24 22:19:21.390: INFO: Done with waiting, PMEM-CSI driver v0.5.0-rc1-728-g149ec0c1 is ready.
[2020-08-24T22:19:21.540Z] Aug 24 22:19:21.396: FAIL: container "driver-registrar" in pod "pmem-csi-node-bb7nz" restarted 1 times, last state: {Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Unknown,Message:,StartedAt:2020-08-24 21:54:06 +0000 UTC,FinishedAt:2020-08-24 22:17:28 +0000 UTC,ContainerID:containerd://3c29e4b8653ef6d8fcfdf77d122151c28b6f2d22d751a7c0e220891fcf9179f1,}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we fix this problem. If it's because of a timeout, then we may be able to increase that timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's much simpler: that happened during the reboot tests, so of course all containers were restarted...
I'm not sure yet how to deal with this. The check is useful because it highlights a problem that went unnoticed without it for a while. Perhaps clean up after the reboot tests by killing all pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, checking if the pod container restart count == node reboot count might work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented the "delete pods" approach.
Checking for node reboots is fragile and more complex. Is there even a counter for node restarts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avalluri okay now? Tests have passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can merge.
Unconditionally returning an error after a successful namespace creation caused the driver to start. The restarted driver then finds the existing namespace and continues. Symptom: State: Running Started: Mon, 24 Aug 2020 09:44:08 +0200 Last State: Terminated Reason: Completed Message: failed to run driver: failed to create PMEM namespace with size '33285996544' in region 'region0': <nil> Exit Code: 0 Started: Mon, 24 Aug 2020 09:44:03 +0200 Finished: Mon, 24 Aug 2020 09:44:03 +0200 Ready: True Restart Count: 1 To detect this and future issues like it, E2E testing now checks the driver for unexpected restarts. In tests that intentionally restart pods, the pods must be deleted after the test to reset the restart counter.
Creating a namespace will only work for regions of the right type. If we were to attempt creating the namespace, the function would fail and the entire driver startup would be aborted, just to fail again when tried again. There's no test for this because we have no way to simulate different PMEM configurations.
- Avoid the "ctx" variable name - normally used for a standard context, not consistent with other PMEM-CSI code. - Better names and documentation for functions - they do not always create something. - Avoid magic constant "2" when construction command line - no need to increase that number when changing parameters, easier to read. - ctx.Free() at the end of the loops would not get called when return from inside the loops.
This doesn't need to be back-ported, 0.7.x still used the init container approach. |
A collection of bug fixes and code cleanup for LVM mode. See commit messages for details.