LVM fixes #714

pohly · 2020-08-24T19:40:33Z

A collection of bug fixes and code cleanup for LVM mode. See commit messages for details.

okartau · 2020-08-24T21:03:25Z

pkg/pmem-device-manager/pmd-lvm.go

@@ -343,7 +343,9 @@ func createNS(r *ndctl.Region, percentage uint) error {
 			Size:  canUse,
 			Align: align,
 		})
-		return fmt.Errorf("failed to create PMEM namespace with size '%d' in region '%s': %v", canUse, r.DeviceName(), err)
+		if err != nil {


I am surprised how this state of code ever worked, as it always returns error.... I am quite sure I wrote that if stmt there when this code was developed. But seems between commits 5365a34 and 0748a91 that whole function was moved, but leaving those (quite important ) if-lines behind. Just curious, what was reasoning to leave out if-part. And how the testing actually passed then. We dont have tests for that part?

means, current fix is not new code, but actually restores code that was there originally.

You are right @okartau, It was unintentionally introduced by me while moving the code as part of removing init-container from the driver.

And how the testing actually passed then. We dont have tests for that part?

We do not have tests to cover this part of the code. Current tests cover only create, delete, and list devices on a pre-populated logical volume group.

We do go through this code once when installing a driver in LVM mode for the first time. The error return then causes a restart of the container. The next instance of the container then gets passed this error because it uses the then existing namespace instead of entering this faulty branch again.

The new check for "no restarts" failed when this line wasn't fixed and passed once it was.

avalluri · 2020-08-25T02:22:49Z

test/e2e/deploy/deploy.go

+	gomega.Expect(len(pods.Items)).Should(gomega.BeNumerically(">", 0), "should have PMEM-CSI pods")
+	for _, pod := range pods.Items {
+		for _, containerStatus := range pod.Status.ContainerStatuses {
+			if containerStatus.RestartCount > 0 {


Maybe we should relax this condition to limit to only our driver container. Test failures also showed that driver-registrar container restarts, I am not sure if it exists after a timeout?:

[2020-08-24T22:19:20.745Z] [BeforeEach] lvm-testing [2020-08-24T22:19:20.745Z] /mnt/workspace/pmem-csi_PR-714/test/e2e/deploy/deploy.go:599 [2020-08-24T22:19:20.745Z] �[1mSTEP�[0m: preparing for test "lvm-testing TLS controller is secure" in namespace default [2020-08-24T22:19:20.745Z] Aug 24 22:19:20.469: INFO: No external IP address on nodes, falling back to internal IPs [2020-08-24T22:19:21.540Z] Aug 24 22:19:21.361: INFO: reusing existing lvm-testing PMEM-CSI components [2020-08-24T22:19:21.540Z] Aug 24 22:19:21.361: INFO: Waiting for PMEM-CSI driver. [2020-08-24T22:19:21.540Z] pmem-csi-controller-0/[email protected]: I0824 22:19:21.379354 1 pmem-csi-driver.go:419] HTTP request: GET "/metrics" from 10.32.0.1:11586 Go-http-client/1.1 [2020-08-24T22:19:21.540Z] Aug 24 22:19:21.390: INFO: Done with waiting, PMEM-CSI driver v0.5.0-rc1-728-g149ec0c1 is ready. [2020-08-24T22:19:21.540Z] Aug 24 22:19:21.396: FAIL: container "driver-registrar" in pod "pmem-csi-node-bb7nz" restarted 1 times, last state: {Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Unknown,Message:,StartedAt:2020-08-24 21:54:06 +0000 UTC,FinishedAt:2020-08-24 22:17:28 +0000 UTC,ContainerID:containerd://3c29e4b8653ef6d8fcfdf77d122151c28b6f2d22d751a7c0e220891fcf9179f1,}}

Or we fix this problem. If it's because of a timeout, then we may be able to increase that timeout.

I think it's much simpler: that happened during the reboot tests, so of course all containers were restarted...

I'm not sure yet how to deal with this. The check is useful because it highlights a problem that went unnoticed without it for a while. Perhaps clean up after the reboot tests by killing all pods?

In this case, checking if the pod container restart count == node reboot count might work.

I've implemented the "delete pods" approach.

Checking for node reboots is fragile and more complex. Is there even a counter for node restarts?

@avalluri okay now? Tests have passed.

Yes, we can merge.

Unconditionally returning an error after a successful namespace creation caused the driver to start. The restarted driver then finds the existing namespace and continues. Symptom: State: Running Started: Mon, 24 Aug 2020 09:44:08 +0200 Last State: Terminated Reason: Completed Message: failed to run driver: failed to create PMEM namespace with size '33285996544' in region 'region0': <nil> Exit Code: 0 Started: Mon, 24 Aug 2020 09:44:03 +0200 Finished: Mon, 24 Aug 2020 09:44:03 +0200 Ready: True Restart Count: 1 To detect this and future issues like it, E2E testing now checks the driver for unexpected restarts. In tests that intentionally restart pods, the pods must be deleted after the test to reset the restart counter.

Creating a namespace will only work for regions of the right type. If we were to attempt creating the namespace, the function would fail and the entire driver startup would be aborted, just to fail again when tried again. There's no test for this because we have no way to simulate different PMEM configurations.

- Avoid the "ctx" variable name - normally used for a standard context, not consistent with other PMEM-CSI code. - Better names and documentation for functions - they do not always create something. - Avoid magic constant "2" when construction command line - no need to increase that number when changing parameters, easier to read. - ctx.Free() at the end of the loops would not get called when return from inside the loops.

pohly · 2020-08-27T14:02:05Z

This doesn't need to be back-ported, 0.7.x still used the init container approach.

pohly added the 0.7 needs to be fixed in 0.7.x label Aug 24, 2020

okartau reviewed Aug 24, 2020

View reviewed changes

avalluri reviewed Aug 25, 2020

View reviewed changes

pohly added 3 commits August 25, 2020 11:47

pohly force-pushed the lvm-fixes branch from 149ec0c to a5ee953 Compare August 25, 2020 09:48

pohly mentioned this pull request Aug 25, 2020

Write unit test for LVM device manager's creteNS() functionality #692

Open

avalluri approved these changes Aug 25, 2020

View reviewed changes

avalluri merged commit acb46b3 into intel:devel Aug 25, 2020

pohly removed the 0.7 needs to be fixed in 0.7.x label Aug 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LVM fixes #714

LVM fixes #714

pohly commented Aug 24, 2020

okartau Aug 24, 2020

okartau Aug 24, 2020

avalluri Aug 25, 2020

avalluri Aug 25, 2020

pohly Aug 25, 2020

avalluri Aug 25, 2020

pohly Aug 25, 2020

pohly Aug 25, 2020

avalluri Aug 25, 2020

pohly Aug 25, 2020 •

edited

Loading

pohly Aug 25, 2020

avalluri Aug 25, 2020

pohly commented Aug 27, 2020

LVM fixes #714

LVM fixes #714

Conversation

pohly commented Aug 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly Aug 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented Aug 27, 2020

pohly Aug 25, 2020 •

edited

Loading