Skip to content
This repository was archived by the owner on Oct 22, 2024. It is now read-only.

LVM fixes #714

Merged
merged 3 commits into from
Aug 25, 2020
Merged

LVM fixes #714

merged 3 commits into from
Aug 25, 2020

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Aug 24, 2020

A collection of bug fixes and code cleanup for LVM mode. See commit messages for details.

@pohly pohly added the 0.7 needs to be fixed in 0.7.x label Aug 24, 2020
@@ -343,7 +343,9 @@ func createNS(r *ndctl.Region, percentage uint) error {
Size: canUse,
Align: align,
})
return fmt.Errorf("failed to create PMEM namespace with size '%d' in region '%s': %v", canUse, r.DeviceName(), err)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised how this state of code ever worked, as it always returns error.... I am quite sure I wrote that if stmt there when this code was developed. But seems between commits 5365a34 and 0748a91 that whole function was moved, but leaving those (quite important ) if-lines behind. Just curious, what was reasoning to leave out if-part. And how the testing actually passed then. We dont have tests for that part?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

means, current fix is not new code, but actually restores code that was there originally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right @okartau, It was unintentionally introduced by me while moving the code as part of removing init-container from the driver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how the testing actually passed then. We dont have tests for that part?

We do not have tests to cover this part of the code. Current tests cover only create, delete, and list devices on a pre-populated logical volume group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do go through this code once when installing a driver in LVM mode for the first time. The error return then causes a restart of the container. The next instance of the container then gets passed this error because it uses the then existing namespace instead of entering this faulty branch again.

The new check for "no restarts" failed when this line wasn't fixed and passed once it was.

gomega.Expect(len(pods.Items)).Should(gomega.BeNumerically(">", 0), "should have PMEM-CSI pods")
for _, pod := range pods.Items {
for _, containerStatus := range pod.Status.ContainerStatuses {
if containerStatus.RestartCount > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should relax this condition to limit to only our driver container. Test failures also showed that driver-registrar container restarts, I am not sure if it exists after a timeout?:

[2020-08-24T22:19:20.745Z] [BeforeEach] lvm-testing
[2020-08-24T22:19:20.745Z]   /mnt/workspace/pmem-csi_PR-714/test/e2e/deploy/deploy.go:599
[2020-08-24T22:19:20.745Z] �[1mSTEP�[0m: preparing for test "lvm-testing TLS controller is secure" in namespace default
[2020-08-24T22:19:20.745Z] Aug 24 22:19:20.469: INFO: No external IP address on nodes, falling back to internal IPs
[2020-08-24T22:19:21.540Z] Aug 24 22:19:21.361: INFO: reusing existing lvm-testing PMEM-CSI components
[2020-08-24T22:19:21.540Z] Aug 24 22:19:21.361: INFO: Waiting for PMEM-CSI driver.
[2020-08-24T22:19:21.540Z] pmem-csi-controller-0/[email protected]: I0824 22:19:21.379354       1 pmem-csi-driver.go:419] HTTP request: GET "/metrics" from 10.32.0.1:11586 Go-http-client/1.1
[2020-08-24T22:19:21.540Z] Aug 24 22:19:21.390: INFO: Done with waiting, PMEM-CSI driver v0.5.0-rc1-728-g149ec0c1 is ready.
[2020-08-24T22:19:21.540Z] Aug 24 22:19:21.396: FAIL: container "driver-registrar" in pod "pmem-csi-node-bb7nz" restarted 1 times, last state: {Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Unknown,Message:,StartedAt:2020-08-24 21:54:06 +0000 UTC,FinishedAt:2020-08-24 22:17:28 +0000 UTC,ContainerID:containerd://3c29e4b8653ef6d8fcfdf77d122151c28b6f2d22d751a7c0e220891fcf9179f1,}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we fix this problem. If it's because of a timeout, then we may be able to increase that timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's much simpler: that happened during the reboot tests, so of course all containers were restarted...

I'm not sure yet how to deal with this. The check is useful because it highlights a problem that went unnoticed without it for a while. Perhaps clean up after the reboot tests by killing all pods?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, checking if the pod container restart count == node reboot count might work.

Copy link
Contributor Author

@pohly pohly Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented the "delete pods" approach.

Checking for node reboots is fragile and more complex. Is there even a counter for node restarts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avalluri okay now? Tests have passed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can merge.

pohly added 3 commits August 25, 2020 11:47
Unconditionally returning an error after a successful namespace
creation caused the driver to start. The restarted driver then
finds the existing namespace and continues.

Symptom:

    State:          Running
      Started:      Mon, 24 Aug 2020 09:44:08 +0200
    Last State:     Terminated
      Reason:       Completed
      Message:      failed to run driver: failed to create PMEM namespace with size '33285996544' in region 'region0': <nil>
      Exit Code:    0
      Started:      Mon, 24 Aug 2020 09:44:03 +0200
      Finished:     Mon, 24 Aug 2020 09:44:03 +0200
    Ready:          True
    Restart Count:  1

To detect this and future issues like it, E2E testing now checks the
driver for unexpected restarts. In tests that intentionally restart
pods, the pods must be deleted after the test to reset the restart
counter.
Creating a namespace will only work for regions of the right type.
If we were to attempt creating the namespace, the function would fail
and the entire driver startup would be aborted, just to fail again
when tried again.

There's no test for this because we have no way to simulate different
PMEM configurations.
- Avoid the "ctx" variable name - normally used for a standard
  context, not consistent with other PMEM-CSI code.
- Better names and documentation for functions - they do not always
  create something.
- Avoid magic constant "2" when construction command line - no need
  to increase that number when changing parameters, easier to read.
- ctx.Free() at the end of the loops would not get called when
  return from inside the loops.
@avalluri avalluri merged commit acb46b3 into intel:devel Aug 25, 2020
@pohly
Copy link
Contributor Author

pohly commented Aug 27, 2020

This doesn't need to be back-ported, 0.7.x still used the init container approach.

@pohly pohly removed the 0.7 needs to be fixed in 0.7.x label Aug 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants