Skip to content
This repository was archived by the owner on Oct 22, 2024. It is now read-only.

DeviceManager/lvm: Remove device from cache on delete #408

Merged
merged 1 commit into from
Sep 19, 2019

Conversation

avalluri
Copy link
Contributor

LVM device manager caches the devices that it creates, but not cleaning up those
device references on delete. This change fixes this issue.

@okartau
Copy link
Contributor

okartau commented Sep 18, 2019

caches the devices that it creates, but not cleaning up those

yes, I noticed this when idempotency-enabled testing in #393 started to fail after uuid->hash change. Problem remained hidden before because we always used fresh uuid so old entries did not surface in checking. With repeatable hash as VolumeID, we hit the previous entry in the cache.

But about LV entry cache: do we really need this? Is it meant for performance?
We have state of volumes now (at least) at 3 levels: Controller has set Volume registry, Node has own, and LVM layer has one more of it's own. With every such state layer, we increase the risk that something gets out of sync. PMEM-CSI driver is meant for long run and bad things may happen, so we should minimize risk of getting states out of sync.
I don't think we need to value perf. gain of cache over reliability risks.
Also, similar layer on ndctl side does not keep track of volumes separately.

@avalluri
Copy link
Contributor Author

avalluri commented Sep 18, 2019

Is it meant for performance?

Yes, When doing this change I felt like, almost every operation in LVM mode has the need to fetch the current list of devices from lvm. So instead of parsing lvm output on every call, cache the device info is better.

We have state of volumes now (at least) at 3 levels: Controller has set Volume registry, Node has own,

Information at the Controller is different from Node.

and LVM layer has one more of it's own. With every such state layer, we increase the risk that something gets out of sync.

I agree.

Also, similar layer on ndctl side does not keep track of volumes separately.

Because it's not executing ndctl binary, instead it's using library calls.

@okartau
Copy link
Contributor

okartau commented Sep 18, 2019

Information at the Controller is different from Node.

true, I was just counting where we create some data entries when a Volume is created.
In LVM mode:

  1. we have LVM volume itself
  2. we have cache entry in lvm.devices
  3. we have entry in Node Volume registry
  4. we have entry in Controller Volume registry

The players are Node, Controller, and k8s (and possibly even more, some sidecars).
With rapid RPC, messages sent, repeated, delayed, lost, or mutexes (which we also have at multiple players and layers) being waited, the PMEM-CSI will face so many combinations of possible errors that it is impossible to test everything and in long run those 4 entities may get out of sync.
Debugging those problems is difficult/impossible without detailed logs (which are not enabled in production)
That's why I am looking for ways to reduce the amount of state we maintain.
If some LVM operation takes some (minor) extra time, it is no problem whatsoever, as provisioning of Volume is not generally time-critical operation, right.

@pohly
Copy link
Contributor

pohly commented Sep 18, 2019

I don't think we need to value perf. gain of cache over reliability risks.

I fully agree.

@pohly
Copy link
Contributor

pohly commented Sep 18, 2019

Before we add complex code for the sake of performance we have to have:

  • a need to improve performance, i.e. the operation is known to be too slow
  • tests that measure performance and detect regressions

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clean up the Go style before merging?

@@ -156,12 +156,18 @@ func (lvm *pmemLvm) DeleteDevice(name string, flush bool) error {
if err != nil {
return err
}
if err := ClearDevice(device, flush); err != nil {
if err = ClearDevice(device, flush); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use err :=, it's better Go style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return err
}

_, err = pmemexec.RunCommand("lvremove", "-fy", device.Path)
return err
if _, err = pmemexec.RunCommand("lvremove", "-fy", device.Path); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

LVM device manager caches the devices that it creates, but not cleaning up those
device references on delete. This change fixes this issue.
@pohly pohly merged commit b22cf8d into intel:devel Sep 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants