Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PV health monitor KEP #1077

Merged
merged 1 commit into from
Jan 29, 2020
Merged

Conversation

NickrenREN
Copy link
Contributor

@NickrenREN NickrenREN commented May 30, 2019

@k8s-ci-robot
Copy link
Contributor

Welcome @NickrenREN!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 30, 2019
@k8s-ci-robot k8s-ci-robot requested review from childsb and saad-ali May 30, 2019 08:35
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels May 30, 2019
@cdickmann
Copy link

This KEP is quite interesting to me, and I think this topic is quite important, so I am curious about your answers.

Some questions about the goals:
Q1: The goals mention Local PVs several times, but I felt the same health monitoring needs apply to any PV. Remote PVs can also suffer network partitions, temporary failure, permanent failure, IO issues leading to mount issues, etc. What do you think? Would it make sense to update the KEP to be less Local PV centric?
Q2: Do I understand correctly that the purpose of the health reporting is to allow for automated actions to be taken, e.g. by a database operator or Statefulset controller? In contrast to providing insights aimed at human users. it would be great to clarify this in the goals.

I also have some questions/suggestions on the concrete proposal:
Q3: The API change calls for 3 specific taints. Would this be extensible just like Node taints, i.e. would the use of custom taint keys be supported?
Q4: I was curious if you had considered a few more conditions, which I couldn’t map to the currently proposed conditions.

  • One would be “evacuation suggested”, which could either be because the underlying provider wishes to perform maintenance and will take back the volume, or because it was detected that the physical media is degrading, and so evacuation may still be possible, but performance may already be impacted.
  • Another would be to differentiate “Absent” and “Failed” underlying media when NotUsable it hit. My background is more bare metal software defined storage, and there it is a frequent situation that a physical drive is physically removed, as opposed to having failed. In network attached there is similarly the condition of “network partition” versus media failure. In my experience the distinction matters because the likelihood that a volume comes back from network disconnect is high, while coming back from “failure” is low (turns out it is not 0 though, because IO failures may be related to a bad firmware which a host reset can fix).
  • Finally, it may be valuable to expose that a volume is experiencing conditions of “reduced redundancy” or “temporarily reduced performance” due to physical media failure that is being automatically remediated. Applications could choose to direct less traffic to these volumes when possible.

@xing-yang
Copy link
Contributor

Hi @cdickmann, thanks for your comments! In addition to what @NickrenREN has replied to you on Slack, I want to add that I have an action item to add the CSI support in this KEP, which should cover the remote PV part. In general new features will only be added to support CSI drivers.

keps/sig-storage/20190530-pv-health-monitor.md Outdated Show resolved Hide resolved


## Implementation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case of a reserved volume: shared between multiple nodes, but RWO and this node does not own the reservation, do we need such a state? The volume could be functioning perfectly well, but is not available to this node at the moment.

Copy link
Contributor Author

@NickrenREN NickrenREN Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, if the volume itself is problematical, we should definitely taint it.
If the volume is shared between multiple nodes and RWX, and because of network problem or sth like that, it is not available to one specific node, i think we need to taint it, because it may cause data loss.
But if it is RWO, and the node does not own the reservation, IIUC, the volume is healthy and will not lead to data loss, as you mentioned above, it may lead to reduced performance.I prefer not tainting it at first.
It seems that we need to list the specific causes. I will rethink the extension mechanism, and validate if Taint is suitable here.

Copy link

@cdickmann cdickmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I had some thoughts especially on the CSI portion.

* Checks if volume still exists and/or is attached
* Checks if volume is still mounted and usable

Container Storage Interface (CSI) specification will be modified to add two RPCs ControllerCheckVolume and NodeCheckVolume. A monitor controller (external-monitor) that is responsible for watching the PersistentVolumeClaim, PersistentVolume, and VolumeAttachment API objects will be added. The external-monitor can be deployed as a sidecar container together with the CSI driver. For PVs, the external-monitor calls CSI driver’s monitor interface to check volume health status.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to deploy a new daemon and further add to complexity? Intuitively, it feels to me that this should be able to fit into existing daemons, as it just extends CSI slightly to cover more state. CSI already communicates multiple pieces of state in both directions, and this is a natural extension to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently for every new feature added in kubernetes-csi, a new sidecar container will be added accordingly. For example, we have external-provisioner, external-attacher, external-snapshotter, and external-resizer, etc. This way it is easier to maintain each component, but I do see your point that this adds more complexity as well. I'll think more about this.


Container Storage Interface (CSI) specification will be modified to add two RPCs ControllerCheckVolume and NodeCheckVolume. A monitor controller (external-monitor) that is responsible for watching the PersistentVolumeClaim, PersistentVolume, and VolumeAttachment API objects will be added. The external-monitor can be deployed as a sidecar container together with the CSI driver. For PVs, the external-monitor calls CSI driver’s monitor interface to check volume health status.

* ControllerCheckVolume RPC

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to define a poll model. Would it also be possible for a push mechanism? I would assume that many storage systems underlying the CSI would actively notice and react to issues, and be able to notify upper layers in an expedient way. Thus, it should be possible for the CSI to communicate this. This is similar to the SCSI device specs, where a physical drive or physical SCSI controller can raise a notification when it detects an issue, so that upper layers can take immediate action. Being expedient can be important for applications that depend on application level replication and needing to know when to kick in rebuilds to reduce downtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Let me think about this.

@NickrenREN
Copy link
Contributor Author

Hi, @xing-yang @cdickmann @richardelling
FYI, i updated the KEP, please take a look when you get a chance, thanks


The reactions needs further discussion, and is not in the scope of this doc.
- all the VolumeTaintEffects are NoEffect now at first, we may talk about the reactions later in another proposal;
- the taint Value is string now, it is theoretically possible that several errors are detected for one PV, we may extend the string to cover this situation: combine the errors together and splited by semicolon or other symbols.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is only one key and values are in free text form, it will be difficult to add reactions later based on those information. Should we keep the keys such as "VolumeNotAttached" or "VolumeNotMounted" and define a few constant string values, i.e., VOLUME_DELETED, FILESYSTEM_CORRUPTED? Maybe allow user to define their own custom values too but those will always have "NoEffect".

- External controller calls ControllerCheckVolume() to check the health condition of PVs themselves, for example, if the PVs are deleted ...
- Move Attach checking here ?
- NodeCheckVolume RPC
- If volume is attached to this node, the external agent calls NodeCheckVolume() to see if volume is still attached; Move this check to ControllerCheckVolume ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep this here. In the case of multi-attach, it is easier to check on each node as discussed earlier.

@xing-yang xing-yang force-pushed the pv-health-monitor branch 4 times, most recently from 7b7bf42 to 54a0473 Compare November 4, 2019 16:30
// Identity information for a specific volume. This field is
// OPTIONAL. It can be used to list only a specific volume.
// ListVolumes will return with current volume information.
string volume_id = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new RPC GetVolume()

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 30, 2019
@xing-yang xing-yang mentioned this pull request Jan 5, 2020
9 tasks
@k8s-ci-robot k8s-ci-robot removed the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 13, 2020
@kubernetes kubernetes deleted a comment from xing-yang Jan 27, 2020
Instead of adding a new RPC, we can leverage the existing NodeGetVolumeStats RPC.

```
rpc NodeGetVolumeStats (NodeGetVolumeStatsRequest)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull based method for fetching utilization makes sense, but for status changes (health) maybe we should consider push?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Push based method is brought up in the section "HTTP(RPC) Service". We didn't adopt that approach as the main proposal because CSI uses pull based method. I'll add some clarification there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a push model, can the node agent define some RPC service that CSI drivers can call out to?

@serathius can you point to any existing design that uses a push based model?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current CSI spec, I don't see a push model. Therefore the design is based on a pull model. We can ask about it at the CSI community meeting during the CSI spec review. If a push model is possible, we can come back to update the KEP later.

Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves.

#### Node failure event
Watch node failure events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is node failure event here? Can you give example?
Assuming that this is some k8s event on node object, does it mean that it will be coping events from Node to PVC?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to a node down event. So the monitoring controller will be checking if the node is still up and only report event when the node is down. I'll reword this section to clarify.

@xing-yang
Copy link
Contributor

Is network storage being handled differently than local volumes for this case?
How do you detect that a PVC for network storage is on that node?

Yes, it is handled differently. For local volumes, we can figure out what local volumes are on a particular node. So an event can be reported for every PVC affected. For network storage, we will only report a general node down event, not specific to any volume. Added clarifications.

@msau42
Copy link
Member

msau42 commented Jan 28, 2020

Is network storage being handled differently than local volumes for this case?
How do you detect that a PVC for network storage is on that node?

Yes, it is handled differently. For local volumes, we can figure out what local volumes are on a particular node. So an event can be reported for every PVC affected. For network storage, we will only report a general node down event, not specific to any volume. Added clarifications.

What object is the node down event reported on? Also how will you tell the difference between a local volume vs network volume?

@xing-yang
Copy link
Contributor

What object is the node down event reported on? Also how will you tell the difference between a local volume vs network volume?

For local storage, @NickrenREN said there is a way to tell what volumes are on that node. So this will be reported on the PVC objects.
For all other storage, we don't really have an object to report the event. So maybe we can only log a message. I'll drop the event part for this one.
In the external controller sidecar, we should know the driver name. Based on that we can tell whether it is local storage or not.


If the CSI driver has implemented the CSI volume health function proposed in this design document, Kubernetes could communicate with the CSI driver to retrieve any errors detected by the underlying storage system. Kubernetes can report an event and log an error about this PVC so that user can inquire this information and decide how to handle them. For example, if the volume is out of capacity, user can request a volume expansion to get more space. In the first phase, volume health monitoring is informational only as it is only reported in the events and logs. In the future, we will also look into how Kubernetes may use volume health information to automatically reconcile.

There could be conditions that cannot be reported by a CSI driver. There could be network failure where Kubernetes may not be able to get response from the CSI driver. In this case, a call to the CSI driver may time out. There could be network congestion which causes slow response. One or more nodes where the volume is attached to may be down. This can be monitored and detected by the volume health controller so that user knows what has happened.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A controller sidecar calling to a CSI driver goes through localhost. I don't think you can detect network errors that way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I said "conditions that cannot be reported by a CSI driver" on line 83 so I meant CSI driver cannot report a network failure. Maybe I should just remove it to avoid confusion.


The main architecture is shown below:

![pv health monitor architecture](./pv-health-monitor.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagram seems a little out of date after addressing the comments. Should we remove it for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure


The following areas will be the focus of this proposal at first:

- The health condition checking of volumes themselves. For example, whether the volume is deleted, whether the usage is reaching the threshold, and so on.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal is not actually doing the health or mount checks right? The proposal is about providing a mechanism for plugins to report volume health/errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will modify this.

```

The following common error codes are proposed for volume health:
* VolumeNotFound
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can discuss the error codes in more detail during CSI spec review

Instead of adding a new RPC, we can leverage the existing NodeGetVolumeStats RPC.

```
rpc NodeGetVolumeStats (NodeGetVolumeStatsRequest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a push model, can the node agent define some RPC service that CSI drivers can call out to?

@serathius can you point to any existing design that uses a push based model?

#### Node down event
Watch node down events.
In the case that a node goes down, the controller will report an event for all local PVCs on that node.
For network storage in the case of a node failure, the controller will just log a general message, not specific to individual PVCs because the controller has no knowledge of what PVCs are on the affected node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there's a local volume CSI driver, I'm not sure how the controller will be able to tell the difference between network and local.

One possibility is to track which pods are using which PVCs and what nodes they got scheduled to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

### External controller

#### CSI interface
Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves. The frequency of the check should be tunalbe. A configure option will be available in the external controller to adjust this value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: tunable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still see typo

The following areas will be the focus of this proposal at first:

- Provide a mechanism for CSI drivers to report volume health problems. For example, whether the volume is deleted, whether the usage is reaching the threshold, and so on.
- Mounting conditions checking.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the node agent actual check for mounts? Or is it the responsibility of the driver?

Maybe we should just say to provide a way for CSI drivers to report volume health problems at the controller and node levels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, driver will check it. I'll reword this.


- Provide a mechanism for CSI drivers to report volume health problems. For example, whether the volume is deleted, whether the usage is reaching the threshold, and so on.
- Mounting conditions checking.
- Other errors that could affect the usability of the volume.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these "other errors" things that a CSI driver would not be able to detect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to errors the CSI drivers can detect. I'll reword.

### External controller

#### CSI interface
Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves. The frequency of the check should be tunalbe. A configure option will be available in the external controller to adjust this value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still see typo

Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves. The frequency of the check should be tunalbe. A configure option will be available in the external controller to adjust this value.

#### Node down event
* Watch node down events.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be watching Node status?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will watch node status. In addition I think we also need to ping the node as it could be an unplanned shut down.

* Volume health feature deployed in production and have gone through at least one K8s upgrade.

## Test Plan
### Unit tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be more specific about the test cases, what error scenarios will we be testing? What drivers are we going to use to test?

Also, any plan for stress/scale tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@xing-yang xing-yang force-pushed the pv-health-monitor branch 2 times, most recently from 7333982 to 41ece7c Compare January 28, 2020 20:37
@xing-yang
Copy link
Contributor

@msau42 comments addressed. PTAL. Thanks.

Call GetVolume() RPC for volumes periodically to check the health condition of volumes themselves. The frequency of the check should be tunable. A configure option will be available in the external controller to adjust this value.

#### Node down event
* Watch node down events by checking node status and also pinging the node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already NodeLease objects that kubelet periodically updates (like a heartbeat). We could possibly use that, but even then, I'm still not sure why we cannot just rely on the NodeController to mark the Node as unhealthy when the NodeLease is stale. It seems like we're crossing over responsibilities here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@xing-yang
Copy link
Contributor

Squashed commits into 1.

@msau42
Copy link
Member

msau42 commented Jan 28, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2020
@xing-yang
Copy link
Contributor

/assign @saad-ali

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Overall I like that this KEP has reduced scope to just creating events (inform user) no programmatic response. That makes it easier to implement. I'm still concerned about encoding error codes in CSI, but we can go over that in the CSI API review.

* DiskDegrading
* VolumeUnmounted
* RWIOError
* FilesystemCorruption
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enumerating the failure types as error codes in CSI makes me worry. Are we going to be able to capture all possible use cases for all possible storage systems? How will we handle backwards compat?

The main reason to have error codes (as opposed to opaque strings) is to enable programmatic response. Like the existing CSI error codes we should detail the expected recovery behavior for each error code that may help us reduce to a minimum set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let's discuss more in the CSI API review.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, NickrenREN, saad-ali

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 29, 2020
@k8s-ci-robot k8s-ci-robot merged commit 24c0d15 into kubernetes:master Jan 29, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Jan 29, 2020
@NickrenREN NickrenREN deleted the pv-health-monitor branch February 3, 2020 07:18
@blackgold
Copy link

will the volume health event be attached to the pod using the volume?

@NickrenREN
Copy link
Contributor Author

NickrenREN commented Jul 22, 2020

@blackgold yes, we are planning to do this, track issue: kubernetes-csi/external-health-monitor#10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.