New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

KEP-5018: DRA: AdminAccess for ResourceClaims and ResourceClaimTemplates #5019

Open

ritazh wants to merge 1 commit into kubernetes:master from ritazh:dra-admin-access

+859 −68

Member

ritazh commented Jan 3, 2025

One-line PR description: DRAAdminAccess: allow creations of ResourceClaims and ResourceClaimTemplates in privileged mode to grant access to devices that are in use by other users for admin tasks like monitor health or status of the device.

Issue link: DRA: AdminAccess for ResourceClaims and ResourceClaimTemplates #5018

Other comments:

NOTE: The DRAAdminAccess feature was initially part of https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters In 1.32, the DRAAdminAccess feature gate was added to keep the adminAccess field in alpha while promoting structured parameters to beta. It was discussed this feature should be part of a separate KEP to push it forward. Most references to DRAAdminAccess are removed from the original KEP and moved to this KEP.

k8s-ci-robot added the cncf-cla: yes label

Contributor

k8s-ci-robot commented Jan 3, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ritazh
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from derekwaynecarr and mrunalp

January 3, 2025 22:33

k8s-ci-robot added kind/kep sig/node size/XL labels

ritazh commented

View reviewed changes

keps/sig-node/4381-dra-structured-parameters/README.md

    
                  //

                  // +required

                  // +optional

Member Author

ritazh Jan 3, 2025

To reflect the latest https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/resource/v1beta1/types.go

jackfrancis Jan 6, 2025

should we update any stale API definitions in a separate PR so there isn't a suggesting that KEP-5018 is responsible for the change from required to optional?


          5018-dra-adminaccess

677a7ed

Signed-off-by: Rita Zhang <[email protected]>

ritazh force-pushed the dra-admin-access branch from 910fe6f to 677a7ed Compare

January 3, 2025 22:38

kubernetes deleted a comment from k8s-ci-robot

Member Author

ritazh commented Jan 3, 2025

/assign @pohly

k8s-ci-robot assigned pohly

Member Author

ritazh commented Jan 3, 2025

/sig auth

k8s-ci-robot added the sig/auth label

Member Author

ritazh commented Jan 6, 2025

/sig node

jackfrancis reviewed

View reviewed changes

keps/sig-node/5018-dra-adminaccess/README.md

    
              As the Dynamic Resource Allocation (DRA) feature evolves, cluster administrators require a privileged mode to grant access to devices already in use by other users. This feature, referred to as DRAAdminAccess, allows administrators to perform tasks such as monitoring device health or status while maintaining device security and integrity.

              This KEP proposes a mechanism for cluster administrators to mark a request in a ResourceClaim or ResourceClaimTemplate with an admin access flag. This flag allows privileged access to devices, enabling administrative tasks without compromising security. Access to this mode is restricted to users authorized to create ResourceClaim or ResourceClaimTemplate objects in namespaces marked with the DRA admin label, ensuring that non-administrative users cannot misuse this feature.

jackfrancis Jan 6, 2025

Is "without compromising security" too strong here? Is it more correct to say something like "This flag allows conditional, privileged access to devices. Conditional access to this mode is restricted..."

jackfrancis reviewed

View reviewed changes

keps/sig-node/5018-dra-adminaccess/README.md

    
              ## Summary

              As the Dynamic Resource Allocation (DRA) feature evolves, cluster administrators require a privileged mode to grant access to devices already in use by other users. This feature, referred to as DRAAdminAccess, allows administrators to perform tasks such as monitoring device health or status while maintaining device security and integrity.

jackfrancis Jan 6, 2025

should we distinguish between "adminstrators" and "regular users" here to clarify that the security story? like "This feature, referred to as DRAAdminAccess, allows administrators to perform tasks such as monitoring device health or status across all devices while ensuring that regular uses only have access to run containers on the devices their workloads are scheduled onto." ?

jackfrancis reviewed

View reviewed changes

keps/sig-node/5018-dra-adminaccess/README.md

    
              * Potential conflicts or misuse of shared hardware.

              As the adoption of DRA expands, the lack of privileged administrative access becomes a bottleneck for cluster operations, particularly in shared environments where devices are critical resources.

jackfrancis Jan 6, 2025

"As the adoption of DRA expands, the inability of administrators to perform privileged device introspection becomes a bottlenect for cluster operations,..."

Something like the above gets rid of the word "access", which may be confusing ("lack of privileged admistrative access" is usually something we are trying to ensure! 😄)

jackfrancis reviewed

View reviewed changes

keps/sig-node/5018-dra-adminaccess/README.md

    
                    resourceClassName: admin-resource-class

                    adminAccess: true

                  ```

              1. Namespace Label for DRA Admin Mode:

jackfrancis Jan 6, 2025

As specified (thus far, I haven't yet read the whole doc 😛) it seems that labelded namespace creation must happen before the DRA resources can refer to adminAccess: true. Should we put this step as the first step in this overview to reflect its serial place in the required order?

jackfrancis Jan 6, 2025

I see in the workflow doc below that you describe the create namespace as the first step, so this comment is less important.

jackfrancis reviewed

View reviewed changes

keps/sig-node/5018-dra-adminaccess/README.md

    
              1. Grants privileged access to the requested device:

                  For requests with `adminAccess: true`, the DRA controller bypasses standard allocation checks and allows administrators to access devices already in use. This ensures privileged tasks like monitoring or diagnostics can be performed without disrupting existing allocations. The controller also logs and audits admin-access requests for security and traceability.

jackfrancis Jan 6, 2025

Is there an assumption that monitoring/diagnostics processes are already running on the underlying host OS, and so any CPU/memory allocation can be guaranteed? And/or do we assume that any new operational headroom required by these privileged tasks is already pre-accounted for, no chance for container allocation to take up 100% of the available headroom of a node's host OS?

jackfrancis reviewed

View reviewed changes

keps/sig-node/5018-dra-adminaccess/README.md

    
              1. No impact on availability of claims:

                  The scheduler ignores claims with `adminAccess: true`, normal usage is not impacted as claims in other namespaces can still be allocated using the same devices that are also accessed by workloads in the admin namespace.

jackfrancis Jan 6, 2025

Does this assume that privileged tasks are either (1) not using the specialized hardware (e.g., GPU) or (2) are using specialized hardware in a cooperative way that is non-invasive to assumptions that kubernetes workloads have? For example, if my workload declares expression: "device.attributes['gpu.nvidia.com'].profile == '1g.10gb'" then I expect to be able to use the entire 10GB of that single GPU. What happens if my k8s workload container needs all 10GB and the entire GPU processing while a privileged task simultaneously has access to that GPU+memory?

nojnhuh Jan 7, 2025

I also wonder if this assumption is something we can make generally across vendors and devices. I imagine some device eventually may come along that requires exclusive access (i.e. no other allocated claims) for some specific admin task. Could that work if an admin creates two requests in a claim: where one has adminAccess: true and another that allocates the entire device?

Overall I think the idea that a ResourceClaim isn't claiming any resources in this case is a little confusing. I suppose it is "claiming" some administrative domain though, so is that something that could be represented in a ResourceSlice? e.g. What if a GPU DRA driver also declared a "metrics" device alongside partitions like MIGs, where that "metrics" device would be marked as requiring admin access somehow in the ResourceSlice? That might remove the need to treat admin access specially in some of the resource accounting changes here too.

jackfrancis reviewed

View reviewed changes

keps/sig-node/5018-dra-adminaccess/README.md

    
              1. A cluster administrator labels a namespace with `kubernetes.io/dra-admin-access`.

              1. Authorized users create `ResourceClaim` or `ResourceClaimTemplate` objects with `adminAccess: true`.

jackfrancis Jan 6, 2025

I think we want to be clear that the ResourceClaim or ResourceClaimTemplate needs to be created in the admin namespace.

jackfrancis reviewed

View reviewed changes

keps/sig-node/5018-dra-adminaccess/README.md

    
              1. Authorized users create `ResourceClaim` or `ResourceClaimTemplate` objects with `adminAccess: true`.

              1. Only users with access to the admin namespace can use them in their pod spec.

jackfrancis Jan 7, 2025

Is this clearer in spite of being longer?

"Only users with access to the admin namespace can reference these ResourceClaims or ResourceClaimTemplates in new pod or deployment specs."

jackfrancis reviewed

View reviewed changes

keps/sig-node/5018-dra-adminaccess/README.md

    
              The `DRAAdminAccess` feature gate controls whether users can set the `adminAccess` field to

              true when requesting devices. That is checked in the apiserver. In addition,

              the scheduler will not allocate claims with admin access when the feature is

              turned off, or if the field was set prior to the feature gate was introduced (for example, set in 1.31 when it

jackfrancis Jan 7, 2025

I don't think I understand the "or if the field was set prior..." part.

jackfrancis reviewed

View reviewed changes

keps/sig-node/5018-dra-adminaccess/README.md

    
              		}

              	}

              	if adminRequested {

              		logger.V(5).Info("ResourceClaim", klog.KRef(claim.Namespace, claim.Name), "has admin access, bypass standard allocation checks")

jackfrancis Jan 7, 2025

this is probably not the place for a code review 😜, but I'd want to see this at v=2

jackfrancis reviewed

View reviewed changes

keps/sig-node/5018-dra-adminaccess/README.md

    
              ```

              ### ResourceQuota

              Requests asking for `adminAccess` contribute to the quota. In practice,

jackfrancis Jan 7, 2025

Can we be more assertive with our language here? Because we are describing something new, we don't know what folks are yet doing in practice. Can we say that we don't recommend enforcing resource quotas or DRA AdminAccess namespaces based on the unique nature of how these workloads co-exist without competing for user resources?

nojnhuh reviewed

View reviewed changes

keps/sig-node/5018-dra-adminaccess/README.md

    
              1. No impact on availability of claims:

                  The scheduler ignores claims with `adminAccess: true`, normal usage is not impacted as claims in other namespaces can still be allocated using the same devices that are also accessed by workloads in the admin namespace.

nojnhuh Jan 7, 2025

I also wonder if this assumption is something we can make generally across vendors and devices. I imagine some device eventually may come along that requires exclusive access (i.e. no other allocated claims) for some specific admin task. Could that work if an admin creates two requests in a claim: where one has adminAccess: true and another that allocates the entire device?

Overall I think the idea that a ResourceClaim isn't claiming any resources in this case is a little confusing. I suppose it is "claiming" some administrative domain though, so is that something that could be represented in a ResourceSlice? e.g. What if a GPU DRA driver also declared a "metrics" device alongside partitions like MIGs, where that "metrics" device would be marked as requiring admin access somehow in the ResourceSlice? That might remove the need to treat admin access specially in some of the resource accounting changes here too.

keps/sig-node/5018-dra-adminaccess/README.md

    
              ### API Changes

              Add `adminAccess` field to `DeviceRequest` which is part of `ResourceClaim` and `ResourceClaimTemplate`:

nojnhuh Jan 7, 2025

To clarify, isn't this field already a part of the API, albeit behind an alpha feature gate?

keps/sig-node/5018-dra-adminaccess/README.md

    
                // omitted

                ```

              In pkg/controller/resourceclaim/controller.go, process requests in `handleClaim` functino to prevent creation of

nojnhuh Jan 7, 2025

nit:

Suggested change

      
            In pkg/controller/resourceclaim/controller.go, process requests in `handleClaim` functino to prevent creation of
          
            In pkg/controller/resourceclaim/controller.go, process requests in `handleClaim` function to prevent creation of

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes kind/kep sig/auth sig/node size/XL