Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Knative #1682

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

tombanksme
Copy link

This is a proof-of-concept pull request to add the canary release deployment strategy for Knative (#903). It's far from production ready but I wanted to see if there's appetite for this feature before I invest more time into completing it.

Currently Flagger can target a Knative service & complete the canary release process. I've not tested rollbacks yet although I think it should just work. The pull request needs some extra work to add test coverage; throw errors when using unsupported release processes & add Knative specific Kubernetes events.

Let me know what you think!

Example

The following canary will target a Knative Service. Once the canary has been initialised you can start the canary release process by creating a new revision of the service.

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: example
spec:
  targetRef:
    apiVersion: serving.knative.dev/v1
    kind: Service
    name: example
  service:
    port: 3000
  analysis:
    interval: 1m
    threshold: 10
    maxWeight: 50
    stepWeight: 5

@tombanksme tombanksme marked this pull request as ready for review July 15, 2024 12:40
@tombanksme tombanksme requested a review from stefanprodan as a code owner July 15, 2024 12:40
@edude03
Copy link

edude03 commented Sep 3, 2024

(I haven't used flagger / knative in awhile but) I'm very excited to see this taking shape!

@arubino
Copy link

arubino commented Oct 4, 2024

Hi @tombanksme! This looks great, are there any news on this PR?

@tombanksme
Copy link
Author

I haven't heard anything back from the Flagger maintainers. I would be happy to finish up the PR if it's something that fluxcd would consider merging.

@aryan9600
Copy link
Member

while i see the value added by this PR, is it not possible to use Gateway API as a bridge to get Flagger and Knative to work together: https://github.com/knative-extensions/net-gateway-api

@tombanksme
Copy link
Author

tombanksme commented Nov 27, 2024

Sorry for the delay. It looks like net-gateway-api isn't ready for production yet according to the readme

@kahirokunn
Copy link
Contributor

kahirokunn commented Dec 12, 2024

@aryan9600

I'll help craft a detailed technical response explaining why the Gateway API approach wouldn't work for Flagger-Knative integration:

Thank you for bringing this up. As someone familiar with Knative, I'd like to clarify why using Gateway API as a bridge between Flagger and Knative wouldn't be possible, and why the current PR approach is actually the correct solution.

The net-gateway-api project has a different purpose - it's not meant to be an control interface for Knative, but rather it allows Knative to use Gateway API as its final networking output. Let me explain how Knative's networking architecture works:

  1. Knative uses a strictly defined traffic flow architecture where:
  • Traffic control is managed through KService resources
  • The networking layer uses KIngress resources to configure the Ingress Gateway
  • The Ingress Gateway then routes requests either to the activator or directly to Knative Service Pods

image

  1. To integrate with Knative properly, external tools must interact through the KService API - this is the only supported entry point for managing Knative services and their traffic patterns.

  2. The net-gateway-api project's scope is specifically about allowing Knative to output Gateway API resources instead of other ingress types - it's about the implementation layer, not the control layer.

Therefore, while the suggestion to use Gateway API as a bridge is interesting, it wouldn't provide the deep integration needed for proper traffic management. The current PR's approach of integrating directly with KService is actually the only correct way to achieve this integration - there are no alternative approaches in the current Knative roadmap that would provide the same level of proper integration.

I've reviewed the implementation in the PR, and it aligns perfectly with Knative's architectural principles and control patterns.

Reference: https://knative.dev/docs/serving/architecture/#traffic-flow-and-dns

@aryan9600
Copy link
Member

i did a quick first pass and the approach looks good. just to confirm, both the user and flagger own the Knative Service object, with the former owning the workload configuration and the latter owning the traffic configuration?

furthermore, can you add some preliminary docs? or share the steps you took to test this change?

@kahirokunn
Copy link
Contributor

kahirokunn commented Jan 13, 2025

Yes, We share one Knative Service object.

Copy link

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Knative Serving Lead here 👋. Awesome to see a change like this 🎉 .

Just some comments but I'm not a flagger expert so I'll defer to others.

@@ -27,6 +27,7 @@ require (
k8s.io/client-go v0.30.1
k8s.io/code-generator v0.30.1
k8s.io/klog/v2 v2.120.1
knative.dev/serving v0.41.1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note:
v0.43.0 tag is out now (v1.16)
v0.44.0 tag will be out next week (v1.17)

Though the API definitions are stable so you should be fine not using the latest

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! 🥳

I'll get these updated


switch kind {
case "DaemonSet":
switch {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dunno what coding style this project has but i'd just switch on kind and maybe have a nested if for Service

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup that's my bad. I was very new to go when I wrote this. I'll get it cleaned up

return true, fmt.Errorf("service %s.%s get query error: %w", cd.Spec.TargetRef.Name, cd.Namespace, err)
}

revision, err := kc.knativeClient.ServingV1().Revisions(cd.Namespace).Get(context.TODO(), service.Status.LatestCreatedRevisionName, metav1.GetOptions{})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not a flagger expert)

I think you'd want to guard against LatestCreatedRevisionName being equal to the value of annotations[flagger.app/primary-revision] - this would ensure you wait for a new revision to be created which becomes your canary.

Otherwise I'm thinking if there was ever a delay in the knative reconcilers it'll look at the wrong revision here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: revisions have a configurationGeneration label

}

func (kc *KnativeController) GetMetadata(canary *flaggerv1.Canary) (string, string, map[string]int32, error) {
// TODO: Do we need this for Knative?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what does flagger use this for?

}

func (kc *KnativeController) ScaleToZero(canary *flaggerv1.Canary) error {
// Not Implemented: Not needed for Knative deployments

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious what this does for other providers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other providers flagger maintains two instances of the application. From memory the process looks like this:

  1. When a new version is created update the canary & scale it up
  2. Slowly send traffic from the primary to the canary & monitor
  3. Once the canary reaches 100% & is performing as expected, update the primary
  4. Switch 100% traffic over to the primary
  5. Scale the canary back to zero

return true, fmt.Errorf("service %s.%s get query error: %w", cd.Spec.TargetRef.Name, cd.Namespace, err)
}

return hasSpecChanged(cd, service.Status.LatestCreatedRevisionName)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not a flagger dev)

What 'applies' the new Service spec.template that would trigger the rollout?

Copy link
Author

@tombanksme tombanksme Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the process that I'd envisioned:

  1. User creates a Knative service for their application
  2. User creates a Flagger configuration pointing at the Knative service
  3. Flagger takes over the traffic management of the Knative service
    4.a. It adds the metadata it needs to operate
    4.b. Wasn't done in this original pull request but it can also probably set the default traffic routing to zero for all revisions & manually set the latest to 100%
  4. When the user wants to update an application they update the Knative service like they would usually
  5. Flagger watches for the new revision & then does the traffic management changeover

So the user is still responsible for triggering the rollout like they would with any other Knative service here

- apiGroups:
- serving.knative.dev
resources:
- services

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just an FYI and this is less common)

You can have a Knative Route and point it to revisions from different knative Configurations. Knative Services are a convenience over it.

I doubt anyone needs that kind of flexibility. Tackling Services rollout will probably cover 99% of use cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at this. If it's easy to add I'll do it in this pull request; if not I'll spin it out into another PR

return
}

return int(*service.Status.Traffic[primaryRevisionIdx].Percent), int(*service.Status.Traffic[canaryRevisionIdx].Percent), false, nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsure when this method is called but in general Knative Services's percent can be nil

@tombanksme
Copy link
Author

Morning folks. Thank you all for taking a look & investing your time into this. I'll allocate some time over the next couple of weeks to get this cleaned up as promised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants