Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark NEG refs from deleted subnet as to-be-deleted state. #2744

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 32 additions & 11 deletions pkg/neg/syncers/transaction.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (
"sync"
"time"

nodetopologyv1 "github.com/GoogleCloudPlatform/gke-networking-api/apis/nodetopology/v1"
"github.com/GoogleCloudPlatform/k8s-cloud-provider/pkg/cloud"
"github.com/GoogleCloudPlatform/k8s-cloud-provider/pkg/cloud/meta"
"google.golang.org/api/googleapi"
Expand Down Expand Up @@ -513,7 +514,7 @@ func (s *transactionSyncer) ensureNetworkEndpointGroups() error {
}

if updateNEGStatus {
s.updateInitStatus(negObjRefs, errList)
s.updateInitStatus(negObjRefs, subnetConfigs, errList)
}

s.syncMetricsCollector.UpdateSyncerNegCount(s.NegSyncerKey, negsByLocation)
Expand Down Expand Up @@ -835,7 +836,7 @@ func (s *transactionSyncer) logEndpoints(endpointMap map[negtypes.EndpointGroupI
// Before patching the NEG CR, it also includes NEG refs for NEGs are no longer
// needed and change status as INACTIVE.
// If neg client is nil, will return immediately.
func (s *transactionSyncer) updateInitStatus(negObjRefs []negv1beta1.NegObjectReference, errList []error) {
func (s *transactionSyncer) updateInitStatus(negObjRefs []negv1beta1.NegObjectReference, subnetConfigs []nodetopologyv1.SubnetConfig, errList []error) {
if s.svcNegClient == nil {
return
}
Expand All @@ -849,8 +850,8 @@ func (s *transactionSyncer) updateInitStatus(negObjRefs []negv1beta1.NegObjectRe

neg := origNeg.DeepCopy()
if flags.F.EnableMultiSubnetClusterPhase1 {
inactiveNegObjRefs := getInactiveNegRefs(origNeg.Status.NetworkEndpointGroups, negObjRefs, s.logger)
negObjRefs = append(negObjRefs, inactiveNegObjRefs...)
additionalNegObjRefs := getInactiveAndTBDNegRefs(origNeg.Status.NetworkEndpointGroups, negObjRefs, subnetConfigs, s.logger)
negObjRefs = append(negObjRefs, additionalNegObjRefs...)
}
neg.Status.NetworkEndpointGroups = negObjRefs

Expand Down Expand Up @@ -998,9 +999,11 @@ func ensureCondition(neg *negv1beta1.ServiceNetworkEndpointGroup, expectedCondit
return expectedCondition
}

// getInactiveNegRefs creates NEG references for NEGs in Inactive State.
// Inactive NEG are NEGs that are no longer needed.
func getInactiveNegRefs(oldNegRefs []negv1beta1.NegObjectReference, currentNegRefs []negv1beta1.NegObjectReference, logger klog.Logger) []negv1beta1.NegObjectReference {
// getInactiveAndTBDNegRefs creates NEG references for NEGs in Inactive and
// to-be-deleted State.
// Inactive NEGs are NEGs that are no longer needed, while to-be-deleted
// NEGs are NEGs in a subnet that no longer in the cluster.
func getInactiveAndTBDNegRefs(oldNegRefs []negv1beta1.NegObjectReference, currentNegRefs []negv1beta1.NegObjectReference, subnetConfigs []nodetopologyv1.SubnetConfig, logger klog.Logger) []negv1beta1.NegObjectReference {
activeNegs := make(map[negtypes.NegInfo]struct{})
for _, negRef := range currentNegRefs {
negInfo, err := negtypes.NegInfoFromNegRef(negRef)
Expand All @@ -1011,11 +1014,29 @@ func getInactiveNegRefs(oldNegRefs []negv1beta1.NegObjectReference, currentNegRe
activeNegs[negInfo] = struct{}{}
}

var inactiveNegRefs []negv1beta1.NegObjectReference
clusterSubnets := make(map[string]struct{})
for _, subnetConfig := range subnetConfigs {
clusterSubnets[subnetConfig.Name] = struct{}{}
}

var updatedNegRefs []negv1beta1.NegObjectReference
for _, origNegRef := range oldNegRefs {
negInfo, err := negtypes.NegInfoFromNegRef(origNegRef)
if err != nil {
logger.Error(err, "Failed to extract name and zone information of a neg from the previous snapshot, skipping validating if it is an Inactive NEG", "negId", origNegRef.Id, "negSelfLink", origNegRef.SelfLink)
logger.Error(err, "Failed to extract name and zone information of a neg from the previous snapshot, skipping validating if it is an Inactive or to-be-deleted NEG", "negId", origNegRef.Id, "negSelfLink", origNegRef.SelfLink)
continue
}

resourceID, err := cloud.ParseResourceURL(origNegRef.SubnetURL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a walk through of what happens when:

  1. NEG already exists at a GKE version which has no knowledge of MSC and SvcNEG do not have subnet information.
  2. GKE cluster is upgraded to a GKE version which has all of these changes.
  3. So for the first time, how will the SvcNEG stats upgrades look like?

Do we have the code in place to update existing SvcNEGs with the subnet information such that it is available here for the first time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we are doing here is comparing the subnet of NEGs in NEG CR with the current set of subnets from nodeTopology CRD/subnetConfigs.

After upgrade, before customer adding any subnets:
The SvcNeg will only have NEG refs from the default subnet, and the nodeTopology CRD will only have the default subnet, so we won't find any NEGs that belongs to a subnet that are not in the current set of subnets.

The update of SvcNeg with subnet information is happening in updateInitStatus. We will do ensureNEG for each subnet and zone, and update the NEG CR with the returned NEG from ensure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Still have some questions:

So for the first time after that upgrade, origNegRef.SubnetURL will be empty right? Which means we will run into this error case below because we'll fail to ParseResourceURL(). This seems to be a bit destructive, especially if someone also has some NEG transitioning to INACTIVE, which then we'll fail to mark and it will get lost. Is this right? Can we think of any other problems?

Is the answer to simply swap the order of checking Inactive followed by ToBeDeleted. Thinking about this, it seems logical to say we first classify things as Inactive, and out of those Inactive, we then further classify them as ToBeDeleted. If the current ordering was intended to cause some other issues, we may need to think of something else.

if err != nil {
logger.Error(err, "Failed to extract subnet information from the previous snapshot, skipping validating if it is an Inactive or to-be-deleted NEG", "negId", origNegRef.Id, "negSelfLink", origNegRef.SelfLink)
continue
}
negSubnet := resourceID.Key.Name
if _, exists := clusterSubnets[negSubnet]; !exists {
toBeDeletedNegRef := origNegRef.DeepCopy()
toBeDeletedNegRef.State = negv1beta1.ToBeDeletedState
updatedNegRefs = append(updatedNegRefs, *toBeDeletedNegRef)
continue
}

Expand All @@ -1027,10 +1048,10 @@ func getInactiveNegRefs(oldNegRefs []negv1beta1.NegObjectReference, currentNegRe
if _, exists := activeNegs[negInfo]; !exists {
inactiveNegRef := origNegRef.DeepCopy()
inactiveNegRef.State = negv1beta1.InactiveState
inactiveNegRefs = append(inactiveNegRefs, *inactiveNegRef)
updatedNegRefs = append(updatedNegRefs, *inactiveNegRef)
}
}
return inactiveNegRefs
return updatedNegRefs
}

// getSyncedCondition returns the expected synced condition based on given error
Expand Down
Loading