Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Cl2 tests to work in <100 node clusters #1667

Merged

Conversation

lyzs90
Copy link
Contributor

@lyzs90 lyzs90 commented Jan 19, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:
Details: #887

Which issue(s) this PR fixes:

Fixes #887

Special notes for your reviewer:
The type conversion was rounding fractions to 0. Hence when .Nodes passed in is less than NODES_PER_NAMESPACE (defaults to 100), namespaces = DivideInt .Nodes NODES_PER_NAMESPACE = 0. thereafter namespace is used as the denominator for several computations eg. https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/load/config.yaml#L7

Does this PR introduce a user-facing change?:

Users can now run load tests with < 100 nodes

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 19, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @lyzs90. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 19, 2021
@lyzs90
Copy link
Contributor Author

lyzs90 commented Jan 19, 2021

/cc @mm4tt

@k8s-ci-robot k8s-ci-robot requested a review from mm4tt January 19, 2021 16:01
@@ -39,7 +39,8 @@
{{$ENABLE_VIOLATIONS_FOR_API_CALL_PROMETHEUS := DefaultParam .CL2_ENABLE_VIOLATIONS_FOR_API_CALL_PROMETHEUS false}}
{{$ENABLE_VIOLATIONS_FOR_API_CALL_PROMETHEUS_SIMPLE := DefaultParam .CL2_ENABLE_VIOLATIONS_FOR_API_CALL_PROMETHEUS_SIMPLE true}}
#Variables
{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}}
# set min namespaces to 1 to avoid divide by zero errors. See https://github.com/kubernetes/perf-tests/issues/887
{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE | MaxInt 1}}
{{$totalPods := MultiplyInt $namespaces $NODES_PER_NAMESPACE $PODS_PER_NODE}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also adjust this value. Otherwise, if I run this test on a 1 node cluster, the total pods will be calculated as if there was 100 nodes (and it will be equal to 3000pods). Obviously, it won't work

Copy link
Contributor Author

@lyzs90 lyzs90 Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah my bad, i didn't look beyond that line. the real culprit seems to be the NODES_PER_NAMESPACE constant. is it ok to set NODES_PER_NAMESPACE to be the minimum of num nodes and 100 (or whatever value was sent in)? then we can do away with the rounding fix on namespaces

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that should work.

@mm4tt
Copy link
Contributor

mm4tt commented Jan 20, 2021

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 20, 2021
@lyzs90
Copy link
Contributor Author

lyzs90 commented Jan 28, 2021

@mm4tt Noticed a few other places where this fix might not work for small clusters eg. https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/load/config.yaml#L67 and https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/density/config.yaml#L17. These may have to be scaled down also - the goal is so that anyone can run load / density tests on a 1 node cluster?

@lyzs90
Copy link
Contributor Author

lyzs90 commented Jan 28, 2021

Instead of trying to patch the load and density tests so that they will work on small clusters, wdyt about defining a minimal test config that will run on a 1 node cluster?

I'm thinking of an examples folder with a few configs like minimal, with-modules, with-overrides, etc.

@mm4tt
Copy link
Contributor

mm4tt commented Jan 29, 2021

I'd really like to make the load test to work at any scale. All the hard-coded values we have there are actually a technical debt and we should aim to get rid of them. A few notes:

  • Please disregard the density test, we're aiming to deprecate it soon
  • Is there anything else other than MIN_PODS_IN_SMALL_CLUSTERS that's blocking us?
  • For MIN_PODS_IN_SMALL_CLUSTERS, I believe we have some options:
    1. Ideally, we'd just fix the Re-implement scheduling-throughput measurement #1027, so we don't need this constant. Unfortunately, that's not that simple, so let's think of something else
    2. Instead of having the MIN_PODS_IN_SMALL_CLUSTERS as a constant, hard-coded in the config.yaml, it could be a test param that is set to 500 in the test-infra presets only for our 100+ node tests
    3. Given that this part of the test doesn't really work if we have less than 100 nodes (if we have less then we're not able to accurately measure the scheduling throughput), we should probably conditionally disable this part if # nodes < 100.

I believe 3 is the best option currently. Thoughts?

@lyzs90
Copy link
Contributor Author

lyzs90 commented Jan 29, 2021

Yes I think approach 3 sounds good. Will also need to handle BIG_GROUP_SIZE and MEDIUM_GROUP_SIZE in the same manner I suppose?

@mm4tt
Copy link
Contributor

mm4tt commented Jan 29, 2021

Will also need to handle BIG_GROUP_SIZE and MEDIUM_GROUP_SIZE in the same manner I suppose?

Hmm, wouldn't it work if we just round down everywhere? E.g. if we have just 10 nodes, the total number of pods will be 300 (30*10). Then we'll calculate how many small/medium/big deployments we can fit and e.g. for big we'll do: 300 / 4 / 250 = 0 (assuming we round down). For medium it'll be 300/4/30 = 2. For small it'll be 300/2/5 = 30. Does it make sense?

@lyzs90
Copy link
Contributor Author

lyzs90 commented Jan 29, 2021

Will also need to handle BIG_GROUP_SIZE and MEDIUM_GROUP_SIZE in the same manner I suppose?

Hmm, wouldn't it work if we just round down everywhere? E.g. if we have just 10 nodes, the total number of pods will be 300 (30*10). Then we'll calculate how many small/medium/big deployments we can fit and e.g. for big we'll do: 300 / 4 / 250 = 0 (assuming we round down). For medium it'll be 300/4/30 = 2. For small it'll be 300/2/5 = 30. Does it make sense?

But it's also used in number of replicas https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/load/config.yaml#257. Seems like BIG_GROUP_SIZE, MEDIUM_GROUP_SIZE needs to be a factor of number of nodes?

@mm4tt
Copy link
Contributor

mm4tt commented Jan 29, 2021

It will be fine if we have 250 there even if there is no place for so many pods. This is because this object will be never created. The replicasPerNamespace will be 0 - https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/load/config.yaml#L244

@lyzs90
Copy link
Contributor Author

lyzs90 commented Jan 29, 2021

Ah yes. I'm curious if that (medium + small) will all fit on 1 node 😅

@lyzs90 lyzs90 force-pushed the num-nodes-divide-by-zero-error branch from 91a2abe to 9c7c0d0 Compare January 29, 2021 13:03
@lyzs90
Copy link
Contributor Author

lyzs90 commented Jan 30, 2021

Noting down some issues on a single node run:

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 30, 2021
@lyzs90
Copy link
Contributor Author

lyzs90 commented Jan 30, 2021

@mm4tt Added defaults and disabled heavier workloads when running on clusters < 100 nodes (I tested with a single node cluster)

Copy link
Contributor

@mm4tt mm4tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is great! I know how painful it's to write in yaml and go-templates, so I really appreciate it!

@@ -90,7 +97,7 @@

name: load
namespace:
number: {{AddInt $namespaces $schedulerThroughputNamespaces}}
number: {{MinInt .Nodes (AddInt $namespaces $schedulerThroughputNamespaces)}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure If I understand this line. Wouldn't be better to make sure the $namespaces and schedulerThroughputNamespaces are just set properly in all the cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some calculations below so it's easier to make sense of the change.

Without MinInt:

# 100 nodes
$MIN_PODS_IN_SMALL_CLUSTERS := 500
$totalSchedulerThroughputPods := MaxInt (MultiplyInt 2 $MIN_PODS_IN_SMALL_CLUSTERS) (MultiplyInt 2 .Nodes) # 1000
$schedulerThroughputPodsPerDeployment := .Nodes # 100
$schedulerThroughputNamespaces := DivideInt $totalSchedulerThroughputPods $schedulerThroughputPodsPerDeployment # 10
number: AddInt $namespaces $schedulerThroughputNamespaces # 11
# 1 node
$MIN_PODS_IN_SMALL_CLUSTERS := 500
$totalSchedulerThroughputPods := MaxInt (MultiplyInt 2 $MIN_PODS_IN_SMALL_CLUSTERS) (MultiplyInt 2 .Nodes) # 1000
$schedulerThroughputPodsPerDeployment := .Nodes # 1
$schedulerThroughputNamespaces := DivideInt $totalSchedulerThroughputPods $schedulerThroughputPodsPerDeployment # 1000
number: AddInt $namespaces $schedulerThroughputNamespaces # 1001 <- !!!

With MinInt:

# 100 nodes
same as above
number: MinInt .Nodes (AddInt $namespaces $schedulerThroughputNamespaces) # 11
# 1 node
same as above
number: MinInt .Nodes (AddInt $namespaces $schedulerThroughputNamespaces) # 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd rather leave this line as is and change $schedulerThroughputNamespaces to 0 if $IS_SMALL_CLUSTER. It makes the code easier to understand

@@ -229,6 +236,7 @@ steps:
objectTemplatePath: daemonset.yaml
templateFillMap:
Image: k8s.gcr.io/pause:3.0
{{if not $IS_SMALL_CLUSTER}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this if block needed here? I believe the $bigDeploymentsPerNamespace will be 0 if .Nodes < 100. In that case no object is going to be created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try again, I may have mixed this up with the error coming from a negative $bigDeploymentsPerNamespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big deployment objects are still being created even though bigDeploymentsPerNamespace is 0. Seems like ReplicasMin and ReplicasMin have the final say. So maybe targeting the BIG_GROUP size variable would be better as it will affect jobs created at the same time.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, we're missing something. If the replicasPerNamespace is set to 0 then no new object should be created -

for replicaCounter := instancesStates[0].CurrentReplicaCount; replicaCounter < phase.ReplicasPerNamespace; replicaCounter++ {

The only explanation is that the $bigDeploymentsPerNamespace` is not set to 0. Isn't it because in line 67 we're doing:

{{$bigDeploymentsPerNamespace := MaxInt 1 (SubtractInt $bigDeploymentsPerNamespace 1)}}

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This is it, addressed in my latest commit 99c29b4

@@ -345,6 +354,7 @@ steps:
templateFillMap:
ReplicasMin: {{$MEDIUM_GROUP_SIZE}}
ReplicasMax: {{$MEDIUM_GROUP_SIZE}}
{{if not $IS_SMALL_CLUSTER}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of havign this if block, we could introduce a variable$bigJobsPerNamespace and make sure it will be set to 0 if #Nodes < 100


- name: Scaling and updating objects
phases:
{{if not $IS_SMALL_CLUSTER}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I believe it's not needed

@@ -652,6 +671,7 @@ steps:

- name: Deleting objects
phases:
{{if not $IS_SMALL_CLUSTER}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@@ -627,6 +644,7 @@ steps:
templateFillMap:
ReplicasMin: {{MultiplyInt $MEDIUM_GROUP_SIZE 0.5}}
ReplicasMax: {{MultiplyInt $MEDIUM_GROUP_SIZE 1.5}}
{{if not $IS_SMALL_CLUSTER}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, one we introduce a variable it shoudln't be needed anymore

# Ensure non zero or negative after subtraction.
{{$smallDeploymentsPerNamespace := MaxInt 1 (SubtractInt $smallDeploymentsPerNamespace 1)}}
{{$mediumDeploymentsPerNamespace := MaxInt 1 (SubtractInt $mediumDeploymentsPerNamespace 1)}}
{{$bigDeploymentsPerNamespace := MaxInt 1 (SubtractInt $bigDeploymentsPerNamespace 1)}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have: MaxInt 0 ... instead of MaxInt 1 .... Here and above. Then we won't need lines 69-71

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added :69-71 because there is nothing limiting $bigDeploymentsPerNamespace above any more, it will be 250 * 4 - 1 = 999 on this line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the bigDeploymentsPerNamespace is calculated as

{{$bigDeploymentsPerNamespace := DivideInt $podsPerNamespace (MultiplyInt 4 $BIG_GROUP_SIZE)}}

And if we have $namespaces=1 then $podsPerNamespace = 30 * #Nodes

So, e.g. in a 10 node cluster

bigDeploymentsPerNamespace := 30* 10 / 4 / 250 = 0

I believe it should work, we don't need any custom logic there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah did I miss something. Let me try again later. I vaguely recall it wasn't working as expected when I set it to MaxInt 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok it works! Made the change accordingly

Copy link
Contributor

@mm4tt mm4tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

# Stateful sets are enabled. Reduce the number of small and medium deployments per namespace
# See https://github.com/kubernetes/perf-tests/issues/1036#issuecomment-607631768
# Ensure non zero or negative after subtraction.
{{$smallDeploymentsPerNamespace := MaxInt 1 (SubtractInt $smallDeploymentsPerNamespace $SMALL_STATEFUL_SETS_PER_NAMESPACE)}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let' have MaxInt 0 everywhere. It's more correct. We should err on the side of downsizing to keep the "~30 pods per node on avg" assumption to hold.

@@ -112,7 +122,7 @@ tuningSets:
# The expected number of created/deleted pods is totalPods/4 when scaling,
# as each RS changes its size from X to a uniform random value in [X/2, 3X/2].
# To match 10 [pods/s] requirement, we need to divide saturationTime by 4.
timeLimit: {{DivideInt $saturationTime 4}}s
timeLimit: {{MaxInt 1 (DivideInt $saturationTime 4)}}s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert this. It shouldn't really matter whether the time limit is 0s or 1s. Let's keep the code simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RNG panics if timeLimit is 0:

panic: invalid argument to Int63n

goroutine 321 [running]:
math/rand.(*Rand).Int63n(0xc0000a8420, 0x0, 0xc0008b67c0)
	/usr/local/go/src/math/rand/rand.go:111 +0x11d
math/rand.Int63n(...)
	/usr/local/go/src/math/rand/rand.go:327
k8s.io/perf-tests/clusterloader2/pkg/tuningset.(*randomizedTimeLimitedLoad).Execute.func1()
	/Users/shemleong/go/src/perf-tests/clusterloader2/pkg/tuningset/randomized_time_limited.go:43 +0x4a
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc000b00420, 0xc000b7a750)
	/Users/shemleong/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:73 +0x51
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/Users/shemleong/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:71 +0x65
exit status 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so let's fix that here -

time.Sleep(time.Duration(rand.Int63n(r.params.TimeLimit.ToTimeDuration().Nanoseconds())))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, call it only if timeLimit > 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, added the check there

@@ -90,7 +97,7 @@

name: load
namespace:
number: {{AddInt $namespaces $schedulerThroughputNamespaces}}
number: {{MinInt .Nodes (AddInt $namespaces $schedulerThroughputNamespaces)}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd rather leave this line as is and change $schedulerThroughputNamespaces to 0 if $IS_SMALL_CLUSTER. It makes the code easier to understand

@mm4tt mm4tt changed the title Set default number of namespaces to 1 Make Cl2 tests to work in <100 node clusters Feb 5, 2021
@lyzs90
Copy link
Contributor Author

lyzs90 commented Feb 5, 2021

Other than the timeLimit one, I've addressed the other comments!

Copy link
Contributor

@mm4tt mm4tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment

{{$PODS_PER_NODE := DefaultParam .PODS_PER_NODE 30}}
{{$LOAD_TEST_THROUGHPUT := DefaultParam .CL2_LOAD_TEST_THROUGHPUT 10}}
{{$DELETE_TEST_THROUGHPUT := DefaultParam .CL2_DELETE_TEST_THROUGHPUT $LOAD_TEST_THROUGHPUT}}
{{$BIG_GROUP_SIZE := DefaultParam .BIG_GROUP_SIZE 250}}
{{$MEDIUM_GROUP_SIZE := DefaultParam .MEDIUM_GROUP_SIZE 30}}
{{$MEDIUM_GROUP_SIZE := DefaultParam .MEDIUM_GROUP_SIZE 15}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops my bad, forgot to undo the change after testing

@lyzs90 lyzs90 force-pushed the num-nodes-divide-by-zero-error branch from db6047b to 1dbc08f Compare February 5, 2021 15:45
@lyzs90
Copy link
Contributor Author

lyzs90 commented Feb 5, 2021

LGTM with one comment

Cool :) will squash commits

@lyzs90 lyzs90 force-pushed the num-nodes-divide-by-zero-error branch from 1dbc08f to 411b1af Compare February 5, 2021 15:58
@mm4tt
Copy link
Contributor

mm4tt commented Feb 5, 2021

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 5, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lyzs90, mm4tt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2021
@k8s-ci-robot k8s-ci-robot merged commit 11eae44 into kubernetes:master Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CL2 density and load test shouldn't require # nodes to be multiple of 100
3 participants