-
Notifications
You must be signed in to change notification settings - Fork 542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental config with merged density and load #1008
Conversation
122cffd
to
a17ac5e
Compare
a17ac5e
to
0025dfb
Compare
I ran some manual 100 node tests and things look good. /hold |
/assign @wojtek-t |
0025dfb
to
93aac3e
Compare
# BEGIN scheduler-throughput section | ||
# Min number of pods per deployment to be used for measuring scheduler throughput | ||
# to get enough samples and accurate measurements in small clusters. | ||
{{$MIN_PODS_PER_DEPLOYMENT_TO_MEASURE_SCHEDULER_THROUGHPUT := 250}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is 250 coming from? Only the fact that it's 500 divisor? :D
What I'm mostly interested in is ensuring that we test large deployments (which is what we are doing in existing density test) - there are deployments of size 3000 there.
However, I think that constraining it to "cluster size" is actually reasonable - we will test 100 pod deployments in 100-node cluster and 5000-node deployments in 5k-node clusters.
So I personally think that instead of 250 here, I would simply use ".Nodes", just I would create many more namespaces there, to ensure that at least N(=500?) pods will be created.
[+ ensure that there are at least 2]
# BEGIN pod-startup-latency section | ||
# Min number of pods to be used for measuring pod startup latency to get enough | ||
# samples and accurate measurements in small clusters. | ||
{{$MIN_PODS_TO_MEASURE_STARTUP_LATENCY := 500}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a more generic constant?
And use it both to determine min number of latency pods as well as min number of pods needed for scheduler throughput?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Params: | ||
action: start | ||
labelSelector: group = scheduler-throughput | ||
threshold: 60s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 60s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 100 node tests the 99th %-ile was around 10s, I extrapolated to 1min for 5k node test :)
But changing to 1h to match what we currently have for pods in the load test and added a TODO for that - basically to see whether we can get rid of this artificial pod-startup-latency setups and measure it (and assert on it) across the whole test.
- namespaceRange: | ||
min: 1 | ||
max: {{$namespaces}} | ||
replicasPerNamespace: {{$latencyReplicas}} # TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch :) Removed.
93aac3e
to
d43144e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
Params: | ||
action: start | ||
labelSelector: group = scheduler-throughput | ||
threshold: 60s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 100 node tests the 99th %-ile was around 10s, I extrapolated to 1min for 5k node test :)
But changing to 1h to match what we currently have for pods in the load test and added a TODO for that - basically to see whether we can get rid of this artificial pod-startup-latency setups and measure it (and assert on it) across the whole test.
- namespaceRange: | ||
min: 1 | ||
max: {{$namespaces}} | ||
replicasPerNamespace: {{$latencyReplicas}} # TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch :) Removed.
# BEGIN pod-startup-latency section | ||
# Min number of pods to be used for measuring pod startup latency to get enough | ||
# samples and accurate measurements in small clusters. | ||
{{$MIN_PODS_TO_MEASURE_STARTUP_LATENCY := 500}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
d43144e
to
192a1dd
Compare
I run the tests at scale, the baseline run was the gce-performance-scale run from 02-04 - https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1224739883763896323 Below comparing the relevant results from both tests Pod startup latencyBaseline
This PR
Scheduler ThroughputBaseline
This PR
API-Call-LatencyBaseline
This PR
SummaryI think the results look reasonable enough to merge this PR. We see some regression in |
# failure won't fail the test. See https://github.com/kubernetes/kubernetes/issues/73461#issuecomment-467338711 | ||
{{$saturationDeploymentHardTimeout := MaxInt $saturationDeploymentTimeout 1200}} | ||
|
||
# TODO(https://github.com/kubernetes/perf-tests/issues/1007): Get rid of this file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also high-density test, which relies on it.
Medium-term, we should modify the new load to support it (e.g. instead of creating 2 deployments, create N), but short-term, maybe let's leave this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sent you PRs to address it
{{$schedulerThroughputThreshold := DefaultParam .CL2_SCHEDULER_THROUGHPUT_THRESHOLD 0}} | ||
# END scheduler-throughput section | ||
|
||
# TODO(https://github.com/kubernetes/perf-tests/issues/1024): Ideally, we wouldn't need this section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
s/Ideally .../Investigate and get rid of this section./
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
min: {{AddInt $namespaces 1}} | ||
max: {{AddInt $namespaces $schedulerThroughputNamespaces}} | ||
replicasPerNamespace: 1 | ||
tuningSet: PodThroughputParallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodThroughputParallel name is misleading, as in fact we create deployments with that throughput. Given those are fairly big ones, pod throughput in fact may potentially be much higher.
I don't have a good name though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh, I wanted to write SchedulerThroughputParallel, meaning that this is a dedicated tuningSet for SchedulerThroughput that results in fully parallel creations of deployments.
Changed and documented better.
This is to allow deprecating density tests in all other places except high-density. See kubernetes#1008 (comment)
This is to allow deprecating density tests in all other places except high-density. See kubernetes#1008 (comment)
192a1dd
to
6c5052d
Compare
/lgtm |
/hold |
@oxddr (current scalability oncall) FYI I'm merging this today so we can check it tomorrow and revert before the weekend if needed. |
/hold cancel |
/test pull-perf-tests-clusterloader2 |
6c5052d
to
0d48e27
Compare
@mm4tt - what has changed? |
0d48e27
to
e1bb781
Compare
Given some flakes I noticed in the presubmits I changed the code to make it no-op for existing tests. |
e1bb781
to
1f5b392
Compare
@wojtek-t, PTAL This should be no-op now |
1f5b392
to
68ee035
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mm4tt, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Ref. #1007