Experimental config with merged density and load #1008

mm4tt · 2020-02-03T15:44:36Z

Ref. #1007

mm4tt · 2020-02-05T11:51:57Z

I ran some manual 100 node tests and things look good.

/hold
I'll run it at 5k node scale today before merging

mm4tt · 2020-02-05T11:52:45Z

/assign @wojtek-t

wojtek-t · 2020-02-05T12:05:39Z

clusterloader2/testing/load/config.yaml

+# BEGIN scheduler-throughput section
+# Min number of pods per deployment to be used for measuring scheduler throughput
+# to get enough samples and accurate measurements in small clusters.
+{{$MIN_PODS_PER_DEPLOYMENT_TO_MEASURE_SCHEDULER_THROUGHPUT := 250}}


Where is 250 coming from? Only the fact that it's 500 divisor? :D

What I'm mostly interested in is ensuring that we test large deployments (which is what we are doing in existing density test) - there are deployments of size 3000 there.
However, I think that constraining it to "cluster size" is actually reasonable - we will test 100 pod deployments in 100-node cluster and 5000-node deployments in 5k-node clusters.

So I personally think that instead of 250 here, I would simply use ".Nodes", just I would create many more namespaces there, to ensure that at least N(=500?) pods will be created.
[+ ensure that there are at least 2]

clusterloader2/testing/load/config.yaml

wojtek-t · 2020-02-05T12:11:29Z

clusterloader2/testing/load/config.yaml

+# BEGIN pod-startup-latency section
+# Min number of pods to be used for measuring pod startup latency to get enough
+# samples and accurate measurements in small clusters.
+{{$MIN_PODS_TO_MEASURE_STARTUP_LATENCY := 500}}


Can we make this a more generic constant?
And use it both to determine min number of latency pods as well as min number of pods needed for scheduler throughput?

clusterloader2/testing/load/config.yaml

wojtek-t · 2020-02-05T12:15:23Z

clusterloader2/testing/load/config.yaml

+    Params:
+      action: start
+      labelSelector: group = scheduler-throughput
+      threshold: 60s


In 100 node tests the 99th %-ile was around 10s, I extrapolated to 1min for 5k node test :)
But changing to 1h to match what we currently have for pods in the load test and added a TODO for that - basically to see whether we can get rid of this artificial pod-startup-latency setups and measure it (and assert on it) across the whole test.

clusterloader2/testing/load/config.yaml

wojtek-t · 2020-02-05T12:17:23Z

clusterloader2/testing/load/config.yaml

+  - namespaceRange:
+      min: 1
+      max: {{$namespaces}}
+    replicasPerNamespace: {{$latencyReplicas}} # TODO


Good catch :) Removed.

clusterloader2/testing/load/config.yaml

mm4tt

PTAL

clusterloader2/testing/load/config.yaml

mm4tt · 2020-02-05T13:47:43Z

clusterloader2/testing/load/config.yaml

+    Params:
+      action: start
+      labelSelector: group = scheduler-throughput
+      threshold: 60s


In 100 node tests the 99th %-ile was around 10s, I extrapolated to 1min for 5k node test :)
But changing to 1h to match what we currently have for pods in the load test and added a TODO for that - basically to see whether we can get rid of this artificial pod-startup-latency setups and measure it (and assert on it) across the whole test.

clusterloader2/testing/load/config.yaml

mm4tt · 2020-02-05T13:48:33Z

clusterloader2/testing/load/config.yaml

+  - namespaceRange:
+      min: 1
+      max: {{$namespaces}}
+    replicasPerNamespace: {{$latencyReplicas}} # TODO


Good catch :) Removed.

clusterloader2/testing/load/config.yaml

mm4tt · 2020-02-05T14:13:01Z

clusterloader2/testing/load/config.yaml

+# BEGIN pod-startup-latency section
+# Min number of pods to be used for measuring pod startup latency to get enough
+# samples and accurate measurements in small clusters.
+{{$MIN_PODS_TO_MEASURE_STARTUP_LATENCY := 500}}


mm4tt · 2020-02-06T08:55:02Z

I run the tests at scale, the baseline run was the gce-performance-scale run from 02-04 - https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1224739883763896323

Below comparing the relevant results from both tests

Pod startup latency

Baseline

{
  "version": "1.0",
  "dataItems": [
    {
      "data": {
        "Perc50": 0,
        "Perc90": 0,
        "Perc99": 1000
      },
      "unit": "ms",
      "labels": {
        "Metric": "create_to_schedule"
      }
    },
    {
      "data": {
        "Perc50": 1000,
        "Perc90": 1000,
        "Perc99": 2000
      },
      "unit": "ms",
      "labels": {
        "Metric": "schedule_to_run"
      }
    },
    {
      "data": {
        "Perc50": 1138.736383,
        "Perc90": 1752.331418,
        "Perc99": 2060.449444
      },
      "unit": "ms",
      "labels": {
        "Metric": "run_to_watch"
      }
    },
    {
      "data": {
        "Perc50": 2119.650554,
        "Perc90": 2752.734037,
        "Perc99": 3131.267012
      },
      "unit": "ms",
      "labels": {
        "Metric": "schedule_to_watch"
      }
    },
    {
      "data": {
        "Perc50": 2134.250446,
        "Perc90": 2773.996255,
        "Perc99": 3228.573599
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    }
  ]
}

This PR

{
  "version": "1.0",
  "dataItems": [
    {
      "data": {
        "Perc50": 2435.624141,
        "Perc90": 3059.753256,
        "Perc99": 3718.617312
      },
      "unit": "ms",
      "labels": {
        "Metric": "schedule_to_watch"
      }
    },
    {
      "data": {
        "Perc50": 2457.899606,
        "Perc90": 3130.831216,
        "Perc99": 3983.652041
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
    {
      "data": {
        "Perc50": 0,
        "Perc90": 0,
        "Perc99": 1000
      },
      "unit": "ms",
      "labels": {
        "Metric": "create_to_schedule"
      }
    },
    {
      "data": {
        "Perc50": 1000,
        "Perc90": 2000,
        "Perc99": 2000
      },
      "unit": "ms",
      "labels": {
        "Metric": "schedule_to_run"
      }
    },
    {
      "data": {
        "Perc50": 1157.777974,
        "Perc90": 1742.486806,
        "Perc99": 2192.646648
      },
      "unit": "ms",
      "labels": {
        "Metric": "run_to_watch"
      }
    }
  ]
}

Scheduler Throughput

Baseline

{
  "average": 99.00990099009915,
  "perc50": 100.6,
  "perc90": 102,
  "perc99": 104.8
}

This PR

{
  "average": 95.23809523809524,
  "perc50": 99.2,
  "perc90": 101.2,
  "perc99": 115
}

API-Call-Latency

Baseline

W0205 04:51:05.880] I0205 04:51:05.879556   14942 prometheus.go:108] Executing "histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{resource!=\"events\", verb!~\"WATCH|WATCHLIST|PROXY|proxy|CONNECT\"}[579m])) by (resource,  subresource, verb, scope, le))" at
 2020-02-05T04:50:52Z
W0205 04:51:13.491] I0205 04:51:13.490485   14942 prometheus.go:108] Executing "sum(increase(apiserver_request_duration_seconds_count{resource!=\"events\", verb!~\"WATCH|WATCHLIST|PROXY|proxy|CONNECT\"}[579m])) by (resource, subresource, scope, verb)" at 2020-02-05T04:50:52Z
W0205 04:51:13.710] I0205 04:51:13.710240   14942 api_responsiveness_prometheus.go:282] APIResponsivenessPrometheusSimple: Top latency metric: {Resource:pods Subresource: Verb:LIST Scope:namespace Latency:perc50: 33.872901ms, perc90: 92.702588ms, perc99: 3.094022988s Count:49727}; 
threshold: 5s
W0205 04:51:13.711] I0205 04:51:13.711072   14942 api_responsiveness_prometheus.go:282] APIResponsivenessPrometheusSimple: Top latency metric: {Resource:deployments Subresource: Verb:LIST Scope:cluster Latency:perc50: 50ms, perc90: 1.45s, perc99: 1.495s Count:4}; threshold: 30s
W0205 04:51:13.711] I0205 04:51:13.711503   14942 api_responsiveness_prometheus.go:282] APIResponsivenessPrometheusSimple: Top latency metric: {Resource:services Subresource: Verb:LIST Scope:cluster Latency:perc50: 135.646387ms, perc90: 194.890109ms, perc99: 1.339166666s Count:393}
; threshold: 30s
W0205 04:51:13.712] I0205 04:51:13.711817   14942 api_responsiveness_prometheus.go:282] APIResponsivenessPrometheusSimple: Top latency metric: {Resource:nodes Subresource: Verb:LIST Scope:cluster Latency:perc50: 281.161616ms, perc90: 339.464751ms, perc99: 1.151083333s Count:3087}; 
threshold: 30s
W0205 04:51:13.712] I0205 04:51:13.712131   14942 api_responsiveness_prometheus.go:282] APIResponsivenessPrometheusSimple: Top latency metric: {Resource:persistentvolumes Subresource: Verb:LIST Scope:cluster Latency:perc50: 28.921078ms, perc90: 65.846153ms, perc99: 405.249999ms Cou
nt:1158}; threshold: 30s

This PR

W0206 03:24:30.481] I0206 03:24:30.481494   13663 prometheus.go:108] Executing "sum(increase(apiserver_request_duration_seconds_count{resource!=\"events\", verb!~\"WATCH|WATCHLIST|PROXY|proxy|CONNECT\"}[612m])) by (resource, subresource, scope, verb)" at 2020-02-06T03:24:12Z
W0206 03:24:30.658] I0206 03:24:30.658187   13663 api_responsiveness_prometheus.go:282] APIResponsivenessPrometheusSimple: Top latency metric: {Resource:deployments Subresource: Verb:LIST Scope:cluster Latency:perc50: 1.59057971s, perc90: 1.969354838s, perc99: 3.09s Count:246}; thr
eshold: 30s
W0206 03:24:30.659] I0206 03:24:30.658242   13663 api_responsiveness_prometheus.go:282] APIResponsivenessPrometheusSimple: Top latency metric: {Resource:pods Subresource: Verb:LIST Scope:namespace Latency:perc50: 36.181327ms, perc90: 93.696929ms, perc99: 3.027289416s Count:59781}; 
threshold: 5s
W0206 03:24:30.659] I0206 03:24:30.658254   13663 api_responsiveness_prometheus.go:282] APIResponsivenessPrometheusSimple: Top latency metric: {Resource:nodes Subresource: Verb:LIST Scope:cluster Latency:perc50: 284.575371ms, perc90: 346.103999ms, perc99: 1.167596153s Count:3257}; 
threshold: 30s
W0206 03:24:30.659] I0206 03:24:30.658262   13663 api_responsiveness_prometheus.go:282] APIResponsivenessPrometheusSimple: Top latency metric: {Resource:statefulsets Subresource: Verb:DELETE Scope:namespace Latency:perc50: 26.612903ms, perc90: 47.903225ms, perc99: 800.999999ms Coun
t:99}; threshold: 1s
W0206 03:24:30.659] I0206 03:24:30.658270   13663 api_responsiveness_prometheus.go:282] APIResponsivenessPrometheusSimple: Top latency metric: {Resource:services Subresource: Verb:LIST Scope:cluster Latency:perc50: 138.931297ms, perc90: 193.414634ms, perc99: 672ms Count:412}; thres
hold: 30s

Summary

I think the results look reasonable enough to merge this PR. We see some regression in pod-startup-latency (from 3.2s to 3.9s) but this is still well within 5s SLO. Moreover, we'd like to get rid of this artificial way of measuring it (see #1024) and this small regression shouldn't block us from doing it, quite the opposite it should encourage us to debug and improve it. On the other hand the we see improvements in api-call-latency and the whole test (density + load) now takes 2h less.

wojtek-t · 2020-02-06T09:21:20Z

clusterloader2/testing/density/config.yaml

-# failure won't fail the test. See https://github.com/kubernetes/kubernetes/issues/73461#issuecomment-467338711
-{{$saturationDeploymentHardTimeout := MaxInt $saturationDeploymentTimeout 1200}}
-
+# TODO(https://github.com/kubernetes/perf-tests/issues/1007): Get rid of this file


There is also high-density test, which relies on it.
Medium-term, we should modify the new load to support it (e.g. instead of creating 2 deployments, create N), but short-term, maybe let's leave this test.

Sent you PRs to address it

clusterloader2/testing/load/config.yaml

wojtek-t · 2020-02-06T09:23:56Z

clusterloader2/testing/load/config.yaml

+{{$schedulerThroughputThreshold := DefaultParam .CL2_SCHEDULER_THROUGHPUT_THRESHOLD 0}}
+# END scheduler-throughput section
+
+# TODO(https://github.com/kubernetes/perf-tests/issues/1024): Ideally, we wouldn't need this section.


nit:
s/Ideally .../Investigate and get rid of this section./

wojtek-t · 2020-02-06T09:28:05Z

clusterloader2/testing/load/config.yaml

+      min: {{AddInt $namespaces 1}}
+      max: {{AddInt $namespaces $schedulerThroughputNamespaces}}
+    replicasPerNamespace: 1
+    tuningSet: PodThroughputParallel


PodThroughputParallel name is misleading, as in fact we create deployments with that throughput. Given those are fairly big ones, pod throughput in fact may potentially be much higher.

I don't have a good name though...

Doh, I wanted to write SchedulerThroughputParallel, meaning that this is a dedicated tuningSet for SchedulerThroughput that results in fully parallel creations of deployments.
Changed and documented better.

clusterloader2/testing/load/config.yaml

This is to allow deprecating density tests in all other places except high-density. See kubernetes#1008 (comment)

wojtek-t · 2020-02-06T11:50:55Z

/lgtm

mm4tt · 2020-02-06T11:53:29Z

/hold
Will merge once #1026 gets merged

mm4tt · 2020-02-06T12:58:04Z

@oxddr (current scalability oncall) FYI

I'm merging this today so we can check it tomorrow and revert before the weekend if needed.

mm4tt · 2020-02-06T13:18:54Z

/hold cancel

mm4tt · 2020-02-06T14:43:26Z

/test pull-perf-tests-clusterloader2

wojtek-t · 2020-02-06T16:24:58Z

@mm4tt - what has changed?

mm4tt · 2020-02-07T11:40:14Z

Given some flakes I noticed in the presubmits I changed the code to make it no-op for existing tests.
I'm forking the load test into a separate config where I'll make my changes and enable it as an experimental job. Once we get enough results and confidence that the new test works as expected I'll start enabling it in CI/CD jobs.

mm4tt · 2020-02-07T11:41:47Z

@wojtek-t, PTAL

This should be no-op now

Ref. kubernetes#1007

wojtek-t · 2020-02-07T12:20:54Z

/lgtm

k8s-ci-robot · 2020-02-07T12:21:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mm4tt, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~clusterloader2/OWNERS~~ [mm4tt,wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 3, 2020

k8s-ci-robot requested review from krzysied and mborsz February 3, 2020 15:45

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2020

mm4tt force-pushed the get_rid_of_density branch 2 times, most recently from 122cffd to a17ac5e Compare February 4, 2020 15:53

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 4, 2020

mm4tt force-pushed the get_rid_of_density branch from a17ac5e to 0025dfb Compare February 5, 2020 11:45

mm4tt changed the title ~~<WIP> Get rid of density test~~ Merge density test into load test Feb 5, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 5, 2020

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 5, 2020

k8s-ci-robot assigned wojtek-t Feb 5, 2020

mm4tt force-pushed the get_rid_of_density branch from 0025dfb to 93aac3e Compare February 5, 2020 11:56

wojtek-t reviewed Feb 5, 2020

View reviewed changes

mm4tt force-pushed the get_rid_of_density branch from 93aac3e to d43144e Compare February 5, 2020 15:08

mm4tt commented Feb 5, 2020

View reviewed changes

mm4tt force-pushed the get_rid_of_density branch from d43144e to 192a1dd Compare February 5, 2020 15:25

wojtek-t reviewed Feb 6, 2020

View reviewed changes

mm4tt added a commit to mm4tt/perf-tests that referenced this pull request Feb 6, 2020

Copy density test into high-density-config.

6512fa4

This is to allow deprecating density tests in all other places except high-density. See kubernetes#1008 (comment)

mm4tt added a commit to mm4tt/perf-tests that referenced this pull request Feb 6, 2020

Copy density test into high-density-config.

0f9d6b9

This is to allow deprecating density tests in all other places except high-density. See kubernetes#1008 (comment)

mm4tt mentioned this pull request Feb 6, 2020

Copy density test into high-density-config #1026

Merged

mm4tt force-pushed the get_rid_of_density branch from 192a1dd to 6c5052d Compare February 6, 2020 10:59

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 6, 2020

mm4tt force-pushed the get_rid_of_density branch from 6c5052d to 0d48e27 Compare February 6, 2020 15:23

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2020

mm4tt force-pushed the get_rid_of_density branch from 0d48e27 to e1bb781 Compare February 7, 2020 11:38

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 7, 2020

mm4tt changed the title ~~Merge density test into load test~~ Experimental config with merged density and load Feb 7, 2020

mm4tt force-pushed the get_rid_of_density branch from e1bb781 to 1f5b392 Compare February 7, 2020 11:41

Experimental config with merged density and load

68ee035

Ref. kubernetes#1007

mm4tt force-pushed the get_rid_of_density branch from 1f5b392 to 68ee035 Compare February 7, 2020 11:48

mm4tt mentioned this pull request Feb 7, 2020

Experimental jobs for merged density and load kubernetes/test-infra#16186

Merged

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2020

k8s-ci-robot merged commit 1b65c52 into kubernetes:master Feb 7, 2020

wojtek-t mentioned this pull request Feb 10, 2020

Scalability tests for beta releases kubernetes/sig-release#908

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental config with merged density and load #1008

Experimental config with merged density and load #1008

mm4tt commented Feb 3, 2020 •

edited

Loading

mm4tt commented Feb 5, 2020

mm4tt commented Feb 5, 2020

wojtek-t Feb 5, 2020

wojtek-t Feb 5, 2020

mm4tt Feb 5, 2020

wojtek-t Feb 5, 2020

mm4tt Feb 5, 2020

wojtek-t Feb 5, 2020

mm4tt Feb 5, 2020

mm4tt left a comment

mm4tt Feb 5, 2020

mm4tt Feb 5, 2020

mm4tt Feb 5, 2020

mm4tt commented Feb 6, 2020

wojtek-t Feb 6, 2020

mm4tt Feb 6, 2020

wojtek-t Feb 6, 2020

mm4tt Feb 6, 2020

wojtek-t Feb 6, 2020

mm4tt Feb 6, 2020

wojtek-t commented Feb 6, 2020

mm4tt commented Feb 6, 2020

mm4tt commented Feb 6, 2020

mm4tt commented Feb 6, 2020

mm4tt commented Feb 6, 2020

wojtek-t commented Feb 6, 2020

mm4tt commented Feb 7, 2020

mm4tt commented Feb 7, 2020

wojtek-t commented Feb 7, 2020

k8s-ci-robot commented Feb 7, 2020

Experimental config with merged density and load #1008

Experimental config with merged density and load #1008

Conversation

mm4tt commented Feb 3, 2020 • edited Loading

mm4tt commented Feb 5, 2020

mm4tt commented Feb 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mm4tt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mm4tt commented Feb 6, 2020

Pod startup latency

Baseline

This PR

Scheduler Throughput

Baseline

This PR

API-Call-Latency

Baseline

This PR

Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Feb 6, 2020

mm4tt commented Feb 6, 2020

mm4tt commented Feb 6, 2020

mm4tt commented Feb 6, 2020

mm4tt commented Feb 6, 2020

wojtek-t commented Feb 6, 2020

mm4tt commented Feb 7, 2020

mm4tt commented Feb 7, 2020

wojtek-t commented Feb 7, 2020

k8s-ci-robot commented Feb 7, 2020

mm4tt commented Feb 3, 2020 •

edited

Loading