rework on poller auto scaler #1411

shijiesheng · 2024-12-10T19:37:05Z

What changed?

read new signal (poller wait time) to scale
allow kill switching poller auto scaler from server
new implementation that makes scaling quicker to traffic change
removed no longer used autoscaler package completely (original implementation is over complicated)

Why?

more reliable signal for scaling
quicker to traffic change

How did you test it?

unit test
[WIP] canary test + bench test

Potential risks

codecov · 2024-12-21T10:21:50Z

Codecov Report

Attention: Patch coverage is 91.90751% with 14 lines in your changes missing coverage. Please review.

Project coverage is 82.64%. Comparing base (1fd8ba0) to head (636c433).

Files with missing lines	Patch %	Lines
internal/internal_worker_base.go	76.74%	10 Missing ⚠️
internal/worker/concurrency_auto_scaler.go	96.92%	3 Missing and 1 partial ⚠️

Files with missing lines	Coverage Δ
internal/worker/concurrency_auto_scaler.go	`96.92% <96.92%> (ø)`
internal/internal_worker_base.go	`81.57% <76.74%> (-1.06%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1fd8ba0...636c433. Read the comment docs.

Groxx · 2024-12-31T22:12:00Z

internal/internal_worker_base.go

@@ -421,3 +422,20 @@ func (bw *baseWorker) Stop() {
 	}
 	return
 }
+
+func getAutoConfigHint(task interface{}) *shared.AutoConfigHint {
+	switch t := task.(type) {


when I was poking through these types, to see if this was complete, I believe I found a fourth. I don't remember if it was locally-dispatched activity types or query tasks or something else though.

what happens if this is wrong? does it just consider [something] as having no cost and... do what?

minor nonblocking suggestion: not to be too OO pilled or wahtever, If this is a property of the task type, maybe make it a method on the task interface?

ie, instead consider

type autoConfigAwareTask interface { getAutoConfigHint() *shared.AutoConfigHint } // for each task, make them responsible for their internal getters and internal state func (t *workflowTask) getAutoConfigHint() { return t.shared.AutoConfigHint }

Groxx · 2024-12-31T22:45:10Z

internal/worker/concurrency_auto_scaler_test.go

+				inputChan <- tt.input[i]
+				<-doneC


this is serializing the whole thing, so it doesn't really test anything concurrently - you can even remove the mutexes and it'll still pass the race detector.

what's the goal with the concurrency? it's how it's used so 👍 concurrency in tests is good, but this has artificially nerfed it.

Groxx · 2024-12-31T22:51:25Z

internal/worker/concurrency_auto_scaler.go

+	r.index %= len(r.window)
+	r.sum += value - r.window[r.index]
+	r.window[r.index] = value
+	r.index++


general design critique: where possible, make sure all state valid when it leaves the function boundaries.

e.g. in this case, r.index is allowed to be ==len (and unsafe to use) every len(r.window) times it's called. if this were instead:

Suggested change

r.index %= len(r.window)

r.sum += value - r.window[r.index]

r.window[r.index] = value

r.index++

r.sum += value - r.window[r.index]

r.window[r.index] = value

r.index++

r.index %= len(r.window)

it'd always be correct.

Groxx · 2024-12-31T22:59:26Z

internal/worker/concurrency_auto_scaler.go

+	c.scope.Gauge("poller_in_action").Update(float64(c.concurrency.PollerPermit.Count()))
+	c.scope.Gauge("poller_quota").Update(float64(c.concurrency.PollerPermit.Quota()))


no complaint at all about emitting both, but: when are we expecting them to be different, aside from for extremely small amounts of time (between "poll complete" and "started the next poll")?

if it's just for general monitoring / if there is a difference we can investigate because it might be a bug, yea seems great. I'm just wondering if there's some other intent / expected difference, like from other limiters or something.

make sense. Maybe just poller quota is needed here

Groxx · 2024-12-31T23:04:38Z

internal/worker/concurrency_auto_scaler.go

+	if switched := c.enable.CompareAndSwap(!shouldEnable, shouldEnable); switched {
+		if shouldEnable {
+			c.logEvent(autoScalerEventEnable)
+		} else {
+			c.resetConcurrency()
+			c.logEvent(autoScalerEventDisable)
+		}
+	}


tbh I think I'd prefer just using a mutex.

right now there's nothing preventing:

updatePollerPermit begins, checks enable...

disable

reset concurrency resets the quota

... updatePollerPermit sets quota to something else

and leaving things in an abnormal state with no ability to notice it.

with the possible exception of metrics/log buffer flushing and whatnot, there's no I/O in here and no expected lock contention - should be no issue at all with using a mutex.

Groxx · 2024-12-31T23:07:27Z

internal/worker/concurrency_auto_scaler.go

+// 2. enable/disable auto scaler
+func (c *ConcurrencyAutoScaler) ProcessPollerHint(hint *shared.AutoConfigHint) {
+	if hint == nil {
+		c.log.Warn("auto config hint is nil, this results in no action")


maybe just info? otherwise this'll make a flood of warns in user services when rolling out or disabling, but it's not really an unexpected / worrying thing in that state.

or maybe no log if not enabled?

makes sense. I'll mark it as info level

Groxx · 2024-12-31T23:38:44Z

internal/worker/concurrency_auto_scaler.go

+const (
+	defaultAutoScalerUpdateTick = time.Second
+	// concurrencyAutoScalerObservabilityTick = time.Millisecond * 500
+	targetPollerWaitTimeInMsLog2  = 4 // 16 ms


took a while to figure out what exactly the goal here was, I think because the log2/exp2 steps are so separated...

might be easier to follow if you make this a normal ms value, and do a current*log2(actual/target) at the point where you're calculating the quota instead? to me that feels more like more obviously a smoothing operation at that point / a "react to the magnitude rather than the value", where currently it's kinda hidden as a current*log2(actual)/log2(target) that's being done in three separate locations.

re TimeInMs vs a time.Duration: seems fine to me. it's a ton of value-casting noise otherwise, sadly. and/or maybe we should consider a "duration math" util somewhere, as we do a moderate amount of duration-related floaty calculations in both client and server, and they're super verbose.

The only concern is currently I'm averaging the wait time. average(log2(wait_time))
If I adopt this approach, large wait time will play a bigger role in the average, which is not desired.

Groxx · 2024-12-31T23:48:33Z

internal/worker/concurrency_auto_scaler.go

+	updateTime := c.clock.Now()
+	if updateTime.Before(c.pollerPermitLastUpdate.Add(c.cooldown)) { // before cooldown
+		c.logEvent(autoScalerEventPollerSkipUpdateCooldown)
+		return
+	}


since this func (and therefore updateTime's last value) is only ever called in a loop driven by a timer, it seems like this only has two modes of operation:

cooldown < tick: every tick causes an update

cooldown > tick: every N ticks cause an update

neither of which seems particularly useful. probably remove / leftovers from an older idea?

or is this intended as a one-time warm-up delay of some kind? I'm not sure that would be useful here / with the current math setup, but there's often some reason to have warmup periods.

It's actually updated when poller permit is updated. It serves as a sustain time to avoid updating multiple times in a small time window.

Groxx · 2025-01-02T18:31:44Z

to stick it in here too: overall looks pretty good. simpler and the overall goal (and why it achieves it) is clearer too. seems like just minor tweaks (many optional) and it's probably good to go

3vilhamster

Overall looks good, but I left some nits

3vilhamster · 2025-01-08T11:23:28Z

internal/internal_worker_base.go

@@ -301,7 +308,7 @@ func (bw *baseWorker) pollTask() {
 	var err error
 	var task interface{}

-	if bw.pollerAutoScaler != nil {
+	if bw.concurrencyAutoScaler != nil {
 		if pErr := bw.concurrency.PollerPermit.Acquire(bw.limiterContext); pErr == nil {


nit: this looks like a leaking abstraction. This should be handled inside concurrencyAutoScaler.
I suggest moving all
concurrencyAutoScaler != nil checks inside methods where it is required.
This code should be simpler. Just calling methods on autoscaler. If it is nil, do nothing.

3vilhamster · 2025-01-08T11:31:12Z

internal/worker/concurrency_auto_scaler.go

+				return
+			case <-ticker.Chan():
+				c.logEvent(autoScalerEventMetrics)
+				c.lock.Lock()


nit: push lock/unlock to updatePollerPermit, then you can use defer inside the function and ensure that unlock happens if anything will cause panic.

right, it's simpler

3vilhamster · 2025-01-08T11:33:03Z

internal/worker/concurrency_auto_scaler.go

+	c.wg.Add(1)
+
+	go func() {
+		defer c.wg.Done()


nit: any calls that start goroutine should have a panic handler.
If a bug exists, it will crash the worker process, significantly impacting customer service.
This is an optional functionality that should be safe to break. Worst case, it won't update concurrency.

davidporter-id-au · 2025-01-31T01:03:31Z

internal/worker/concurrency_auto_scaler.go

+	if hint.EnableAutoConfig != nil && *hint.EnableAutoConfig {
+		shouldEnable = true
+	}
+	if shouldEnable != c.enabled { // flag switched


nit: consider an early abort?

if hint.EnableAutoConfig == nil || !*hint.EnableAutoConfig { return } ``

davidporter-id-au · 2025-01-31T01:11:39Z

internal/worker/concurrency_auto_scaler.go

+	if r.count == 0 {
+		return 0
+	}
+	return r.sum / T(r.count)


are there any conditions under which sum might become zero?

shijiesheng requested review from dkrotx and demirkayaender as code owners December 10, 2024 19:37

shijiesheng mentioned this pull request Dec 18, 2024

poller auto scaler rework #1409

Closed

shijiesheng force-pushed the autoscaler-rework branch from 82e6bb6 to 636c433 Compare December 20, 2024 23:01

shijiesheng requested review from Groxx, jakobht, 3vilhamster and taylanisikdemir as code owners December 20, 2024 23:01

Groxx reviewed Dec 31, 2024

View reviewed changes

shijiesheng force-pushed the autoscaler-rework branch from 8438696 to 69f7e3f Compare January 7, 2025 21:24

3vilhamster reviewed Jan 8, 2025

View reviewed changes

shijiesheng force-pushed the autoscaler-rework branch from 168b02e to a9d3781 Compare January 14, 2025 21:28

shijiesheng added 9 commits January 17, 2025 09:32

populate task for empty polls as well

8c7f6ce

fix test case and revert task handler change

c7b090c

simply change

f40bde1

rework on poller auto scaler

60f4fca

add unit test

f7630e8

address incorrect scaling direction

a9e7fae

address comment

9b4796f

process poller hint only when task is not nil

19684e7

use lock instead for transactions

9156b50

shijiesheng added 7 commits January 17, 2025 09:32

bug fix on get auto config hint

b569aec

use worker type in logger and scope for autoscaler

4796d89

correct locking

5a3bf9f

address zero wait time

d374b62

process hint only for successful tasks

1b81c7b

split scaling policy

b02706c

change default cooldown to 10s

52dd229

shijiesheng force-pushed the autoscaler-rework branch from a9d3781 to 52dd229 Compare January 17, 2025 17:34

davidporter-id-au reviewed Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rework on poller auto scaler #1411

rework on poller auto scaler #1411

shijiesheng commented Dec 10, 2024

codecov bot commented Dec 21, 2024 •

edited

Loading

Groxx Dec 31, 2024 •

edited

Loading

davidporter-id-au Jan 31, 2025

Groxx Dec 31, 2024

Groxx Dec 31, 2024

Groxx Dec 31, 2024 •

edited

Loading

shijiesheng Jan 6, 2025

Groxx Dec 31, 2024 •

edited

Loading

shijiesheng Jan 6, 2025

Groxx Dec 31, 2024 •

edited

Loading

shijiesheng Jan 6, 2025

Groxx Dec 31, 2024 •

edited

Loading

shijiesheng Jan 6, 2025

Groxx Dec 31, 2024 •

edited

Loading

shijiesheng Jan 6, 2025

Groxx commented Jan 2, 2025 •

edited

Loading

3vilhamster left a comment

3vilhamster Jan 8, 2025

shijiesheng Jan 27, 2025

3vilhamster Jan 8, 2025

shijiesheng Jan 27, 2025

3vilhamster Jan 8, 2025 •

edited

Loading

shijiesheng Jan 27, 2025

davidporter-id-au Jan 31, 2025

davidporter-id-au Jan 31, 2025

		c.scope.Gauge("poller_in_action").Update(float64(c.concurrency.PollerPermit.Count()))
		c.scope.Gauge("poller_quota").Update(float64(c.concurrency.PollerPermit.Quota()))

rework on poller auto scaler #1411

Are you sure you want to change the base?

rework on poller auto scaler #1411

Conversation

shijiesheng commented Dec 10, 2024

codecov bot commented Dec 21, 2024 • edited Loading

Codecov Report

Groxx Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Groxx Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Groxx Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Groxx Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Groxx Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Groxx Dec 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Groxx commented Jan 2, 2025 • edited Loading

3vilhamster left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

3vilhamster Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 21, 2024 •

edited

Loading

Groxx Dec 31, 2024 •

edited

Loading

Groxx Dec 31, 2024 •

edited

Loading

Groxx Dec 31, 2024 •

edited

Loading

Groxx Dec 31, 2024 •

edited

Loading

Groxx Dec 31, 2024 •

edited

Loading

Groxx Dec 31, 2024 •

edited

Loading

Groxx commented Jan 2, 2025 •

edited

Loading

3vilhamster Jan 8, 2025 •

edited

Loading