Support for Stateful Map Vertex #2384

th0ger · 2025-02-07T10:35:05Z

Summary

Proposing a stateful map vertex.

Inside the map handler, the user must be able to read and write a global state for the vertex.

Use Cases

Change detection (between current and previous message).
Cumulatative averages
Exponential_smoothing (EMA)
Online Machine Learning, like River
Kalman filters (sensor fusion)
KS test for outlier detection

The state content is fully up to the user, but may for example hold:

the previous message,
a list of the N previous messages,
an aggregated metric of prior events,
an online ML model with updateable weights,
a histogram of prior events.

This proposal could also solve the GPS data smoothening raised in #2235 in cases where only "past" events are needed. (If "future" events are needed, one could use the state as a N-message ring buffer, where the incoming message is added to the buffer and the oldest buffer message is output from the map.)

Design Considerations

The state must persisted to be resilient to pod/pipeline restarts.
State writes and msg acks should somehow be covered in the same transaction. (For ex. a handler should not be able to increase a count in the state, crash on msg writeout, pod restarts, all repeating in a loop.)
The state should be keyed (keyed streams have one global state per key).
Only single-partition is required (per key).

Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

vigith · 2025-02-10T16:16:38Z

There are two concerns I have with implementing this within Numaflow.

Choosing a Store

The current stores that come with Numaflow are optimized for data and metadata movement. It won't be able to support any types that is deviant from what we have optimized for. E.g., we will experience OOMs if the size grows or the throughput will be severely compromised, causing a lot of unwanted side effects.

On the other hand, there are lots of open-source cloud-native stores out there and they can be deployed very easily in K8s. One can choose any optimal store of any API style and configure it specifically for their needs.

NOTE: Even in the Flink pipelines we write, we move the state out from Flink to external DBs because Flink simply cannot scale as these stores can get huge at high TPS.

Fulfilling Completeness Property

For a platform to implement the Stateful Map Vertex obeying the "completeness property" (should work in all use cases) is quite tricky.
As a platform, we will need to have concrete answers for:

How big should these stores be (apriori knowledge)? (we have auto-scaling mechanisms if we see backpressure for ISB, which cannot be translated for custom states).
What should the APIs look like? Just put, and get (KV style), or should we support pop, push (list style).
When to GC/delete these datasets (element based or time based), we can give APIs but then someone will have to track the names of these datasets?

vigith · 2025-02-10T16:17:36Z

I am moving this to GitHub discussion, we can convert it to an issue once we have a better picture.

th0ger added the enhancement New feature or request label Feb 7, 2025

numaproj locked and limited conversation to collaborators Feb 10, 2025

vigith converted this issue into discussion #2388 Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Support for Stateful Map Vertex #2384

Support for Stateful Map Vertex #2384

th0ger commented Feb 7, 2025

vigith commented Feb 10, 2025

vigith commented Feb 10, 2025

This issue was moved to a discussion.

This issue was moved to a discussion.

Support for Stateful Map Vertex #2384

Support for Stateful Map Vertex #2384

Comments

th0ger commented Feb 7, 2025

Summary

Use Cases

Design Considerations

vigith commented Feb 10, 2025

Choosing a Store

Fulfilling Completeness Property

vigith commented Feb 10, 2025

This issue was moved to a discussion.