You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The state content is fully up to the user, but may for example hold:
the previous message,
a list of the N previous messages,
an aggregated metric of prior events,
an online ML model with updateable weights,
a histogram of prior events.
This proposal could also solve the GPS data smoothening raised in #2235 in cases where only "past" events are needed. (If "future" events are needed, one could use the state as a N-message ring buffer, where the incoming message is added to the buffer and the oldest buffer message is output from the map.)
Design Considerations
The state must persisted to be resilient to pod/pipeline restarts.
State writes and msg acks should somehow be covered in the same transaction. (For ex. a handler should not be able to increase a count in the state, crash on msg writeout, pod restarts, all repeating in a loop.)
The state should be keyed (keyed streams have one global state per key).
Only single-partition is required (per key).
Message from the maintainers:
If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
The text was updated successfully, but these errors were encountered:
There are two concerns I have with implementing this within Numaflow.
Choosing a Store
The current stores that come with Numaflow are optimized for data and metadata movement. It won't be able to support any types that is deviant from what we have optimized for. E.g., we will experience OOMs if the size grows or the throughput will be severely compromised, causing a lot of unwanted side effects.
On the other hand, there are lots of open-source cloud-native stores out there and they can be deployed very easily in K8s. One can choose any optimal store of any API style and configure it specifically for their needs.
NOTE: Even in the Flink pipelines we write, we move the state out from Flink to external DBs because Flink simply cannot scale as these stores can get huge at high TPS.
Fulfilling Completeness Property
For a platform to implement the Stateful Map Vertex obeying the "completeness property" (should work in all use cases) is quite tricky.
As a platform, we will need to have concrete answers for:
How big should these stores be (apriori knowledge)? (we have auto-scaling mechanisms if we see backpressure for ISB, which cannot be translated for custom states).
What should the APIs look like? Just put, and get (KV style), or should we support pop, push (list style).
When to GC/delete these datasets (element based or time based), we can give APIs but then someone will have to track the names of these datasets?
Summary
Proposing a stateful map vertex.
Inside the map handler, the user must be able to read and write a global state for the vertex.
Use Cases
The state content is fully up to the user, but may for example hold:
This proposal could also solve the GPS data smoothening raised in #2235 in cases where only "past" events are needed. (If "future" events are needed, one could use the state as a N-message ring buffer, where the incoming message is added to the buffer and the oldest buffer message is output from the map.)
Design Considerations
Message from the maintainers:
If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
The text was updated successfully, but these errors were encountered: