I have split our compute node into compute-serving
and compute-streaming
components with multiple tables being replicated from PostgreSQL. I've noticed that when I have multiple compute-streaming
nodes, only one connects to the database for all sources, which are not load balanced across other nodes. I expected sources to be round-robin distributed between nodes. Also, when I kill the node that is doing all the work, it takes a long time for another node to pick up, and I'm wondering if this failover time is tunable.
Rick Otten
Asked on Feb 12, 2024
The lack of horizontal scalability for compute-streaming
nodes is due to the fact that each source's number of active source actors won't exceed the number of splits in the upstream data source. Since your tables are singleton, only one source actor on one node will ingest data to maintain data consistency. The scheduling strategy might be causing all active actors to be on the same node. For faster failover, you should use risectl
to scale the horizon and unregister the compute node gracefully before killing it. This process is currently manual but will be automated in future releases. The long failover time you're experiencing is likely due to the meta node waiting to see if the killed node comes back before reassigning tasks, which is a safety feature to handle unexpected failures.