How does RisingWave ensure data distribution and node responsibility?
I'm trying to understand how RisingWave manages data distribution across nodes and how it ensures that each node or executor is responsible for its own portion of data. I saw a diagram that seemed to specify node responsibilities, but I'm not clear on whether the nodes are chosen by RisingWave or if there's a specific configuration that needs to be done. Can you explain how the ingest_batch
command works in this context and how the shared buffer and materialized views play a role in data distribution?
JJ
Asked on May 16, 2022
Alex Chi clarified that RisingWave automatically handles the distribution of data. The ingest_batch
command is called by streaming executors, which put the write batch into the shared buffer. The compute worker node then checkpoints it onto shared storage, like S3. The distribution of data is not done at the storage layer but rather at the compute layer through the stream plan. For example, when creating a materialized view with an exchange by
clause, RisingWave ensures that each executor reads only its own portion of data. The shared buffer is local to each compute node and allows executors on the same node to see each other's writes, but it does not control data distribution across nodes. The actual distribution is defined by the SQL plan, which specifies how data is partitioned (e.g., hash(c) % n == x
).