How Scalable is a Materialized View Aggregate in RisingWave?

I'm trying to understand the scalability of a materialized view aggregate in RisingWave, specifically when counting unique events from a Kafka stream based on a hash field in the JSON payload. Here's an example query I'm considering:

create materialized view mv2 as select field1, count(field1) from s1 group by field1

I want to know how much throughput an individual RisingWave deployment can handle and when it's necessary to distribute data streams to separate deployments.


Bat Milkyway

Asked on Oct 23, 2023

The scalability of a materialized view aggregate in RisingWave depends on several factors:

  1. Node Configuration: The number of vCores and memory in the cluster can affect throughput.
  2. Cardinality of the Field: If the cardinality of field1 is small and can be cached in memory and disk, the query is very scalable. Throughput likely scales linearly with the number of CPUs.
  3. Cache Efficiency: With high cardinality and limited memory, cache misses may occur, leading to frequent state fetches from external storage like S3, which introduces latency and potential bottlenecks.
  4. Window Functions: Using TUMBLE or HOP can confine the state to a specified interval, affecting how state is managed and potentially improving performance.

It's recommended to run a workload on RisingWave to get a clearer idea of performance, especially since a small cache might still be effective if there's a small number of hot keys. Benchmarking results can also provide a general idea of throughput capabilities.

Nov 02, 2023Edited by