troubleshooting

How to resolve a timeout error when creating a materialized view in RisingWave?

I'm trying to create a materialized view (MV) in RisingWave which depends on two other views, but I keep encountering a timeout error with the following message from the metadata server:

thread 'risingwave-main' panicked at src/meta/src/barrier/mod.rs:303:17:
duplicated table in concurrent checkpoint

This happens after the first attempt times out. The MV I'm trying to create is as follows:

CREATE MATERIALIZED VIEW port_snapshot AS
SELECT
    timestamp, nd.source, nd.iface, up
FROM
    net_device_ifaces AS nd
RIGHT JOIN
    port_history AS ph ON
        nd.source = ph.source AND
        nd.iface  = ph.iface
    WHERE
        timestamp > NOW() - INTERVAL '15 days'
    AND
        timestamp = (
          SELECT
              MAX(timestamp)
          FROM
              port_history
          WHERE
              source = nd.source AND
              iface = nd.iface
        )
GROUP BY timestamp, source, iface, up;

The row count for port_snapshot in that time period is approximately 2.8 million. Both input views are materialized views. The metadata server has 2 cores and 2 GB of RAM. I'm deploying using Kubernetes (k8s) with etcd, and using the RisingWave operator; deployed a 1.5.0 cluster. The job stalls at around 3.45% and eventually fails with an error indicating that the metadata server crashed and restarted. I've also noticed warnings about using an in-memory remote object store for Meta Backup, which is not recommended for a production environment. I'm currently using Google Cloud Storage (GCS) as the backing store for hummock.

What could be causing this timeout issue, and how can I resolve it to successfully create the materialized view?

Jo

Josh Toft

Asked on Dec 21, 2023

It seems that the root cause of the issue is related to GCS read operations taking an excessive amount of time, leading to a halt in handling barriers. To mitigate this issue, you can lower the streaming read timeout in the RisingWave configuration to avoid the problem from recurring. Specifically, you can set the object_store_streaming_read_timeout_ms to a lower value, such as 60000 milliseconds. This workaround should help until the issue is fixed in a future release, such as version 1.6. Additionally, ensure that your backup_storage_url and backup_storage_directory are correctly configured in the risingwave.toml file or through the ALTER SYSTEM SET command.

Here's an example of how to set the streaming read timeout:

[storage.object_store]
object_store_streaming_read_timeout_ms = 60000

And for setting the backup_storage_url:

ALTER SYSTEM SET backup_storage_url TO "gcs://your-bucket-name/your-subdirectory";
Dec 22, 2023Edited by