I left a long running query over the weekend, and it's hanging at around 9%. I noticed compactor errors and warnings indicating issues with Amazon S3, such as remote storage error streaming_read_read_bytes
and Failed to vacuum task: ObjectStore failed with IO error s3 error
. Additionally, I observed that the hummock is getting disconnected frequently, and my materialized view (MV) throughput has effectively decreased to zero since the errors started. I'm using the docker-compose (with-s3) testbed from the RisingWave git repository for this proof of concept project. I've also noticed a high number of 'in-flight' barrier messages but can't identify why my container would have issues reaching S3. Can you help me resolve these issues and get my query and compaction back on track?
Dominic Lindsay
Asked on Oct 09, 2023
From the discussion, it seems like the compactor is having trouble reading from S3, affecting read performance and causing the entire stream to get stuck. This could be due to insufficient compaction resources, leading to a pile-up of uncompacted files. It's recommended to upgrade all components to the latest image, as there have been fixes related to compaction. Additionally, increasing the number of CPU cores on the compactor and adjusting compaction configurations might help. Running the ./risingwave/bin/risingwave ctl trace
command on the compute node can help determine the location of the stuck process. Monitoring the Grafana dashboard for Hummock (Read)
, Hummock (Write)
, Compaction Failure Count
, and Write Stop Compaction Groups
can provide insights into the compaction process and whether it's functioning properly. If the write has been stopped, checking the Write Stop Compaction Groups
under Hummock Manager
in Grafana and sharing the result of select * from rw_hummock_compaction_group_configs;
can help identify config tweaks to accelerate compaction. If the cluster was created with an old image, there might be stale config entries that need updating. Finally, enabling emergency compaction by running specific commands in the meta node bash might resolve the issue if the write is throttled.