I'm implementing a custom source/sink for a format similar to Debezium streams stored in Parquet. I have a basic sink working, but I'm struggling with understanding the state management of SplitEnumerator
s and the use of SplitMetaData::update_with_offset
. Specifically, I'm facing issues with re-execution of splits after a restart, and the start_offset
always being 0
when I emit records in my SplitReader
. Here's a snippet of my concerns:
SplitEnumerator
s managed? After a restart, it seems to re-execute splits already processed.SplitMetaData::update_with_offset
called, and is it the mechanism to mark a split as fully/partially processed?I'm also dealing with the challenge of having to read all data in-order and managing the end_offset
to be advanced periodically by an external system.
Sergii Mikhtoniuk
Asked on Apr 02, 2024
Eric and Bohan Zhang provided insights into my questions:
SplitEnumerator
s do not keep states internally; they list all splits and track changes in meta.SplitMetaData::update_with_offset
is called when a barrier comes, and the source exec updates the offset in the state table. RisingWave is a streaming system, so splits are not marked as done; it keeps reading each split.start_offset
is always 0
, it might be expected as the source part may get the split assigned from the enumerator and read from the offset given.end_offset
advance periodically by an external system.