Class FlinkKafkaShuffle


  • @Experimental
    @Deprecated
    public class FlinkKafkaShuffle
    extends Object
    Deprecated.
    This experimental feature never graduated to a stable feature and will be removed in future releases. In case of interest to port it to the Source/Sink API, please reach out to the Flink community.
    FlinkKafkaShuffle uses Kafka as a message bus to shuffle and persist data at the same time.

    Persisting shuffle data is useful when - you would like to reuse the shuffle data and/or, - you would like to avoid a full restart of a pipeline during failure recovery

    Persisting shuffle is achieved by wrapping a FlinkKafkaShuffleProducer and a FlinkKafkaShuffleConsumer together into a FlinkKafkaShuffle. Here is an example how to use a FlinkKafkaShuffle.

    
     StreamExecutionEnvironment env = ... 					// create execution environment
     	DataStream<X> source = env.addSource(...)				// add data stream source
     	DataStream<Y> dataStream = ...							// some transformation(s) based on source
    
     KeyedStream<Y, KEY> keyedStream = FlinkKafkaShuffle
     	.persistentKeyBy(									// keyBy shuffle through kafka
     			dataStream,										// data stream to be shuffled
     			topic,											// Kafka topic written to
     			producerParallelism,							// the number of tasks of a Kafka Producer
     			numberOfPartitions,								// the number of partitions of the Kafka topic written to
     			kafkaProperties,								// kafka properties for Kafka Producer and Consumer
     			keySelector<Y, KEY>);							// key selector to retrieve key from `dataStream'
    
     keyedStream.transform...								// some other transformation(s)
    
     	KeyedStream<Y, KEY> keyedStreamReuse = FlinkKafkaShuffle
     		.readKeyBy(											// Read the Kafka shuffle data again for other usages
     			topic,											// the topic of Kafka where data is persisted
     			env,											// execution environment, and it can be a new environment
     			typeInformation<Y>,								// type information of the data persisted in Kafka
     			kafkaProperties,								// kafka properties for Kafka Consumer
     			keySelector<Y, KEY>);							// key selector to retrieve key
    
     	keyedStreamReuse.transform...							// some other transformation(s)
     

    Usage of persistentKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, int, int, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>) is similar to DataStream.keyBy(KeySelector). The differences are:

    1). Partitioning is done through FlinkKafkaShuffleProducer. FlinkKafkaShuffleProducer decides which partition a key goes when writing to Kafka

    2). Shuffle data can be reused through readKeyBy(java.lang.String, org.apache.flink.streaming.api.environment.StreamExecutionEnvironment, org.apache.flink.api.common.typeinfo.TypeInformation<T>, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>), as shown in the example above.

    3). Job execution is decoupled by the persistent Kafka message bus. In the example, the job execution graph is decoupled to three regions: `KafkaShuffleProducer', `KafkaShuffleConsumer' and `KafkaShuffleConsumerReuse' through `PERSISTENT DATA` as shown below. If any region fails the execution, the other two keep progressing.

         source -> ... KafkaShuffleProducer -> PERSISTENT DATA -> KafkaShuffleConsumer -> ...
                                                    |
                                                    | ----------> KafkaShuffleConsumerReuse -> ...