Class FlinkKafkaShuffle

java.lang.Object
org.apache.flink.streaming.connectors.kafka.shuffle.FlinkKafkaShuffle

@Experimental @Deprecated public class FlinkKafkaShuffle extends Object
Deprecated.
This experimental feature never graduated to a stable feature and will be removed in future releases. In case of interest to port it to the Source/Sink API, please reach out to the Flink community.
FlinkKafkaShuffle uses Kafka as a message bus to shuffle and persist data at the same time.

Persisting shuffle data is useful when - you would like to reuse the shuffle data and/or, - you would like to avoid a full restart of a pipeline during failure recovery

Persisting shuffle is achieved by wrapping a FlinkKafkaShuffleProducer and a FlinkKafkaShuffleConsumer together into a FlinkKafkaShuffle. Here is an example how to use a FlinkKafkaShuffle.


 StreamExecutionEnvironment env = ... 					// create execution environment
 	DataStream<X> source = env.addSource(...)				// add data stream source
 	DataStream<Y> dataStream = ...							// some transformation(s) based on source

 KeyedStream<Y, KEY> keyedStream = FlinkKafkaShuffle
 	.persistentKeyBy(									// keyBy shuffle through kafka
 			dataStream,										// data stream to be shuffled
 			topic,											// Kafka topic written to
 			producerParallelism,							// the number of tasks of a Kafka Producer
 			numberOfPartitions,								// the number of partitions of the Kafka topic written to
 			kafkaProperties,								// kafka properties for Kafka Producer and Consumer
 			keySelector<Y, KEY>);							// key selector to retrieve key from `dataStream'

 keyedStream.transform...								// some other transformation(s)

 	KeyedStream<Y, KEY> keyedStreamReuse = FlinkKafkaShuffle
 		.readKeyBy(											// Read the Kafka shuffle data again for other usages
 			topic,											// the topic of Kafka where data is persisted
 			env,											// execution environment, and it can be a new environment
 			typeInformation<Y>,								// type information of the data persisted in Kafka
 			kafkaProperties,								// kafka properties for Kafka Consumer
 			keySelector<Y, KEY>);							// key selector to retrieve key

 	keyedStreamReuse.transform...							// some other transformation(s)
 

Usage of persistentKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, int, int, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>) is similar to DataStream.keyBy(KeySelector). The differences are:

1). Partitioning is done through FlinkKafkaShuffleProducer. FlinkKafkaShuffleProducer decides which partition a key goes when writing to Kafka

2). Shuffle data can be reused through readKeyBy(java.lang.String, org.apache.flink.streaming.api.environment.StreamExecutionEnvironment, org.apache.flink.api.common.typeinfo.TypeInformation<T>, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>), as shown in the example above.

3). Job execution is decoupled by the persistent Kafka message bus. In the example, the job execution graph is decoupled to three regions: `KafkaShuffleProducer', `KafkaShuffleConsumer' and `KafkaShuffleConsumerReuse' through `PERSISTENT DATA` as shown below. If any region fails the execution, the other two keep progressing.

     source -> ... KafkaShuffleProducer -> PERSISTENT DATA -> KafkaShuffleConsumer -> ...
                                                |
                                                | ----------> KafkaShuffleConsumerReuse -> ...