Spark Stream Processing 如何设置计算?
How Spark Stream Processing sets up computation?
我有一个关于火花流处理如何工作的基本问题。
在here中,有一个块写着:"Note that when these lines are executed, Spark Streaming only sets up the computation it will perform after it is started, and no real processing has started yet. To start the processing after all the transformations have been setup, we finally call start method."
我无法消化上面的描述。
理想情况下,计算属于一个函数,并以某种方式要求 "ssc.start()" 循环执行该函数。
但在这里,核心计算先执行,然后流上下文启动。这有什么意义?
public final class JavaNetworkWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: JavaNetworkWordCount <hostname> <port>");
System.exit(1);
}
StreamingExamples.setStreamingLogLevels();
// Create the context with a 1 second batch size
SparkConf sparkConf = new SparkConf().setAppName("JavaNetworkWordCount");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
// Create a JavaReceiverInputDStream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
// Note that no duplication in storage level only for running locally.
// Replication necessary in distributed scenario for fault tolerance.
/* below four statements represents the core computation */
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(
args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);
wordCounts.print();
/* starting actual stream processing */
ssc.start();
ssc.awaitTermination();
}
}
But here, the core computation is executed first and then the streaming context is started. How does this make sense?
因为 "core computation" 不是 "computation"。它描述了如果 StreamingContext
曾经启动并且是否收到任何数据,应该做什么。
如果仍然不清楚 - 它是 什么 而不是 如何 并且内部不会泄露给用户代码。
我有一个关于火花流处理如何工作的基本问题。
在here中,有一个块写着:"Note that when these lines are executed, Spark Streaming only sets up the computation it will perform after it is started, and no real processing has started yet. To start the processing after all the transformations have been setup, we finally call start method."
我无法消化上面的描述。 理想情况下,计算属于一个函数,并以某种方式要求 "ssc.start()" 循环执行该函数。 但在这里,核心计算先执行,然后流上下文启动。这有什么意义?
public final class JavaNetworkWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: JavaNetworkWordCount <hostname> <port>");
System.exit(1);
}
StreamingExamples.setStreamingLogLevels();
// Create the context with a 1 second batch size
SparkConf sparkConf = new SparkConf().setAppName("JavaNetworkWordCount");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
// Create a JavaReceiverInputDStream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
// Note that no duplication in storage level only for running locally.
// Replication necessary in distributed scenario for fault tolerance.
/* below four statements represents the core computation */
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(
args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);
wordCounts.print();
/* starting actual stream processing */
ssc.start();
ssc.awaitTermination();
}
}
But here, the core computation is executed first and then the streaming context is started. How does this make sense?
因为 "core computation" 不是 "computation"。它描述了如果 StreamingContext
曾经启动并且是否收到任何数据,应该做什么。
如果仍然不清楚 - 它是 什么 而不是 如何 并且内部不会泄露给用户代码。