无法从 SparkContext 在 Python lines.first() 处输出结果

Cannot output result at Python lines.first() from SparkContext

我正在写我的第一个 test.py 在 spark。

代码

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("My Test")
sc = SparkContext(conf = conf)

lines = sc.textFile("file:///home/hduser/spark-1.5.2-bin-hadoop2.6/README.md") # Create an RDD called lines

lines.count()
lines.first()

输出:

hduser@borischow-VirtualBox:~/spark-1.5.2-bin-hadoop2.6$ bin/spark-submit test.py
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hduser/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/12/28 17:42:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
****15/12/28 17:42:46 WARN Utils: Your hostname, borischow-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface eth0)
15/12/28 17:42:46 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/12/28 17:42:48 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
15/12/28 17:42:48 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.****
hduser@borischow-VirtualBox:~/spark-1.5.2-bin-hadoop2.6$ 

问题:

  1. 我无法从 lines.count() 和 lines.first() 生成预期的输出。为什么?

  2. 警告信息背后的原因是什么?

15/12/28 17:42:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

15/12/28 17:42:46 WARN Utils: Your hostname, borischow-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface eth0)

15/12/28 17:42:46 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to >another address

15/12/28 17:42:48 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

15/12/28 17:42:48 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.

非常感谢!

您看不到输出,因为它不是 countfirst 方法不会向 std.out 输出任何内容。

只需使用print:

from __future__ import print_function
from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("My Test")
sc = SparkContext(conf = conf)

lines = sc.textFile("file:///home/hduser/spark-1.5.2-bin-hadoop2.6/README.md")

print(lines.count())
print(lines.first())