Spark RDD问题

Question

我刚开始使用 spark，从未使用过 Hadoop。我有 10 台 iMac，我在上面安装了 Spark 1.6.1 和 Hadoop 2.6。我下载了预编译版本，只是将提取的内容复制到 /usr/local/spark/ 中。我使用 SCALA_HOME 设置了所有环境变量，对 PATH 和其他 spark conf 进行了更改。我能够运行 spark-shell 和 pyspark（使用 anaconda 的 python）。

我已经设置了独立集群；所有节点都出现在我的网站上 UI。现在，通过使用 python shell （集群上的运行而不是本地），我遵循了 this link's python interpreter word count example.

这是我用过的代码

from operator import add

def tokenize(text):
    return text.split()

text = sc.textFile("Testing/shakespeare.txt")
words = text.flatMap(tokenize)
wc = words.map(lambda x: (x,1))
counts = wc.reduceByKey(add)

counts.saveAsTextFile("wc")

在从属节点上找不到文件 shakespeare.txt 给我一个错误。四处搜索我了解到，如果我不使用 HDFS，那么该文件应该存在于同一路径上的每个从属节点上。这是堆栈跟踪 - github gist

现在，我有几个问题-

RDD不是应该分布式的吗？也就是说，它应该在所有节点上分发（当操作是运行在 RDD 上时）文件而不是要求我分发它。
我用Hadoop 2.6下载了spark，但是没有任何Hadoop命令可以用来制作HDFS。我提取了在 spark/lib 中找到的 Hadoop jar 文件，希望找到一些可执行文件，但什么也没有。那么，spark下载中提供了哪些Hadoop相关文件呢？
最后，如何在集群上运行分布式应用程序（spark-submit）或分布式分析（使用pyspark）？如果我必须创建一个 HDFS，那么需要哪些额外的步骤？还有，这里怎么创建HDFS？

Answer 1

如果您阅读 Spark Programming Guide，您会找到第一个问题的答案：

To illustrate RDD basics, consider the simple program below:
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
The first line defines a base RDD from an external file. This dataset is not loaded in memory or otherwise acted on: lines is merely a pointer to the file. The second line defines lineLengths as the result of a map transformation. Again, lineLengths is not immediately computed, due to laziness. Finally, we run reduce, which is an action. At this point Spark breaks the computation into tasks to run on separate machines, and each machine runs both its part of the map and a local reduction, returning only its answer to the driver program.

请记住，转换是在 Spark worker 上执行的（参见 link，幻灯片 n.21）。

关于您的第二个问题，如您所见，Spark 仅包含使用 Hadoop 基础架构的库。您需要先设置 Hadoop 集群（Hdfs 等），以便使用它（使用 Spark 中的库）：查看 Hadoop Cluster Setup.

为了回答你的最后一个问题，我希望official documentation helps, in particular Spark Standalone。

Spark RDD问题

Spark RDD problems

hadoop

apache-spark

hadoop2

rdd

pyspark