内存中的 Spark 数据

Question

我正在使用 PySpark SQL，我想从 RedShift 检索表，将它们保存在内存中，然后应用一些连接和转换。我想应用内存数据的连接和转换，而不是将使用转换创建的 sql 计划直接应用于 Redshift。

当我检索数据时，它只保存模式，对吗？

如果我使用 createTempView()，它会在 sparkcontext 中保存视图而不是数据，对吗？

如果我在获取数据帧后使用 cache() 它会将数据保存在内存中？接下来的转换是在内存中应用的吗？

df = manager.session.read.jdbc(url=url, table=table, properties={"driver": driver, "user": user, "password": password})

df1 = manager.session.read.jdbc(url=url, table=table1, properties={"driver": driver, "user": user, "password": password})

df2 = manager.session.read.jdbc(url=url, table=table2, properties={"driver": driver, "user": user, "password": password})

df_res = df.union(df2)

df_res = df_res.groupBy("seq_rec", "seq_res").agg({'impuesto': 'sum'}).withColumnRenamed("SUM(impuesto)", "pricing")

df_result = df.join(df_res, [df.seq == df_res.seq_rec, df.res == df_res.seq_res])

之后我将数据帧保存到一个 avro 文件，这里是应用所有转换的地方吗？

Answer 1

When Im retrieving the data it saves the schema only, right?

是的，没错。

If I use createTempView() it saves a view in sparkcontext but not the data, right?

这里也一样。

If I use cache() after get the dataframe it saves the data in memory? And the next transformations are applied in memory?

没有。当数据集首次加载时，它可能会在内存中缓存数据，具体取决于可用资源。它不会急切地获取数据。

SQL 中有旧的、更长的记录 CACHE TABLE，它已被用来急切地获取数据并尝试缓存它。

spark.sql("CACHE TABLE foo")

内存中的 Spark 数据

Spark data in memory

apache-spark

pyspark

pyspark-sql